The Technology Sounding Board
The Technology Sounding Board
E7 - Data Mesh
In the previous episode we briefly mentioned the Data Mesh. In this one, we look into in detail. We answer four questions:
- What is a Data Mesh and how is it different from the Data Warehouse, Data Lake or Data Lake-house that we’ve been discussing so far?
- Why do I say that most enterprises, at this point in time, shouldn’t be investing in the Data Mesh?
- How can we tell if we’re one of the few enterprises that should indeed be pursuing this idea right now?
- If we’re not, can we capture any of the benefits of the Data Mesh without actually implementing one?
Is it right for your company...listen in and draw your own conclusions.
In last month’s episode, as the first of three on Modern Data Platforms, we talked about the difference between Data Warehouses and Data Lakes, and we introduced the concept of the Data Lake-house. I said that it was the best of both worlds solution that I strongly recommend as the right option for storage of analytical data in today’s enterprises.
Right at the close, I mentioned the latest “shiny thing” that has attracted a lot of attention, the Data Mesh. I said that I didn’t think this was a good alternative to the Data Lake-house for most enterprises at this time, and I stand by that advice, but that implies that there are at least some for whom it would be.
So, what exactly is a Data Mesh, when would it be a good idea and when would it not? …let’s talk about it.
[Intro Music]
Welcome to The Technology Sounding Board, I’m your host, Michael R. Gilbert, and today we’re talking about the Data Mesh. There are four things I want to cover in this episode:
1) What is a Data Mesh and how is it different from the Data Warehouse, Data Lake or Data Lake-house that we’ve been discussing so far?
2) Why do I say that most enterprises, at this point in time, shouldn’t be investing in the Data Mesh?
3) How can we tell if we’re one of the few enterprises that should indeed be pursuing this idea right now?
4) If we’re not, can we capture any of the benefits of the Data Mesh without actually implementing one?
This being a Podcast on the Data Mesh, you can guess that the answer to the last one is… yes, or I wouldn’t have asked it!
So, there’s the hook to get you to listen to the end, but before we get started with any of that, I want to revisit something I said in last month’s episode on Data Warehouses and Data Lakes. I said…
If the benefits of the data accrue predominantly to the consumers of that data, but the costs accrue instead disproportionately to its producers, then, unless you have an external marketplace for it, i.e., you're in the business of selling data, failing to account for this imbalance will derail any and all data projects.
Let’s be honest, that’s a bit of a mouthful, and it might take a bit of head scratching to figure out what I meant, so let’s take a minute to unpack it with an analogy.
Imagine if you will, that your neighbor wants to borrow your garden hose from time to time. They don’t have one and, given the current watering restrictions, they aren’t allowed to use their automatic sprinkler system – which is actually not too hard to imagine if you happen live in Central Texas, as I currently do.
Now these are your neighbors. You live in the same neighborhood, kinda by definition. Saving their plants and trees from dying helps you directly as it improves the neighborhood you live in, but besides, it’s just the right thing to do…but still, you have your own life to look after, your own house to tend to, your own plants to water. Letting the neighbor borrow the hose when they need it, providing that they come and get it, then put it back again as soon as they’re done with it so that it’s ready for your own use when you need it…well that’s no big deal and I’m sure that most people would be more than happy to help out.
On the other hand, if your neighbor wants you to unhook your hose, walk to their house, hook it up, water their plants, unhook it and then hook it back up to your own house again, that might be asking a bit much. Maybe that’s ok on an occasion…while they’re off on vacation for a week or two perhaps…but it’s not going to happen in the long run. They are asking you to do all the work, while they get most of the benefit. At some point, if they can’t do it themselves, they need to hire a professional gardener and not rely on the goodwill of neighbors.
Well, the same thing is true inside an Enterprise. People will happily help out other departments on a one-off basis and they will happily share excess resource capacity with you, if you can use it without causing a detriment to them. After all we’re all working for the same Enterprise so we want to help where we can, and besides, it’s just the right thing to do. But at the end of the day, we all have our own jobs to do and a limited budget to do it with, so the degree to which you can expect one department to do work that benefits another, for any extended time, is naturally limited.
Internal marketplaces can help if you can create some kind of governance mechanism that let’s credit flow back to the doer of the work…perhaps, where you cross charge time or perhaps where it’s baked into performance objectives, but this only works to limited degree…when push comes to shove, we will look after our own workload first.
This is just natural human behavior, and if we don’t design systems that work in the face of natural human behavior, we can’t be shocked and surprised when they don’t really work at all.
I have a phrase I like to use a lot, which is always make sure you put gravity on your side. Gravity will ultimately win any battle, no matter how much energy you set against it, so don’t fight it. Natural human behavior is like gravity, it is what it is and won’t suddenly change just because currently find it inconvenient. Keep it on your side and don’t fight it unless you like losing.
So, with that firmly in our minds, let’s get back to the Data Mesh…
First proposed by Zhamak Dehghani in 2019, the Data Mesh is basically the answer to the question, what would happen if we brought the fundamental ideas behind micro-services to the world of data?
Now we haven’t talked about micro-services yet… and we will definitely want to dedicate a future episode or two to that topic… but most people in the technology world are aware that we transitioned away from the idea of vast, monolithic applications to micro-services architectures for a number of reasons.
Perhaps most important was that it was increasingly difficult to scale these large, monolithic applications. In the world of the internet, with its hugely variable demands, scaling at very short notice is critical. It also helps in bringing multiple, smaller teams together to work on larger problems, even if those teams have very different skills and use very different technology stacks (perhaps because their part of the problem is better served by different technologies).
Key to all this is the idea known as Domain Driven Design. This is the name that Eric Evans, in his 2003 book of the same name, gave to a method of breaking down large application spaces into a set of subparts – each separately designed around the specific sub-domain that they address.
Perhaps an example might be, in the context of an creating an eCommerce solution, the fact that a shopping cart has very different requirements than a warehouse management system. They might both be necessary components of the overall eCommerce system, but by treating each one as a separate domain and designing their part of the overall solution separately, we get a much better fit to their particular problem spaces; we can leverage multiple small teams that understand their part of the problem very well, even if they only have a basic understanding of the end-to-end problem and even if the language each team uses to describe their sub-problem is very different from that of every other team. There is no need, or advantage, to these sub-solutions sharing a single, common data store, or being programmed in the same language, or even running on the same machines.
It turns out that the large Data Warehouses that we have traditionally used to store enterprise-wide data…that data we intend to use for analytical purposes…suffer from many of the same problems as we had with the old Monolithic approach to creating operational systems. Zhamak asks the question; can a similar solution be used to address them?
Her proposed solution, the Data Mesh, is not an architecture, nor is it a technology – although it relies on elements of both. She defines it as a “sociotechnical paradigm”, but I don’t think that adds much clarity without a considerable detour for which we do not have the time, so I’ll just say that it is a significant element of one possible enterprise-wide data strategy. A strategy that impacts both the enterprise architecture and the organizational operating model.
This is not something that IT, or even some Data Group, can implement. Either an organization decides to adopt the Data Mesh, from the Executive, all the way down, or they don’t. It is no small endeavor and will take any non-trivial company years to complete the transition, involving every department in the enterprise to get it done.
The potential benefit, to an organization that is positioned to be able to adopt it, is large enough to make that effort worthwhile, but it should be noted that there is a very significant network effect to it. That is to say the benefit of rolling it out to a tiny part of the organization is very small, almost certainly net negative, but that it gets exponentially larger as the area that is covered by it increases…this really is an all or nothing proposition.
The Data Mesh is designed around four basic principles, the first of which is exactly the idea of Domain Driven Design, applied to the realm of Data. They are:
1) Domain Ownership of Data
2) Data as a Product
3) A Self-Service Data Infrastructure Platform
…and
4) Federated Computational Governance
As we walk through these principles, it should become clear that it is that first principle, Domain Ownership of Data, that Zhamak sees as pivotal to the whole approach. The other 3 are really there to offset the costs and disadvantages that would stem from an attempt to implement the first without them.
Now here I want to pause and stress that, whereas as we walk through these principles and highlight the problems with implementing them, it may appear that I am criticizing Zhamak’s work. Nothing could be further from the truth. I strongly encourage anyone who is interested in this concept to read her book (Data Mesh – Delivering Data-Driven Value at Scale, published by O’Reilly just last year). It is a well thought out, lucid and articulate discussion of the concepts and ideas, but she doesn’t just whitewash over the challenges inherent in implementing them. She clearly addresses exactly those challenges that I will be highlighting here.
I hope and believe that, should she ever listen to this episode herself, she wouldn’t disagree with any of the points that I am going to make.
With all that said, and before I outline the high-level meaning of the four principles above, let me say that in her book she uses a mythical company, Daff, to help make otherwise intangible ideas more concrete for the reader. She does this very effectively. Daff is essentially a company that might otherwise be Spotify or any similar vendor. The business they are in actually is the business of information. That’s what they sell, that’s what they buy and that’s the be all and end all of their existence. It’s very easy to see how Domain Ownership of Data fits with Data as a Product, because Data is already a Product in companies like these. This will be a much bigger leap for the vast majority of today’s enterprises that aren’t information centric by design, and there can be little doubt that she chose this example exactly because of its more natural fit to the Data Mesh concept.
Secondly, the book centers on a mythical company, exactly because there are no companies that have implemented the Data Mesh yet, and so there can be no real-world case studies. As such, we have to recognize that the benefits that are shown to arise from Daff’s adoption of the Data Mesh are theoretical. They are the benefits we might believe and hope to see. They represent a better world that we are trying to bring into existence, not one that we have seen already and so can know to be possible.
In a similar vein, anyone who has ever developed an idea into an actual product knows that the further you are from the stone-cold granite face of reality, the more perfect your idea is. It’s only as we start to make the necessary compromises that we are forced to make to materialize it that the defects and deficiencies in our thinking become as real as the product itself.
None of this means we shouldn’t strive to create such wonders as we can imagine, only that we should temper our excitement and make small, testable, steps wherever we can.
So back to the four principles…
The idea behind the Domain Ownership of Data principle is to take responsibility for the data away from today’s centralized data teams that don’t really understand the environment, or domain, that gives rise to this specific data. Instead, we should give it to individual operational teams, that do – those that live in that exact domain every day. For example, sure Marketing understands marketing data better than any other group in the enterprise, so shouldn’t the Marketing team have responsibility for it?
In this case, the owner seems obvious, how can we always find the right owner for a given set of data? Who should have the responsibility for identifying, creating, and maintaining each of these data domains and their contents? Zhamak argues that this should be the responsibility of the group most closely associated with the underlying domain itself. To help find the right owner, she identifies three domain data archetypes, as she puts it:
Source-Aligned, Aggregate and Consumer-Aligned.
Source-aligned refers to that data which is generated by our operational systems…the Customer domain data might be a good example, and this probably comes directly from our CRM. The data in the CRM is operational data and is generally accessed, for both read and update, one customer at a time – she is not referring to this. She is very specifically not suggesting that you should simply expose the data directly from the operational systems, in this case the CRM – this is a well understood anti-pattern that leads to all sorts of unpleasantness. She means this data but published in a manner specifically formatted for analytics – intended to be read in bulk and not generally updated after the fact. For those more familiar with Data Lake terminology, we might consider this “raw” data. It comes from an existing operational system, so the owners of this domain are obvious…the owners of the operational system – in this case the Customer Service team.
Aggregate data is simply the data formed from the joining of two or more source-aligned data domains (for example, combining the Orders domain data with the Customers domain data to create Orders that are ‘enriched’ with more Customer details). This sort of intermediate data domain is typically created by today’s data groups, on mass, to produce more usable intermediate data for downstream analytics and reports. Zhamak doesn’t want to see this data created in this way. Part of her philosophy is that all Data Domains contain data that is valuable on its own, not as a step towards value if used in conjunction with something else.
So, if Aggregate data is necessary, but only in support of creating an element from the next domain archetype, that is consumer-aligned data, then this aggregate data should be part of the internal data pipeline that creates that consumer-aligned data, and not separately exposed. We’re going to talk about this in more detail shortly when we discuss what she calls the Data Quantum, but essentially, she’s means an of encapsulation of everything necessary to make up a unit of useful data within a single wrapper. Hiding the data pipeline in this way is part of abstracting away the mechanics of the data to emphasis its value.
Alright then, let’s talk about Consumer-aligned data domains. They would consume the source-data domains and in turn publish exactly the output of these downstream analytics we have been talking about…perhaps a Machine Learning generated list of products that should be actively marketed to our customers based on their previous purchasing patterns combined with patterns from other customers that are similar in some way. So, who should own this data domain…well it serves Sales right? I mean the purpose of producing this list, let’s call it the Potential Targets list, is to be able to better pitch our offerings to our customers – or more specifically, pitch the right offerings to the right customers. That is something that Sales needs in their day-to-day activity – and hence they are best placed to understand it, and so they should own it.
What she appears to be most strongly against is all of these domains being owned by some singular, centralized Data Group. They can’t possibly understand all the underlying domains, at least not as well as the members of those actual domain teams, and so can’t be expected to envision the needs and concerns of the data better than that operational group could.
I don’t disagree with her logic, but here’s where the problem I started with rears its ugly head – the problem of aligning with natural human behavior. Sales owning the Potential Targets data domain isn’t particularly problematic (ignoring the need for them to have Machine Learning skills for the moment)…after all, getting this information and getting it right, helps them drive their primary concern – more sales. But how do you motivate the Customer Service team to create usable data products from their CRM data. An activity which requires significant effort and skill, but which will not particularly benefit the Customer Service team directly?
Zhamak doesn’t ignore this problem in her book, quite the opposite. She talks about the need to adopt the Data Mesh as a corporate wide change in philosophy – to raise the idea of data to the same level of concern as that of the core products or services that the enterprise sells – and to build the appropriate performance metrics into each team so that we reward, and therefore encourage, the right behavior.
She is also clear about the fact that this will take the creation of new roles, including Data Product Owner, Data Product Developer and several others we will cover later, and that these roles will need to be embedded in operational teams across the organization. Devoid of this type of action, you cannot expect to see the company wide behavioral change that it would require for any level of success.
This leads us nicely into her second principle, that of Data as a Product. If you are successful in creating distributed data domain teams within distinct parts of the business, you are going to risk creating data silos. Decisions on the design and use of the data will be made locally and will differ across the organization. Priorities will be set to benefit the owning team rather than the enterprise as a whole. We know from many earlier attempts to distribute application teams across the organization, that silos are bad. To help address this, Zhamak proposes that we treat Data as a Product, so what does she mean by that?
Data as a Product is an extension of the idea that the data itself has value and so the domain team should view it as they would any other Product, i.e., make it as consumable as possible, as cheaply and effective as possible, just as they would if they were creating an API to allow consumers to buy donuts, or whatever the company sells. To be clear, she is not restricting this idea to just the data that you might sell. Exactly as there is a modern trend to treat internal technology as a product and create “product teams” around them – a design philosophy that we will examine in more detail in a future Podcast – she argues that the same can, and should, be done with internal data.
For each Data Product, there should be a Data Product Owner that continually attempts to improve and manage the quality of the data product that the domain team owns and a team of Data Product Developers capable of creating these encapsulated data domains we mentioned earlier, what she refers to as Data Quantum.
It’s an odd name, and she has her reasons for coining it rather than simply saying data service for example, but to avoid wading into unnecessary detail, let’s just say that she wants the data wrapped up in a container that provides a set of API’s for both access and control. By access I mean both input and output of the data. The control API’s would support discoverability, reading meta data, defining policies etc. In Zhamak’s Data Mesh, data is not a passive entity, but one that is active in the sense that it can apply policies and modify itself appropriately to make itself more consumable. This is not just a table in a database.
She also doesn’t want to see traditional, externally visible, data pipelines moving data from domain to domain. Whereas she recognizes that this may indeed need to happen at various points, any such movement should be contained within the Data Quantum themselves. In the case of Source-Aligned data, for example, the cleansing and deduplication that is required to ensure that it is has value to downstream activities without further processing occurs within either the input or output API’s. Consumers of this data see only clean, reliable, trustworthy data. They don’t have to care how this happened. Creating this cleansing capability is a responsibility of that Data Domain’s Data Product team, not some external Data Group
We’ve already talked about the motivation problem in getting an operational team to own this effort, but there’s also a second issue here. Doing this requires dedicated, specialist resources with considerable skill in every Data Product team. Remember, within companies that adopt the Data Mesh philosophy, there should be many of these teams distributed all over the company. This is the exact opposite of the Division of Labor through centralization of high skilled activities, which has been a well-known economic driver since Adam Smith first penned The Wealth of Nations, in 1776!
Again, Zhamak isn’t ignoring this. To address it, at least partially, she proposes Principle 3 – the Self-Serve Data Platform. Conceptually, this is simple. It’s a technology platform that automates the tasks of creating, discovering, consuming, and governing the Data from the Mesh…the Data Quantum, plugin to this platform, but more than that, the platform helps create the Data Quantum themselves. The Data Product Developers only need to create any domain specific transformations, write the applicable build and run-time tests that the platform can use to assure data quality, produce the metadata that describes the data and declare any infrastructure requirements it might have… the platform takes care of everything else. So yes, there is still work to be done by each product team, but the data platform reduces this cost by automating the heavy lifts. The idea is to reduce both the overall volume of work, and the skill level required to do the remaining bits.
The problem is equally simple….no such platform exists at this point – or at least none that you can buy commercially. Several organizations are building this kind of platform internally, but they are all very different implementations and at varying levels of completeness. We are a long way off this being commercially available.
In the meantime, any organization walking the Data Mesh path is going to need to invest in a significant Data Platform Development team, with very high skill levels. You can see why this is much better suited to a company for which Information is already their product…skills like this are common across companies like Spotify or NetFlix…not so much at Acme Meat Products.
Her final principal is that of Federated Computational Governance. Anyone that has been in the data world for any length of time can understand why a single, centralized data governance team is not great…you need input from all across the organization in order to get data to be as usable as possible – a single central group can’t hope to understand the workings of the entire organization. All of the technology concerns, all of the legal concerns, all of the security concerns etc. etc.
Putting together a federated governance team with representatives from every significant group across the organization is both difficult and expensive – back to the problem of incentives again – but is the only path that can succeed. The variance here is that Zhamak has this group defining policies which can then be given to the Data Platform and enforced automatically at runtime. This is a great idea, but of course, pre-supposes principle 3 is already in play, that you have a Self-Serve Data Platform.
Ok, so that’s been a lot already, but I hope you now understand, at least at a high-level, what the Data Mesh is and why I say that this isn’t where most enterprises will want to go, at least right now. We can see that the companies that are going to be best suited to this will have a number of common attributes:
Firstly, they are already going to be aligned in a Domain-Oriented manner – meaning there is a 1:1 relationship between operational groups and a dedicated technology team that supports them, and data is at the core of everything they do. This will typically be because they *are* a technology company and what they sell *is* information.
Secondly, they are going to have a very high level of technical skill across the entire organization…again, with companies that sell information, this is probably an automatic checkbox. However, more and more companies are getting a higher and higher level of tech savviness, so perhaps this is increasingly less of a restriction – it’s still going to be a small minority at this moment in time I suspect.
Lastly, they are going to have an Executive level commitment to making Data as a Product work. Many companies like to *say* that they are committed to being a “Data Driven” organization, but please don’t underestimate the enormous cost and effort that will be needed to make this successful, both technically and in terms of better profitability post implementation. Talking about it isn’t commitment, putting cold, hard cash on the line is.
Well, what if you’ve heard enough to believe that the Data Mesh isn’t for you, or at least not now? Are there lessons we can take away from Zhamak’s work that can make us better right now, with technology that’s easy to get and that we might even already have? Yes, there is.
If you listened to the previous episode, you will not be shocked to hear that I think the basis from which to start is the Data Lake-house. Zhamak tends to lump data lakes and data warehouses together in the “old legacy way of thinking bucket” and to be sure, you can certainly adopt newer technologies but keep older ways of thinking and achieve very little improvement. But that’s not necessarily the way it must be.
If we abstract the last episode down to its very core, we can say that Data Warehouses are a way of storing all corporate data in a highly structured manner. The trap is that structuring everything, and keeping that way, is both almost impossible and extremely expensive.
Data Lakes allow us to push any and all data, without worrying about structuring them, and so they make it much cheaper and quicker. Zhamak might say that this will almost inevitably result in what we might call a Data Swamp…a collection of old, mostly stale data that no one understands making it hard even to find the few treasures that lie within…and she’d have a good point.
The Data Lake-house lets us get the speed and agility of the data lake, and then super impose structure and control, as and where it is beneficial to do so. Bifurcating the lake-house into a certified space and an uncertified one allows us to enable domain-based management of prototypical data products, at least for those that are identified by a business group as potentially useful, and then leverage our existing Data Team to industrialize that data product as and when (or if) it becomes demonstrably valuable.
By empowering the consuming group with getting started on defining a new data domain, we are putting the ownership where it should be…with the employees that will get direct benefit from its creation and maintenance. The very fact that they need it will give them the motivation they need, and the potential benefit gives them the justification to invest in the skills (whether that’s training internal staff or using temporary staff, contractors that already have those skills) to get it done. It’s a prototype, a proof of value, it doesn’t need to be perfect at this stage…once it is proven, then the existing Data Group can take it further from a technical point of view, while the consuming group remain as the Data Domain owners, guiding its growth.
To reduce the barrier to getting started, we’ll still want to invest in making our data platform as self-serve as we can, but for this implementation, it doesn’t need to be anywhere near as sophisticated as it would need to be to support an actual Data Mesh. We can have our Data Team start by creating guidelines and advice for those data consumers that want to start populating the uncertified regions of the lake-house…proposed storage locations and naming standards so that they can be more easily discovered, for example. Maybe then they can expose a meta data repository and provide some self-serve interface to allow its easy population.
At first this can be a voluntary thing…use the proposed guidance and it will be easier to onboard later if it’s successful, but over time more and more of these steps can be automated. Instead of guidance on where to put data in the lake-house, a web UI is created that will create it for you. Step by step, little by little, we can work on building out that self-serve platform that can lower the barrier to good data governance.
There are tools available today that can help create such an environment, which is the subject of the next episode… a look at those tools and applications we have commercially available to us right now, that can help us build out a modern data platform for our organizations. One with as much self-serve capability as we can get. As data mesh enabled products begin to emerge, as I am certain they will, this home-grown automation can slowly be replaced and more and more of the Data Mesh concepts can be adopted.
What I’m outlining here isn’t a poor man’s approach to the Data Mesh, not even close, but it will get you using your data far more productively, today, and it is a step in the direction of the Data Mesh should you choose to eventually go that far.
So that’s what I would be doing if this was my problem to address today. I hope you found that helpful. As always, the transcript for this and all previous episodes can be found on the website at www.thetechnologysoundingboard.com. If you get a chance, stop by and leave us a review or a comment. Until next time…