The Technology Sounding Board
The Technology Sounding Board
E6 - Data Lakes vs Data Warehouses
Most companies have a Data Warehouse to some degree already, some have delivered more value than others, but these days the chatter is mostly around Data Lakes instead. What's the difference and which, if either, should today's Enterprises be investing in? Is there a middle ground between the two worth pursuing? Spoiler alert: the answer is yes and we'll be talking about it in this episode.
I was talking with a CEO the other day who was lamenting the endless spend on technology for things that were, as he put it quite frankly, hard to understand the value of. He'd just been listening to his data analytics team telling him that they needed to invest in a data lake. Even after listening to their explanation, he wasn't clear what a data lake was, why it was different from the incredibly expensive data warehouse they already had and why tech people needed to keep making up weird names for all these things. I smiled, offered to buy him a coffee and help him answer the first of those two questions. The last one would have to wait for another day. So data lakes vs data warehouses, let's talk about it. Welcome to the Technology Sounding Board. I'm your host, michael R Gilbert, and we're talking about data lakes, data warehouses and the differences between them. Perhaps it's best to start with a brief review of the conditions that led to the development of each of them. We'll go in chronological order and start with a data warehouse, but before we do, I'd like to put an idea in your head that will hopefully make more and more sense as we go on, and that is if the benefits of the data accrue predominantly to the consumers of that data, but the costs accrue instead disproportionately to its producers, then, unless you have an external marketplace for it, ie you're in the business of selling data. Failing to account for this imbalance will derail any and all data projects.
Michael R. Gilbert:Alright, so with that behind us, why did we create data warehouses in the first place? Well, in the beginning things were easy, that's to say, they were incredibly difficult and expensive, but universally so. All data processing for a given firm was done by a centralized team using vast monolithic mainframes. Getting data in or out of them was slow and expensive, but they were the only source of the truth. Note that phrase as it will re-emerge shortly. During the 80s and 90s, we saw the explosion of microprocessor-based computing. First we saw departments having their own Unix boxes, and then we saw PCs everywhere. Tech skills also grew wildly and soon enough every department could have their own application written to solve their specific problems and data exploded. Unfortunately, this data was also all over the place. Both geographically and philosophically. Different groups within the same company would have different views of parts of the company's performance and, worse, based on how you interpreted them, you came up with very different opinions on how things were going. So what was to be done about it? This couldn't go on. Such was the environment into which the data warehouse was born. Organizations everywhere were coming to the realization that they needed one source of the truth.
Michael R. Gilbert:Now, I'm not sure which consulting company came up with this phrase in relation to company data. The earliest reference I know of is Galileo, and he was referring to God. But whoever it was that spun it into a corporate data mantra was extremely successful, and by the turn of the century, just about every large company was investing in a centralized data warehouse under this banner. The idea was simple enough, let's get feeds from all these operational systems, that's our name for the systems that actually do the work, as opposed to the reporting systems that would stem from the data warehouses we're talking about. But we're getting ahead of ourselves. So let's take these feeds of data, perhaps on a daily basis, and load them all into one massive centralized database that IT would look after and keep in good shape. Then we can build the aforementioned reporting applications on top of this data warehouse and we can trust that everyone is seeing a picture built on the whole view of the company and everyone will be getting the same answer, of course, as simple as the idea is, the devil is in the details.
Michael R. Gilbert:The first problem is that all of these operational systems were built individually to do their work, without worrying about all the other systems that overlap with them. This meant that the way the data was mapped out the data schema, as we would say was different in each system. In order to look up an entity that we might understand, say a customer, you'd need an identifier, a key. Well, the key for the same customer in different systems was incredibly unlikely to be the same, so relating two views of the same customer that came from two different operational systems was difficult and fraught with opportunity for errors. Solving this problem led to the idea of the ETL process, that's extract, transform and load. You would extract the data from the operational system, transform that data into a standardized schema for the data warehouse, and that would mean changing the data layout and updating all the keys that it used to a standardized set to be used whenever that entity in our example the customer was referenced. Then you would load this transformed data into the right place in the data warehouse. And there's the main rub. Writing these ETLs is a vast, complicated and highly skilled operation. It includes designing a data schema that's massive, large enough to map every interesting entity in the whole company, and you can imagine just how many different things any non-trivial organization might want to track and then creating and maintaining key mappings for each of these operational systems to those entities. But it's worse than it seems.
Michael R. Gilbert:In the UK we talk about projects being like the painting of the Forth bridge. It's an iron bridge about two miles long, which crosses the river Forth in Scotland. Being made of iron, it's important that the paint is in good shape or it would rust. There's a painting crew that starts at one end and works towards the other in a process that takes three to four years, which is about how long the paint will last in the nasty weather that is native to Scotland. On reaching the end, they must immediately return to the beginning and start putting down the next coat. Such it is with data warehouses. It takes a vast effort to map all the systems we use into it, but we're continuously adding to and changing our operational systems. Each time we do, the data warehouse and the ETL processes must change.
Michael R. Gilbert:Another major challenge is that the databases we use for our operational systems are designed for very different needs than we have in a data warehouse. Operational systems tend to write new data or update old data frequently, each time a customer places a new order, for example, but whereas data is added to the warehouse every night typically, data isn't really changed. It's read a lot, however, and because a data warehouse contains all the data from all the operational systems for all time, they are very, very large. Traditional databases don't scale well and certainly not cheaply. That means systems that support data warehouses tend to be very large and very expensive. If you recall the idea that we started with that, there's a fatal imbalance in play when the benefits of the data accrue to the consumers of it, but the costs accrue disproportionately to the producers. Perhaps you can see where this is going. Getting data into the data warehouse is a very slow and expensive project, but the act of putting it in generates no value at all. The value comes from the small percentages actually used, and it's usually used by different people than the ones who are charged with putting it in there in the first place. This is not a recipe for high-speed success and innovation.
Michael R. Gilbert:Meanwhile, back in Gotham City, things have not been standing still in the technology space. If we peg the birth of the data warehouse at approximately the turn of the century. Then, by 2010 and certainly by 2020, things had changed radically. The web had exploded, leading to massive sources of new data, including data from outside the company. The cloud had made storage and compute both very cheap and readily available, and the world of data science had become a mainstream thing. Teams like the marketing department and the newly formed data analytics teams wanted to be able to import data into their new digital platforms and test ideas within days, releasing new products in weeks or, at the very most, months. They weren't interested in waiting months or even years to get new data feeds into the data warehouse. They weren't willing to invest the kind of money that IT wanted for it either, and, in any case, they often weren't using the typical SQL data skill sets that they'd need in order to access a traditional data warehouse. They had new languages like Python and R that read data in from files, not databases.
Michael R. Gilbert:At the same time, a couple of new technological advancements had arrived 1) the creation of a massively distributed filing system that could spread data across thousands of cheap commodity level machines, each breaking their part of the data into thousands of smaller files. And 2) a technique called MapReduce, which allowed you to run queries and calculations across these thousands of small files, to come up with a consolidated answer with blistering speed, using very cheap cloud-based resources. To understand MapReduce and why it's game-changing, imagine that you want to know how many type A widgets you sold last year and having last year's sales written out as a list showing everything sold, with each individual sale appearing on its own line in the order it was sold. It'd probably be a very long list, right? Well, you could answer the question by starting at the top of the list, running down it and keeping a tally of all the type A widgets you find, and when you get to the bottom, you'd have your answer. Okay, now imagine every day's sales are on a different list, 365 of them. Now we could give each list to a different person, this would be mapping the problem, and each person could keep a tally of the type A widget sales from their list and when they're all done, they could give that tally back to you and all you need to do is add up 365 numbers and you have your answer. That would be the reduce step.
Michael R. Gilbert:You can see that we can extend this idea. Imagine that every one of your 20,000 stores has its own daily sales list. Now, instead of giving our 365 people a single list, we give them 20,000 lists each - One for each store sales on that day, our map stage again. They in turn give one of their lists to 20,000 people working for them. Another map stage. Each of their people tallies the sales of type A widgets on their list and hands back their answers to our 365 workers, who sum up those 20,000 store base answers to get a daily total across all stores. A reduce stage. Finally, they hand back the 365 daily answers to us and we sum them up to get the answer for all stores across the whole year, another reduce stage. See how we broke down a huge problem into tiny steps that we can spread across cheap, low skilled workers. That's the idea behind MapReduce, and it's very, very powerful and it allows us to run queries across absolutely immense amounts of data incredibly quickly by leveraging the burst compute capabilities that the cloud can offer at very low costs. We might be using a lot of hardware, but only for a few seconds or even parts of seconds at a time.
Michael R. Gilbert:Let's throw in one last idea, that of columnar data stores. You see, a traditional database would store these lists exactly as you would imagine, by writing out a whole line of data for each individual sale, as it starts with the store identifier, then the date and time, then it has customer, and then the quantity, then the product and the unit price, extended price, discount tax, blah, blah, blah until we've captured everything we need to know about that particular sale. Now we move on to the next line for the next sale and write that down. It's logical and it's exactly how we think. It's not particularly useful for the way we want to analyze the data, though. Take our previous MapReduce example. We need to know the store and the date to split them out into our different workers. Remember, this data is recorded in the store by day. This is probably already information we have by virtue of which list we're reading anyway. The only other two pieces of information we need are which product this is so we can ignore anything that isn't a type A widget and how many did we sell, so that we can update the tally appropriately. We don't care about the price, the tax, the customer or anything else. So what if we still write these lists out, but we write out each entry for the first column for the whole day. Then we write out each entry for the second column, the third column and so on. Now I only need to give my workers just the two columns they want and I can completely ignore the data that I don't need, and maybe I don't even understand it anyway.
Michael R. Gilbert:Moving data takes time. Not moving data is really really quick. So you can see how storing data in this format is incredibly useful. Of course, you couldn't do this in an operational data store. The whole point is that you have to know when you've written all of the data elements for column 1 before you can start writing column 2. But if you're still selling products for that day at that store, you don't know that this is the last one. In fact, at the start of the day, you're seriously hoping it isn't, but once you're done with the day and there can't be any more sales, you certainly could store it this way.
Michael R. Gilbert:Enter the data lake. Again, the idea is simple. Forget about boiling the ocean, painting the Forth bridge or whatever metaphor you like here. Forget about ETLs and standard schemas. Anytime you want to do some research, we'll figure out the data we actually need. If it's already in the data lake, great. Otherwise, we'll just extract it from wherever it is and load it into the data lake, in whatever structure it already has, just as a set of plain old flat files. If we're loading data every night, we'll just load the new data up as a new file. No need to insert it into the old file, we'll just add a new one.
Michael R. Gilbert:We don't do any transformations at this stage. We don't design any standard schemas or build out complex keymappings. If we have to relate data from two different data sources with different schemas and keys, we'll add that mapping to our query logic at the time. We'll cross that bridge when we come to it, so to speak. We also don't need to build indexes to accelerate the queries, as the queries we write just scan through the whole data from one end to the other and they're so fast anyway. There's very little point in trying to build indexes. We don't even bother understanding the whole file necessarily. Just the data elements we need for the analysis we want to do at this time. Remember, we're only going to send the columns we care about to our workers, so who cares what the data in the other columns means? If we ever do need to update the information, let's say we get new sales information from two weeks ago for a store whose systems were offline at the time that we got the feeds for that day. We don't try to insert the new data. We just find the files for that store-day combination, delete them and write the new files with the corrected data back in their place.
Michael R. Gilbert:If all this sounds a bit like the Wild West brought to the world of data, well it is, and that leads to a few issues we'll talk about in a bit. But first let's look at the problems it solves. The cost to the data producers is pretty small. Just let us have access to your daily data output. No need to transform it. We'll take it in whatever format it's already in. No need to explain the data in detail. We might ask you about a few columns of it that matter to us and we'll take it from there.
Michael R. Gilbert:The costs of putting the file into the data lake and constructing the query all fall on the consumers of that data and they're only doing the work on the things that they care about. There is some overhead, of course. Someone's got to set up the data lake in the first place and there is some cost to storing the data and running the queries, but it's very small in comparison to the cost of a data warehouse. As for the speed of getting it done, there's no impact on any existing data of adding new data, so there's no need for planning or heavy governance. New analysis can be enabled in days or even hours, rather than months or the years that it used to take. The downsides are simple. Because there is no standardized structure and no mapping into standard key sets, every query writer has to create their own mapping every time. Because there's no formal governance enforced, it's possible that we will be duplicating efforts and potentially duplicating storage. Remember that it's cheap, but it's not free.
Michael R. Gilbert:There are other things that we've come to rely on in the data warehouse world too. In a data warehouse, if we update something or even delete something we didn't intend to one call to IT and suddenly the change is undone, like magic, the underlying database technology allows for this. Furthermore, if changes keep getting made to the data that are incorrect, every change can be tracked right down to who made it and when, so we can trace the problem back to its root and fix it. In the data lake world, all we have is a series of files. In order to change any data, we delete the old files and write new one. Make a mistake and the old data is just plain gone. You have to go back to the source and get another copy, if it still exists, because the old files are just deleted. There's no tracing what happened to them and who did what to what when, it's just gone.
Michael R. Gilbert:The obvious question is can we get a hybrid of these two worlds? Can we have our cake and eat it, so to speak? Well, if you combine a data warehouse and a data lake, what would you get? A data lake house perhaps, and no, I'm not making that up. That is indeed what they call it, and no, I can't explain Tech humor. I can explain what it is, though. Go back to the analogy that I gave you earlier about lists of sales by store and by day. Imagine that each of these lists is on a piece of paper. Well, if you put all these lists into an envelope, you could still use the list just like you did before, but you could write extra information, metadata as we would call it, on the envelope, and this could cure a few problems for you.
Michael R. Gilbert:First, let's tackle the schema problem. As we work with these lists, we figure out what each of the columns means and how it relates to other columns. In other lists. We could write down these definitions on the envelope. Then the next person to work with this data can start from there. They don't have to rediscover what we've already done and they can add more details as they find them in their work. Slowly, the data lake will start to become better and better defined, much like a data warehouse.
Michael R. Gilbert:Next, if we update the data, instead of deleting the old pages that are wrong, we could mark the bad pages on the envelope and just put the new pages in as well. We can note who we are, when we're making the change, and why, just like we would in a data warehouse. Now, if something went wrong, we could look at this change log written on the envelope and, if needed, we could reinstate the old pages to undo the change. After all, they're still there inside the envelope. We just need to change the markings on the outside of the envelope to say that they are good and to mark the new pages we had replaced them with as bad instead . Now we have the same type of auditability and reliability that we had in the old data warehouse world.
Michael R. Gilbert:Finally, we can add some of the governance back to our Wild West by dividing the data lake into two parts. One part we might call uncertified, for example. This works just like we've described so far. Anyone is free to do with it what they like, but any data you use from it, you use at your own risk. If it's valuable to you, great, but you're responsible for checking its accuracy. As we find data that is particularly useful and many people want to use on an ongoing basis, then we refer that to the same type of governance team that we had for our data warehouse. People with high skill levels and good enterprise knowledge that can validate the definitions that have evolved through our explorations and give them, well, the official blessing. The data, and indeed queries, that are so blessed can be moved into the certified part of the data lake, and all data sources in the certified region can now only be managed and controlled through IT's normal processes, and that way, everyone in the company can access any data from this certified region and know that they're good to go. That it can be trusted. In a way, we're building out the old data warehouse idea, but this time using much cheaper technology and not investing money in any part of it until it's already proven it's worth. So, to wrap up, should you be investing in a data warehouse or a data lake. Neither. You should be investing in a data lake house, something with the speed and flexibility of a data lake, but with much of the strength and governance, where it's needed, of a data warehouse.
Michael R. Gilbert:Now, just before I leave the topic of data stores, there's a new shiny thing that people are getting excited about. The data mesh. I'm not going to cover that here. It deserves a podcast of its own. It's a big subject. But if you're wondering if you should be investing in that instead, I'd say probably not. There are going to be specific use cases where it'll make a lot of sense, but not for 99% of today's enterprises.
Michael R. Gilbert:The problem isn't one of technology but of behaviour, and that stems from the warning I started with. If the benefits of the data accrue predominantly to the consumers of the data, but costs accrue instead disproportionately to its producers, then failing to account for this imbalance will derail any and all data projects. The data mesh falls foul of this, I'm afraid, and until we find a way to correct that imbalance, I won't be recommending it to my clients. As I said, for those that are interested, I will put out a podcast on this topic in the future, but for now, I hope you have a great time hanging out in your new data lake house and don't forget to bring a good book to read. Yeah, sorry, couldn't resist that. Tech humour again. Thanks for listening. I hope you enjoyed it and I hope you now know a little bit more about the various options for data storage in today's enterprise environments. As always, the transcript of this podcast can be found on the website at https://www. thetechnologysoundingboard. com. If you get a chance, stop by, and leave us a review or a comment. Until next time.