Do you have a complex data storage setup? Are you overwhelmed with the amount of ETL components your company supports? Is your company feature-focused at the moment and have minimum time and resources to develop a platform? If so, you will find this piece very useful. We will show why data lake is a natural answer to the system complexity and is an extremely great fit for quickly growing companies.
A traditional way to integrate multiple, diverse data sources is to build a data warehouse. A data warehouse is a great thing to have, as it serves as a single source of truth and, if it’s companywide, it integrates multiple different business domains into a single, clean, normalized datastore. Having that makes it very easy to build multiple data marts, each providing very specific business insights.
If you already have a data warehouse and are doing a great job of keeping it up to date with all the changes in your business, my sincere congratulations. You have an ideal system.
But unfortunately, the traditional scenario with the data warehouse is not all that widespread among companies. The main reasons for this are the following:
Let’s consider the following real-world example. The Company has a great business idea and decides to build a ProductA in-house. Everything goes according to plan; customers are happy with the new product and request more features. The ProductA grows:
As soon as the Company is at a good spot on the market, it looks for expansion opportunities and acquires a ProductB to cover complementing business domain. And of course, ProductB has its own architecture including various data storage and reporting. So now the technical landscape looks like this:
The Company does a great job refactoring common components and makes its way towards a single platform. For example, separate operational storage is created for the intersecting part of the domain and common data storage for reporting is introduced to get rid of the reporting load from operational storage. But most likely ProductA and ProductB still keep major components of their architecture.
From an engineering point of view, this situation means that there are:
In the meantime, from a business point of view:
This setup is neither good nor bad. It’s just a natural result for a quickly growing business. Let’s look at requests coming to the Engineering team supporting this architecture.
And the natural answer from engineering to this need is more ETLs resulting in the following downsides:
If at least partially this looks familiar to you, then data lake is what you need.
Considering all the scenarios described above and all the pros and cons of the solutions a company usually comes up with, here is a set of qualities which alternative approach should have:
So wait. What is a data lake essentially? It’s scalable BLOB storage combined with a query engine usually providing SQL-like interface. Examples of such a setup can be HDFS + Hive, HDFS + Spark SQL, Amazon S3 + Amazon Athena and so on.
This type of setup allows you to dump your operational data stores and any other data you have to a single location and then run queries across different data sets using the query engine.
All your data is available on-demand with a SQL-interface to access it. The later is really important because SQL knowledge at least on a basic level is very widespread among engineers. This means operational teams and data analysts will be able to use this setup by themselves, keeping Engineering team resources for more technically challenging features.
Let’s check out how to quickly and efficiently implement a data lake.
Firstly, a data lake should be filled in with data. What exactly should be used as a source depends on your case, but in general, you would use either a primary instance of the data store or some sort of copy of it:
The next step is to build a job that extracts data from data sources and stores it into the BLOB storage. As an example of such a job, we built a configurable generic Spark job running in Amazon EMR cluster which extracts database table by table and stores it to S3. In order to make this data discoverable, we added functionality to register extracted data inside Amazon Glue Data Catalog.
At this stage, we have a data lake filled in and ready for processing. Having only this stage implemented, your team can either start the implementation of cross-product reports (if you have specific requirements prepared) or start a research to come up with a set of useful insights. In any case, eventually, either analyst team or engineering comes up with a set of stable queries. So why not take the next step and start running them automatically on a regular basis, making these insights refreshed with new data, say, every night?
In order to achieve this level of automation, setup for Stage 1 is missing a minimum of 2 components:
Additionally, it is a best practice to treat your data lake-based insights as a separate project with its own source repository and CI/CD pipeline. This view will ensure that every new insight (which is essentially a SQL script) passes all the stages from Dev to Prod thereby ensuring quality.
Data lake tackles complexity:
Data lake is efficient:
It already looks great, but this isn’t a paramount yet. In the next articles for data lake series, we will show how to extend this basic setup with orchestration to enable multiple independent ETL running in parallel. Stay tuned.
If you want to know how data lake can be a good response to complexity for your business — GreenM’s architects team would be happy to schedule a free whiteboarding session to collaboratively think through the right solution with you.
Copyright © 2020 GreenM, Inc. All rights reserved.
We’ll send only useful articles and case studies to your inbox!