Data Lake as an Efficient Response to Complexity
Do you have a complex data storage setup? Are you overwhelmed with the amount of ETL components your company supports? Is your company feature-focused at the moment and have minimum time and resources to develop a platform? If so, you will find this piece very useful. We will show why data lake is a natural answer to the system complexity and is an extremely great fit for quickly growing companies.
Why is a structured data lake a good
response to complexity?
A traditional way to integrate multiple, diverse data sources is to build a data warehouse. A data warehouse is a great thing to have, as it serves as a single source of truth and, if it’s companywide, it integrates multiple different business domains into a single, clean, normalized datastore. Having that makes it very easy to build multiple data marts, each providing very specific business insights.
If you already have a data warehouse and are doing a great job of keeping it up to date with all the changes in your business, my sincere congratulations. You have an ideal system.
But unfortunately, the traditional scenario with the data warehouse is not all that widespread among companies. The main reasons for this are the following:
- Data warehouse works great when it matches up with the business domain logic and relationships inside it. As such, it either requires a lot of upfront design or a lot of continuous work to keep it up to date with all the quickly changing business demands. It first means that you can’t move your product quickly to market. Later it means that you should have a dedicated resource supporting a data warehouse, instead of building features valuable for your business.
- Traditional data warehouse software like MS SQL Server or Oracle is quite expensive, both in terms of the license and in terms of hardware resources required to run it.
Let’s consider the following real-world example. The Company has a great business idea and decides to build a ProductA in-house. Everything goes according to plan; customers are happy with the new product and request more features. The ProductA grows:
- the Company builds new components to support new features
- some back-end processes are introduced with their own meta-data storage
- eventually, reporting is introduced
As soon as the Company is at a good spot on the market, it looks for expansion opportunities and acquires a ProductB to cover complementing business domain. And of course, ProductB has its own architecture including various data storage and reporting. So now the technical landscape looks like this:
The Company does a great job refactoring common components and makes its way towards a single platform. For example, separate operational storage is created for the intersecting part of the domain and common data storage for reporting is introduced to get rid of the reporting load from operational storage. But most likely ProductA and ProductB still keep major components of their architecture.
From an engineering point of view, this situation means that there are:
- Multiple independent apps/modules (data stores)
- Heterogeneous storage types/vendors (MS SQL, MySQL, NoSQL, etc.)
- Different historical data storage policies (ProductA might keep last 6 months of data, while ProductB keeps a full history due to SLA obligations)
- Each module has a separate team which owns it
In the meantime, from a business point of view:
- ProductA and ProductB cover essential parts of the business domain
- ProductA and ProductB have the same customer base (at least partially)
- ProductA and ProductB have all the necessary data to provide more sophisticated and valuable insights
This setup is neither good nor bad. It’s just a natural result for a quickly growing business. Let’s look at requests coming to the Engineering team supporting this architecture.
And the natural answer from engineering to this need is more ETLs resulting in the following downsides:
- The long feedback loop for customer-facing development
(demo, custom reports, on-demand requests) - Consumes a lot of engineering resources
- Code-base growth
- Unnecessary load on data storages in production
- High cost
If at least partially this looks familiar to you, then data lake is what you need.
Why is a data lake-based
approach so efficient?
Considering all the scenarios described above and all the pros and cons of the solutions a company usually comes up with, here is a set of qualities which alternative approach should have:
- Contain all the data
- SQL-like interface
- Fast enough for on-demand analysis
- Cheap
So wait. What is a data lake essentially? It’s scalable BLOB storage combined with a query engine usually providing SQL-like interface. Examples of such a setup can be HDFS + Hive, HDFS + Spark SQL, Amazon S3 + Amazon Athena and so on.
This type of setup allows you to dump your operational data stores and any other data you have to a single location and then run queries across different data sets using the query engine.
All your data is available on-demand with a SQL-interface to access it. The later is really important because SQL knowledge at least on a basic level is very widespread among engineers. This means operational teams and data analysts will be able to use this setup by themselves, keeping Engineering team resources for more technically challenging features.
Let’s check out how to quickly and efficiently implement a data lake.
Firstly, a data lake should be filled in with data. What exactly should be used as a source depends on your case, but in general, you would use either a primary instance of the data store or some sort of copy of it:
- Backup restored on another server. This choice is very widespread as backups are done regularly anyway and it is just a matter of creation of an automated process to restore them on another server.
- Read-only replica or shadow instance for always-on setup. This case is preferable when you need a high frequency of updates in reporting, but your solution is not ready to adopt a streaming approach.
The next step is to build a job that extracts data from data sources and stores it into the BLOB storage. As an example of such a job, we built a configurable generic Spark job running in Amazon EMR cluster which extracts database table by table and stores it to S3. In order to make this data discoverable, we added functionality to register extracted data inside Amazon Glue Data Catalog.
At this stage, we have a data lake filled in and ready for processing. Having only this stage implemented, your team can either start the implementation of cross-product reports (if you have specific requirements prepared) or start a research to come up with a set of useful insights. In any case, eventually, either analyst team or engineering comes up with a set of stable queries. So why not take the next step and start running them automatically on a regular basis, making these insights refreshed with new data, say, every night?
In order to achieve this level of automation, setup for Stage 1 is missing a minimum of 2 components:
- A generic component which can run a set of SQL scripts from pre-defined storage. For example, Spark Job in EMR cluster which runs Spark SQL scripts from S3 bucket or AWS Batch job which run scripts from S3 bucket using Amazon Athena.
- A trigger which starts script execution. For example, CloudWatch Scheduler to trigger the job using a fixed schedule or S3 Put object event to trigger the job as soon as new files are available for processing
Additionally, it is a best practice to treat your data lake-based insights as a separate project with its own source repository and CI/CD pipeline. This view will ensure that every new insight (which is essentially a SQL script) passes all the stages from Dev to Prod thereby ensuring quality.
Summary
Data lake tackles complexity:
- Combines data from multiple heterogeneous data sources into a single homogenous data store. In this way, it gets rid of the details and complexity of separate products technical stacks.
- Basic data lake setup is simple. It does not result in another monster to support.
- Data lake SQL interface is a standard. No advanced knowledge is required to leverage data lake power.
Data lake is efficient:
- A setup, described above, doesn’t require a lot of time and resources to implement. In our experience, we saw similar projects implemented by analysts’ teams with the support of one engineer in two to three months. This includes both stages with data lake filled in and useful insights queries run automatically on a predefined schedule.
- On-going support and extensions with new insights do not require engineering involvement. New insight is a new SQL query.
- Very short and efficient feedback loop for on-demand requests like sales demo, research and incident resolution. Raw data is there and available for queries at any time.
It already looks great, but this isn’t a paramount yet. In the next articles for data lake series, we will show how to extend this basic setup with orchestration to enable multiple independent ETL running in parallel. Stay tuned.
If you want to know how data lake can be a good response to complexity for your business — GreenM’s architects team would be happy to schedule a free whiteboarding session to collaboratively think through the right solution with you.
WANT TO KNOW HOW TO BUILD SCALABLE PRODUCT?
When developing a product, we follow these four critical rules to deliver high qualitydata development services.