Choosing a data lake or data warehouse for analyzing data
As a CEO of a data analytics firm, I’m often asked what the best strategy is for storing large amounts of data that will be analyzed at a later date. The purpose of this article is to define and explain the benefits of using a data lake and a data warehouse.
What is a data lake?
A system designed as a central repository for all types of data both structured and unstructured.
What is a data warehouse?
A system designed to store structured data used for reporting and data analytics.
While both a data lake and a date warehouse store data and enable in-depth analytics, customers should evaluate four components when deciding whether to implement a data lake or a data warehouse.
Type of Data
Depending on who you talk to there are numerous types of data – big data, machine data, real time data, etc. For the purposes of this article, I am going to organize data into two categories, structured and unstructured data. Structured data fits into a tabular format with relationships between the rows and the columns.
Unstructured data is everything else, such as videos, pictures, and emails. Data warehouses are better fits for structured data whereas data lakes retain both structured and unstructured data.
When assessing a client’s data needs, I find that typically clients fall into two groups of common data challenges. Group A, the client has multiple sources of data from disparate sources and want to consolidate all the data into one location. They might have an idea of what they want to do with the data, but most importantly they want the data to be flexible and scalable depending on what they may want to analyze in the future. Group B, the client knows exactly what they want to analyze, and they need that analysis to be very efficient for specific business decisions based on that data. These clients tend to want to spend the time cleaning the data and put the data into a uniform schema to enable faster queries of the data.
The client in Group A would opt for creating a data lake. Data lakes are cost effective for storing sizable amounts of data from many sources. Data lakes are flexible and do not rely on cleaning or segmenting the data into a schema. The client in Group B would opt to build a data warehouse. A data warehouse is much more efficient for analyzing historical data for specific decisions.
Typically, business analysts prefer working within a data warehouse structure containing pertinent information that has been cleaned and processed for the analysis and report creation they are undertaking. Data scientist on the other hand, want to explore what is in the data. Data lakes contain a wide array of data, allowing data scientists to aggregate and combine data in ways not thought out before in a set schema. Data engineers also tend to prefer data lakes. As data lakes are great for storing incoming structured and unstructured data. The data engineer can then take the data they need and create a data pipeline to fuel such items as a data warehouse, analytics data mart, or a payment processing system.
Size of the data
Data lakes were designed as an efficient way to store massive amounts of data and are comparatively cheaper than a data warehouse. Cheaper in a couple ways, first the actual cost of the storing data in a data lake is less expensive, second a data lake is built to collect all types of data, thus there is no time needed to sort and segment the data that must be done in a data warehouse.
Comparison Chart of a Data lake versus data warehouse:
Remember when evaluating whether to implement a data lake or a data warehouse, evaluate the type of data, data users, size of the data and the business objectives. At times it might sense to use both depending on the business objective you are trying to achieve. So, if you’re in the evaluation process, the Scalesology team invites you to reach out for a conversation. We can listen and recommend specific platforms and services depending on your needs and goals. We look forward to having a conversation with you.