Data Lake Platforms
Updated: Aug 25, 2020
Gain a competitive advantage by organizing and structuring data into data lakes enabling digital dashboards and advanced analytical tools to uncover patterns and insights.
What is a data lake?
A data lake is a scalable, central data repository that can store all your organization’s structured, semi-structured, and unstructured data sources as-is so that it can be analyzed later.
The adoption of data lakes is becoming more relevant as businesses are increasingly focused on gaining a competitive advantage with their data by incorporating advanced analytics into their daily decision-making as well as long-term strategies. As the data lake has evolved over the last several years, so too has its underlying architecture.
Evaluating Critical Success Factors
Based on our experience designing, building and deploying data lakes, these are the top critical success factors to consider before selecting a data lake platform:
Data lake Environment (On-premise, cloud, and hybrid solutions) Cost, security, scalability, ease of deployment, and maintenance are all factors to consider when deciding where to deploy your data lake. Data Lakes can be deployed on-premise, in the cloud, or a hybrid of the two. On-premise solutions can be better if you need the data to be locked down since you have total control over the environment. Cloud solutions, however, are typically better solutions if your top priority is cost, scalability, ease of deployment and maintenance. Some organizations have a hybrid approach with some data residing on premise while other data is stored in the cloud. According to recent market research (e.g. Gartner, Research and Markets, and Market Research Future) cloud-based solutions are becoming the preferred platform for modern data and analytics apps and will significantly increase in the near future; however, the answer depends on what is most important for your business needs.
Data sources Whether your data sources are structured, semi-structured, and/or unstructured, real-time or batch, ensure the platform can ingest and query your data simply and efficiently.
Data governance capabilities Don’t turn your data into a data swamp. Verify the platform has the ability to catalog, index, and secure your data for quality and reliability and ensure you have data governance policies in-place.
Integration of analytical tools Because analytics is the end goal of a data lake, make sure that the data lake can integrate with cloud-native or third-party tools for data processing, visualizations, and advanced analytics. Each cloud provider will list the analytic tools that work with their platform.
Data Lake Architectures
While there are many data lake platforms, the following provides a high-level summary of the industry leaders:
The Original – Hadoop
The first data lakes were implemented on-premise using Hadoop, which is an open-source platform that stores and analyzes massive amounts of structured and unstructured data using a distributed file system.
The main components of Hadoop are the Hadoop Distributed File System (HDFS), MapReduce, and Yet Another Resource Negotiator (YARN). The HDFS stores and replicates large data sets across multiple machines; MapReduce performs distributed and parallel processing on the data; and YARN manages resource allocation and job scheduling.
Hadoop has a mature ecosystem that offers additional services such as in-memory data processing (SPARK), building data pipelines (PIG, HIVE), and creating scalable machine learning applications (Mahout).
Hadoop is available both on-premise and in the cloud.
The Big 3 Cloud Providers – Amazon, Azure and Google
Whereas Hadoop can be on-premise or in the cloud, Amazon, Azure and Google’s data lake solutions, of which Scalesology has considerable experience, are only offered in the cloud.
Amazon – Amazon offers an automatic solution that can be deployed quickly and is both highly available and cost effective. You can also build a custom data lake using a combination of their services, with Simple Storage Service (S3) as the central data storage layer. An Amazon differentiator is that it offers an extensive and solid suite of products for building custom data lake solutions that will meet most needs, such as, Amazon Athena for quickly querying data in S3, Amazon Kinesis products for ingesting and analyzing streaming data, and Amazon SageMaker for developing and implementing machine learning applications.
Azure – Azure offers three solutions to build a data lake:
1. Azure Data Lake Storage (ADLS) is a hyperscale data storage repository built upon the HDFS standard
2. Data Lake Analytics performs on-demand distributed analytics utilizing Microsoft’s query language U-SQL and existing .NET, R, or Python libraries
3. HDInsight runs open-source analytics frameworks like Apache Hadoop, Spark and Hive.
An Azure differentiator is its hyperscale data storage solution which is 200x larger than other cloud stores.
Google – Google offers a data lake solution by integrating several of their product offerings, the main component being Google Cloud Storage (GCS) for data storage. A Google differentiator is the strength of its machine learning and AI solutions, such as BigQuery which is a multi-cloud data warehouse with has built-in machine learning capabilities, and AI Platform which is an environment for quick development and deployment of AI projects.
The Disruptor – Snowflake
Aside from the Big 3, there is one data lake solution that is challenging these cloud provider giants.
Snowflake touts its solution as being a cloud data platform with an architecture that combines the performance of a data warehouse with the flexibly of a data lake repository in a single environment. As a data lake, Snowflake can ingest structured and semi-structured data (JSON, CSV, tables, Parquet, etc.). As a multi-cloud platform, Snowflake can seamlessly consume data from Amazon S3, Azure data lake storage or Google Cloud Storage data lakes as well.
A Snowflake differentiator is its ease of use and its separation of storage and computing power which allows for seamless, infinite scalability of data storage and compute resources for rapid results.
With the volume and complexity of data rapidly growing in our digitized world, data lakes are emerging as a robust solution to storing and analyzing big data. If your organization has decided to implement a data lake but is struggling to get started, our team at Scalesology can help guide you through this process. Give us a call and we can discuss what is the best fit for your organization.