Implementing a data lake for your organization

Jake Zatecky
Sep 14, 2020
4 min read

Updated: Feb 4

Is your organization ready to uncover insights from data stored in multiple system and files? Does the task at hand feel daunting and unclear? It doesn’t have to be–at Scalesology, we’re experts at designing and implementing data lakes to help organizations analyze and uncover insight from their data.

To data lake or not

If your organization has a lot of data that is unstructured, continuously generated, and sized in billions of records, then a data lake may be the right choice. However, if your organization's data originates mostly from traditional databases, a data warehouse may be a better choice.

The differences between a data lake and data warehouse can be confusing, but a data lake is a scalable, centralized repository that stores data as-is, whereas a data warehouse is an optimized database to analyze relational data.

The two do not have to be mutually exclusive. Depending on your organization needs, it may make sense to have both a data lake, which serve as a repository for the raw data, and a data warehouse, which serves a single source of truth for curated data.

Identify and catalog your data sources

The first step on the road to implementing a data lake is to identify your data sources and define their attributes. How you collect your sources into the data lake depends on some key characteristics:

Location – Where does the data originate? Does it come from a database, a sensor device, or social media? Is it available through the open internet or a private network? How your organization accesses data can greatly impact its ingestion.
Data type – Is the data structured or unstructured? Does it conform to a well-defined schema, or is it log data? The structure or lack thereof of a data source will determine how to store it on the data platform.
Frequency – How often does the data update? Does it happen on regular intervals or does it occur continuously throughout the day? Some ingestion tools can stream information into the data lake in real-time while others import batches of data on a fixed schedule.
Size – What is the volume of data? Are there billions of line items, or a handful of records? Both your ingestion and storage mechanisms need to scale with the current and future workloads.

Identifying these characteristics will help your organization choose the right platform, tools, and technologies, as well as assist data scientists in the discovery phase of analytics.

Define data governance policies

While a data lake provides a convenient, central location to access your organization’s data, it also presents challenges when considering regulations, data governance policies, and secure access. Is your organization in a regulated industry? Are there limits to how much data you can store and for how long? Who should have access to the data lake, and what sections? Do you keep audit logs for who is accessing data? Who is responsible for implementing these measures?

Your organization should define and answer questions like these and articulate a clear data governance policy. It is best to do this early, as the more fleshed out your policies are early, the easier they are to implement. Many data services offered by the major cloud providers have mechanisms to help organizations fulfill their data governance specifications.

Automate your ingestion

The most intensive implementation aspect of a data lake involves connecting the data sources to the centralized repository.On this front, a wide range of options exist, from cloud-native services like Amazon Kinesis and Azure Event Hubs,to Hadoop services like Apache Flume, Kafka, or Sqoop, to more customized solutions using Python,Java,JavaScript, etc.in AWS Lambda Functions, Azure Functions,or Google Cloud Functions.

What tools your organization has available will depend on the platform. Choose the one that best fits with your existing applications and your future needs. Refer to our article on data lake platforms for more insight on the available platform options.

Whatever tools you choose, make sure to consider the following:

Scaling – The ingestion should scale with increasing workloads. This “future-proofs” the process and allows data scientists to access fresh copies of the data lake without significant lags in data ingestion.
Ease of use – Prioritize ingestion tools that are easy to implement and maintain. Writing your own scripting may be easy at first but can present long-term maintenance problems, especially when bringing on new resources.
Fault-tolerance – Errors happen. You should anticipate that an ingestion process will fail. Choose solutions that can recover from errors or minimize the corruption to the data lake. If writing your own ingestion scripts, this should be a central focus.
Automated alerts – When errors happen, it is best to fix them as early as possible. Do not let your data lake suffer from silent ingestions for weeks before realizing that critical information is missing. Enable automatic alerting to resolve issues as they come.
Security – The tool and platform should ingest information securely over encrypted protocols. Thankfully, most major tools and platforms have encryption built-in, but for any systems your organization manages, security should be a top consideration.

The next step, on to analytics

The first step to unlocking insights from your organization’s data involved storing it into the data lake. However, ingestion is just the beginning. Now that individuals can access the data in a centralized repository, the next steps involve data discovery, profiling, model building, machine learning, and much more. We will do a deep dive into analytics in a future blog.

Conclusion

Data provides a wealth of insights to organizations. To succeed in the digital era, organizations must consider how they store and access this information. A data lake provides a straightforward way for organizations to both ingest and explore their data at any scale. Ready to talk about how to organize your data? Scalesology can walk through a data assessment and recommend the right platform and approach to ensure meaningful insights from your data. We are here to help you succeed in your analytic journey.