A data lake1 is a pool of data from multiple sources. It is different from a data warehouse as it can store both structured and unstructured data, which can be processed and analyzed later. This eliminates a significant part of the overhead associated with traditional database architectures, which would commonly include lengthy ETL and data modeling when ingesting the data (to impose schema-on-write).
With ever-growing massive amounts of data collected and the need to leverage the data to build solutions, and strategies, organizations are facing a major challenge maintaining these massive pools of data and extracting valuable business insights from it. However, if the data in a data lake is not well-curated, it may flood it with random information which is difficult to manage and consume, leading to a data swamp. Therefore before going forward with a data lake it’s important to be aware of what are the best practices when designing, implementing, and operationalizing your data lake.
Let’s look at the best practices that help build an efficient data lake.
- Data Ingestion
- Addressing Business Problem: It’s always better to question the need for a data lake before diving straight into it. If the business problem demands it only then one should opt for it. It is important to stay committed to a problem and find its answer and later if building a data lake is the right way to go, then great! A common misconception that people have is that they think data lakes and databases are the same. The basics of a data lake should be clear and should be rightly implemented for the right use cases.
In general, data lakes are suitable for analyzing data from diverse sources, especially when the initial data cleansing is problematic. Data lakes also provide unlimited scalability and flexibility at a very reasonable cost. Let’s look at some use cases where businesses/industries use data lakes:
- Healthcare- There is a lot of unstructured data in medical services (i.e. doctors’ notes, clinical information, etc.) and a constant need for real-time insights. Therefore, the use of data lakes turns out to be a better fit for companies in healthcare/insurance, as it gives access to both structured and unstructured data.
- Transportation- Data lakes allow you to transform raw data into structured data that is ready for SQL analytics, data science, and machine learning with low latency. Raw data can be retained indefinitely at a low cost for future use in machine learning and analytics. In the transportation industry, the business insights derived from the data can help companies reduce their costs and increase their profits.
- Schema Discovery Upon Ingest: It’s generally not a good idea to wait for the data to be actually in the lake to know what’s in the data. Having visibility into the schema and a general idea of what the data contains as it is being streamed into the lake will eliminate the need for ‘blind ETLing’ or reliance on partial samples for schema discovery later on.
- Ensure Zero Data Loss: Ingestion can be in batch or streaming form. The data lake must ensure zero data loss and write the data exactly-once or at-least-once. Duplicate events or missed events can significantly hurt the reliability of the data stored in your lake, but exactly-once processing is notoriously difficult to implement since your storage layer is only eventually (not instantly) consistent. The data lake must also handle variability in a schema and ensure that data is written in the most optimized data format into the right partitions, and provide the ability to re-ingest data when needed.
- Persist Data In The Raw State: It’s always good to persist data in its original state so that it can be repurposed whenever new business requirements emerge. Furthermore, raw data is great for exploration and discovery-oriented analytics (e.g., mining, clustering, and segmentation), which work well with large samples, detailed data, and data anomalies (outliers, nonstandard data).
- Data Transformation
- Columnar Data Formats For Read Analytics: Columnar storage makes the data easy and efficient to read, so it is better to store the data that will be used for analytics purposes in a format such as Apache Parquet3 or ORC4. In addition to being optimized for reads, these file formats have the advantage of being open-source rather than proprietary, which implies they can be read by a variety of analytics services.
- Partition Data: Partitioning of the data helps reduce query costs and improves performance by limiting the number of scans the data query engines need to do in order to return the results for a specific query. Data is commonly partitioned by timestamp – which could mean by the hour, by a minute, or by a day – and the size of the partition should depend on the type of query intended to run. One can also use time, geo, lob to reduce data scans, tune partition granularity based on the data set under consideration (by hour vs. by second).
- Chunk Up Small Files: Small files can be optimally chunked into bigger ones asynchronously to reduce network overheads.
- Perform Stats-based Cost-based Optimization: Cost-based optimizer5 (CBO) and statistics can be used to generate efficient query execution plans that can improve performance. It also helps to understand the optimizer’s decisions, such as why the optimizer chooses a nested loop join instead of a hash join, and lets you understand the performance of a query. Dataset stats like file size, rows, histogram of values can be collected to optimize queries with join reordering. Column and table statistics are critical for estimating predicate selectivity and cost of the plan. Certain advanced rewrites require column statistics.
- Use Z-order Indexed Materialized Views For Cost-based Optimization: A materialized view6 is like a query with a result that is materialized and stored in a table. When a user query is found compatible with the query associated with a materialized view, the user query can be rewritten in terms of the materialized view. This technique improves the execution of the user query because most of the query result has been precomputed. A z-order index serves queries with multiple columns in any combination and not just data sorted on a single column.
- Data Governance
- Maintaining Data Catalogs: Data should be cataloged and identified, with sensitive data clearly labeled. Having a data catalog helps users discover and profile datasets for integrity by enriching metadata through different mechanisms, document datasets, and support a search interface.
- Ensuring Correct Metadata For Search: It’s important for every bit of data to have information about it (metadata) in a data lake. The act of creating metadata is quite common among enterprises as a way to organize their data and prevent a data lake from turning into a data swamp. It acts as a tagging system to help people search for different kinds of data. In a scenario where there is no metadata, people accessing the data may run into a problematic scenario where they may not know how to search for information.
- Set A Retention Policy: Data should not be stored forever in a data lake as it will incur the cost and may also result in compliance-related issues. Therefore, it is better to have appropriate retention policies for the incoming data.
- Privacy/Security: A key component of a healthy Data Lake is privacy and security, including topics such as role-based access control, authentication, authorization, as well as encryption of data at rest and in motion. A data lake security plan needs to address the following five important challenges:
- Data access control – The standard approach calls for using built-in Identity and Access Management (IAM) controls from the cloud vendor.
- Data protection – Encryption of data at rest is a requirement of most information security standards.
- Data leak prevention – Most major data leaks come from within the organization, sometimes inadvertently and sometimes intentionally. Fine-grained access control is critical to preventing data leaks. This means limiting access at the row, column, and even cell level, with anonymization to obfuscate data correctly.
- Prevent accidental deletion of data – Data resiliency through automated replicas does not prevent an application (or developers/users) from corrupting data or accidentally deleting it. To prevent accidental deletion, It is recommended to first set the correct access policies for the Data Lake. This includes applying account and file-level access control using the security features provided by the cloud service. It is also recommended to routinely create copies of critical data in another data lake. This can be used to recover from data corruption or deletion incidents.
- Data governance, privacy, and compliance – Every enterprise must deal with its users’ data responsibly to avoid the reputation damage of a major data breach. The system must be designed to quickly enable compliance with industry and data privacy regulations
The data lake is allowing organizations to hold, manage, and exploit diverse data to their benefit. But here’s the reality, some data lakes fail to serve their purpose due to their complexity. This complexity may be induced by several factors, one of which is improper data ingestion. Building a sound data ingestion2 strategy is vital for succeeding with your enterprise data lakes.
Data generation and data collection across semi-structured and unstructured formats is both bursty and continuous. Inspecting, exploring, and analyzing these datasets in their raw form is tedious because the analytical engines scan the entire data set across multiple files. Here are few ways to reduce data scanned and query overheads –
Don’t wait until after your data lake is built to think about data quality. Having a well-crafted data governance7 strategy in place from the start is a fundamental practice for any big data project, helping to ensure consistent, common processes and responsibilities.
Following the above best practices will help create and maintain a sustainable and healthy data lake. By devising the right strategy of collecting and storing data in the right way, one can reduce the cost of the storage, make data access efficient and cost-effective, and ensure data security.