Machine LearningExecutive Summary

Without effective and comprehensive validation, a data lake becomes a data swamp and does not offer a clear link to value creation to business.

With the accelerating adoption of Cloud Data Lake as the data lake of choice, the need for validating data in real-time has become critical.  Accurate, consistent, and reliable data fuels algorithms, operational processes, and effective decision-making.

Existing data validation approaches rely on a rule-based approach that is resource-intensive, time-consuming, costly, and not scalable for 1000s of data assets.

Business Impact of Data Quality issues in Data Lake

Following examples from Global 2000 organizations demonstrate the need to establish data quality checks on each data asset present in the data lake

Scenario 1: New subscribers of an insurance company could not avail the telehealth services for more than a week.

Root Cause: Data Engineering team was not aware of onboarding of the insurance company as a new client and ETL jobs did not pick up the enrollment files landed in their Azure data lake

Scenario 2: Commodity traders of a trading company could not find the user level credit information for a certain group of users on a Monday morning – a report was blank – leading to disruptions in trading activities for 2 hours.

Root Cause: The credit file received from another application had the credit field empty and was not checked before being loaded to the Big Query

Scenario 2: Supply chain executives of a restaurant chain company were surprised by the report that consumption in the UK doubled in May.

Root Cause: Current month’s consumption file was appended to the Consumption file from April because of a processing error and stored in the AWS Data Lake

Current Approach and Challenges

The current focus in Cloud Data Lake projects is on data ingestion, the process of moving data from multiple data sources (often of different formats) into a single destination. After data ingestion, data is moved through the data pipeline which is where data errors/issues begin to surface. Our research estimates that an average of 30-40% of any analytics project is spent identifying and fixing data issues. In extreme cases, the project can get abandoned entirely.

Current data validation approaches are designed to establish data quality rules for one container/bucket/table at a time—as a result, there are significant cost issues in implementing Data Quality(DQ) checks for 1000s of buckets/containers.

Machine Learning (ML) based approach for Data Quality

Instead of figuring out data quality rules through profiling, analysis, and consultation, standardized unsupervised machined learning algorithms can be successfully applied at scale to the data lake to determine acceptable data patterns for each table/bucket/container. Several open-source ML software offers these outlier/anomaly detection algorithms [DBSCAN, Association mining, Principal component analysis ]as part of their packages.

Outliers can be classified through the lens of standardized data quality dimensions as shown below:

1. Freshness – reports if the data has arrived before the next step of the process

2. Completeness – reports the completeness of contextually important fields.

3. Conformity - reports conformity to a pattern, length, format of contextually important fields.

4. Uniqueness – reports uniqueness at a record level.

5. Drift – determine the drift of the key categorical and continuous fields from the historical information

6. Anomaly – determine volume and value anomaly of critical columns

Value comparison

The benefits of ML-based Data Quality fit broadly in two categories: quantitative and qualitative. While the quantitative benefits make the most powerful argument in a business case, the value of the qualitative benefits should not be ignored.

Value Dimension Traditional Approach ML Based Approach
Cost Reduction The effort to establish DQ checks is approximately between 8 to 16 resource hours per bucket/container. For 1000 bucket/container data lake, the cost is approximately $800K-1600K. The effort to establish DQ checks is approximately between 2 to 4 compute hours plus 1 resource hour per bucket/container. For 1000 bucket/container data lake, the cost is approximately $150K-200K including the cost of initial setup.
Time to Market One year - It would take a team of 8 resources to establish DQ checks for approximately 1000 buckets/containers Three months - It would take a team of 2 resources to establish DQ checks for approximately 1000 buckets/containers
Risk Reduction The rule-based approach addresses known risks often missing out on newer types of data risks. The ML-based approach addresses both known risks (such as completeness, conformity, etc.) as well emerging risks such as changes in data density.

Conclusion

Data is the most valuable asset for organizations. Current approaches for validating data are full of operational challenges leading to trust deficiency, time-consuming, and costly methods for fixing data errors. There is an urgent need to adopt a standardized autonomous approach for validating the Cloud data lake to ensure prevent data lake from becoming a data swamp.

About the Author

Angsuman Dutta, CTO and Co-Founder, FirstEigen

Angsuman Dutta is the CTO and co-founder of FirstEigen. He is an entrepreneur, investor, and corporate strategist with experience in building software businesses that scale and drive value. He has provided Information Governance and Data Quality advisory services to numerous Fortune 500 companies for two decades and has successfully launched several businesses, including Pricchaa Inc. He is a recognized thought leader and has published numerous articles on Information Governance.

He earned a B.S. in engineering from the Indian Institute of Technology, Kharagpur, an M.S. in Computer Science from the Illinois Institute of Technology, and an MBA in Analytical Finance and Strategy from the University of Chicago.

Linkedin Profile: https://www.linkedin.com/in/dutta-a-225240230/

Publications:

publications