Do your product dashboards look funky? Are your quarterly reports stale? Is the dataset you're using broken or just plain wrong? These problems affect almost every team, yet they're usually addressed on an ad hoc basis and in a reactive manner. If you answered yes to any of the questions above, this book is for you.
Many data engineering teams today face the "good pipelines, bad data" problem. It doesn't matter how advanced your data infrastructure is if the data you're piping is bad. In this book, Barr Moses, Lior Gavish, and Molly Vorwerck from the data reliability company Monte Carlo explain how to tackle data quality and trust at scale by leveraging best practices and technologies used by some of the world's most innovative companies. - Build more trustworthy and reliable data pipelines - Write scripts to make data checks and identify broken pipelines with data observability. - Program your own data quality monitors from scratch - Develop and lead data quality initiatives at your company - Generate a dashboard to highlight your company's key data assets - Automate data lineage graphs across your data ecosystem - Build anomaly detectors for your critical data assets
The book encompasses all the relevant aspects of data quality and data reliability. It is a solid essay both for a newbie and an advanced enginner. However it gets lost a bit in repetitive and redundant considerations.
NOTES Data quality is the health of data at any stage in its life cycle. New trends: Migration to the cloud; more data sources; increasingly complex data pipelines; more specialized data teams; decentralized data teams; streaming data; data lakehouse; data mesh.
Operational data: day-to-day ongoing operations from actual business processes for a quick update, two systems and processes. Analytical data: the type of data behind data driven business decisions for more robust and efficient analysis. Throughput versus latency trade off. DWH requires schema on write access, set the structure of the data at the instant it enters the warehouse (Redshift, BigQuery, Snowflake); Limited flexibility, SQL only support, frictional workflows. Datalake allows manipulations at the file level: schema on read access, infer the structure of the data when we want to use it; Decoupled storage and compute, support for distributed compute, customization and interoperability, largely built on open-source technologies, ability to handle unstructured, weakly structured, raw data, support non-SQL models (PySpark). Challenges: Data Integrity not guaranteed, technical debt little quality checks, more endpoints. Data Lakehouse: high-performance SQL, presto, spark; More robust Schema, such as Parquet; atomicity, consistency, isolation, durability (ACID); managed services.
Collecting, Cleaning, Transforming and Testing Data Downtime: periods of time when your data is partial, erroneous, missing or inaccurate. ETL vs ELT (quicker processing the data is loaded in the target system and then transformed) Data quality tests: null values, volume, distribution, uniqueness, known invariants. Unit Testing Platforms: dbt, great expectations, deequ. Apache Airflow can schedule service level agreement (SLAs) for the maximum amount of time, a task should take or similar metrics; circuit breaker prevent and the reliable data from flowing into production, serving as an implicit guarantee.
Monitoring and Anomaly detection Known Unknowns: easily predicted issues, null values, freshness, schema changes. Unknown Unknowns: Data downtime, steel data, data, drift overtime. Anomaly detection is an unsupervised task: the optimal behavior is not knowable at the training time, there is no ground truth. False positives: predicted anomalous, actually ok. False negatives: predicted ok, actually anomalous. Recall (out of all the genuine anomalies TP+FN, how many did we catch?) is often more important than precision (out of all the positives predicted, how many are correct?). That’s because we want to avoid anomalies, why let minimizing the risk of false positives: it is better or less bad to report a false bomb alarm as true (false positive), then to miss and not report an actual bombing (false negative). Rule definitions or hard thresholding; autoregressive models for time series; Exponential smoothing; Clustering; Hyperparameter tuning; Ensembled model framework.
Data Reliability Unit testing; functional testing; Integration testing. Data Observability: freshness, distribution (defined ranges), volume, schema, lineage (related up and downstream assets). Metrics: TTD (time-to-detection), TTR (time-to-resolution) , SLAs (Service Level Agreement), SLIs (Service Level Indicators), SLOs (Service Level Objects). Compute the cost of downtime: (TTD+TTR)*Downtime hourly cost Define data reliability SLAs, measure SLIs, track SLOs. Data Quality dimensions: completeness, timeliness, validity, accuracy, consistency, uniqueness.
Data Platform Layers - Data Ingestion - Data Storage and Processing - Data transformation and modelling - Business intelligence and analytics - Data observability - Data discovery and governance
Fixing data quality issues at scale Software development life cycle: plan, code, build, test, release, deploy, operate, monitor. Root cause analysis: - Look at the lineage - Look at the code - Look at the data - Look at the operational environment - Leverage your peers Post-mortem best practices: Frame everything as a learning experience; assess the readiness for future incidents; document each post mortem and share with the broader data team; revisit the service level agreements SLAs.
Building end to end Lineage Data Lineage refers to a map of the data set’s journey throughout its life cycle from ingestion to visualization, it to traces their relationships across in between upstream support systems and downstream dependencies. Fast time to value; secure architecture; automation; integration with popular data tools; extraction of column level information.
Democratizing Data Quality Treating Data like a product: reliability and observability; scalability; extensibility; usability; security and compliance; release, discipline, and roadmap. Increase the data, accessibility and democratization, faster ROI on data, time saving for the data team, more precise insights. Data product managers, help building internal tooling and platform to achieve those goals. Invest in self-serve tooling and prioritise data quality/reliability: reactive, proactive, automated, scalable. Align data product’s goals with the goals of the business.
Certifying the data: build out data observability capabilities; determine data owners; understand what a good data looks like; set clear SLAs, SLIs, SLOs for the most important data sets; develop communication and incident management processes; determine a mechanism to target the data as certified; train data team and downstream consumers.
Supporting hypergrowth as a decentralized data operation: dedicated analysts working in their business units, still maintained a close connection with the core analytics team.
Increase Data Literacy, the ability to read, write and communicate about data in a way that drives value and impact for their organization, and create a data catalogue, physical library catalogue as an inventory of a meta data.
Data Quality Fundamentals" offers a comprehensive examination of data quality across the entire data pipeline. The book is structured to address key considerations essential for maintaining high-quality data, exploring aspects from data ingestion to end-user consumption.
### Data Ingestion The book starts with the fundamentals of data ingestion, emphasizing the importance of clean and well-documented data sources. The author stresses the need for validating and cleaning data at the point of entry to avoid downstream quality issues.
### Data Storage In discussing data storage, the book highlights best practices for maintaining data integrity and consistency. It covers techniques such as normalization, the use of primary keys, and indexing, all of which are critical for preserving data quality in storage systems.
### Data Transformation Transformation processes are dissected to show how data can be enriched or corrupted. The book provides practical advice on ensuring transformations are transparent and reversible, promoting practices like version control.
### Data Lineage While discussing data lineage, the book provides an overview of tracking data flow through various systems. However, this section can be somewhat repetitive and lacks the technical depth that some readers might expect. Despite this, the emphasis on the importance of data lineage in understanding data transformations and ensuring accountability is clear.
### Data Consumption The book culminates with the end-user stage, explaining how to present data in a way that maintains its integrity and supports accurate decision-making. Techniques for building dashboards and reports that accurately reflect the underlying data are discussed.
## Data Observability and Its Impacts
### Full Essay on Data Observability One of the strengths of the book is its comprehensive discussion on data observability. It explains how monitoring, logging, and alerting mechanisms can provide insights into data health and detect quality issues early. The author argues that observability is crucial for maintaining trust in data systems.
### Impact of Data Downtime The book poignantly addresses the impacts of data downtime, noting how disruptions can lead to significant business and operational setbacks. Real-world examples are used to illustrate the costs of poor data quality, emphasizing the importance of robust data practices to mitigate such risks.
## Critique: Repetition and Vagueness
### Repetitiveness A notable critique of the book is its tendency to be repetitive, especially in sections where fundamental concepts are reiterated without adding new insights. This can detract from the reader's engagement and make some parts of the book feel redundant.
### Vague Technical Structure In some areas, particularly when discussing advanced topics like data lineage, the book tends to be vague. It often lacks the technical depth that data professionals might be seeking, opting instead for high-level overviews. This could leave readers wanting more detailed guidance on implementing the concepts discussed.
## Conclusion "Data Quality Fundamentals" is a valuable resource for understanding the essentials of maintaining data quality across the pipeline. Despite its occasional repetitiveness and lack of technical detail in certain areas, the book offers practical advice and highlights the critical importance of data observability and the impacts of data downtime. It serves as a useful guide for both beginners and seasoned professionals looking to enhance their data quality practices.
Data quality is crucial for businesses as inconsistent and unauthentic data can cost companies money and erode customer trust.
Measuring data quality through metrics such as completeness, timeliness, validity, accuracy, consistency, and uniqueness is essential.
Here are 10 key takeaways that will revolutionize your approach to data quality:
1️⃣ Data Downtime: Inconsistent, unauthentic data can cost companies millions and erode customer trust. Prioritize data reliability to avoid costly disasters.
2️⃣ Trustworthy Data: Remember, "No data is better than bad data." Ensure accuracy, completeness, and timeliness to establish trust and make informed decisions.
3️⃣ Meaningful Metrics: Measure completeness, timeliness, validity, accuracy, consistency, and uniqueness to assess and improve data quality effectively.
4️⃣ Data Governance: Understand data sourcing, ownership, and usage to ensure data integrity and accountability throughout the organization.
5️⃣ Pillars of Data Quality: Focus on freshness, distribution, volume, schema, and lineage to build a solid foundation for high-quality data.
6️⃣ Building Data Infrastructure: Master data ingestion, storage, processing, transformation, BI & analytics, observability, and governance to unleash the power of data.
7️⃣ Impact of Poor Data: Poor data quality hampers decision-making, efficiency, and business success. Prioritize data excellence for optimal results.
8️⃣ Cultivating a Data-Driven Culture: Foster data literacy, educate stakeholders, and promote the value of high-quality data throughout the organization.
9️⃣ Democratize Data Insights: Empower teams with transparent data, embrace imperfections, and democratize data-driven decision-making.
🔟 Measure and Improve: Assess the impact of poor data quality, identify critical data sets, and continuously enhance data quality for better outcomes.
Data quality is crucial for businesses as inconsistent and unauthentic data can cost companies money and erode customer trust.
Measuring data quality through metrics such as completeness, timeliness, validity, accuracy, consistency, and uniqueness is essential.
Here are 10 key takeaways that will revolutionize your approach to data quality:
1️⃣ Data Downtime: Inconsistent, unauthentic data can cost companies millions and erode customer trust. Prioritize data reliability to avoid costly disasters.
2️⃣ Trustworthy Data: Remember, "No data is better than bad data." Ensure accuracy, completeness, and timeliness to establish trust and make informed decisions.
3️⃣ Meaningful Metrics: Measure completeness, timeliness, validity, accuracy, consistency, and uniqueness to assess and improve data quality effectively.
4️⃣ Data Governance: Understand data sourcing, ownership, and usage to ensure data integrity and accountability throughout the organization.
5️⃣ Pillars of Data Quality: Focus on freshness, distribution, volume, schema, and lineage to build a solid foundation for high-quality data.
6️⃣ Building Data Infrastructure: Master data ingestion, storage, processing, transformation, BI & analytics, observability, and governance to unleash the power of data.
7️⃣ Impact of Poor Data: Poor data quality hampers decision-making, efficiency, and business success. Prioritize data excellence for optimal results.
8️⃣ Cultivating a Data-Driven Culture: Foster data literacy, educate stakeholders, and promote the value of high-quality data throughout the organization.
9️⃣ Democratize Data Insights: Empower teams with transparent data, embrace imperfections, and democratize data-driven decision-making.
🔟 Measure and Improve: Assess the impact of poor data quality, identify critical data sets, and continuously enhance data quality for better outcomes.