Data quality checks are the process of validating data to ensure it is accurate, reliable, and fit for consumption. According to a recent Gartner study, poor data quality can cost organizations an average of $12.9 million annually. Hence, it is absolutely essential to ensure the quality of the data businesses use to make decisions.
Table of Contents
This article introduces the fundamentals of data quality checks, common issues faced with data quality, and 7 tests to ensure the data is accurate. Finally, we do a comparative analysis of the popular data testing tools.
What is Data Quality Testing?
Data quality testing is a process for checking whether the data in our system is correct, reliable, and useful. Imagine you have a list of names and phone numbers. Data quality testing ensures that every name matches the right number, like validating that Alice’s number isn’t accidentally listed under Bob. It is important to find and fix these errors so that decisions using this data are based on accurate information.
Implement robust data quality checks while Hevo streamlines your data integration process. With 150+ pre-built connectors and real-time data validation, Hevo ensures your data pipeline remains clean and reliable.
Focus on defining quality metrics as Hevo handles the heavy lifting. Start your free trial and elevate your data quality!
Get Started with Hevo for FreeCommon Data Quality Issues
There are a variety of issues that can affect the quality of data, like human error, flawed data collection process, system errors, or during data migration or integration tasks.
Some of the common data quality issues are as listed below:
- Data Inaccuracies, such as incorrect or outdated information
- Data Inconsistencies, where data does not match across different sources or records
- Data Duplication, where the same data appears multiple times, creating redundancy and confusion
- Incomplete data, characterized by missing values or gaps in information
- Poor data formatting, such as when date fields are inconsistently entered (For example: 01/01/2000, 2000-01-01 and 1st Jan 2000)
Consequences of Poor Data Quality
Poor data quality can impact companies’ business decisions, leading to financial loss, wasted resources, and lost opportunities. For example, a business relying on inaccurate customer data for its marketing campaign could result in ineffective customer targeting and a reduced conversion rate. Similarly, in healthcare, patient safety can be at risk if medical records are inaccurate or incomplete.
If incorrect information leads to customer dissatisfaction, poor data quality can damage a company’s reputation. Ultimately, it can hinder an organization’s growth and the value it delivers to its customers.
7 Data Quality Checks
1. Freshness checks
Imagine an e-commerce platform that relies on hourly updates of its inventory to display stock to its users. Data freshness checks can help verify whether the data is updated within the expected time interval.
To implement freshness checks, we can use a script that compares the timestamp of the last update for each inventory record with the current time. If the data is stale beyond a certain threshold, it might trigger an alert or update process.
# Example Python script to check data freshness
from datetime import datetime, timedelta
# Simulate the timestamp of the last data update
last_update = datetime(2023, 4, 1, 12, 30)
# Define the freshness threshold (e.g., 2 hours)
freshness_threshold = timedelta(hours=2)
# Check if the data is fresh
current_time = datetime.now()
if current_time - last_update > freshness_threshold:
print("Data is stale. Triggering update process.")
# Trigger update process (e.g., through an API call)
else:
print("Data is fresh.")
Freshness SLIs
Freshness Service Level Indicators (SLIs) provide a quantitative measure of how up-to-date the data is. It could be the time lag between when data is expected and when it becomes available. For example, in an e-commerce platform, the SLI can be the time a sale transaction takes to appear in the analytics dashboard.
2. Volume Tests
Data volume tests determine whether a system can handle the huge volume of data at peak times. You can identify potential bottlenecks and scalability issues by simulating heavy load testing. For data volume testing, we can gradually increase the number of transactions per second (TPS) until the system shows signs of strain.
Missing Data
Missing data can lead to incomplete analysis and biased results. For example, in a dataset representing sales over time, missing entries for certain days can skew the understanding of sales performance. To identify missing data, we can use the below script to count the number of expected versus actual records:
# Example Python script to check data volume
import pandas as pd
# load data
df = pd.read_csv('sales_data.csv')
# expected number of records per day (e.g., number of stores)
expected_records = 100
# count actual records per day
actual_records = df.groupby('date').size()
# identify dates with missing records
missing_data_dates = actual_records[actual_records < expected_records].index.tolist()
# report missing data
print("Data was missing on the following dates: ", missing_data_dates)
Too Much Data
Too much data than expected can make a system slow or unresponsive. This can lead to increased response time, slow processing, or even system crashes. For instance, when a database table becomes too large, query performance might degrade noticeably. This query might take significantly longer to execute on a large table:
SELECT *
FROM large_table
WHERE some_column = 'some_value';
To resolve this issue, we can look into solutions like database partitioning, query optimization, or scaling out the database across multiple servers to distribute the load. We can also index the relevant columns to make the queries faster:
CREATE INDEX idx_some_column
ON large_table(some_column);
By creating an index on `some_column`, we can retrieve the data more efficiently, improving query performance even when the data volume is high.
Volume SLIs
Volume SLIs are used to define the performance of a system with respect to data volume. For example, it could be the throughput of a data pipeline when processing terabytes of data. When these SLIs do not meet the expectations, it may indicate an issue with the data volume.
3. NULL Values Tests
NULL values refer to the missing or unknown data in a database. They can skew results, causing errors in calculations and impact decision making.
In SQL, we can test for NULL values using `IS NULL` and `IS NOT NULL` operators. Below query selects all users whose age is NULL:
SELECT name, age
FROM users
WHERE age IS NULL;
To fix issues with NULLs, we can set default values, or use COALESCE
to update them to alternate meaningful values, wherever applicable:
UPDATE users
SET age = 18
WHERE age IS NULL;
The above SQL statement sets a default age of 18 for all users whose age is not specified. By addressing NULLs, we can ensure the completeness of the data.
4. Numeric Distribution Tests
Numeric distribution tests are useful when dealing with continuous, or numeric data. They analyze the distribution of the data to ensure it aligns with expected pattern and range.
For example, statistical tests such as Kolmogorov-Smirnov or Shapiro-Wilk are used to test for normality, whereas Q-Q plots can visualize how closely the data follows a specific distribution. We can also detect outliers, skewness, and kurtosis through numeric distribution tests.
# Example Python script to check data normality
import numpy as np
import scipy.stats as stats
# Generate sample data
data = np.random.normal(loc=0, scale=1, size=100)
# Perform a Kolmogorov-Smirnov test to check for normality
ks_statistic, p_value = stats.kstest(data, 'norm')
print(f'KS Statistic: {ks_statistic}, P-value: {p_value}')
# If the p-value is less than 0.05, the distribution is likely not normal
if p_value < 0.05:
print("The distribution is not normal.")
else:
print("The distribution is normal.")
Inaccurate Data
Inaccurate data refers to data that contains errors, outliers or anomalies. This could be due to human error during data entry, measurement errors, processing errors, etc. We can use numeric distribution tests to identify unusual distributions in the data. For example, we can apply a normality test to check if the data follows a Gaussian distribution. If it deviates significantly, there might be inaccuracies in the data.
Data Variety
Data variety refers to the diversity of data formats and types within a dataset. This includes structured data (like numbers, and dates), semi-structured data (like XML files), and unstructured data (like text, video, and audio). Numeric distribution tests can assess the distribution of data, and suggest whether different types of numeric data (e.g., currency: USD vs INR) can be reliably compared or combined.
5. Uniqueness Tests
Suppose we have a scenario where primary keys or unique constraints need to be enforced on the data. We can use uniqueness tests to ensure that every record across a certain set of fields is unique and no duplicates are present.
For example, each user should have a unique ID in a `Users` table, as duplicates can lead to errors in data retrieval. Below is a Python script to demonstrate how a uniqueness test can be conducted on a dataset:
# Example Python script to check data uniqueness
import pandas as pd
# Sample DataFrame with user_id column
data = {
'user_id': [101, 102, 103, 101], # Notice the duplicate user_id '101'
'name': ['Alice', 'Bob', 'Charlie', 'Daisy']
}
df = pd.DataFrame(data)
# Uniqueness test
if df['user_id'].is_unique:
print("All values in the 'user_id' column are unique.")
else:
print("There are duplicate values in the 'user_id' column.")
# Identifying duplicate entries
duplicates = df[df.duplicated('user_id')]
print(duplicates)
Once non-unique records are identified, we can clean the data by removing or consolidating duplicates, and implement constraints to prevent the insertion of duplicate records.
6. Referential Integrity Tests
Referential integrity tests assist to check the consistency of the relationships between data across multiple tables in a relational database. They ensure that a value in the child table’s foreign key always points to an existing record in the parent table.
For example, consider a database of an online bookstore where there is a `Books` table with a primary key `BookID`, and an `Orders` table with a foreign key column `BookID` pointing to the `Books` table. Referential integrity ensures that every `BookID` listed in the `Orders` table must exist in the `Books` table.
To resolve issues with referential integrity, we can use cascading updates and deletes. This means that if there are changes in the parent table, they will automatically be reflected in the linked foreign key values in the child table as well. In case a book is deleted from the `Books’ table, its corresponding orders will also be deleted from the `Orders’ table to ensure referential integrity.
7. String Patterns
String patterns in data can lead to issues if they are inconsistent and do not stick to a standard format. For example, we might have dates in different formats such as “MM/DD/YYYY”, “DD-MM-YYYY”, or “1st January 2020”.
This can result in incorrect sorting, inability to accurately compare dates, and challenges in data integration. Similarly, text data may have spelling differences (e.g., Analyse vs Analyze), case differences (e.g., ‘New York’ vs ‘new york’), or unwanted whitespace.
To resolve these issues, we can use pattern matching through regular expressions to check if a string is in the correct format. For dates, we can ensure that all entries match a standardized pattern such as “YYYY-MM-DD”. For text data, we can check if all entries start with a capital letter, or they match a certain list of valid values. Similarly, data trimming functions can be used to remove extra whitespace.
Popular Data Testing Tools
Data testing tools use automated checks to ensure error detection and data consistency. They enable organizations to make reliable data-driven decisions by ensuring the data used is accurate. Some of the most popular data testing tools are as mentioned below:
- Deequ: It is an open-source library built on top of Apache Spark, and allows users to write unit tests for data. It provides metrics to measure data completeness and uniqueness to ensure data quality.
- Great Expectations: It is an open-source framework that simplifies the process of validating, documenting, and profiling data. It includes a library of common unit tests which enables users to create “expectations” or assertions, against which they can check their incoming data.
- Torch by Acceldata: It is an enterprise-level data observability platform that helps with data monitoring and validation. It majorly focuses on anomaly detection.
- Monte Carlo: It is a data reliability platform that uses machine learning to understand data, and alerts teams about issues preemptively.
- Databand: It is a data observability tool that integrates with existing workflows and provides insights into data health, pipeline performance, and resource management.
The below table compares the key features of these popular data testing tools:
Factor | Deequ | Great Expectations | Torch by Acceldata | Monte Carlo | Databand |
Open-source | Yes | Yes | No | No | Yes |
Type of Tool | Data testing | Data testing and profiling | Data observability | Data reliability | Data observability |
Technical Knowledge Required | High: Spark and Scala-based | Moderate: Python-based | Moderate: User-friendly; Some advanced features | High: Data systems and monitoring data flows | Moderate: Python-based |
Integrations | AWS (natively on EMR, S3) | Multiple databases, cloud warehouses, APIs | Hadoop, Spark, Kafka, APIs | Multiple data sources, cloud services, APIs | Multiple data sources through APIs |
Conclusion
It is crucial for any organization relying on data-driven decision-making to maintain high data quality. Some common issues such as incorrect, inconsistent, incomplete or duplicate data can significantly affect the quality of data. To mitigate these challenges, we can implement some tests for data quality checks like freshness checks, uniqueness checks, and referential integrity tests. Thus, organizations can ensure the reliability of their data, and use it to make well-informed decisions in this data-driven world.
FAQ
1. What are some common data quality issues?
Some common data quality issues are: Inaccurate data, Inconsistent data, Incomplete data, Duplicate data, and issues with Data formatting.
2. What are the 7 tests for data quality checks?
The 7 tests for data quality checks are: Freshness checks, Volume tests, NULL values tests, Numeric distribution tests, Uniqueness tests, Referential integrity tests, and String patterns.
3. What are the Six Data Quality Dimensions?
The six data quality dimensions are: Accuracy, Completeness, Consistency, Timeliness, Uniqueness, and Validity.