Data quality checks are the process of validating data to ensure it is accurate, reliable, and fit for consumption. According to a recent Gartner study, poor data quality can cost organizations an average of $12.9 million annually. Hence, it is absolutely essential to ensure the quality of the data businesses use to make decisions.

This article introduces the fundamentals of data quality checks, common issues faced with data quality, and 7 tests to ensure the data is accurate. Finally, we do a comparative analysis of the popular data testing tools.

What is Data Quality Testing?

Data quality testing is a process for checking whether the data in our system is correct, reliable, and useful. Imagine you have a list of names and phone numbers. Data quality testing ensures that every name matches the right number, like validating that Alice’s number isn’t accidentally listed under Bob. It is important to find and fix these errors so that decisions using this data are based on accurate information.

Streamline Data Integration With Hevo Data

Implement robust data quality checks while Hevo streamlines your data integration process. With 150+ pre-built connectors and real-time data validation, Hevo ensures your data pipeline remains clean and reliable.

Focus on defining quality metrics as Hevo handles the heavy lifting. Start your free trial and elevate your data quality!

Get Started with Hevo for Free

Common Data Quality Issues

There are a variety of issues that can affect the quality of data, like human error, flawed data collection process, system errors, or during data migration or integration tasks.

Some of the common data quality issues are as listed below:

  • Data Inaccuracies, such as incorrect or outdated information
  • Data Inconsistencies, where data does not match across different sources or records
  • Data Duplication, where the same data appears multiple times, creating redundancy and confusion
  • Incomplete data, characterized by missing values or gaps in information
  • Poor data formatting, such as when date fields are inconsistently entered (For example: 01/01/2000, 2000-01-01 and 1st Jan 2000)

Consequences of Poor Data Quality

Poor data quality can impact companies’ business decisions, leading to financial loss, wasted resources, and lost opportunities. For example, a business relying on inaccurate customer data for its marketing campaign could result in ineffective customer targeting and a reduced conversion rate. Similarly, in healthcare, patient safety can be at risk if medical records are inaccurate or incomplete.

If incorrect information leads to customer dissatisfaction, poor data quality can damage a company’s reputation. Ultimately, it can hinder an organization’s growth and the value it delivers to its customers.

7 Data Quality Checks

1. Freshness checks

Imagine an e-commerce platform that relies on hourly updates of its inventory to display stock to its users. Data freshness checks can help verify whether the data is updated within the expected time interval.

To implement freshness checks, we can use a script that compares the timestamp of the last update for each inventory record with the current time. If the data is stale beyond a certain threshold, it might trigger an alert or update process.

# Example Python script to check data freshness

from datetime import datetime, timedelta

# Simulate the timestamp of the last data update

last_update = datetime(2023, 4, 1, 12, 30)

# Define the freshness threshold (e.g., 2 hours)

freshness_threshold = timedelta(hours=2)

# Check if the data is fresh

current_time = datetime.now()

if current_time - last_update > freshness_threshold:

   print("Data is stale. Triggering update process.")

   # Trigger update process (e.g., through an API call)

else:

   print("Data is fresh.")

Freshness SLIs

Freshness Service Level Indicators (SLIs) provide a quantitative measure of how up-to-date the data is. It could be the time lag between when data is expected and when it becomes available. For example, in an e-commerce platform, the SLI can be the time a sale transaction takes to appear in the analytics dashboard.

2. Volume Tests

Data volume tests determine whether a system can handle the huge volume of data at peak times. You can identify potential bottlenecks and scalability issues by simulating heavy load testing. For data volume testing, we can gradually increase the number of transactions per second (TPS) until the system shows signs of strain.

Missing Data

Missing data can lead to incomplete analysis and biased results. For example, in a dataset representing sales over time, missing entries for certain days can skew the understanding of sales performance. To identify missing data, we can use the below script to count the number of expected versus actual records:

# Example Python script to check data volume

import pandas as pd

# load data

df = pd.read_csv('sales_data.csv')

# expected number of records per day (e.g., number of stores)

expected_records = 100

# count actual records per day

actual_records = df.groupby('date').size()

# identify dates with missing records

missing_data_dates = actual_records[actual_records < expected_records].index.tolist()

# report missing data

print("Data was missing on the following dates: ", missing_data_dates)

Too Much Data

Too much data than expected can make a system slow or unresponsive. This can lead to increased response time, slow processing, or even system crashes. For instance, when a database table becomes too large, query performance might degrade noticeably. This query might take significantly longer to execute on a large table:

SELECT *

FROM large_table

WHERE some_column = 'some_value';

To resolve this issue, we can look into solutions like database partitioning, query optimization, or scaling out the database across multiple servers to distribute the load. We can also index the relevant columns to make the queries faster:

CREATE INDEX idx_some_column

ON large_table(some_column);

By creating an index on `some_column`, we can retrieve the data more efficiently, improving query performance even when the data volume is high.

Volume SLIs

Volume SLIs are used to define the performance of a system with respect to data volume. For example, it could be the throughput of a data pipeline when processing terabytes of data. When these SLIs do not meet the expectations, it may indicate an issue with the data volume.

3. NULL Values Tests

NULL values refer to the missing or unknown data in a database. They can skew results, causing errors in calculations and impact decision making.

In SQL, we can test for NULL values using `IS NULL` and `IS NOT NULL` operators. Below query selects all users whose age is NULL:

SELECT name, age

FROM users

WHERE age IS NULL;

To fix issues with NULLs, we can set default values, or use COALESCE to update them to alternate meaningful values, wherever applicable:

UPDATE users

SET age = 18

WHERE age IS NULL;

The above SQL statement sets a default age of 18 for all users whose age is not specified. By addressing NULLs, we can ensure the completeness of the data.

4. Numeric Distribution Tests

Numeric distribution tests are useful when dealing with continuous, or numeric data. They analyze the distribution of the data to ensure it aligns with expected pattern and range.

For example, statistical tests such as Kolmogorov-Smirnov or Shapiro-Wilk are used to test for normality, whereas Q-Q plots can visualize how closely the data follows a specific distribution. We can also detect outliers, skewness, and kurtosis through numeric distribution tests.

# Example Python script to check data normality

import numpy as np

import scipy.stats as stats

# Generate sample data

data = np.random.normal(loc=0, scale=1, size=100)

# Perform a Kolmogorov-Smirnov test to check for normality

ks_statistic, p_value = stats.kstest(data, 'norm')

print(f'KS Statistic: {ks_statistic}, P-value: {p_value}')

# If the p-value is less than 0.05, the distribution is likely not normal

if p_value < 0.05:

   print("The distribution is not normal.")

else:

   print("The distribution is normal.")

Inaccurate Data

Inaccurate data refers to data that contains errors, outliers or anomalies. This could be due to human error during data entry, measurement errors, processing errors, etc. We can use numeric distribution tests to identify unusual distributions in the data. For example, we can apply a normality test to check if the data follows a Gaussian distribution. If it deviates significantly, there might be inaccuracies in the data.

Data Variety

Data variety refers to the diversity of data formats and types within a dataset. This includes structured data (like numbers, and dates), semi-structured data (like XML files), and unstructured data (like text, video, and audio). Numeric distribution tests can assess the distribution of data, and suggest whether different types of numeric data (e.g., currency: USD vs INR) can be reliably compared or combined.

5. Uniqueness Tests

Suppose we have a scenario where primary keys or unique constraints need to be enforced on the data. We can use uniqueness tests to ensure that every record across a certain set of fields is unique and no duplicates are present.

For example, each user should have a unique ID in a `Users` table, as duplicates can lead to errors in data retrieval. Below is a Python script to demonstrate how a uniqueness test can be conducted on a dataset:

# Example Python script to check data uniqueness

import pandas as pd

# Sample DataFrame with user_id column

data = {

   'user_id': [101, 102, 103, 101],  # Notice the duplicate user_id '101'

   'name': ['Alice', 'Bob', 'Charlie', 'Daisy']

}

df = pd.DataFrame(data)

# Uniqueness test

if df['user_id'].is_unique:

   print("All values in the 'user_id' column are unique.")

else:

   print("There are duplicate values in the 'user_id' column.")

   # Identifying duplicate entries

   duplicates = df[df.duplicated('user_id')]

   print(duplicates)

Once non-unique records are identified, we can clean the data by removing or consolidating duplicates, and implement constraints to prevent the insertion of duplicate records.

6. Referential Integrity Tests

Referential integrity tests assist to check the consistency of the relationships between data across multiple tables in a relational database. They ensure that a value in the child table’s foreign key always points to an existing record in the parent table.

For example, consider a database of an online bookstore where there is a `Books` table with a primary key `BookID`, and an `Orders` table with a foreign key column `BookID` pointing to the `Books` table. Referential integrity ensures that every `BookID` listed in the `Orders` table must exist in the `Books` table.

To resolve issues with referential integrity, we can use cascading updates and deletes. This means that if there are changes in the parent table, they will automatically be reflected in the linked foreign key values in the child table as well. In case a book is deleted from the `Books’ table, its corresponding orders will also be deleted from the `Orders’ table to ensure referential integrity.

7. String Patterns

String patterns in data can lead to issues if they are inconsistent and do not stick to a standard format. For example, we might have dates in different formats such as “MM/DD/YYYY”, “DD-MM-YYYY”, or “1st January 2020”.

This can result in incorrect sorting, inability to accurately compare dates, and challenges in data integration. Similarly, text data may have spelling differences (e.g., Analyse vs Analyze), case differences (e.g., ‘New York’ vs ‘new york’), or unwanted whitespace.

To resolve these issues, we can use pattern matching through regular expressions to check if a string is in the correct format. For dates, we can ensure that all entries match a standardized pattern such as “YYYY-MM-DD”. For text data, we can check if all entries start with a capital letter, or they match a certain list of valid values. Similarly, data trimming functions can be used to remove extra whitespace.

Popular Data Testing Tools

Data testing tools use automated checks to ensure error detection and data consistency. They enable organizations to make reliable data-driven decisions by ensuring the data used is accurate. Some of the most popular data testing tools are as mentioned below:

  • Deequ: It is an open-source library built on top of Apache Spark, and allows users to write unit tests for data. It provides metrics to measure data completeness and uniqueness to ensure data quality.
  • Great Expectations: It is an open-source framework that simplifies the process of validating, documenting, and profiling data. It includes a library of common unit tests which enables users to create “expectations” or assertions, against which they can check their incoming data.
  • Torch by Acceldata: It is an enterprise-level data observability platform that helps with data monitoring and validation. It majorly focuses on anomaly detection.
  • Monte Carlo: It is a data reliability platform that uses machine learning to understand data, and alerts teams about issues preemptively.
  • Databand: It is a data observability tool that integrates with existing workflows and provides insights into data health, pipeline performance, and resource management.

The below table compares the key features of these popular data testing tools:

FactorDeequGreat ExpectationsTorch by AcceldataMonte CarloDataband
Open-sourceYesYesNoNoYes
Type of ToolData testingData testing and profilingData observabilityData reliabilityData observability
Technical Knowledge RequiredHigh: Spark and Scala-basedModerate: Python-basedModerate: User-friendly; Some advanced featuresHigh: Data systems and monitoring data flowsModerate: Python-based
IntegrationsAWS (natively on EMR, S3)Multiple databases, cloud warehouses, APIsHadoop, Spark, Kafka, APIsMultiple data sources, cloud services, APIsMultiple data sources through APIs

Conclusion

It is crucial for any organization relying on data-driven decision-making to maintain high data quality. Some common issues such as incorrect, inconsistent, incomplete or duplicate data can significantly affect the quality of data. To mitigate these challenges, we can implement some tests for data quality checks like freshness checks, uniqueness checks, and referential integrity tests. Thus, organizations can ensure the reliability of their data, and use it to make well-informed decisions in this data-driven world.

FAQ

1. What are some common data quality issues?

Some common data quality issues are: Inaccurate data, Inconsistent data, Incomplete data, Duplicate data, and issues with Data formatting.

2. What are the 7 tests for data quality checks?

The 7 tests for data quality checks are: Freshness checks, Volume tests, NULL values tests, Numeric distribution tests, Uniqueness tests, Referential integrity tests, and String patterns.

3. What are the Six Data Quality Dimensions?

The six data quality dimensions are: Accuracy, Completeness, Consistency, Timeliness, Uniqueness, and Validity.

              Sakshi Kulshreshtha is a Data Engineer with 4+ years of experience in various domains, including finance and travel. Her specialization lies in Big Data Engineering tools like Spark, Hadoop, Hive, SQL, and Airflow for batch processing. Her work focuses on architecting data pipelines for collecting, storing and analyzing terabytes of data at scale. She also specializes in cloud-native technologies and is a certified AWS Solutions Architect Associate.