The exponential growth in data in present times is making high-quality data an indispensable business asset. According to Gartner’s estimation, poor-quality data costs organizations an average of $12.9M annually.

AWS Glue is a powerful managed ETL service that provides data quality features to ensure your data is always accurate, complete, and reliable. Glue’s set of data quality rules can automate checks and monitoring and reduce the chances of costly errors, allowing you to make more informed decisions.

What is a Data Quality Rule?

A data quality rule can be defined as any sort of pre-defined standard or criterion to judge the quality of any dataset. It tells us whether the data is up to the norms expected or not, if it is accurate, complete, or consistent. Such rules in AWS Glue help you automate such a process for data validation. Each rule independently represents one condition against the quality of your data. For example, the rule may validate that a given column should not hold null values, dates are out of a certain range of values, etc. 

To learn more about Data Quality:

Secure ETL/ELT with Zero Data Loss Using Hevo

Hevo ensures that your data transfers are secure and reliable, with no data loss. Experience end-to-end encryption for complete peace of mind.

  • Zero Data Loss: Safeguard your data from source to destination.
  • Complete Data Security: Enjoy encrypted transfers to protect your sensitive information.
  • Efficient ETL/ELT: Process data in real-time without compromising security or performance.

Trust Hevo to handle your ETL/ELT needs with unparalleled efficiency and security.

Get Started with Hevo for Free

Data Quality Rule Categories

AWS Glue categorizes data quality rules into types according to their purpose and what type of data they check. Each one is concerned with a different focus on data integrity. Some of those categories are shown in this table:

CategoryDescription
CompletenessValidate that the selected fields do not contain any null or empty values.
AccuracyIt checks that the data entries are accurate and follow the specified format or value.
ConsistencyIt checks for consistency across sets of data so that similar data entries look the same.
UniquenessIt ensures that the values in the dataset are unique and it is necessary to avoid any type of duplicate values. 
TimelinessIt tells if the data is fresh or current enough to reflect the most up-to-date value and is usually determined by availability. 
ValidityIt means it checks whether data belongs to specified ranges or conveys the standard.

Create a Data Quality Rule in AWS Glue:

The creation of a data quality rule in AWS Glue is pretty simple, using data catalog tables. This can be done by following the forward steps given below:

Step 1: Access to AWS Management Console

Log on to your AWS account and then click on the service type AWS Glue from the list of services.

Step 2: Choose the Data Catalog

In the navigation pane, select Data Catalog. Click Databases to find your database.

Step 3: Select Your Table

Identify in which table you would like to create data quality rules. That should be a table existing in the Data Catalog and was created there typically by AWS Glue Crawler or manually.

Step 4: Open Data Quality Tab:

To access data quality features, from the Table Details page, click the Data Quality tab.

Test table overview

Step 5: Create Ruleset: 

In the Rulesets section, click on the Create data quality rules. This will pop open the DQDL – Data Quality Definition Language editor where you’ll be able to define your rules.

Create Data Quality Rules overivew

Step 6: Define Your Rules: 

In the DQDL editor, one can optionally manually write the rules if one so wishes or utilize the Recommend Rules feature to automatically build rules from the characteristics of their data. Such that you can add rules as in the examples below:

Rules = [ 
IsComplete "columnName", 
ColumnLength "columnName" > 10
   ]
Define Data Quality Rules

Step 7: Save Your Ruleset:

After specifying your rules, save your ruleset. This will let you run evaluations against this set of quality checks later.

Step 8: Running Data Quality Evaluation: 

To evaluate your dataset against the newly created ruleset, navigate back to the Data Quality tab, and select from the list your ruleset, followed by Run.

Run the Data Quality Rule

It ensures that your data meets the predefined standard of quality regarding making informed decisions based on valid information.

Edit a Data Quality Rule in AWS Glue:

You can follow the following steps to edit the data quality rule in AWS Glue:

Step 1: Log into AWS Management Console: 

Access your AWS account and navigate to AWS Glue.

Step 2: Navigate to Data Catalog: 

Navigate to the Data Catalog, then select Databases, and choose the relevant database containing your table.

Select your Database

Step 3: Select Your Table: 

Click on the table for which you want to edit data quality rules.

Step 4: Access the Data Quality Tab: 

On the table details page, go to the Data Quality tab.

Test table overview

Step 5: Locate Your Ruleset: 

In the Rulesets section, select the ruleset you wish to edit.

Load your Data Quality Rules

Step 6: Edit Ruleset: 

Click on Actions, then select Edit. The DQDL editor will open with your existing rules displayed.

Edit Rules

Step 7: Modify Your Rules: 

Make necessary changes to your rules as needed. For example, you might want to adjust a rule from:

ColumnExists "record_id" to ColumnExists "name"

Step 8: Save Changes: 

After editing your rules, save your changes to update the ruleset.

Save the Edits

Step 9: Re-run Evaluation if Necessary: 

You can run a new evaluation using this updated ruleset to see how these changes affect data quality.

Run the Edits

Conclusion

Data should always be of the highest quality to deduce usable understanding from the information a given organization handles. AWS Glue offers a set of powerful features to create and manage data quality rules for maintaining the accuracy, completeness, and reliability of your datasets over time. These rules enable organizations to reduce the risks that come naturally when incorrect data quality is imposed on them.

Frequently Asked Questions (FAQs) on Data Quality Rules

What are the five rules of data quality?

The commonly mentioned five essential rules would include accuracy, completeness, consistency, timeliness, and validity.

What is the purpose of data quality management?

The primary purpose is to ensure that organizational data are accurate and valid for decision-making.

How to manage data quality?

Effective management takes place with clear definitions of acceptable quality; regular audits take place, automated monitoring tools like AWS Glue are employed, staff training is imparted on best practices, and a culture of continuous improvement in handling organizational data is cultivated.

Raju is a Certified Data Engineer and Data Science & Analytics Specialist with over 8 years of experience in the technical field and 5 years in the data industry. He excels in providing end-to-end data solutions, from extraction and modeling to deploying dynamic data pipelines and dashboards. His enthusiasm for data architecture and visualization motivates him to create informative technical content that simplifies complicated concepts for data practitioners and business leaders.