The exponential growth in data in present times is making high-quality data an indispensable business asset. According to Gartner’s estimation, poor-quality data costs organizations an average of $12.9M annually.
Table of Contents
AWS Glue is a powerful managed ETL service that provides data quality features to ensure your data is always accurate, complete, and reliable. Glue’s set of data quality rules can automate checks and monitoring and reduce the chances of costly errors, allowing you to make more informed decisions.
What is a Data Quality Rule?
A data quality rule can be defined as any sort of pre-defined standard or criterion to judge the quality of any dataset. It tells us whether the data is up to the norms expected or not, if it is accurate, complete, or consistent. Such rules in AWS Glue help you automate such a process for data validation. Each rule independently represents one condition against the quality of your data. For example, the rule may validate that a given column should not hold null values, dates are out of a certain range of values, etc.
To learn more about Data Quality:
- What is Data Quality Management?
- How to Improve Data Quality: Tips & Strategies
- Most Common Data Quality Issues
Hevo ensures that your data transfers are secure and reliable, with no data loss. Experience end-to-end encryption for complete peace of mind.
- Zero Data Loss: Safeguard your data from source to destination.
- Complete Data Security: Enjoy encrypted transfers to protect your sensitive information.
- Efficient ETL/ELT: Process data in real-time without compromising security or performance.
Trust Hevo to handle your ETL/ELT needs with unparalleled efficiency and security.
Get Started with Hevo for FreeData Quality Rule Categories
AWS Glue categorizes data quality rules into types according to their purpose and what type of data they check. Each one is concerned with a different focus on data integrity. Some of those categories are shown in this table:
Category | Description |
Completeness | Validate that the selected fields do not contain any null or empty values. |
Accuracy | It checks that the data entries are accurate and follow the specified format or value. |
Consistency | It checks for consistency across sets of data so that similar data entries look the same. |
Uniqueness | It ensures that the values in the dataset are unique and it is necessary to avoid any type of duplicate values. |
Timeliness | It tells if the data is fresh or current enough to reflect the most up-to-date value and is usually determined by availability. |
Validity | It means it checks whether data belongs to specified ranges or conveys the standard. |
Create a Data Quality Rule in AWS Glue:
The creation of a data quality rule in AWS Glue is pretty simple, using data catalog tables. This can be done by following the forward steps given below:
Step 1: Access to AWS Management Console
Log on to your AWS account and then click on the service type AWS Glue from the list of services.
Step 2: Choose the Data Catalog
In the navigation pane, select Data Catalog. Click Databases to find your database.
Step 3: Select Your Table
Identify in which table you would like to create data quality rules. That should be a table existing in the Data Catalog and was created there typically by AWS Glue Crawler or manually.
Step 4: Open Data Quality Tab:
To access data quality features, from the Table Details page, click the Data Quality tab.
Step 5: Create Ruleset:
In the Rulesets section, click on the Create data quality rules. This will pop open the DQDL – Data Quality Definition Language editor where you’ll be able to define your rules.
Step 6: Define Your Rules:
In the DQDL editor, one can optionally manually write the rules if one so wishes or utilize the Recommend Rules feature to automatically build rules from the characteristics of their data. Such that you can add rules as in the examples below:
Rules = [
IsComplete "columnName",
ColumnLength "columnName" > 10
]
Step 7: Save Your Ruleset:
After specifying your rules, save your ruleset. This will let you run evaluations against this set of quality checks later.
Step 8: Running Data Quality Evaluation:
To evaluate your dataset against the newly created ruleset, navigate back to the Data Quality tab, and select from the list your ruleset, followed by Run.
It ensures that your data meets the predefined standard of quality regarding making informed decisions based on valid information.
Edit a Data Quality Rule in AWS Glue:
You can follow the following steps to edit the data quality rule in AWS Glue:
Step 1: Log into AWS Management Console:
Access your AWS account and navigate to AWS Glue.
Step 2: Navigate to Data Catalog:
Navigate to the Data Catalog, then select Databases, and choose the relevant database containing your table.
Step 3: Select Your Table:
Click on the table for which you want to edit data quality rules.
Step 4: Access the Data Quality Tab:
On the table details page, go to the Data Quality tab.
Step 5: Locate Your Ruleset:
In the Rulesets section, select the ruleset you wish to edit.
Step 6: Edit Ruleset:
Click on Actions, then select Edit. The DQDL editor will open with your existing rules displayed.
Step 7: Modify Your Rules:
Make necessary changes to your rules as needed. For example, you might want to adjust a rule from:
ColumnExists "record_id"
to ColumnExists "name"
Step 8: Save Changes:
After editing your rules, save your changes to update the ruleset.
Step 9: Re-run Evaluation if Necessary:
You can run a new evaluation using this updated ruleset to see how these changes affect data quality.
Conclusion
Data should always be of the highest quality to deduce usable understanding from the information a given organization handles. AWS Glue offers a set of powerful features to create and manage data quality rules for maintaining the accuracy, completeness, and reliability of your datasets over time. These rules enable organizations to reduce the risks that come naturally when incorrect data quality is imposed on them.
Frequently Asked Questions (FAQs) on Data Quality Rules
What are the five rules of data quality?
The commonly mentioned five essential rules would include accuracy, completeness, consistency, timeliness, and validity.
What is the purpose of data quality management?
The primary purpose is to ensure that organizational data are accurate and valid for decision-making.
How to manage data quality?
Effective management takes place with clear definitions of acceptable quality; regular audits take place, automated monitoring tools like AWS Glue are employed, staff training is imparted on best practices, and a culture of continuous improvement in handling organizational data is cultivated.