Data Quality Assessments
Important
Data quality assessments can only be carried out on valid AGS4 data files. The files MUST be validated first.
Overview
It has been possible to validate AGS data format files for many years using proprietary AGS 'checkers' and now the open source AGS Python library. However the 'checkers' did not report on the quality of the actual data. They did report warnings or FYI messages about the data, but these were tailored for data import into specific software.
The ability to check data quality has been introduced in version 3 of the AGS desktop validator. It allows users to create bespoke data quality checks at a project, organisation or system level.
Process
Data quality requirements are defined using a json file. The assessment reports the quality of the data using the criteria set out in the json file.
Note
The syntax of a json file is relatively simple, the definition can be found at json.org There are many online resources that provide help, file formatters and syntax checkers (linters)
Data quality is a detailed subject. This tool has been created using principles set out in the DAMA® Data Management Body of Knowledge (DAMA-DMBOK®)
Data quality assessment is an automated profiling tool that presents the data quality as a qualitative term based on categorisation of good or poor quality data.
Details of the data quality assessment are provided below.
A predefined data quality assessment has been implemented for National Highways, to pre-check data that is to be loaded on the Geotechnical and Drainage asset Management Service (GDMS). Details can be seen here.
In summary the AGS data file is converted to an in-memory SpatiaLite (SQLite) database with simple tables to reflect the AGS groups. The heading rows that set out the TYPE and UNIT requirements are extracted to a single 'UNTY' table in the database The data quality requirements are written as SQL statements, which are then run against the database.
Note
Spatialite is a standalone relational database. It is an extension of SQLite that includes spatial functions, such as the ability to query the locations of exploratory holes against site boundaries.
The advantages of this approach are as follows:
- SQL is a natural query language and is extensively documented
- No coding experience is needed to generate queries (other than SQL).
- SQL statements can be tested during production using open source applications such as DB Browser.
- Conversion to a SQL database enables data types to be set as defined in the AGS format e.g. Numeric.
- Conversion to SpatiaLite enables spatial queries to be carried out against the exploratory point locations.
How to specify json data quality requirements
Data quality requirements are specified in a json file. An example json file with a two data quality requirements is shown below. Additional requirements are added to the "rqmts" value array.
The key elements are:
- The plain language description of the data quality requirement.
- the SQL query that returns a value from the data and
- the pass criteria that denotes whether the data is good or poor quality.
It is very important to write the plain language description of the requirement. This helps the reader understand the results and it helps specify and test the SQL statement.
Details of how to write SQL statements are beyond the scope of this guidance. There are many online resources available, as the SQL language has been in use for over 60 years.
Note
SQL queries must be written to return a number, which will be used in the pass criteria. SQL queries can be tested against AGS data by converting the AGS file to SpatiaLite (see file conversion pages in this guidance) and using applications such as DB Browser to run the queries.
The pass criteria comprises a simple comparison operator and a number.
{
"file data": {
"profile name": "GDMS check",
"version": "draft",
"purpose": "Validation for upload onto GDMS",
"organisation": "National Highways",
"rqmts": [
{
"ref": "NH_GDMS_5",
"description": "Check that 75% of LOCA_GL in LOCA table is populated",
"dimension": "consistency",
"sql": "SELECT printf ('%.2f', ((SELECT COUNT(*) FROM LOCA)-(SELECT COUNT(*) FROM LOCA WHERE length(LOCA_GL) = 0 or LOCA_GL is null))/((SELECT COUNT(*) FROM LOCA) / 100.0));",
"error": "LOCA_GL column has missing values",
"pass_criteria": [">=", 75]
},
{
"ref": "NH_GDMS_2.1",
"sql": "SELECT COUNT(*) FROM PROJ WHERE LENGTH(PROJ_NAME) > 0;",
"description": "Check if PROJECT NAME/ID Exists in PROJ table",
"error": "Error reading PROJ NAME/ID",
"pass_criteria": ["=", 0]
}
]
}
}
Field Descriptions
| Field | Type | Required | Description |
|---|---|---|---|
ref |
string | Yes | Unique reference ID (e.g., "NH_GDMS_2.1") |
sql |
string | Yes | SQL query to execute against the database |
description |
string | Yes | Human-readable description of what the rule checks |
error |
string | Yes | Message displayed when rule fails |
pass_criteria |
list[string, number] | Denotes the pass threshold for 'good' quality data |
pass_criteria Values
Pass criteria comprises a two element json array object. Json arrays are denoted by square brackets. The first item in the array is a comparison operator and the second item is a numeric value. The comparison operator is specified as a json string (enclosed by inverted commas)and the numeric value is a float.
for example:
["=", 0]
[">", 75]
["<=", 10]
Results/Output
The assessment provides a tabulated output of results.
It includes details about the input file, the requirements file and the tests carried out.
This can be copied to the clipboard, or saved to a file using the menu items at the top of the data quality assessment window.
An example is shown below:
Example Output:
============================================================
AGS Data quality check
Date time: 2026-03-26 13:15:00
============================================================
Inputs
AGS filename: 732813_v5.ags
Data quality definition filename: nh_gdms_requirements.json
Data quality definition version: 1.0
Data quality profile purpose: Validation for upload onto GDMS
Data quality profile organisation: National Highways
============================================================
Results
Table of tests carried out
Test ref Description result data
quality
---------- --------------------------------------------------------------- -------- ---------
NH_GDMS_1 Check if PROJECT ID Exists in PROJ table true good
NH_GDMS_2 Check if TRAN_ISNO Exists in TRAN table true good
NH_GDMS_3 Check if TRAN_AGS Exists in TRAN table true good
NH_GDMS_4 Check that LOCA_TYPE in LOCA table contains no empty values false poor
NH_GDMS_5 Check that 75% of LOCA_GL in LOCA table is populated true good
NH_GDMS_6 Check that 75% of LOCA_FDEP in LOCA table contains positive true good
numbers
NH_GDMS_7 Check that LOCA_STAR and LOCA_ENDD data types are in the format false poor
'yyyy-mm-dd'
NH_GDMS_8 Percentage of exploratory holes with (both) (OS) co-ordinates: true good
pass >=75%
NH_GDMS_9 Greater than 75% OSGB grid coordinates within England true good
============================================================
Summary
Number of tests carried out: 9
Number of tests passed: 7
============================================================
End of Report
System Architecture
Data Flow
graph TD
B[data_quality_checks_sl.py]
B --> C[Converts AGS file to SpatiaLite database]
C --> D[Loads data quality specification: ]
D --> E[Executes data quality tests SQL validation queries]
E --> F[Evaluates results against Criteria]
F --> G[Outputs Data Quality report including Pass/Fail]
Data quality module
data_quality_checks_sl.py
Note
This needs to be updated following merge
AGS data quality check module This module isn't used at the moment It uses SQLite and has been superseded by data_quality_checks_sl.py, which uses SpatiaLite
create_dicts(gname, dframe)
create dataframe from untit and type data
Source code in data_quality_checks.py
41 42 43 44 45 46 47 48 | |
exportags2db(ags_file_path, dbpath)
Main process to convert AGS file to SQL tables
Source code in data_quality_checks.py
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
parse_criteria(crit)
split pass criteria string
Source code in data_quality_checks.py
96 97 98 99 100 101 102 | |
carry_out_tests(conn, profile_qry, purpose, organisation)
Example test code. Loops over input Write tests
Source code in data_quality_checks.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 | |
format_results(dct_results, ags_fnm, dq_fnm, dq_ver)
Formats the data in the restlts dictionary Output is a file-like opject that can be used in an application
Source code in data_quality_checks.py
187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | |