Data Quality Assessments

Important

Data quality assessments can only be carried out on valid AGS4 data files. The files MUST be validated first.

Overview

It has been possible to validate AGS data format files for many years using proprietary AGS 'checkers' and now the open source AGS Python library. However the 'checkers' did not report on the quality of the actual data. They did report warnings or FYI messages about the data, but these were tailored for data import into specific software.

The ability to check data quality has been introduced in version 3 of the AGS desktop validator. It allows users to create bespoke data quality checks at a project, organisation or system level.

Process

Data quality requirements are defined using a json file. The assessment reports the quality of the data using the criteria set out in the json file.

Note

The syntax of a json file is relatively simple, the definition can be found at json.org There are many online resources that provide help, file formatters and syntax checkers (linters)

Data quality is a detailed subject. This tool has been created using principles set out in the DAMA® Data Management Body of Knowledge (DAMA-DMBOK®)

Data quality assessment is an automated profiling tool that presents the data quality as a qualitative term based on categorisation of good or poor quality data.

Details of the data quality assessment are provided below.

A predefined data quality assessment has been implemented for National Highways, to pre-check data that is to be loaded on the Geotechnical and Drainage asset Management Service (GDMS). Details can be seen here.

In summary the AGS data file is converted to an in-memory SpatiaLite (SQLite) database with simple tables to reflect the AGS groups. The heading rows that set out the TYPE and UNIT requirements are extracted to a single 'UNTY' table in the database The data quality requirements are written as SQL statements, which are then run against the database.

Note

Spatialite is a standalone relational database. It is an extension of SQLite that includes spatial functions, such as the ability to query the locations of exploratory holes against site boundaries.

The advantages of this approach are as follows:

SQL is a natural query language and is extensively documented
No coding experience is needed to generate queries (other than SQL).
SQL statements can be tested during production using open source applications such as DB Browser.
Conversion to a SQL database enables data types to be set as defined in the AGS format e.g. Numeric.
Conversion to SpatiaLite enables spatial queries to be carried out against the exploratory point locations.

How to specify json data quality requirements

Data quality requirements are specified in a json file. An example json file with a two data quality requirements is shown below. Additional requirements are added to the "rqmts" value array.

The key elements are:

The plain language description of the data quality requirement.
the SQL query that returns a value from the data and
the pass criteria that denotes whether the data is good or poor quality.

It is very important to write the plain language description of the requirement. This helps the reader understand the results and it helps specify and test the SQL statement.

Details of how to write SQL statements are beyond the scope of this guidance. There are many online resources available, as the SQL language has been in use for over 60 years.

Note

SQL queries must be written to return a number, which will be used in the pass criteria. SQL queries can be tested against AGS data by converting the AGS file to SpatiaLite (see file conversion pages in this guidance) and using applications such as DB Browser to run the queries.

The pass criteria comprises a simple comparison operator and a number.

{
  "file data": {
    "profile name": "GDMS check",
    "version": "draft",
    "purpose": "Validation for upload onto GDMS",
    "organisation": "National Highways",
    "rqmts": [
      {
        "ref": "NH_GDMS_5",
        "description": "Check that 75% of LOCA_GL in LOCA table is populated",
        "dimension": "consistency",
        "sql": "SELECT printf ('%.2f', ((SELECT COUNT(*) FROM LOCA)-(SELECT COUNT(*) FROM LOCA WHERE length(LOCA_GL) = 0 or LOCA_GL is null))/((SELECT COUNT(*) FROM LOCA) / 100.0));",
        "error": "LOCA_GL column has missing values",
        "pass_criteria": [">=", 75]
      },
      {
        "ref": "NH_GDMS_2.1",
        "sql": "SELECT COUNT(*) FROM PROJ WHERE LENGTH(PROJ_NAME) > 0;",
        "description": "Check if PROJECT NAME/ID Exists in PROJ table",
        "error": "Error reading PROJ NAME/ID",
        "pass_criteria": ["=", 0]
      }
    ]
  }
}

Field Descriptions

Field	Type	Required	Description
`ref`	string	Yes	Unique reference ID (e.g., "NH_GDMS_2.1")
`sql`	string	Yes	SQL query to execute against the database
`description`	string	Yes	Human-readable description of what the rule checks
`error`	string	Yes	Message displayed when rule fails
`pass_criteria`	list[string, number]	Denotes the pass threshold for 'good' quality data

pass_criteria Values

Pass criteria comprises a two element json array object. Json arrays are denoted by square brackets. The first item in the array is a comparison operator and the second item is a numeric value. The comparison operator is specified as a json string (enclosed by inverted commas)and the numeric value is a float.

for example:

["=", 0]
[">", 75]
["<=", 10]

The full json file that is used to carry out a data quality assessment for data to be uploaded to the National Highways GDMS database can be seen in these help pages.

Results/Output

The assessment provides a tabulated output of results.

It includes details about the input file, the requirements file and the tests carried out.

This can be copied to the clipboard, or saved to a file using the menu items at the top of the data quality assessment window.

An example is shown below:

Example Output:

============================================================
AGS Data quality check
Date time: 2026-03-26 13:15:00
============================================================
Inputs

AGS filename: 732813_v5.ags
Data quality definition filename: nh_gdms_requirements.json
Data quality definition version: 1.0
Data quality profile purpose: Validation for upload onto GDMS
Data quality profile organisation: National Highways
============================================================
Results

Table of tests carried out

Test ref    Description                                                      result    data
                                                                                       quality
----------  ---------------------------------------------------------------  --------  ---------
NH_GDMS_1   Check if PROJECT ID Exists in PROJ table                         true      good
NH_GDMS_2   Check if TRAN_ISNO Exists in TRAN table                          true      good
NH_GDMS_3   Check if TRAN_AGS Exists in TRAN table                           true      good
NH_GDMS_4   Check that LOCA_TYPE in LOCA table contains no empty values      false     poor
NH_GDMS_5   Check that 75% of LOCA_GL in LOCA table is populated             true      good
NH_GDMS_6   Check that 75% of LOCA_FDEP in LOCA table contains positive      true      good
            numbers
NH_GDMS_7   Check that LOCA_STAR and LOCA_ENDD data types are in the format  false     poor
            'yyyy-mm-dd'
NH_GDMS_8   Percentage of exploratory holes with (both) (OS) co-ordinates:   true      good
            pass >=75%
NH_GDMS_9   Greater than 75% OSGB grid coordinates within England            true      good

============================================================
Summary

Number of tests carried out: 9
Number of tests passed: 7
============================================================
          End of Report

System Architecture

Data Flow

graph TD
B[data_quality_checks_sl.py]
B --> C[Converts AGS file to SpatiaLite database]
C --> D[Loads data quality specification: ]
D --> E[Executes data quality tests SQL validation queries]
E --> F[Evaluates results against Criteria]
F --> G[Outputs Data Quality report including Pass/Fail]

Data quality module

data_quality_checks_sl.py

Note

This needs to be updated following merge

AGS data quality check module This module isn't used at the moment It uses SQLite and has been superseded by data_quality_checks_sl.py, which uses SpatiaLite