Skip to content

Data Quality Assessments

Important

Data quality assessments can only be carried out on valid AGS4 data files. The files MUST be validated first.

Overview

It has been possible to validate AGS data format files for many years using proprietary AGS 'checkers' and now the open source AGS Python library. However the 'checkers' did not report on the quality of the actual data. They did report warnings or FYI messages about the data, but these were tailored for data import into specific software.

The ability to check data quality has been introduced in version 3 of the AGS desktop validator. It allows users to create bespoke data quality checks at a project, organisation or system level.

Process

Data quality requirements are defined using a json file. The assessment reports the quality of the data using the criteria set out in the json file.

Note

The syntax of a json file is relatively simple, the definition can be found at json.org There are many online resources that provide help, file formatters and syntax checkers (linters)

Data quality is a detailed subject. This tool has been created using principles set out in the DAMA® Data Management Body of Knowledge (DAMA-DMBOK®)

Data quality assessment is an automated profiling tool that presents the data quality as a qualitative term based on categorisation of good or poor quality data.

Details of the data quality assessment are provided below.

A predefined data quality assessment has been implemented for National Highways, to pre-check data that is to be loaded on the Geotechnical and Drainage asset Management Service (GDMS). Details can be seen here.

In summary the AGS data file is converted to an in-memory SpatiaLite (SQLite) database with simple tables to reflect the AGS groups. The heading rows that set out the TYPE and UNIT requirements are extracted to a single 'UNTY' table in the database The data quality requirements are written as SQL statements, which are then run against the database.

Note

Spatialite is a standalone relational database. It is an extension of SQLite that includes spatial functions, such as the ability to query the locations of exploratory holes against site boundaries.

The advantages of this approach are as follows:

  1. SQL is a natural query language and is extensively documented
  2. No coding experience is needed to generate queries (other than SQL).
  3. SQL statements can be tested during production using open source applications such as DB Browser.
  4. Conversion to a SQL database enables data types to be set as defined in the AGS format e.g. Numeric.
  5. Conversion to SpatiaLite enables spatial queries to be carried out against the exploratory point locations.

How to specify json data quality requirements

Data quality requirements are specified in a json file. An example json file with a two data quality requirements is shown below. Additional requirements are added to the "rqmts" value array.

The key elements are:

  1. The plain language description of the data quality requirement.
  2. the SQL query that returns a value from the data and
  3. the pass criteria that denotes whether the data is good or poor quality.

It is very important to write the plain language description of the requirement. This helps the reader understand the results and it helps specify and test the SQL statement.

Details of how to write SQL statements are beyond the scope of this guidance. There are many online resources available, as the SQL language has been in use for over 60 years.

Note

SQL queries must be written to return a number, which will be used in the pass criteria. SQL queries can be tested against AGS data by converting the AGS file to SpatiaLite (see file conversion pages in this guidance) and using applications such as DB Browser to run the queries.

The pass criteria comprises a simple comparison operator and a number.

{
  "file data": {
    "profile name": "GDMS check",
    "version": "draft",
    "purpose": "Validation for upload onto GDMS",
    "organisation": "National Highways",
    "rqmts": [
      {
        "ref": "NH_GDMS_5",
        "description": "Check that 75% of LOCA_GL in LOCA table is populated",
        "dimension": "consistency",
        "sql": "SELECT printf ('%.2f', ((SELECT COUNT(*) FROM LOCA)-(SELECT COUNT(*) FROM LOCA WHERE length(LOCA_GL) = 0 or LOCA_GL is null))/((SELECT COUNT(*) FROM LOCA) / 100.0));",
        "error": "LOCA_GL column has missing values",
        "pass_criteria": [">=", 75]
      },
      {
        "ref": "NH_GDMS_2.1",
        "sql": "SELECT COUNT(*) FROM PROJ WHERE LENGTH(PROJ_NAME) > 0;",
        "description": "Check if PROJECT NAME/ID Exists in PROJ table",
        "error": "Error reading PROJ NAME/ID",
        "pass_criteria": ["=", 0]
      }
    ]
  }
}

Field Descriptions

Field Type Required Description
ref string Yes Unique reference ID (e.g., "NH_GDMS_2.1")
sql string Yes SQL query to execute against the database
description string Yes Human-readable description of what the rule checks
error string Yes Message displayed when rule fails
pass_criteria list[string, number] Denotes the pass threshold for 'good' quality data

pass_criteria Values

Pass criteria comprises a two element json array object. Json arrays are denoted by square brackets. The first item in the array is a comparison operator and the second item is a numeric value. The comparison operator is specified as a json string (enclosed by inverted commas)and the numeric value is a float.

for example:

["=", 0]
[">", 75]
["<=", 10]
The full json file that is used to carry out a data quality assessment for data to be uploaded to the National Highways GDMS database can be seen in these help pages.


Results/Output

The assessment provides a tabulated output of results.

It includes details about the input file, the requirements file and the tests carried out.

This can be copied to the clipboard, or saved to a file using the menu items at the top of the data quality assessment window.

An example is shown below:

Example Output:

============================================================
AGS Data quality check
Date time: 2026-03-26 13:15:00
============================================================
Inputs

AGS filename: 732813_v5.ags
Data quality definition filename: nh_gdms_requirements.json
Data quality definition version: 1.0
Data quality profile purpose: Validation for upload onto GDMS
Data quality profile organisation: National Highways
============================================================
Results

Table of tests carried out

Test ref    Description                                                      result    data
                                                                                       quality
----------  ---------------------------------------------------------------  --------  ---------
NH_GDMS_1   Check if PROJECT ID Exists in PROJ table                         true      good
NH_GDMS_2   Check if TRAN_ISNO Exists in TRAN table                          true      good
NH_GDMS_3   Check if TRAN_AGS Exists in TRAN table                           true      good
NH_GDMS_4   Check that LOCA_TYPE in LOCA table contains no empty values      false     poor
NH_GDMS_5   Check that 75% of LOCA_GL in LOCA table is populated             true      good
NH_GDMS_6   Check that 75% of LOCA_FDEP in LOCA table contains positive      true      good
            numbers
NH_GDMS_7   Check that LOCA_STAR and LOCA_ENDD data types are in the format  false     poor
            'yyyy-mm-dd'
NH_GDMS_8   Percentage of exploratory holes with (both) (OS) co-ordinates:   true      good
            pass >=75%
NH_GDMS_9   Greater than 75% OSGB grid coordinates within England            true      good

============================================================
Summary

Number of tests carried out: 9
Number of tests passed: 7
============================================================
          End of Report

System Architecture

Data Flow

graph TD
B[data_quality_checks_sl.py]
B --> C[Converts AGS file to SpatiaLite database]
C --> D[Loads data quality specification: ]
D --> E[Executes data quality tests SQL validation queries]
E --> F[Evaluates results against Criteria]
F --> G[Outputs Data Quality report including Pass/Fail]

Data quality module

data_quality_checks_sl.py

Note

This needs to be updated following merge

AGS data quality check module This module isn't used at the moment It uses SQLite and has been superseded by data_quality_checks_sl.py, which uses SpatiaLite

create_dicts(gname, dframe)

create dataframe from untit and type data

Source code in data_quality_checks.py
41
42
43
44
45
46
47
48
def create_dicts(gname: str, dframe: pd.DataFrame) -> pd.DataFrame:
    """create dataframe from untit and type data"""
    df = dframe.transpose()
    df = df.reset_index()
    df.insert(0, 'group', gname)
    df.rename({'index': 'heading', 0: 'unit', 1: 'type'}, axis=1, inplace=True)
    df = df.iloc[1:]
    return df

exportags2db(ags_file_path, dbpath)

Main process to convert AGS file to SQL tables

Source code in data_quality_checks.py
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def exportags2db(ags_file_path: str, dbpath: str):
    """ Main process to convert AGS file to SQL tables"""
    print(ags_file_path)
    print("Starting to convert ags to db")
    if dbpath == "memory":
        print("creating database")
        conn = sqlite3.connect(":memory:")
        print("Database connection created")
    else:
        if os.path.exists(dbpath):
            os.remove(dbpath)
        conn = sqlite3.connect(dbpath)

    try:
        # c = connection.cursor()
        ## TONY please check datatype assigning logic
        # rows: list[tuple[Point, int, str, float, float]] = []
        tables, headings = AGS4.AGS4_to_dataframe(ags_file_path)
        print(ags_file_path)

        # ----populate database tables------
        lst2_df = []
        for key, value in tables.items():
            print(key)
            # include in here option to remove empty columns
            lst_df = create_dicts(key, value.iloc[:2])
            df = AGS4.convert_to_numeric(value)
            df.to_sql(key, conn, if_exists='replace', index=False)
            lst2_df.append(lst_df)
        # Create table for UNIT and TYPE info
        df_dict = lst2_df[0]
        for i in range (len(lst2_df)-1):
            df_dict = pd.concat([df_dict, lst2_df[i]], axis=0, ignore_index=True)

        df_dict = df_dict.sort_values(by=['group', 'heading'])
        print("create UNTY")
        df_dict.to_sql('UNTY', conn, if_exists='replace', index=False)
        # conn.close()
    except Exception as err:
        print(f"Unexpected {err=}, {type(err)=}")
        raise
    print("Conversion to SQLite complete")
    return conn

parse_criteria(crit)

split pass criteria string

Source code in data_quality_checks.py
 96
 97
 98
 99
100
101
102
def parse_criteria(crit: str)-> tuple:
    """split pass criteria string"""
    try:
        str_crit = crit.split("_")
        return str_crit[0], float(str_crit[1])
    except:
        print(f"invalid pass criteria: {crit}")

carry_out_tests(conn, profile_qry, purpose, organisation)

Example test code. Loops over input Write tests

Source code in data_quality_checks.py
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
def carry_out_tests(conn: str, profile_qry: list, purpose:str,
                    organisation:str)-> dict:
    """Example test code. Loops over input
       Write tests
    """
    print("Starting to carry out tests")
    # conn = sqlite3.connect(dbpath)
    # conn = sqlite3.connect(":memory:")
    cur = conn.cursor()


    dct_dqprofile_res = {"purpose": purpose, "organisation": organisation}
    lst_results =[]
    for qry in profile_qry:
        cur.execute(qry["sql"])
        result = float(cur.fetchall()[0][0])
        # if qry["pass_criteria"] == "gt_0":
            # print(f"gt0 {result = }")
            # if result > 0:
                # dct_res = {"Test ref": qry['ref'], "Description":
                            # qry['description'], "result": "true",
                            # "data\nquality": "good"}
            # else:
                # dct_res = {"Test ref": qry['ref'],
                           # "Description": qry['description'],
                           # "result": "false", "data\nquality": "poor"}


        # elif qry["pass_criteria"] == "e_0":
            # print(f"e0 {result = }")
            # if result == 0:
                # dct_res = {"Test ref": qry['ref'], "Description":
                            # qry['description'], "result": "true",
                            # "data\nquality": "good"}
            # else:
                # dct_res = {"Test ref": qry['ref'],
                           # "Description": qry['description'],
                           # "result": "false", "data\nquality": "poor"}

        # else:
        if parse_criteria(qry["pass_criteria"])[0] == "e":
            print(f"e {result = }")
            if result == parse_criteria(qry["pass_criteria"])[1]:
                dct_res = {"Test ref": qry['ref'], "Description":
                            qry['description'], "result": "true",
                            "data\nquality": "good"}
            else:
                dct_res = {"Test ref": qry['ref'],
                           "Description": qry['description'],
                           "result": "false", "data\nquality": "poor"} 

        if parse_criteria(qry["pass_criteria"])[0] == "gt":
            print(f"gt {result = }")
            if result >= parse_criteria(qry["pass_criteria"])[1]:
                dct_res = {"Test ref": qry['ref'], "Description":
                            qry['description'], "result": "true",
                            "data\nquality": "good"}
            else:
                dct_res = {"Test ref": qry['ref'],
                           "Description": qry['description'],
                           "result": "false", "data\nquality": "poor"}                                    


        # elif isinstance(qry["criteria"], (int, float)):
            # print(f"numb {result = }")
            # if result >= qry["criteria"]:
                # dct_res = {"Test ref": qry['ref'],
                           # "Description": qry['description'],
                           # "result": str(result), "data\nquality": "good"}
            # else:
                # dct_res = {"Test ref": qry['ref'],
                           # "Description": qry['description'],
                           # "result": str(result), "data\nquality": "poor"}
        lst_results.append(dct_res)
    # Following line sorts results - make optional param
    lst_results = sorted(lst_results, key=lambda x: x['Test ref'])
    dct_dqprofile_res.update({"results": lst_results})

    conn.close()
    return dct_dqprofile_res

format_results(dct_results, ags_fnm, dq_fnm, dq_ver)

Formats the data in the restlts dictionary Output is a file-like opject that can be used in an application

Source code in data_quality_checks.py
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
def format_results(dct_results:dict, ags_fnm:str, dq_fnm:str, dq_ver:str):
    """Formats the data in the restlts dictionary
    Output is a file-like opject that can be used in an application"""

    int_no_of_tests = len(dct_results["results"])
    int_tests_passed = len([i for i in dct_results['results']
                            if i['data\nquality']=="good"])

    f_out = StringIO()
    f_out.write("=" * 60)
    f_out.write("\nAGS Data quality check")
    f_out.write(f"\nDate time: {datetime.now().replace(second=0, microsecond=0)}")
    f_out.write('\n')
    f_out.write("=" * 60)
    f_out.write(f"\nInputs")
    f_out.write('\n')
    f_out.write(f"\nAGS filename: {ags_fnm}")
    f_out.write(f"\nData quality definition filename: {dq_fnm}")
    f_out.write(f"\nData quality definition version: {dq_ver}")
    f_out.write(f"\nData quality profile purpose: {dct_results['purpose']}")
    f_out.write(f"\nData quality profile organisation: {dct_results['organisation']}")
    f_out.write('\n')
    f_out.write("=" * 60)
    f_out.write('\nResults')
    f_out.write('\n')
    f_out.write('\nTable of tests carried out')
    f_out.write('\n')
    f_out.write('\n')
    f_out.write(tabulate(dct_results["results"], headers="keys", maxcolwidths=[None, 65, None, 5]))
    f_out.write('\n')
    f_out.write('\n')
    f_out.write("=" * 60)
    f_out.write('\nSummary')
    f_out.write('\n')
    f_out.write(f"\nNumber of tests carried out: {int_no_of_tests}")
    f_out.write('\n')
    f_out.write(f"Nmber of tests passed: {int_tests_passed}")
    f_out.write('\n')
    f_out.write("=" * 60)
    f_out.write('\n          End of Report')

    return f_out