Athora Data Quality (ADQ)

Athora Data Quality (ADQ) is a tool that allows you to measure and monitor data quality. The framework is designed to be used as self-service tool for all employees within Athora Netherlands. Next to complete self-service, it also supports a more controlled environment where data quality rules are defined by owners of a data domain. The framework is designed to be used with Databricks and with the Athora Data Quality User Interface, but can be extended to support other engines as well.

How does Athora Data Quality work?

Athora Data Quality works by defining data quality criteria. There are two types of criteria:

  1. Profiling Criteria: Profiling criteria are used to profile columns of a single table based on the specified criteria. Profiling criteria measure values on a row-by-row basis.

  2. Business Rules: Business Rules can measure a combination of multiple columns and tables. It allows you to freely define rules in SQL syntax. Business rules can also join multiple dataset together in order to check if the data meets certain requirements.

Technical components of the Athora Data Quality Framework

With ADQ you can define data quality rules with an API and allows you to monitor the quality of your data with a web interface. Next to a web interface and an API, Athora Data Quality also contains a set of Python tools that allow you to measure the quality of your data and interact with the ADQ framework. These tools can be used from any Python environment, including Jupyter notebooks and Databricks.

The web interface and Python tools both use the API to interact with the framework. The API acts as the core of the framework, which makes it possible to integrate ADQ with others tools and systems.

        graph BT;
    A[Web Interface] --> B[API];
    C[Python Package] --> B;
    B --> D[Database];
    E[3rd Party Tools] --> B;
    F[Databricks] --> C;
    G[Jupyter] --> C;
    H[Other Python Clients] --> C;
    

Figure 1: Architecture of Athora Data Quality

Quickstart: measuring data using a Jupyter/Databricks notebook

Step 1: Install the Python package

Install the latest version of the ADQ package using the following command:

%pip install adq

You can also a specific version of the package by specifying the version number. For example, to install version 0.1.0, use the following command:

%pip install adq==0.1.0

Or automatically install minor and patch updates, while not installing major updates. See PEP 440 for more information.

%pip install adq~=0.1

Step 2: Configure the client

The client can be configured in two separate ways. You can either use environment variables or pass the configuration directly to the client.

Option I: Using environment variables

You can set the following environment variables to configure the client:

  • DQ__CLIENT__ID: The client ID that is used to authenticate with the API (example: e6f53d0a-0004-4a2a-84f7-3c394c783b99).

  • DQ__CLIENT__SECRET: The client secret that is used to authenticate with the API (example: client_secret).

  • DQ__CLIENT__API_URL: The URL of the API (example: https://api.dataquality.athora.nl)

  • DQ__CLIENT__API_TENANT_ID: The tenant ID (example: e6f53d0a-0004-4a2a-84f7-3c394c783b99)

  • DQ__CLIENT__API_AUTHORITY: The authority (example: https://login.microsoftonline.com/e6f53d0a-0004-4a2a-84f7-3c394c783b99)

  • DQ__CLIENT__API_SCOPE: The scope of the API (example: https://[APP_ID]/Measurement.Post)

This will automatically configure the client with the correct settings.

Option II: Passing the configuration directly to the client

You can also pass the configuration directly to the client. This is useful when you want to use multiple clients in the same environment.


from adq.client import ClientSettings, HttpClient, BusinessRulePostClient, MeasurementGet

settings = ClientSettings(
  CLIENT_ID="e6f53d0a-0004-4a2a-84f7-3c394c783b99",
  CLIENT_SECRET="client_secret",
  API_URL="https://api.dataquality.athora.nl",
  API_TENANT_ID="e6f53d0a-0004-4a-2a-84f7-3c394c783b99",
  API_AUTHORITY="https://login.microsoftonline.com/e6f53d0a-0004-4a-2a-84f7-3c394c783b99",
  API_SCOPE="https://[APP_ID]/Measurement.Post"
)

client = HttpClient(settings)

Step 3: Configure a domain and datasource

Before you can start measuring data, you need to configure a domain and a datasource. The domain is the name of the domain that you are measuring data for. The datasource is the name of the datasource that you are measuring data from. You can use the API documentation to see which domains and datasources are available and add/modify them.

In the future, you will be able to configure the domain and datasource using the web interface.

Step 4: Use cell magic to run a measurement

Import the package in order to use the cell magic.

import adq.client

Now you can use the %%profile cell magic to start a ProfilingCriteria measurement (see explaination below).

%%profile 
name: profiling_criteria_name
domain: domain_name
table: catalog.schema.table
datasource: datasource_name

columns:
- name: column_name
  datatype: string
  empty:
    allowed: false
    values:
    - NULL      
  checks:
  - enumeration:
    - value1
    ...

And you can use the %%validate cell magic to start a BusinessRule measurement (see explaination below).

%%validate

name: "Each policy should have a single person"
domain: YourDomain
datasource: DatasourceName
selection_query: "SELECT * FROM policy LEFT JOIN person ON policy.person_id = person.id"
rules:
- rule: "Each policy should have a person"
  query: "person.id IS NOT NULL"
- rule: "Each policy should only contain a single person"
  query: "ROW_NUMBER() OVER (PARTITION BY policy.id) = 1"
  ...

Step 5: Check the result of your measurement

After running the cell magic, you can check the result of your measurement by using the MeasurementGet client.

measurement = _
print(measurement.model_dump_json(indent=2))

The measurement object is a Python object that contains the result of your measurement. You can explore the object to see the result of your measurement.

Step 6: Check the data in a Spark DataFrame

from adq.client import create_and_register_spark_dataframes
create_and_register_spark_dataframes(measurement)

This will create a Spark DataFrame and register a view with the name ADQResult in your Spark session. You can use this view to explore the result of your measurement in a SQL cell.

%sql
SELECT * FROM ADQResult

Step 7: Explore the result of your measurement in the web interface

You can explore the result of your measurement in the web interface. You can navigate to the web interface by navigating to https://dataquality.athora.nl. You can login using your Athora credentials. After logging in, you can navigate to the Measurements tab to see the result of your measurement. The ID of your measurement is the ID of the measurement object that you printed in step 5. You can also use a direct link to the measurement by navigating to https://dataquality.athora.nl/measurements/{measurement_id}.