Profiling Criteria

A Profiling Criteria object is defined by speciying Table and one or multiple Column(s). Under each column, you can define one or multiple Checks, a Datatype and an Empty object. The Checks are the actual rules that you want to apply to the data. The Datatype is the expected datatype of the column and the Empty object defines what is considered an empty value and how to handle it.

        flowchart TD;
  checktypes["`equal
  not equal
  greater than
  less than
  range
  regex
  custom
  `"];
  subgraph criteria;
    table --> columns;
    columns --> check;
    columns --> datatype;
    columns --> empty;
  end;
  subgraph checkstypes;
    check --> checktypes;
  end;
  subgraph enumerations;
    check --> enumeration;
    direction LR;
    enumeration --> dimensions;
  end;
    

Checks

The Checks are the actual rules that you want to apply to the rows of your column. Each check results in a True or False value per cell. Currently, we support the following checks:

  • equal: Checks if the value is equal to a certain value

  • not_equal: Checks if the value is not equal to a certain value

  • greater_than: Checks if the value is greater than a certain value

  • greater_than_or_equal_to: Checks if the value is greater than or equal to a certain value

  • less_than: Checks if the value is less than a certain value

  • less_than_or_equal_to: Checks if the value is less than or equal to a certain value

  • range: Checks if the value is within a certain range

  • regex: Checks if the value matches a certain regular expression

  • enumeration: Checks if the value is in a certain list of values

  • custom: Allows you to define a custom check in SQL syntax. The expression should result in a boolean value.

Datatype

The Datatype object under a column defines the expected datatype of the column. It checks whether a field can be casted to the expected datatype. If it cannot be casted, the field is considered invalid and checks will be ignored. Checks are applied to the column with casted value. This means that if a field cannot be casted to the expected datatype, it will not be checked. And that the type of the value of the check will be inferred from the expected datatype.

Empty

The Empty object defines what is considered an empty value and how to handle it. The Empty object has two properties: values and allowed. The values property defines what is considered an empty value. The allowed property defines how to handle empty values. If a value is empty, the checks will be ignored. Checks in the list always act as an AND statement. This means that all checks should be True in order for the row to be considered valid. However, you can use a or statement to define multiple checks where only one (or more) of the checks should be True. See YAML reference for more information.

YAML Example

# Example of a Profiling Criteria in YAML format
name: Person
columns: 
- name: id
  datatype: integer
  empty:
    values:
    - NULL
    - ""
    allowed: True
  checks:
  - greater_than: 0

- name: name
  datatype: string
  empty:
    values:
      - NULL
    allowed: False
  checks:
  - regex: "^[A-Z][a-z]*$"

- name: age
  datatype: integer
  empty:
    values:
    - NULL
    - ""
    allowed: False
  checks:
  - greater_than_or_equal_to: 0
  - less_than_or_equal_to: 120

Custom Properties

Predefined tags can be given to multiple parts within the API. These tags can be used to e.g. group or catagorize to Dimensions, Enumerations and Profiling Criteria. Within the Profiling Criteria, it’s possible to tag multiple parts, like columns, objects and criteria.

End Points

At the following location the API references can be found: https://app-dq-dev-01.azurewebsites.net/docs#/. For Custom Properties, the following end points can be used:

Manage:
/custom_properties #POST for adding a new selection
/custom_properties/{id} #PUT for changing an existing selection

Find (GET):
/custom_properties #For returning all selections
/custom_properties/{id} #For finding a specific selection
/custom_properties/find_one #To find specific selections without pagination
/custom_properties/find  #To find specific selection with pagination

YAML Reference

ProfilingCriteria

name: string [REQUIRED]
columns: list[ProfilingColumn] [REQUIRED]
keys: list[ObjectKey] [OPTIONAL] # For reporting/monitoring purposes
requires_review: bool [OPTIONAL] # When enabled, a review window will appear in the web interface
filters: list[str | SCD2DateFilter | ActiveFilter] [OPTIONAL] # Filters to apply to the table before measuring
categories: list[Category] [OPTIONAL] # Advanced feature for grouping rows, do not use unless you know what you are doing
datasource: string [REQUIRED] # The datasource to measure
table: string [REQUIRED] # The table to measure
domain: string [REQUIRED] # The domain to measure

ProfilingColumn

name: string [REQUIRED] # The name of the column to profile
datatype: binary | boolean | byte | char | date | decimal | double | float | tinyint | smallint | int | integer | bigint | interval | long | map | null | short | string | struct | table_type | timestamp | timestamp_ntz | user_defined_type [REQUIRED] # The expected datatype of the column, will be used to cast the column before checking
critical_when_fails_exceed: CountThreshold | PercentageThreshold [OPTIONAL] # For reporting/monitoring purposes
empty: Empty [OPTIONAL] # What is considered an empty value and how to handle it
checks:
- equal: datetime | int | float | str
- not_equal: datetime | int | float | str
- greater_than: datetime | int | float | str
- greater_than_or_equal_to: datetime | int | float | str
- less_than: datetime | int | float | str
- less_than_or_equal_to: datetime | int | float | str
- length: int
  operator: = | != | > | >= | < | <=
- range:
    min: datetime | int | float | str
    max: datetime | int | float | str
- regex: string
- enumeration: list[datetime | int | float | str]
- or: 
  - equal: datetime | int | float | str
  - not_equal: datetime | int | float | str
  ...

Critical when fails exceed

Used to define a threshold for the number of failed checks. If the number of failed checks exceeds the threshold, the column is considered critical.

CountThreshold

count: int [REQUIRED]

PercentageThreshold

percentage: int [REQUIRED]

Empty

values: NULL | "" | 0 [REQUIRED] # The values that are considered empty
allowed: bool [REQUIRED] # How to handle empty values