Profiling Criteria
A Profiling Criteria object is defined by speciying Table and one or multiple Column(s). Under each column, you can define one or multiple Checks, a Datatype and an Empty object. The Checks are the actual rules that you want to apply to the data. The Datatype is the expected datatype of the column and the Empty object defines what is considered an empty value and how to handle it.
flowchart TD;
checktypes["`equal
not equal
greater than
less than
range
regex
custom
`"];
subgraph criteria;
table --> columns;
columns --> check;
columns --> datatype;
columns --> empty;
end;
subgraph checkstypes;
check --> checktypes;
end;
subgraph enumerations;
check --> enumeration;
direction LR;
enumeration --> dimensions;
end;
Checks
The Checks are the actual rules that you want to apply to the rows of your column. Each check results in a True or False value per cell. Currently, we support the following checks:
equal: Checks if the value is equal to a certain valuenot_equal: Checks if the value is not equal to a certain valuegreater_than: Checks if the value is greater than a certain valuegreater_than_or_equal_to: Checks if the value is greater than or equal to a certain valueless_than: Checks if the value is less than a certain valueless_than_or_equal_to: Checks if the value is less than or equal to a certain valuerange: Checks if the value is within a certain rangeregex: Checks if the value matches a certain regular expressionenumeration: Checks if the value is in a certain list of valuescustom: Allows you to define a custom check in SQL syntax. The expression should result in a boolean value.
Datatype
The Datatype object under a column defines the expected datatype of the column. It checks whether a field can be casted to the expected datatype. If it cannot be casted, the field is considered invalid and checks will be ignored. Checks are applied to the column with casted value. This means that if a field cannot be casted to the expected datatype, it will not be checked. And that the type of the value of the check will be inferred from the expected datatype.
Empty
The Empty object defines what is considered an empty value and how to handle it. The Empty object has two properties: values and allowed. The values property defines what is considered an empty value. The allowed property defines how to handle empty values. If a value is empty, the checks will be ignored. Checks in the list always act as an AND statement. This means that all checks should be True in order for the row to be considered valid. However, you can use a or statement to define multiple checks where only one (or more) of the checks should be True. See YAML reference for more information.
YAML Example
# Example of a Profiling Criteria in YAML format
name: Person
columns:
- name: id
datatype: integer
empty:
values:
- NULL
- ""
allowed: True
checks:
- greater_than: 0
- name: name
datatype: string
empty:
values:
- NULL
allowed: False
checks:
- regex: "^[A-Z][a-z]*$"
- name: age
datatype: integer
empty:
values:
- NULL
- ""
allowed: False
checks:
- greater_than_or_equal_to: 0
- less_than_or_equal_to: 120
Custom Properties
Predefined tags can be given to multiple parts within the API. These tags can be used to e.g. group or catagorize to Dimensions, Enumerations and Profiling Criteria. Within the Profiling Criteria, it’s possible to tag multiple parts, like columns, objects and criteria.
End Points
At the following location the API references can be found: https://app-dq-dev-01.azurewebsites.net/docs#/. For Custom Properties, the following end points can be used:
Manage:
/custom_properties #POST for adding a new selection
/custom_properties/{id} #PUT for changing an existing selection
Find (GET):
/custom_properties #For returning all selections
/custom_properties/{id} #For finding a specific selection
/custom_properties/find_one #To find specific selections without pagination
/custom_properties/find #To find specific selection with pagination
YAML Reference
ProfilingCriteria
name: string [REQUIRED]
columns: list[ProfilingColumn] [REQUIRED]
keys: list[ObjectKey] [OPTIONAL] # For reporting/monitoring purposes
requires_review: bool [OPTIONAL] # When enabled, a review window will appear in the web interface
filters: list[str | SCD2DateFilter | ActiveFilter] [OPTIONAL] # Filters to apply to the table before measuring
categories: list[Category] [OPTIONAL] # Advanced feature for grouping rows, do not use unless you know what you are doing
datasource: string [REQUIRED] # The datasource to measure
table: string [REQUIRED] # The table to measure
domain: string [REQUIRED] # The domain to measure
ProfilingColumn
name: string [REQUIRED] # The name of the column to profile
datatype: binary | boolean | byte | char | date | decimal | double | float | tinyint | smallint | int | integer | bigint | interval | long | map | null | short | string | struct | table_type | timestamp | timestamp_ntz | user_defined_type [REQUIRED] # The expected datatype of the column, will be used to cast the column before checking
critical_when_fails_exceed: CountThreshold | PercentageThreshold [OPTIONAL] # For reporting/monitoring purposes
empty: Empty [OPTIONAL] # What is considered an empty value and how to handle it
checks:
- equal: datetime | int | float | str
- not_equal: datetime | int | float | str
- greater_than: datetime | int | float | str
- greater_than_or_equal_to: datetime | int | float | str
- less_than: datetime | int | float | str
- less_than_or_equal_to: datetime | int | float | str
- length: int
operator: = | != | > | >= | < | <=
- range:
min: datetime | int | float | str
max: datetime | int | float | str
- regex: string
- enumeration: list[datetime | int | float | str]
- or:
- equal: datetime | int | float | str
- not_equal: datetime | int | float | str
...
Critical when fails exceed
Used to define a threshold for the number of failed checks. If the number of failed checks exceeds the threshold, the column is considered critical.
CountThreshold
count: int [REQUIRED]
PercentageThreshold
percentage: int [REQUIRED]
Empty
values: NULL | "" | 0 [REQUIRED] # The values that are considered empty
allowed: bool [REQUIRED] # How to handle empty values