# Profiling Criteria A `Profiling Criteria` object is defined by speciying `Table` and one or multiple `Column(s)`. Under each column, you can define one or multiple `Checks`, a `Datatype` and an `Empty` object. The `Checks` are the actual rules that you want to apply to the data. The `Datatype` is the *expected* datatype of the column and the `Empty` object defines *what* is considered an empty value and *how* to handle it. ```{mermaid} flowchart TD; checktypes["`equal not equal greater than less than range regex custom `"]; subgraph criteria; table --> columns; columns --> check; columns --> datatype; columns --> empty; end; subgraph checkstypes; check --> checktypes; end; subgraph enumerations; check --> enumeration; direction LR; enumeration --> dimensions; end; ``` ## Checks The `Checks` are the actual rules that you want to apply to the rows of your column. Each check results in a `True` or `False` value per cell. Currently, we support the following checks: - `equal`: Checks if the value is equal to a certain value - `not_equal`: Checks if the value is not equal to a certain value - `greater_than`: Checks if the value is greater than a certain value - `greater_than_or_equal_to`: Checks if the value is greater than or equal to a certain value - `less_than`: Checks if the value is less than a certain value - `less_than_or_equal_to`: Checks if the value is less than or equal to a certain value - `range`: Checks if the value is within a certain range - `regex`: Checks if the value matches a certain regular expression - `enumeration`: Checks if the value is in a certain list of values - `custom`: Allows you to define a custom check in SQL syntax. The expression should result in a boolean value. ## Datatype The `Datatype` object under a column defines the expected datatype of the column. It checks whether a field can be casted to the expected datatype. If it cannot be casted, the field is considered invalid and checks will be ignored. Checks are applied to the column with casted value. This means that if a field cannot be casted to the expected datatype, it will not be checked. And that the type of the value of the check will be inferred from the expected datatype. ## Empty The `Empty` object defines what is considered an empty value and how to handle it. The `Empty` object has two properties: `values` and `allowed`. The `values` property defines what is considered an empty value. The `allowed` property defines how to handle empty values. If a value is empty, the checks will be ignored. Checks in the list always act as an `AND` statement. This means that all checks should be `True` in order for the row to be considered valid. However, you can use a `or` statement to define multiple checks where only one (or more) of the checks should be `True`. See YAML reference for more information. ### YAML Example ```yaml # Example of a Profiling Criteria in YAML format name: Person columns: - name: id datatype: integer empty: values: - NULL - "" allowed: True checks: - greater_than: 0 - name: name datatype: string empty: values: - NULL allowed: False checks: - regex: "^[A-Z][a-z]*$" - name: age datatype: integer empty: values: - NULL - "" allowed: False checks: - greater_than_or_equal_to: 0 - less_than_or_equal_to: 120 ``` ### Custom Properties Predefined tags can be given to multiple parts within the API. These tags can be used to e.g. group or catagorize to Dimensions, Enumerations and Profiling Criteria. Within the Profiling Criteria, it's possible to tag multiple parts, like columns, objects and criteria. ### End Points At the following location the API references can be found: https://app-dq-dev-01.azurewebsites.net/docs#/. For Custom Properties, the following end points can be used: ``` Manage: /custom_properties #POST for adding a new selection /custom_properties/{id} #PUT for changing an existing selection Find (GET): /custom_properties #For returning all selections /custom_properties/{id} #For finding a specific selection /custom_properties/find_one #To find specific selections without pagination /custom_properties/find #To find specific selection with pagination ``` ## YAML Reference ### ProfilingCriteria ```yaml name: string [REQUIRED] columns: list[ProfilingColumn] [REQUIRED] keys: list[ObjectKey] [OPTIONAL] # For reporting/monitoring purposes requires_review: bool [OPTIONAL] # When enabled, a review window will appear in the web interface filters: list[str | SCD2DateFilter | ActiveFilter] [OPTIONAL] # Filters to apply to the table before measuring categories: list[Category] [OPTIONAL] # Advanced feature for grouping rows, do not use unless you know what you are doing datasource: string [REQUIRED] # The datasource to measure table: string [REQUIRED] # The table to measure domain: string [REQUIRED] # The domain to measure ``` ### ProfilingColumn ```yaml name: string [REQUIRED] # The name of the column to profile datatype: binary | boolean | byte | char | date | decimal | double | float | tinyint | smallint | int | integer | bigint | interval | long | map | null | short | string | struct | table_type | timestamp | timestamp_ntz | user_defined_type [REQUIRED] # The expected datatype of the column, will be used to cast the column before checking critical_when_fails_exceed: CountThreshold | PercentageThreshold [OPTIONAL] # For reporting/monitoring purposes empty: Empty [OPTIONAL] # What is considered an empty value and how to handle it checks: - equal: datetime | int | float | str - not_equal: datetime | int | float | str - greater_than: datetime | int | float | str - greater_than_or_equal_to: datetime | int | float | str - less_than: datetime | int | float | str - less_than_or_equal_to: datetime | int | float | str - length: int operator: = | != | > | >= | < | <= - range: min: datetime | int | float | str max: datetime | int | float | str - regex: string - enumeration: list[datetime | int | float | str] - or: - equal: datetime | int | float | str - not_equal: datetime | int | float | str ... ``` ### Critical when fails exceed Used to define a threshold for the number of failed checks. If the number of failed checks exceeds the threshold, the column is considered critical. #### CountThreshold ```yaml count: int [REQUIRED] ``` #### PercentageThreshold ```yaml percentage: int [REQUIRED] ``` #### Empty ```yaml values: NULL | "" | 0 [REQUIRED] # The values that are considered empty allowed: bool [REQUIRED] # How to handle empty values ```