Entities
A dataset consists of a collection of data entities (columns) where each of them has a name, type, and role.
For example- "Country" is a categorical (data type) dataset feature (role)
Entity role
To support different relevant pieces in the ML decision process, Superwise supports the following roles:
Role | Description |
---|---|
ID | Unique identifier per row or prediction. Using the ID entity, one can send labels to be connected to the previously sent predictions and compute performance metrics. Each schema must include exactly one data entity with an 'ID' role. Note: We support both string and int32 data types for unique ID. However, our recommendation is to use string UUID to support high volumes of data. |
Timestamp | Indicates when the prediction took place. The timestamp column enables us to present model metrics according to the actual prediction date and not based on the time it was sent to the platform. Each schema must include exactly one data entity with a 'Timestamp' role. |
Feature | A data entity that is being used as input by the model. |
Prediction probability | A probability value that was generated by the model that can be used for classification use cases. |
Prediction value | A model output value. Typically found in binary classification models as a boolean data type while for regression models it will be a numeric data type. |
Label | The actual ground-truth the model is trying to predict. The label should be the same data type as the prediction value. |
Label weight | The label weight can be used when we want a label's prediction to represent more than one observation. It can be used for semi-supervised cases, where one label observation can represent more than one case |
Metadata | Available data that is not used directly as a feature by the model and therefore wonβt be factored into model drift calculations but can be used for analytical purposes such as segmentation or dimension breakdowns. |
Data entity type
Each data entity has its own data format. Based on data type, our platform calculates the relevant metrics per data entity to provide you with full observability. For example, categorical entities will automatically be computed for entropy while variance will be computed for numeric entities (to view the full set of metrics according to each data type, see here).
Supported data formats:
Data type | Possible values | Example |
---|---|---|
Numeric | Float or int | > Age: 50, 18, 63, 7, ... > Amount: 1011.0, 23.7, 674.3, ... |
Categorical | Object or string | > Color: Red, Blue, Yellow, ... > Country: England, Israel, USA, ... Note: categorical entities with over 200 categories (e.g first name, street name, IP, etc.) will be counted as "Sparse" features, and will not have any metric calculated on it other than "Missing values" |
Boolean | 0/1 or true/false | > Fraud: true/false > Is_active: 0/1 |
Timestamp | yyyy-mm-dd hh:mm:ss.SSS | > Prediction_TS: '2021-12-04 18:27:20.213' |
Updated over 1 year ago