Datasets

A dataset is a collection of data entities that define the schema of a model version. Datasets allow you to compare your production data with different collections of data.

Dataset and a model version

When creating a new model version, the uploaded dataset will automatically determine the model version's schema. The dataset's entities (including name, datatype, and role) will accompany the model throughout the life of the version.

Once you connect a dataset to Superwise platform, you automatically get statistics and analytical information on your data, such as "Min value," "Max value," "Feature importance," "missing values," etc. You can see this information on the Analytics screen when filtering the table by the relevant dataset.

Match entities

Data entities might appear in multiple datasets and models across a project.

Data entities have their own distribution in different datasets. To analyze and monitor new production data more efficiently across the project, we are using the "matching entities" capability to match those entities that appear across the project.

Matching entities utilizing already calculated distribution attributes, therefore, allows us to compare new data to existing data from different datasets (apples to apples).

The same data entities can appear in multiple models and versions across a project. Moreover, each data entity can have its own distribution within different datasets. To analyze and monitor new production data more efficiently across the project, we use the match entities capability to match those entities that appear in different models across the same project.

Matching entities use distribution attributes that have already been calculated, thereby allowing us to compare new data to existing data from different datasets (i.e., apples to apples).

By connecting new data entities to existing ones within a project, the matching entities feature enables you to create segments and policies that are shared across models.

Dataset as drift reference

To calculate drift metrics to your model's entities, you need to determine what reference dataset you want to calculate the drift in compare to. Is it drifting from the training dataset? Or from the testing dataset?
Superwise allows you to calculate drift compared to any dataset as long as it matches the model's entities.