3. Uploading a dataset

Uploading a dataset is an essential step when you create a version in the Superwise platform. It acts as a model baseline and is used by Superwise as reference data when carrying out analytics and pre-configured metrics.

📘

What is dataset in superwise?

Read more on the dataset concept here

🚧

Dataset file limitations

  • The sum of the size of all baseline data files together should be up to 100MB
  • The dataset must contain columns for ID and Timestamp
  • The first row should contain the feature/entity name

Creating a new dataset via the console

There are two ways to create a dataset from the Superwise console:

  • When you are creating a version - You can read about this process in the next section Create a version
  • From the datasets screen: To do this, proceed as follows:

Start by going to the dataset screen - simply click on the dataset icon in the left menu bar within your project.

1. Upload dataset file: Click New dataset.
The 'add dataset' dialog is displayed. Name the dataset and select the type (training, test, validation or other). Then click on the Upload file box to choose the CSV file you want to use and click Next.

18001800

2. Confirm your approval for the data types included: The Superwise platform allows you to define roles and data types for your dataset columns. Your dataset must include the roles id and time stamp. Any other entities that appear in the dataset but are not assigned a role or data type will be automatically selected as features by the system.

📘

ID data type

We support both string and int32 data types for unique IDs.
However, our recommendation is to use string UUID to support high volumes of data.

Make sure you set it as categorical

18001800 18001800

3. Select the roles:

18001800

4. Match entities: Suppose the entities of your uploaded dataset already exist in the project. In that case, you can use 'Match entities' to allow cross models actions (like creating segments or policies on multiple models) on those entities.

📘

More info

Read more on the Match entities concept

18001800

Creating a new dataset via the SDK

1. Create a dataset file - A dataset file is a structured data file that contains all the columns of your model data. It can be your set for training, testing, validation, or another use.
The dataset file must contain a Timestamp and an ID column, as presented in the example below)

url = 'https://gitlab.com/superwise.ai-public/integration/-/raw/main/getting_started/data/baseline.csv?inline=false'
training_data = pd.read_csv(StringIO(requests.get(url).text))
training_data.to_csv("training.csv", index=False)
training_data.head()
14881488

2. Define roles and data types
The Superwise platform allows you to define roles and data types for your dataset columns. Your dataset must include the roles id and time stamp. Any other entities that appear in the dataset but are not assigned a role or data type will be automatically selected as features by the system.
The SDK has a built-in method that can automatically infer the datatypes, or you can specify them manually.
These roles and datatypes are optional. This means that when they are not specified, they will be inferred automatically by Superwise.

from superwise.resources.superwise_enums import DataEntityRole
from superwise.resources.superwise_enums import FeatureType

dataset = Dataset(name=“Example training dataset”,
                  files=[training_data_filepath],
                  project_id=project.id,
                  type = DatasetType.TRAIN.value, 
                  roles={
                    DataEntityRole.LABEL.value:[“price”],
                    DataEntityRole.PREDICTION_VALUE.value:[“prediction”],
                    DataEntityRole.TIMESTAMP.value:“ts”,
                    DataEntityRole.ID.value:“id”
                    },
                  # Pay attention! This is only an example! 
                  # If you choose to override the dtypes you must define all schema columns dtypes 
                  # (and not only the ones you want to override)  
                  dtypes: {
                    “carat”: FeatureType.NUMERIC.value,
                    “color”: FeatureType.CATEGORICAL.value,
                    “x”: FeatureType.BOOLEAN.value
                  })
                                    # another option is to let superwise infere dtypes by using
                                # dtypes: infer_dtype(training_data)

# Create the dataset in Superwise, may take some time to process
dataset = sw.dataset.create(dataset)
from superwise.controller.infer import infer_dtype

roles = {
    "id": "id",
    "time stamp": "ts",
    "prediction value": ["prediction"],
    "label": ["price"]
}

dtypes = infer_dtype(training_data)

3. Create dataset- After you create a dataset file and configure the roles and datatypes, you can create the dataset. The "match entities" process is automatically applied to any entities in the same project that have the same name, type, and role.
You can also specify a type for the dataset. The default value for type is Training.

from superwise.models.dataset import DatasetType

dataset_type = DatasetType.TRAIN

dataset = Dataset(name="train", files=["training.csv"], project_id=2, dtypes=dtypes, roles=roles, type=dataset_type)
dataset = sw.dataset.create(dataset)

Did this page help you?