3. Upload dataset
Uploading a dataset is an essential step when you create a version in the Superwise platform. It helps us create the version's schema and is used by Superwise as reference data when carrying out analytics and pre-configured metrics.
What is dataset in Superwise?
Read more on the dataset concept here
Dataset file limitations
- The sum of the size of all dataset data files together should be up to 100MB
- The dataset must contain columns for ID and Timestamp
- The first row should contain the feature/entity name
Creating a new dataset via the console
There are two ways to create a dataset from the Superwise console:
- When you are creating a version - You can read about this process in the next section Create version
- From the datasets screen: To do this, proceed as follows:
Start by going to the dataset screen - simply click on the dataset icon in the left menu bar within your project.
1. Upload dataset file: Click New dataset
.
The 'add dataset' dialog is displayed. Name the dataset and select the type (training
, test
, validation
or other
). Then click on the Upload file
box to choose the CSV/Parquet file you want to use and click Next
.
2. Confirm your approval for the data types included: The Superwise platform allows you to define roles and data types for your dataset columns. Your dataset must include the roles id
and time stamp
. Any other entities that appear in the dataset but are not assigned a role or data type will be automatically selected as features
by the system.
ID data type
We support both string and int32 data types for unique IDs.
However, our recommendation is to use string UUID to support high volumes of data.Make sure you set it as categorical
3. Select the roles:
4. Match entities: Suppose the entities of your uploaded dataset already exist in the project. In that case, you can use 'Match entities' to allow cross models actions (like creating segments or policies on multiple models) on those entities.
More info
Read more on the Match entities concept
Creating a new dataset via the SDK
1. Create a dataset file - A dataset file is a structured data file that contains all the columns of your model data. It can be your set for training, testing, validation, or another use.
The dataset file must contain a Timestamp
and an ID
column, as presented in the example below)
url = 'https://gitlab.com/superwise.ai-public/integration/-/raw/main/getting_started/data/baseline.csv?inline=false'
training_data = pd.read_csv(StringIO(requests.get(url).text))
training_data.to_csv("training.csv", index=False)
training_data.head()
2. Define roles and data types
The Superwise platform allows you to define roles and data types for your dataset columns. Your dataset must include the roles id
and time stamp
. Any other entities that appear in the dataset but are not assigned a role or data type will be automatically selected as features
by the system.
The SDK has a built-in method that can automatically infer the datatypes, or you can specify them manually.
These roles and datatypes are optional. This means that when they are not specified, they will be inferred automatically by Superwise.
from superwise.models.dataset import Dataset
from superwise.resources.superwise_enums import DataEntityRole, FeatureType, DatasetType
dataset = Dataset(name="Example training dataset",
files=[training_data_filepath],
project_id=project.id,
type=DatasetType.TRAIN.value,
roles={
DataEntityRole.LABEL.value: ["price"],
DataEntityRole.PREDICTION_VALUE.value: ["prediction"],
DataEntityRole.TIMESTAMP.value: "ts",
DataEntityRole.ID.value: "id"
},
dtypes={
"carat": FeatureType.NUMERIC.value,
"color": FeatureType.CATEGORICAL.value,
"x": FeatureType.BOOLEAN.value
})
# if dtypes is not set, the types are inferred automatically
# Create the dataset in Superwise, this may take some time to process
dataset = sw.dataset.create(dataset)
from superwise.controller.infer import infer_dtype
roles = {
"id": "id",
"time stamp": "ts",
"prediction value": ["prediction"],
"label": ["price"]
}
dtypes = infer_dtype(training_data)
3. Create dataset- After you create a dataset file and configure the roles and datatypes, you can create the dataset. The "match entities" process is automatically applied to any entities in the same project that have the same name, type, and role.
You can also specify a type for the dataset. The default value for type is Training
.
from superwise.models.dataset import DatasetType
from superwise.models.dataset import Dataset
dataset_type = DatasetType.TRAIN
# file type can be csv/parquet
dataset = Dataset(name="train", files=["training.csv"], project_id=2, dtypes=dtypes, roles=roles, type=dataset_type)
dataset = sw.dataset.create(dataset)
Create dataset from dataframe
from superwise.models.dataset import Dataset
dataset = Dataset.generate_dataset_from_dataframe(name="dataset_name",dataframe=df, project_id=project.id, dtypes=dtypes, roles=roles, type=dataset_type)
dataset = sw.dataset.create(dataset)
Updated over 1 year ago