Skip to content

DatasetIdentifier

DatasetIdentifier

Bases: ABC

Base class for all the possible ways to indentify a dataset. You must provide a DatasetIdentifier to a FinetuneConfig to supply input data for the FinetuningRun.

There are three ways to provide a DatasetIdentifier to a fine-tuning run. Click through the tabs to see options.

A "public" URL means a link that you can access without logging in or providing an API key.

The dataset must be a zip file.

from baseten.training import PublicUrl
# A URL to a publicly accessible dataset as a zip file
dataset = PublicUrl("https://cdn.baseten.co/docs/production/DreamboothSampleDataset.zip")

If you have your dataset on the local machine that you're running your Python code on, you can use the LocalPath option to upload it as part of your fine-tuning script.

from baseten.training import LocalPath
# A path to a local folder with your dataset
dataset = LocalPath("/path/to/my-dataset", "dog_pics")

If your fine-tuning script is running on one machine and your dataset lives on another, or you want to upload a dataset once and use it in multiple fine-tuning runs, you'll want to upload the dataset separately.

This approach uses a shell command, not a line of Python. In terminal, run:

baseten dataset upload -n my-dataset -t DREAMBOOTH ~/path/to/my-dataset/

You should see:

Upload Progress: 100% |█████████████████████████████████████████████████████████
INFO 🔮 Upload successful!🔮

Dataset ID:
DATASET_ID

Then, for your fine-tuning config (your Python code), you'll use:

from baseten.training import Dataset
# The ID of a dataset already uploaded to Blueprint
dataset = Dataset("DATASET_ID")

Dataset

Dataset(dataset_id: str)

Bases: DatasetIdentifier

A Dataset hosted on Baseten

Example:

from baseten.training import Dataset
dataset = Dataset("DATASET_ID")

Parameters:

Name Type Description Default
dataset_id str

The ID of the dataset hosted on Baseten

required

LocalPath

LocalPath(path: str, dataset_name: Optional[str] = None)

Bases: DatasetIdentifier

A local dataset to be uploaded

Example:

from baseten.training import LocalPath
dataset = LocalPath("./my-dataset")

Parameters:

Name Type Description Default
path str

The absolute or relative path to the dataset directory on your local machine

required
dataset_name Optional[str]

The name to assign the dataset once it is uploaded to Baseten. If none, the uploaded directory's name will be used instead.

None

PublicUrl

PublicUrl(url: str)

Bases: DatasetIdentifier

A dataset hosted at a publicly accessible url

Example:

from baseten.training import PublicUrl
dataset = PublicUrl("https://cdn.baseten.co/docs/production/DreamboothSampleDataset.zip")

Parameters:

Name Type Description Default
url str

The URL of a zip file to use as a dataset. This URL must be publicly accessible (unauthenticated)

required