Skip to content

Uploading a dataset

To use fine-tune a model, you need a dataset. There are three ways to upload your dataset to Blueprint.

Before uploading a dataset, make sure your local setup is complete.

Uploading a dataset from your computer

Uses LocalPath.

If your dataset lives on your local machine, pass a LocalPath object to your FinetuningConfig and your dataset will be uploaded automatically when your FinetuningRun is created.

If your dataset is a directory (e.g. images for Dreambooth, Stable Diffusion):

from baseten.training import LocalPath

dataset = LocalPath(path="./my-dataset-directory", name="my-cool-dataset")

If your dataset is a single file (e.g. csv for FLAN-T5):

from baseten.training import LocalPath

dataset = LocalPath(path="./my-dataset.csv", name="my-cool-dataset")

Uploading a dataset from a URL

Uses PublicUrl.

If your dataset is hosted at a publicly accessible URL, you can point to it by creating a PublicUrl object.

from baseten.training import PublicUrl

dataset = PublicUrl(url="https://cdn.baseten.co/docs/production/DreamboothSampleDataset.zip")

If your dataset is a single file, it still must be zipped, e.g. https://cdn.baseten.co/docs/production/DatasetRecipes.csv.zip.

Using a dataset already uploaded to Blueprint

Uses Dataset.

If you have already uploaded a dataset to Blueprint, you can use it by instantiating a Dataset object and accessing your dataset by its ID.

from baseten.training import Dataset

dataset = Dataset(dataset_id="my-dataset-id")

Uploading datasets manually

If you want to upload a dataset and get a Dataset ID, use this CLI command:

baseten dataset upload is a bash command

Open a terminal window and run:

baseten dataset upload --name my-cool-dataset --training-type DREAMBOOTH ./my-dataset-directory

Notes:

  • If the name parameter is not provided, Blueprint will name your dataset based on the directory name.
  • If you're doing a Full Stable Diffusion run, instead use --training-type CLASSIC_STABLE_DIFFUSION.