DatasetIdentifier
DatasetIdentifier
Bases: ABC
Base class for all the possible ways to indentify a dataset. You must provide a DatasetIdentifier to a FinetuneConfig to supply input data for the FinetuningRun.
There are three ways to provide a DatasetIdentifier
to a fine-tuning run. Click through the tabs to see options.
A "public" URL means a link that you can access without logging in or providing an API key.
The dataset must be a zip file.
If you have your dataset on the local machine that you're running your Python code on, you can use the LocalPath
option to upload it as part of your fine-tuning script.
If your fine-tuning script is running on one machine and your dataset lives on another, or you want to upload a dataset once and use it in multiple fine-tuning runs, you'll want to upload the dataset separately.
This approach uses a shell command, not a line of Python. In terminal, run:
You should see:
Upload Progress: 100% |█████████████████████████████████████████████████████████
INFO 🔮 Upload successful!🔮
Dataset ID:
DATASET_ID
Then, for your fine-tuning config (your Python code), you'll use:
Dataset
Bases: DatasetIdentifier
A Dataset hosted on Baseten
Example:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_id |
str
|
The ID of the dataset hosted on Baseten |
required |
LocalPath
Bases: DatasetIdentifier
A local dataset to be uploaded
Example:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The absolute or relative path to the dataset directory on your local machine |
required |
dataset_name |
Optional[str]
|
The name to assign the dataset once it is uploaded to Baseten. If none, the uploaded directory's name will be used instead. |
None
|
PublicUrl
Bases: DatasetIdentifier
A dataset hosted at a publicly accessible url
Example:
from baseten.training import PublicUrl
dataset = PublicUrl("https://cdn.baseten.co/docs/production/DreamboothSampleDataset.zip")
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url |
str
|
The URL of a zip file to use as a dataset. This URL must be publicly accessible (unauthenticated) |
required |