Creating a dataset for FLAN-T5

Formatting

FLAN-T5 is an encoder-decoder language model trained by Google. Fine-tuning FLAN-T5 with Blueprint requires a CSV file containing at least two columns, representing the input (source sequence) and desired output (target sequence). For example, when fine-tuning Flan-T5 to summarize news articles, the input would be the article, and the output would be its summary.

Your CSV file can have as many columns as you'd like but Blueprint will specifically check that at least 2 columns exist. In your FlanT5BaseConfig, you can specify the names of your source and target column. By default, Blueprint will look for columns titled "source" and "target".

To upload and use your dataset in a FinetuningRun, you'll need to create a DatasetIdentifier object.

Tips

Instruction Finetuning: FLAN-T5 is designed to interpret instructions. To benefit from this feature, prepend an instruction to your input sequences. For example, instead of passing a news article directly, prepend "Summarize:" or "Summarize the following text:". Check out more examples of instruction templates here.
Quality and Quantity: The better your dataset, the better the fine-tuned model. Gather as many high-quality input/output sequence examples as possible. Manually inspect your dataset to remove bad examples or formatting errors, as these can negatively affect your model's performance.
Balanced Data: Ensure your dataset represents diverse examples and avoids biases. This helps create a more versatile and accurate fine-tuned model.
Data Preprocessing: Clean and preprocess your data to remove irrelevant information, HTML tags, excessive white spaces, or any other noisy elements. This helps the model focus on the task at hand and improves its performance.

FAQ

How large should my dataset be?

The ideal dataset size depends on your specific task and the complexity of your input/output sequences. Generally, a larger dataset provides better results. However, diminishing returns may be observed beyond a certain size. Start with a few thousand examples and iteratively expand your dataset, while monitoring the model's performance.

How should I handle very long input sequences?

We recommend keeping the length of input sequences below 512 tokens (in the tokenizer, 1 token ≈ 4 characters).

Can I fine-tune FLAN-T5 for multiple tasks simultaneously?

Yes, you can create a dataset that combines different tasks by including various types of input-output pairs. Ensure that your instructions clearly indicate the desired task for each input sequence to help the model understand and generate the correct output.

What's next?

Use your dataset to create a fine-tuning run