Creating a dataset for LLaMA

Formatting

LLaMA is a LLM for text generation like GPT. Fine-tuning LLaMA with Blueprint requires a CSV file containing at least two columns, representing the input (source sequence) and desired output (target sequence). For example, when fine-tuning LLaMA to summarize news articles, the input would be the article, and the output would be its summary.

Your CSV file can have as many columns as you'd like but Blueprint will specifically check that at least 2 columns exist. In your LlamaConfig, you can specify the names of your source and target column. By default, Blueprint will look for columns titled "source" and "target".

To upload and use your dataset in a FinetuningRun, you'll need to create a DatasetIdentifier object.

LLaMA is not licensed for commercial use

The LLaMA model is not currently licensed for commercial use. LLaMA fine-tuning is offered for research purposes.

Tips

Prompt Structure: Fine-tuning LLaMA can be significantly improved by adding some structure to your prompt. For example, if your dataset consists of questions and answers about your favorite book:

source	target
`The following is a question from a user about a book. Please answer the question as succinctly as possible. Input: {question}`	`Response: {answer}`

Prompt structure

When imposing a prompt structure, it's important you use this structure when prompting your fine-tuned model. For example, with a new question, the prompt from above would look like this:

The following is a question from a user about a book. Please answer the question as succinctly as possible. Input: {question} Response:.

Quality and Quantity: The better your dataset, the better the fine-tuned model. Gather as many high-quality input/output sequence examples as possible. Manually inspect your dataset to remove bad examples or formatting errors, as these can negatively affect your model's performance.
Balanced Data: Ensure your dataset represents diverse examples and avoids biases. This helps create a more versatile and accurate fine-tuned model.
Data Preprocessing: Clean and pre-process your data to remove irrelevant information, HTML tags, excessive white spaces, or any other noisy elements. This helps the model focus on the task at hand and improves its performance.

FAQ

How large should my dataset be?

The ideal dataset size depends on your specific task and the complexity of your input/output sequences. Generally, a larger dataset provides better results. However, diminishing returns may be observed beyond a certain size. Start with a few thousand examples and iteratively expand your dataset, while monitoring the model's performance.

How should I handle very long input sequences?

We recommend keeping the length of input sequences below 512 tokens (in the tokenizer, 1 token ≈ 4 characters). You can specify a max_length in your LlamaConfig. Examples that are larger than this max length will be truncated during training which may affect your model quality.

What's next?

Use your dataset to create a fine-tuning run