Finetuning T5
Overview
T5, or the Text-To-Text Transfer Transformer, is a popular all-purpose NLP model proposed by Google in 2020. It is important to play with it due to its polyvalence in textual processing. However, simple and clean codebase for model training are often hard to come by. Luckily, I found this repository (which is in turn based on this repository) on T5 training.
Environment
The training environment is Compute Canada. It is a high-performance computing cluster providing high-end hardware such as NVIDIA A100 GPUs. Python 3.10 is used along with the packages found in requirements.txt
. Any missing package must be installed during runtime if applicable.
Dataset
The training dataset can be accessed here. It is a dataset formed by selecting causal questions among different popular question answering datasets such as NQ, MSMARCO, etc. The original datasets’ formats are also edited to be uniform with columns | id | question | question_processed | context | context_processed | answer | answer_processed |
, so the same metric script can be used to evaluate all of them.
Training
Training uses HuggingFace API, with the general process being:
- instantiate HuggingFace model
- instantiate HuggingFace tokenizer
- instantiate HuggingFace trainer with training arguments
For more detailed code breakdown of the above you can refer to the previous RoBERTa training post which is similar, or to go to the linked repositories in the overview section and look at source code directly. However, there are certain functions of interest in the preprocessing stage:
- The build_input() function concatenates the question and the context for extractive QA. A simple newline is enough. Notice how the batch being a list utilizes the batched processing capability of
dataset.map()
, speeding up the process time.
- Pay attention to the
encoded_inputs["labels"]
part in tokenize_function_train. Notice howpad_token_id
is encoded by-100
. This is a convention in HuggingFace models used to signify to the optimizer to not calculate loss for padding, which is consistent with Pytorch’s cross-entropy loss implementation where-100
is the default ignore index.
- The training then can be started with a
Seq2SeqTrainer
andSeq2SeqTrainingArguments
.
Training with accelerate
Accelerate is a library to enable straightforward distributed and mixed-precision training with PyTorch. The HuggingFace Trainer
class is by default supporting accelerate, but you need to create a yaml configuration file and launch the script via accelerate launch
instead of simply python
. To create the configuration file, do:
accelerate config --config_file CONFIG_FILE(str)
This will initiate a few interactive questions such as In which compute environment are you running?
which you can respond to in the command line. The generated file will be saved to the location of the CONFIG_FILE
argument. An example of finished configuration file is:
You can then launch your script with accelerate using:
accelerate launch --config_file CONFIG_FILE(str) <script.py>
In Compute Canada, after allocating your resources with sbatch
, accelerate is able to take care of the distributed training automatically without you having to explicitly set up classes such as DistributedDataParallel
. To train T5 for 3 epochs on the full SQUAD2.0 dataset using accelerate and 2 GPUs, the training time is cut down significantly from around 50h to only around 20h.
Evaluation
For evaluation, we can predict through trainer, or we can explicitly instantiate the finetuned model, and loop through the test data using a DataLoader
as so:
where run_model
is defined as:
Then, having the predictions
and the true answers
, we can define a metric function to perform different metrics (such as em, f1) and output them in a specified format (such as json). A sample prediction is: