Finetuning RoBERTa on SQUAD2.0
Environment
I used Compute Canada’s Narval cluster as the environment for this experiment. Compute Canada is a high-performance computing cluster offering high-end GPU access (NVIDIA A100). It is accessed through the login nodes, but the actual training is performed on the compute nodes that do not have internet access. Thus, both the model and the datasets have to be downloaded. From scratch, the general process is as follow: 1. ssh to a login node 2. create a virtual environment and install the necessary packages 3. download the dataset and the model 4. preprocess data 5. schedule training on a compute node (and optional postprocessing). 6. evaluate model with metrics. All nodes (including compute nodes) have access to a shared filesystem, thus what is downloaded from the login nodes can be accessed by the compute nodes.
Setup virtual environment
Compute Canada uses Python virtual environments instead of Anaconda. But beforehand, we need to load the appropriate software modules such as Python. The following steps generally suffice to create a functional environment able to train HuggingFace models, the latter is the de facto hub for open source deep learning at the time of the writing:
-
load necessary modules (here
arrow
is needed to manipulate large datasets)module load gcc arrow python/3.10
-
create virtual environment and install packages
python -m venv venv source venv/bin/activate pip install transformers, datasets, ...
During runtime, there might be errors caused by missing packages. These errors can typically be fixed easily by simply installing the missing package with pip install
. Conflicting requirements is another story, but luckily they don’t happen too often.
Downloading model and dataset
With HuggingFace API, downloading model and dataset is very easy, since both the model and datasets are typically uploaded to the HuggingFace Hub (model hub link, dataset hub link), and follow the same streamlined syntax convention. Tokenizers associated with models can also be found in the model hub and use the same syntax as models. For RoBERTa and SQUAD2.0, we can do:
- loading and saving model (from online hub)
- loading and saving dataset (from online hub)
To load the models locally during training, we can use the corresponding load functions as so:
- loading model locally:
- loading dataset locally:
Notice that the function from_pretrained()
is used both to load model remotely and locally, it is because the function is overloaded to do so, thus providing a clean API.
Preprocessing
Preprocessing the datasets needs to be done according to model specifications. For RobertaForQuestionAnswering (which is what we really instantiate with auto model), we can see that the model input needs arguments start_positions
and end_positions
, which the original SQUAD2.0 dataset does not provide. Thus we have to build them ourselves. Moreover, as SQUAD2.0 contains unanswerable questions, we need to consider that as well. For answerable questions, we determine the start and end positions by matching the index of the start/end tokens of the answer within the context. For unanswerable questions, we can simply return (0, 0). The following preprocess function does the job:
This preprocess function can be mapped to each data of the dataset, either individually or in batches using the map()
function:
Batched mapping is recommended as it is typically faster on multi-core CPUs.
Training
With the processed dataset and the model in place, we can now train the model. Training can be done in several ways, one way is to loop through the dataset, define the loss function and perform backpropagation ourselves, this way offers more control but requires more code, especially when using parallel training on multiple hardware resources. Another way is by using the Trainer API of HuggingFace, this way offers less control but requires less code, and the trainer is able to handle many of the internal complexities of training. The trainer class requires training arguments, an example is as follow:
Then, we can pass model, dataset and the training arguments to the trainer class and start training:
By default, the trainer will save a model checkpoint every 500 data, so even if you don’t explicitly save the model or if the training is interrupted, you can still load the (trained) model of the latest checkpoint.
To schedule training on a compute node, one has to submit the training script containing everything from loading module up to training code to a compute node through the SLURM scheduler. You also need to specify the resources needed to be allocated. The more resources, the longer the wait time. A typical script is:
#!/bin/bash
#SBATCH --job-name=roberta-squad2
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gpus-per-node=a100:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=127000M
#SBATCH --time=03:00:00
#SBATCH --account=<account_name>
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
#SBATCH --mail-user=<email_address>
#SBATCH --output=<output_name>
module load gcc arrow python/3.10
cd <path_to_project_folder>
source venv/bin/activate
python <path_to_training_script.py>
Evaluation
After training is done, we probably want to evaluate the performance of our model. This sometimes requires some postprocessing as in the case of RobertaForQuestionAnswering and SQUAD2.0. As we are in extractive question answering, the model outputs the logprobabilities of each token being the start/end token corresponding to the answer. If we want to make it human readable, we need to first convert it to text. The following function does the trick:
We can then apply this function to the model predictions as so:
With predictions in text format conforming to the official Evaluation Script v2.0, we can simply use the evaluation script to get metrics for our model by doing:
<path_to_eval_script.py> <path_to_true_answers.json> <path_to_predictions.json> -o <path_to_output_metrics.json>
For this experiment, I got:
Which is not amazing but passable. A performance like this indicates that the training code is working and can be a basis for further optimizations.