Jao Ming
5 min readSep 4, 2023

Parameter Efficient Fine-Tuning with a “Medium” Language Model like BERT

Generated from Bing Image Creator with the prompt “BERT and PEFT Fine-tuning Model”
Generated from Bing Image Creator with the prompt “BERT and PEFT Fine-tuning Model”

Large Language Models (LLMs) have take the world by storm with their reasoning prowess and ability to perform multiple tasks within a single prompt. However, in some cases using an LLM would be akin to bringing a cannon to a knife match. It would sure be an advantage but probably impractical. In some scenarios, using the “Medium” Language Models that have “only” a few hundred million parameters would suffice. For example, encoder models like BERT can still be used to generate embeddings for Retrieval Augmented Generation (RAG) applications. Simple tasks like sentiment analysis and or general classification scenarios still remain performant as well. As such, I wanted to leverage on the newer technologies to use on “Medium” sized models.

Parameter Efficient Fine-Tuning, or PEFT for short, is a novel way of fine-tuning a transformer model by representing the weight updates with two smaller matrices (called update matrices) through low-rank decomposition. By doing so the rest of the model weights can be ignored and therefore the number of trainable weights reduces significantly. You can think of it like the Boosting algorithm. If the base transformer is weak, you can use these update matrices to “fix” the weaknesses of the base transformer. These update matrices are called adapters.

There are many articles and papers that benchmark models fine-tuned with adapters versus traditionally fine-tuned. What is even more prevalent are the articles about PEFT for LLMs only. Annoyed by not being able to find articles about PEFT for BERT and other “Medium” sized models, here I am making my contribution.


First step is always to install all the relevant packages

pip install transformers datasets evaluate accelerate
  • transformers: key transformers package for training and the gateway to models available on huggingface_hub
  • datasets: the gateway to datasets available on huggingface_hub
  • evaluate: to provide evaluation metrics
  • accelerate: needed for using Trainer on PyTorch models.

Next we import all the relevant classes from the packages

import numpy as np
import evaluate
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer

Now we bring in the model and tokenizer that we want to fine-tune

model_name = "nlptown/bert-base-multilingual-uncased-sentiment"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Dataset preparation

dataset = load_dataset("yelp_review_full")
small_train_dataset = dataset["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = dataset["test"].shuffle(seed=42).select(range(1000))
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True, return_tensors="pt")
tokenized_train_dataset =, batched=True)
tokenized_eval_dataset =, batched=True)

For this simple implementation, I used the yelp_review_full dataset that can be found on Huggingface Hub. Because I didn’t have the compute power for training the large dataset, I decided to only select 1000 data points for training and evaluation. What I essentially wanted to test out was whether the code worked anyway. After that we just tokenize the text in the dataset for them to be passed into the model for training.

The tokenizing process is essential as the BERT model will be looking out for either input_ids or input_embeddings. The tokenizer transforms the text data into tokens and labels the column input_ids.

Declaring the function to use for calculating the metric

metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

This is where we start using the PEFT techniques. In order to make use of PEFT, we need to 1. decide what modules in the transformer architecture we want to perform the low-rank decomposition for, and 2. what task type are you fine-tuning for.

Modules simply refers to the set of matrices in the transformers architecture. If you’re familiar with the way the attention layer works, you’d know that there are 3 kind of matrix involved. The Key, Value and Query. In the research paper for PEFT, the Query and Value matrices were targeted and that is what we will be doing as well. In order to find out what the layers are called, you can simply print(model).

(bert): BertModel(
(embeddings): BertEmbeddings(
(word_embeddings): Embedding(105879, 768, padding_idx=0)
(position_embeddings): Embedding(512, 768)
(token_type_embeddings): Embedding(2, 768)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(encoder): BertEncoder(
(layer): ModuleList(
(0-11): 12 x BertLayer(
(attention): BertAttention(
(self): BertSelfAttention(
(query): Linear(in_features=768, out_features=768, bias=True)
(key): Linear(in_features=768, out_features=768, bias=True)
(value): Linear(in_features=768, out_features=768, bias=True)
(dropout): Dropout(p=0.1, inplace=False)
(output): BertSelfOutput(
(dense): Linear(in_features=768, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(intermediate): BertIntermediate(
(dense): Linear(in_features=768, out_features=3072, bias=True)
(intermediate_act_fn): GELUActivation()
(output): BertOutput(
(dense): Linear(in_features=3072, out_features=768, bias=True)
(LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
(dropout): Dropout(p=0.1, inplace=False)
(pooler): BertPooler(
(dense): Linear(in_features=768, out_features=768, bias=True)
(activation): Tanh()
(dropout): Dropout(p=0.1, inplace=False)
(classifier): Linear(in_features=768, out_features=5, bias=True)

As you can see above, the Query and Value layers are simply called query and value.

from peft import LoraConfig, get_peft_model, TaskType

# Define LoRA Config
lora_config = LoraConfig(
target_modules=["query", "value"],
task_type=TaskType.SEQ_CLS, # this is necessary

# add LoRA adaptor
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # see % trainable parameters

You can read up on the other arguments over here. But just take note that the target_modules and task_type arguments are necessary.

training_args = TrainingArguments(output_dir="bert_peft_trainer")
bert_peft_trainer = Trainer(
train_dataset=tokenized_train_dataset, # training dataset requires column input_ids

Thereafter, the training process is similar to what you can find in the guides on Huggingface. Voila! Your model has been fine-tuned with PEFT.

If you save the model with bert_peft_trainer.model.save_pretrained("bert-peft" you will see that only the adapter files are saved. Which should be a config file in .json and a model file in .bin.

You can even merge the adapater into the original model! Just a note that the adapter can only be merged with the original model; and not the quantised version. I say this because it is possible to quantise a model and use the quantised version of a model to train the adapter. If you’re interested, go take a look at QLoRA.

In order to merge the models, simply follow this code

from peft import PeftModel
original_model = AutoModelForSequenceClassification.from_pretrained(
original_with_adapter = PeftModel.from_pretrained(
original_model, "bert-peft" # bert-peft; the folder of the saved adapter
merged_model = original_with_adapter.merge_and_unload()

And there you go. The files located in the folder merged-model will be the model file of the original model integrated with the weights from the adapter.