Domain Training Your LLM

Jao Ming
6 min readNov 4, 2023

--

A Llama working out. Generated by DALL-E.

There are plenty of resources on the internet that details the various steps required for fine-tuning your foundation model. However, as I was searching around, I still ended up having to refer back and forth between materials before settling on an actual approach that I could take on. My business problem was this:

I would like to fine-tune a foundation model / LLM for my business domain, however in my industry, there are a ton of jargon, abbreviations and acronyms that are not commonly found on the internet. Is there some way I can factor in these domain specific tokens when fine-tuning?

In this post, I will cover the logic for deciding how and the code needed to fine-tune your LLM on domain data, and how to account for the domain specific tokens by updating the Tokenizer and Embedding layer of the model.

Idea

When wondering how to approach the problem, the first idea I had was to simply update the vocabulary of the model. Although a rather simple idea, following through with the solution was not as straightforward as I had expected.

If I simply just update the Tokenizer with the new words, the model will not work as optimally because it is not tuned to receive such tokens. The Embedding layer of the model will not recognise the token and will not have an embedding vector it can associate with it. To make matters worse, even if the Embedding layer can be updated, the question of how to instantiate the embedding vector for these new tokens becomes a problem. And it certainly doesn’t help to know that the Embedding layer can only be trained with a full fine-tuning — which requires A TON of computational resources.

After pondering a bit more, I came up with a solution for it.

Currently the way LLMs handle words that are not in its vocabulary is to either 1. tokenize it as an Unknown token, or 2. use sub-words to construct the word; which means that your new word is made up of way more tokens than necessary. Since this is already what it is doing, that is how I am going to instantiate the embedding vectors of new words. I will average out all the embedding vectors of the sub-word or unknown tokens and use them to represent the new word.

After updating the Tokenizer and Embedding layer with the new Domain words, I will then use a LoRA adapter to account for the update in vocabulary.

Let me explain with code.

Adding Domain Tokens

I experimented using the Llama-2 model that has been fine-tuned for chat — meta-llama/Llama-2–7b-chat-hf. Because I’m using the Llama-2 model, I needed to sign into my Huggingface. Besides that, the rest of the code should be self-explanatory.

The goal of this section is to update the vocabulary of the model by adding them as new tokens in the Tokenizer and adding their respective embedding vectors into the Embedding layer of the model.

In the code below, i’m simply loading up the Llama-2 model and quantising it down to 8-bits in preparation for a QLoRA fine-tuning. And I’m also making sure that the tokens used for padding are “end-of-sentence” tokens.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

import huggingface_hub
huggingface_hub.login(token="{add your HF token here}")

base_model_name = "meta-llama/Llama-2-7b-chat-hf"
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
load_in_8bit=True
)

base_model.config.use_cache = False

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Next step is to prepare the a training dataset. Of course, this should be whatever data you have for your particular domain. But for this example, I’ll be using the alpaca-cleaned dataset from Huggingface.

Preparation of datasets is and will always be the most tedious part of fine-tuning a machine learning model. There are so many different kinds of datasets out there with varying formats for training. Some datasets would have formats for Question and Answer (i.e Question: {text}, Answer: {text}), some would have formats for Chatbots (i.e User A: {prompt}, User B: {reply}), and so on and so forth.

To help cut through all of these variations, it helps to remember that at the end of the day, you are essentially just training a Transformer decoder model. Meaning to say you just need to provide a text, and the model simply trains by taking token x in that pieces of text as the label, and all tokens x-1 and before as the predictors. So it doesn’t matter what format you choose as long as you use the same one during inference. In this project, I used the “instruction”, “input” and “response” format.

dataset = load_dataset("yahma/alpaca-cleaned")
subset_dataset = dataset["train"].shuffle(seed=9).select(range(1000))

def format_data(sample):
return f"""### Instruction:
{sample["instruction"]}

### Input:
{sample['input']}

### Response:
{sample['output']}
"""

Now comes the part where we add the new tokens into the Tokenizer and Embedding layer. The function oov_proportion() simply calculates the proportion of new tokens that does not exist in the Tokenizer’s existing vocabulary.

The more interesting function is the one below it called add_new_tokens() and it takes in the Model, Tokenizer and a list of new tokens and performs the following:

  1. Filters away tokens that already exists in the Tokenizer’s vocabulary
  2. Adds all the new tokens into the Tokenizer’s vocabulary
  3. Resizes the number of Token embeddings in the Embedding Layer
  4. Uses the embeddings of the sub-words of the new words as generated by the Tokenizer to create a reasonably relevant vector embedding for the new word
  5. Replace the randomly initialised vector embeddings of the new words with the one created in Step 4
def oov_proportion(domain_words, tokenizer):
"""Calculate the proportion of words that are not in the tokenizer vocabulary"""
vocab = tokenizer.vocab.keys()
oov = [i not in vocab for i in domain_words]
return sum(oov) / len(oov)


def add_new_tokens(model, tokenizer, new_tokens):
"""Adds new tokens into the tokenizer and embedding layer"""
new_tokens = list(set(new_tokens) - set(tokenizer.vocab.keys()))
n_new_tokens = tokenizer.add_tokens(new_tokens)
print(f"{n_new_tokens} tokens added to tokenizer")
model.resize_token_embeddings(len(tokenizer))

with torch.no_grad():
for i, token in enumerate(reversed(new_tokens), start=1):
tokenized = tokenizer.tokenize(token) # convert into tokens
tokenized_ids = tokenizer.convert_tokens_to_ids(tokenized) # convert into index in the embeddings dictionary
model.model.embed_tokens.weight[-i, :] = model.model.embed_tokens.weight[tokenized_ids].mean(axis=0)

And with that you’re done with the data portion of the process! Now it’s just the fine-tuning of the model.

Fine-tuning with PEFT Adapters

Because we want the LLM to account for the changes in the Tokenizer and Embedding layer, we need to fine-tune it with datasets that include the new words. Basically teaching the LLM how to align these new vectors with the expected tasks.

For that we just have to perform a standard fine-tuning process using the transformers package.

# PEFT configurations
lora_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
r=64,
bias="none",
task_type="CAUSAL_LM",
)

# Creating a PEFT model - Attaching PEFT configs to the base_model
base_model = prepare_model_for_kbit_training(base_model)
base_model = get_peft_model(base_model, lora_config)

# Fine-tuning configurations
training_args = TrainingArguments(
output_dir="output-model",
auto_find_batch_size=True,
learning_rate=1e-3,
num_train_epochs=1,
logging_dir="logs",
logging_strategy="steps",
logging_steps=500,
warmup_ratio=0.03,
lr_scheduler_type="constant",
)

# Setting up the actual Trainer
trainer = SFTTrainer(
model=base_model,
tokenizer=tokenizer,
train_dataset=subset_dataset,
peft_config=lora_config,
packing=True,
max_seq_length=2048,
formatting_func=format_data,
args=training_args
)

# train
trainer.train()

# save model
trainer.save_model()

Some key things to note with the training process for LLMs are:

  1. We need to specify the task_type in the PEFT configuration, if not it will raise and error saying the training can’t proceed without the correct configurations
  2. If you want to add efficiency for fine-tuning with PEFT, you have to make sure you are using a quantised version of the base_model in order to perform QLoRA instead
  3. Because we are just passing in a piece of text, the formatting_func argument is important as it formats the piece of text from the dataset before passing it into the Trainer for fine-tuning

With that, we are done! You now have a PEFT adapter and Tokenizer that has been fine-tuned for your domain specific words!

--

--

Jao Ming

Building AI Solutions on a Global Scale. There's no good AI without engineering.