Accelerating AI with Huggingface Models on Inferentia1

Jao Ming
7 min readDec 12, 2023
Gradient Speed from FreePik

Overview of Inferentia

AWS unveiled the 2nd version of Inferentia, which is their AI focused GPU that’s been optimised for model inference. The 1st version of Inferentia is already a fantastic compute instance offer that touts 2.3x throughput, much lower latency at 70% lower cost per inference when benchmarked against comparable EC2 instances. Inferentia2 levels up with 4x more throughput and 10x lower latency, but with 3x the cost.

With these performance benchmarks, it is obvious for data scientists and ML engineers to gravitate towards leveraging on Inferentia. However, there are some conditions that does not make Inferentia a good fit or eligible for some teams. Here are some that pertained to me when working on my projects:

  1. GenAI models are only supported on Inferentia2. This is because Inferentia1 does not support transformer decoders. Both support transformer encoders.
  2. Inferentia2 is only available in the US right now. Which could be a key reason for anyone to use Inferentia1 instead.
  3. Both versions of Inferentia may not support some transformer architectures. You can refer in their documentation for more information.

For my project we needed an embedding model situated in Singapore that can be flexible on transformer architectures. This meant that Inferentia1 was the way to go.

So for those that need to deploy Huggingface models outside the US, or want to deploy embedding models for lower cost than Inferentia2, this article will show you how to get a Huggingface model to work on Inferentia1.


There are 4 parts to preparing a Huggingface model for deployment on Inferentia1.

  1. Installation of relevant dependencies
  2. Conversion of Huggingface model (Torch model) into Neuron format
  3. Packaging the Tokenizer and Inference code with the Neuron formatted model
  4. Setting up a SageMaker Endpoint using the Neuron model payload package

Handling Dependencies

The packages for converting Torch format models into Neuron format models can be quite volatile given the pace of development of the AI space right now. Huggingface has an API to help perform this conversion. However, as of publication of this article, the API does not work. Hopefully the method I share here will help go around that.

These are the packages that I had to download for my Neuron conversion to work. The extra index-url is needed to pull from AWS’s pip repository.


If you’re on a Jupyter notebook remember to restart your kernel after installing these packages.

Neuron Conversion

For the conversion, instead of using the Huggingface API, we will be using the script in the optimum-neuron package. Out of all the methods I could find online, this seemed to be the most consistent way for me to perform the conversion.

import time
import shutil
import subprocess
import sagemaker
from transformers import AutoTokenizer
from sagemaker.s3 import S3Uploader


sess = sagemaker.Session()
sagemaker_session_bucket = sess.default_bucket()
role = sagemaker.get_execution_role()
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"Bucket: {sagemaker_session_bucket}")
print(f"Role: {role}")

I performed all of these chunks of codes in separate cells of my Jupyter notebook.

  • HF_MODEL: Is the model from Huggingface that you’d like to convert into a Neuron model to deploy with Inferentia1
  • HF_TASK: This is the task type of your model. For embedding models, the task is called “feature-extraction”
  • SEQ_LENGTH: One of the conditions required to leverage on Inferentia is to have a fixed sequence length for your transformer model. Having a fixed input sequence length helps with hardware optimisation for inference
  • BATCH_SIZE: Similar reason as above
  • OUTPUT_FOLDER: Folder name for where we want to export the converted Neuron model into
  • MODEL_NAME: The name for your exported Neuron model

Once all of these variables are set, you can run this code. Since I’m doing this on Jupyter notebook, I have the ! in front of my CLI code.

!optimum-cli export neuron \
--model {HF_MODEL} \
--task {HF_TASK} \
--sequence_length {str(SEQ_LENGTH)} \
--batch_size {str(BATCH_SIZE)} \
--disable-validation \

This conversion is actually a compilation of the Huggingface model in Neuron format. And the compilation process is done on CPU. This means that instead of splurging on a GPU instance for this script, you should just choose a CPU optimised instance (e.g c5, etc).

Last window of logs for Neuron compilation

After the code is done running, you should see these logs printed. As you can see the compilation duration is about 2.5 minutes for the BAAI embedding model.

Packaging the Neuron Model

Now that we have compiled a Neuron model, it should sit within the folder that you have previously declared. The next step is to download the tokenizer for the same model into the OUTPUT_FOLDER and then add an inference script for the Neuron model under OUTPUT_FOLDER/

Settling the tokenizer is simpler.

tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)

However, the inference script can get a bit tricky.

%%writefile -a {OUTPUT_FOLDER}/code/ # only for jupyter notebook
# for BAAI/bge-base-en-v1.5
import os
from transformers import AutoConfig, AutoTokenizer
import torch
import torch.neuron

# To use one neuron core per worker
os.environ["NEURON_RT_NUM_CORES"] = "1"

# saved weights name

def model_fn(model_dir):
# load tokenizer and neuron model from model_dir
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = torch.jit.load(os.path.join(model_dir, AWS_NEURON_TRACED_WEIGHTS_NAME))
model_config = AutoConfig.from_pretrained(model_dir)

return model, tokenizer, model_config

def predict_fn(data, model_tokenizer_model_config):
# destruct model, tokenizer and model config
model, tokenizer, model_config = model_tokenizer_model_config

# create embeddings for inputs
inputs = data.pop("inputs", data)
embeddings = tokenizer(
# convert to tuple for neuron model
neuron_inputs = tuple(embeddings.values())

# run prediciton
with torch.no_grad():
model_output = model(**neuron_inputs)[0] # last_hidden_state
sentence_embeddings = model_output[:, 0] # first vector for each payload

sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
return [{"inputs": inputs, "embeddings": sentence_embeddings[0].tolist()}]

Some key things to note here is the use of the magic command %%writefile in Jupyter notebook to write this code into a file under the stated directory. Next would be to note the model_output of the Neuron model. As shown in the code example, we can extract out the last_hidden_state. But depending on what type of Huggingface model you choose, you may have to tweak this accordignly. For example, if you use a classification model, you’d need to add in a softmax layer.

with torch.no_grad():
predictions = model(*neuron_inputs)[0]
scores = torch.nn.Softmax(dim=1)(predictions)

# return dictonary, which will be json serializable
return [{"label": model_config.id2label[item.argmax().item()], "score": item.max().item()} for item in scores]

Something like this instead.

Once your tokenizer, inference script and Neuron model are all situated in the same file, we can compress and export it. Remember to keep the inference script into a sub-folder named “code”.

!tar zcvf {MODEL_NAME}.tar.gz *
%cd ..

This was the Jupyter notebook code cell that I used. I am not sure why but I get an error when I run the tar line from outside the OUTPUT_FOLDER. And a fix for that was to cd into the folder to perform the compression.

Items compressed into .tar file

After the compression, you should see a list of files that have been compressed into the .tar file like this. Now that we have the Neuron model payload, we can proceed with the deployment.

Deployment onto Inferentia1

In preparation for deployment, we need to first upload the .tar file into an S3 bucket.

s3_model_path = f"s3://{sess.default_bucket()}/neuron/embeddings"
s3_model_uri = S3Uploader.upload(

The reason we do this is because when we create a model object on SageMaker, we need to reference a model file that is located in S3. For this deployment, what we are essentially doing are these 3 things:

  1. Create a model object in SageMaker with reference to the .tar file in S3
  2. Create a SageMaker endpoint configuration referencing the model object
  3. Create a SageMaker endpoint using the endpoint configuration

But instead of doing all of this manually, SageMaker has an API to handle it for us.

from sagemaker.huggingface.model import HuggingFaceModel

huggingface_model = HuggingFaceModel(

huggingface_model._is_compiled_model = True

predictor = huggingface_model.deploy(

When calling the HuggingFaceModel class, what is happening here is the creation of the model object on SageMaker. We are detailing the location of the model artifact (S3), what environment variables are requires, and what Docker unage is to be used with this. The Docker container is referenced by these 3 arguments: transformers_version, pytorch_version, and py_version. You can find the Docker image in this repository. Although I have tried other configurations of arguments that work, but do not have matching Docker images in the repository. So I just take it as a reference.

But once the endpoint is created, we can invoke the EndPoint using boto3, or just call .predict() from the predictor.


And there you have it, some code and some explanation on how I compiled a Neuron model from a Huggingface model and deployed it on the Inferentia1 GPU instance. I hope this has been helpful and works for everyone. Enjoy the performance improvements and cost savings!