Artificial Intelligence

AI Model Inferencing with ONNX on GPU (CUDA)

Recently I've been working a lot with different AI Models for a variety of use cases, ranging from simple object classification to massive crowd counting models. While working with these models, I saw that it is always difficult to implement a model in custom code since the Architecture is never included in the weights themselves, thus each model works differently.

That's when I came across a format named ONNX which instantly peaked my interest with promising points such as:

Improved Inference Performance
Interoperability

Prerequisites

To get started with ONNX, we have two tasks that should be done

Configure the ONNX Execution Provider
Get an ONNX Model to infer

Configuring the ONNX Execution Provider for CUDA

To be able to run on our GPU I chose to utilize the CUDA Execution Provider. Which has 2 major components that should be installed:

CUDA
cuDNN

Installing CUDA

CUDA is the easiest of the two to install since it is publically available without a login wall. Before we start, check if CUDA is already installed by running the below, which should finish correctly:

# Check if CUDA is available
nvidia-smi
nvcc --version

If the above is not available, we can install it (on WSL 2) by running the below:

If you want to install it on another target, feel free to check the original link

# View all: https://developer.nvidia.com/cuda-toolkit-archive
# View ONNX Supported: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#requirements
CUDA_VERSION_MAJOR=11
CUDA_VERSION_MINOR=4
CUDA_VERSION_PATCH=0

# Install CUDA
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.5.1/local_installers/cuda-repo-wsl-ubuntu-${CUDA_VERSION_MAJOR}-${CUDA_VERSION_MINOR}-local_${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}.${CUDA_VERSION_PATCH}-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-${CUDA_VERSION_MAJOR}-${CUDA_VERSION_MINOR}-local_${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}.${CUDA_VERSION_PATCH}-1_amd64.deb
sudo apt-key add /var/cuda-repo-wsl-ubuntu-${CUDA_VERSION_MAJOR}-${CUDA_VERSION_MINOR}-local/7fa2af80.pub
sudo apt-get update
sudo apt-get -y install cuda

# Add CUDA to Path
# https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions
# Note: add those to ~/.bashrc at the bottom
export PATH=/usr/local/cuda-11.5/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-11.5/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Note: Do not forget to add the export links to your ~/.bashrc file

Installing cuDNN

Now up to the other requirement, to install cuDNN.

OS="wsl-ubuntu"
CUDA_VERSION_MAJOR=11
CUDA_VERSION_MINOR=4
CUDA_VERSION_PATCH=0
CUDNN_VERSION_MAJOR=8
CUDNN_VERSION_MINOR=2
CUDNN_VERSION_PATCH=4

wget https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/cuda-${OS}.pin 

sudo mv cuda-${OS}.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/7fa2af80.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/${OS}/x86_64/ /"
sudo apt-get update
sudo apt-get install libcudnn8=${CUDNN_VERSION_MAJOR}.${CUDNN_VERSION_MINOR}.${CUDNN_VERSION_PATCH}-1+cuda${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}
sudo apt-get install libcudnn8-dev=${CUDNN_VERSION_MAJOR}.${CUDNN_VERSION_MINOR}.${CUDNN_VERSION_PATCH}-1+cuda${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}

Running the YOLOX Model

Now we have all the required software installed for the CUDA Execution Provider, we can start running a model and infer from it. Luckily for us, the YOLOX team provides a ready-to-use ONNX model that we can use. So let's download it and use it!

To keep the code part short, you can find the entire project on my public repository the main part to know is how we can infer an ONNX Model.

Infering an ONNX Model is as easy as creating a session that defines the model path as well as the provider to use

# Create an ONNX Inference Session
session = onnxruntime.InferenceSession(
    path_or_bytes=args.model,
    providers=["CUDAExecutionProvider"]
)

After that is done, we just need to provide it an input:

ort_inputs = {session.get_inputs()[0].name: img[None, :, :, :]}
output = session.run(None, ort_inputs)

Whereafter we can run all of this and receive our result!

# Create directories
mkdir model; mkdir images; mkdir output

# Download Model
wget -O ./model/yolox_l.onnx https://github.com/Megvii-BaseDetection/YOLOX/releases/download/0.1.1rc0/yolox_l.onnx 

# Download Example
wget -O ./images/example.jpg https://www.ikea.com/ext/ingkadam/m/7f8f282fb240f466/original/PH179208-crop001.jpg?f=m

# Run
python src/main.py -m ./model/yolox_l.onnx -i ./images/example.jpg -o ./output -s 0.3 --input_shape 640,640

Conclusion

As you can see from the above, it is quite easy to infer an ONNX Model and get results from it! I personally think I will utilize ONNX more in the future, reducing the complexity of my AI pipeline codes even further.

AI Model Inferencing with ONNX on GPU (CUDA)

Prerequisites

Configuring the ONNX Execution Provider for CUDA

Installing CUDA

Installing cuDNN

Running the YOLOX Model

Conclusion

Read next

Creating Embeddings with vLLM on MacOS

Getting Started with Model Context Protocol (MCP) Servers

Create your own LLM Voice Assistant in just 5 minutes!

Comments ()

Prerequisites

Configuring the ONNX Execution Provider for CUDA

Installing CUDA

Installing cuDNN

Running the YOLOX Model

Conclusion

Read next

Comments ( )

Comments ()