December 2, 2021 - infrastructure nvidia

Enabling CUDA with PyTorch on Nvidia Jetson and Python 3.9

Xavier Geerinck


Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

At my current customer Proximus I am working on putting Edge AI in use for the vision of "TODO-INSERT-VISION". To achieve this, we had to enable CUDA on the Nvidia Jetson devices that will analyze and infer our sensors.


Let's first go through the prerequisites. These are important as else we won't be able to correctly run our container.


  1. Have JetPack installed on your Nvidia Jetson Nano
  2. Have PyTorch compiled for Python 3.9

The Nvidia on Docker packages

To have GPU support in your containers, it is important to understand the different packages out there and how they work together in achieving this.

All packages below are summarized under the term nvidia-docker which refers to a collection of components:

  • libnvidia-container: Main required package, it is container-runtime agnostic and provides a wrapper CLI that different runtimes can invoke to inject NVIDIA GPU support into their containers.
  • nvidia-container-toolkit: Includes a sccript that implements the interface required by runC prestart hook.
  • nvidia-container-runtime: Wrapper around runC and injects the nvidia-container-toolkit script in it.
  • nvidia-docker2: Takes script associated with nvidia-container-runtime and installs it into docker's /etc/docker/daemon.json so we can run docker run --runtime=nvidia ...

It is important to note that all are needed even though some documentation links state only to install nvidia-docker2. This is because Kubernetes with Docker 19.03 needs nvidia-docker2 to be able to pass GPU information since it doesn't support the --gpus flag yet.

💡 More details can be found in this post by Kevin Klues.

Ensure the packages described above are available on the system

dpkg -l | grep nvidia


To verify the above, we can utilize the following blocks that help us check CUDA.

Check Docker Runtime

Run the command below

sudo docker info | grep Runtime

That should show the below, with Default Runtime: nvidia

Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
Default Runtime: nvidia
WARNING: No blkio weight support
WARNING: No blkio weight_device support

If this is not the case, make sure nvidia-docker is installed correctly and that the default runtime was configured in /etc/docker/daemon.json

"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []

Nvidia Jetson Board Details

We can get detailed information of the board we are running through jetsonhacks its utilities. Run the following for this:

git clone
cd jetsonUtilities

Which will give something such as:

NVIDIA Jetson Nano (Developer Kit Version)
L4T 32.6.1 [ JetPack 4.6 ]
Ubuntu 18.04.5 LTS
Kernel Version: 4.9.253-tegra
CUDA 10.2.300
CUDA Architecture: 5.3
OpenCV version: 4.1.1
OpenCV Cuda: NO
Vision Works:
VPI: ii libnvvpi1 1.1.11 arm64 NVIDIA Vision Programming Interface library
Vulcan: 1.2.70

Verify Nvidia Runtime for Containers

If you have followed the prerequisites, you should be able to run the below.

# Introduced with JetPack 4.2
# Allow Docker containers access to underlying GPU
sudo nvidia-docker version

Running the Device Query example on our GPU

Let's get started with the first example now running in a container. For this we will utilize the commonly used deviceQuery example that will query our device and show specifications about the found devices.

Create a Dockerfile with the content below:

ARG ARCH=aarch64
RUN apt-get update \
&& apt-get install -y --no-install-recommends make g++ git ca-certificates
RUN git clone --branch v${V_CUDA} /tmp/cuda-samples
WORKDIR /tmp/cuda-samples/Samples/deviceQuery
RUN make clean && make TARGET_ARG=${ARCH}
CMD [ "./deviceQuery" ]

Finally, we can build this container now and run it:

# Build the docker image
sudo docker build -t m18x/nvidia-jetson-device-query .
# Run the docker image
sudo docker run --rm -it m18x/nvidia-jetson-device-query

Running PyTorch with Python 3.9 on our GPU

Note: for this example, you need to build PyTorch for Python 3.9

Going to a more complex example, let's build a container that invokes PyTorch and prints out the version.

Create a file with the following content:

import torch
print(f"Torch Version: {torch.__version__}")
print(f"Torch CUDA Is Available: {torch.cuda.is_available()}")
print(f"Torch CUDA Version: {torch.version.cuda}")

Package it all up with a Dockerfile that will install Python 3.9, add our Python specific PyTorch wheel with CUDA enabled and run our application.

ARG ARCH=aarch64
ENV DEBIAN_FRONTEND=noninteractive
# Install Dependencies
# Required for PyTorch: libomp5 libopenblas-dev
RUN apt-get update \
&& apt-get install -y --no-install-recommends software-properties-common apt-utils curl libomp5 libopenblas-dev
# Configure Python 3.9
RUN add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y python${V_PYTHON} python${V_PYTHON}-dev python${V_PYTHON}-venv python${V_PYTHON_MAJOR}-tk \
&& rm /usr/bin/python \
&& rm /usr/bin/python3 \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python${V_PYTHON_MAJOR} \
&& curl --silent --show-error | python
# Install PyTorch
COPY torch-1.10.0a0+git36449ea-cp39-cp39-linux_aarch64.whl /tmp/torch-1.10.0a0+git36449ea-cp39-cp39-linux_aarch64.whl
RUN pip install /tmp/torch-1.10.0a0+git36449ea-cp39-cp39-linux_aarch64.whl
# Install Dependencies
RUN pip install numpy
# Application
COPY /code/
CMD [ "python", "" ]

Finally, we can again build this container and run it with:

# Build the docker image
sudo docker build -t m18x/nvidia-jetson-pytorch-test .
# Run the docker image
sudo docker run --rm -it m18x/nvidia-jetson-pytorch-test

k3s agent -s -t ${NODE_TOKEN}

Running the PyTorch example on Kubernetes

Kubernetes Setup for GPU

As a final step, we need to be able to run PyTorch examples on a Kubernetes enabled cluster. This might sound trivial, but as seen above, Kubernetes does not natively support the --gpus flag yet. But by following the commands below we can quite easily get it working on Kubernetes.

First start by creating your Kubernetes cluster. Since we are running on ARM, I typically use K3S which is more lightweight. We can set this up through the following command:

# Server:
# --kubelet-arg=--feature-gates=DevicePlugins=true: Enables us to use the nvidia-device-plugin to get available
curl -sfL | K3S_TOKEN="XAVIER_HIS_SECRET_TOKEN" INSTALL_K3S_EXEC="--cluster-init --docker --write-kubeconfig-mode 644 -write-kubeconfig $HOME/.kube/config --kubelet-arg=feature-gates=DevicePlugins=true" sh -s -
# Agent:
curl -sfL | sh -s agent --server --token "XAVIER_HIS_SECRET_TOKEN" --docker
curl -sfL | sh -s agent --server https://your-hostname:6443 --token "XAVIER_HIS_SECRET_TOKEN" --docker
# On the master, label the worker nodes
kubectl label nodes node-1 node-2

Once the command above finishes, we should now be able to execute a NVIDIA container on Kubernetes:

sudo kubectl run -i -t nvidia --rm --image=m18x/nvidia-jetson-device-query --restart=Never

Which if successfully completes should output our Device Information:

[email protected]:~$ sudo kubectl run -i -t nvidia --rm --image=m18x/nvidia-jetson-device-query --restart=Never
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1"
CUDA Driver Version / Runtime Version 10.2 / 10.2
CUDA Capability Major/Minor version number: 5.3
Total amount of global memory: 1980 MBytes (2076119040 bytes)
( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores
GPU Max Clock rate: 922 MHz (0.92 GHz)
Memory Clock rate: 13 Mhz
Memory Bus Width: 64-bit
L2 Cache Size: 262144 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: Yes
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: No
Supports Cooperative Kernel Launch: No
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
pod "nvidia" deleted

Or if we run the other container we build:

[email protected]:~$ sudo kubectl run -i -t nvidia --rm --image=m18x/nvidia-jetson-pytorch-test --restart=Never
If you don't see a command prompt, try pressing enter.
Torch Version: 1.10.0a0+git36449ea
Torch CUDA Is Available: True
Torch CUDA Version: 10.2
pod "nvidia" deleted

Enabling Nvidia Device Plugin for POD YAML Deployments with GPU support

⚠️ The below will not work and finish with an error: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "--fail-on-init-error=false": executable file not found in $PATH: unknown. I am however, including this for future reference and illustration of the state of support for the Jetson. For now, you should be able to utilize the GPU by tagging your nodes for GPU use and having the default runtime set.

The above will start up a container but it won't be registered under pods automatically. For that we need to create a manifest, but how can we request GPUs in there? Normally we simply would use the Nvidia Device Plugin for this that will expose the resource limit tag for our YAML. However, this DaemonSet depends on NVML which in its turn depends on nvidia-smi which is not available for Tegra devices

Luckily the people and WindRiver figured out how to get this plugin available for us.

# Config your user first
git config --global "[email protected]"
# Clone the Device Plugin repository and apply the patches
git clone -b 1.0.0-beta6
cd k8s-device-plugin
git am 000*.patch
# Build the device plugin container
sudo docker build -t nvidia/k8s-device-plugin:1.0.0-beta6 -f docker/arm64/Dockerfile.ubuntu16.04 .
# Apply it on the cluster
kubectl apply -f nvidia-device-plugin.yml

There is a prebuilt container available at m18x/k8s-device-plugin:1.0.0-beta6

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn