November 25, 2021 - infrastructure nvidia ai

Enabling CUDA with PyTorch on Nvidia Jetson and Python 3.9

Xavier Geerinck

@XavierGeerinck

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

For a use case I wanted to utilize the Nvidia Jetson for edge inference. One of the bottlenecks here was that my software required a Python version that is greater than 3.6. When looking at the Nvidia Jetson packages for PyTorch it was seen that this was only created for version 3.6.

Searching around, I could find quite a lot of people struggling with this (@TODO: PROOF) which made me to look into a solution that would help me automate the building of ARM Wheel of PyTorch that can run on Nvidia Devices (e.g. Nvidia Jetson Nano) and thus support CUDA.

The entire process above took me around ~11 full days, starting of with figuring out how to build the Dockerfile and finally automating the CI process.

Note: I also decided to utilize this article as an entry for a running Hackathon by Dev.to for GitHub Actions

You would wonder why we want to automate this? Well when compiling this initially on my Nvidia Jetson Nano, I couldn't get it to compile past ~80% when using 16Gb of Swap Space (the device only has 2 Gb). After building it on my personal PC, this took ~6h to get it to compile completely. Making it large and long enough to automate it.

The final source code can be found on GitHub

Contributions

In any kind of project of this size, there are specific contributions that were made. In my project I believe to have made the following:

  • Install CUDA on non-GPU devices
  • Compile PyTorch with CUDA enabled on non-GPU devices
  • Compile PyTorch for Python > 3.6
  • Build for ARM with CI through Docker Buildx

Project Outline

As a best practice, I always love to include the project outline of how I tackled the issue above (to share my thought process):

  1. Create a Dockerfile that builds the Wheel
    • How do I build for CUDA? (Hardest part)
    • How do I cross build?
  2. Create GitHub Action
    • How do I cross build? Can I run for ARM specifically?

Dockerfile Creation - Building PyTorch for ARM and Python > 3.6

The hardest part BY FAR is how to compile PyTorch for ARM and Python > 3.6 with CUDA enabled. Because we are going to run on a non-GPU device, thus CUDA is not available on there. I splitted up the Dockerfile into 3 specific sections that can be ran in parallel:

  1. Set up CUDA
  2. Set up PyTorch (the cloning takes a while)
  3. Set up Python 3.9
  4. Compile PyTorch
    • I have Nvidia Jetson optimisations included here, thanks QEngineering!
  5. Dockerfile Result

Below you can find the explanation of the respective steps. For the final Dockerfile optimizations where made to decrease the docker layer sizes (by grouping RUN in 1 command)

Setting up CUDA

For CUDA, we do not have the CUDA libraries, nor do we have access to them! There is however a trick that allows us to get CUDA loaded in. We copy over the public key of NVIDIA Jetson and we authorize ourselves to the repository. Then we use our package manager to install them:

V_CUDA_DASH=10-2
# Add the public key
echo "[Builder] Adding the Jetson Public Key"
curl https://repo.download.nvidia.com/jetson/jetson-ota-public.asc > /etc/apt/trusted.gpg.d/jetson-ota-public.asc
echo "deb https://repo.download.nvidia.com/jetson/common ${L4T} main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
echo "deb https://repo.download.nvidia.com/jetson/t186 ${L4T} main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
# Install the CUDA Libraries
echo "[Builder] Installing CUDA System"
apt-get update
apt-get install -y --no-install-recommends \
cuda-libraries-$V_CUDA_DASH \
cuda-libraries-dev-$V_CUDA_DASH \
cuda-nvtx-$V_CUDA_DASH \
cuda-minimal-build-$V_CUDA_DASH \
cuda-license-$V_CUDA_DASH \
cuda-command-line-tools-$V_CUDA_DASH \
libnvvpi1 vpi1-dev
# Link CUDA to /usr/local/cuda
ln -s /usr/local/cuda-$CUDA /usr/local/cuda

When we eventually start compiling, we will see CUDA enabled in our CMAKE output 🥳

# USE_CUDA : ON
# Split CUDA : OFF
# CUDA static link : OFF
# USE_CUDNN : OFF
# USE_EXPERIMENTAL_CUDNN_V8_API: OFF
# CUDA version : 10.2
# CUDA root directory : /usr/local/cuda
# CUDA library : /usr/local/cuda/lib64/stubs/libcuda.so
# cudart library : /usr/local/cuda/lib64/libcudart.so
# cublas library : /usr/local/cuda/lib64/libcublas.so
# cufft library : /usr/local/cuda/lib64/libcufft.so
# curand library : /usr/local/cuda/lib64/libcurand.so
# nvrtc : /usr/local/cuda/lib64/libnvrtc.so
# CUDA include path : /usr/local/cuda/include
# NVCC executable : /usr/local/cuda/bin/nvcc
# NVCC flags : <CUT>
# CUDA host compiler : /usr/bin/clang
# NVCC --device-c : OFF
# USE_TENSORRT : OFF

Setting up PyTorch

In a separate docker step, we set up PyTorch and clone it to the working repository (in our case /build/pytorch)

V_PYTORCH=v1.10.0
# Downloads PyTorch to /build/pytorch
git clone --recursive --branch ${V_PYTORCH} http://github.com/pytorch/pytorch /build/pytorch

Setting up Python 3.9

We configure our Python version through the deadsnakes ppa and link it as the default one.

Best practice we should have a venv but since I am running it in a Docker container this should suffice.

# Setting up Python 3.9
RUN add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y python${V_PYTHON} python${V_PYTHON}-dev python${V_PYTHON}-venv python${V_PYTHON_MAJOR}-tk \
&& rm /usr/bin/python \
&& rm /usr/bin/python3 \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python${V_PYTHON_MAJOR} \
&& curl --silent --show-error https://bootstrap.pypa.io/get-pip.py | python

Compiling PyTorch

The last step in the Dockerfile is to compile PyTorch. For this we set the correct environment variabels to enable CUDA and to optimise the building process by disabling some other parts (e.g. MKLDNN, NNPACK, XNNPACK, ... to be turned off).

We also configure it to use clang as the Nvidia Jetson has NEON registers and clang supports those (GCC doesn't).

For our source, we utilize the other layer we created and just copy it from there.

COPY --from=downloader-pytorch /build/pytorch /build/pytorch
WORKDIR /build/pytorch
# PyTorch - Build - Prerequisites
# Set clang as compiler
# clang supports the ARM NEON registers
# GNU GCC will give "no expression error"
ARG CC=clang
ARG CXX=clang++
# Build
rm build/CMakeCache.txt || : \
sed -i -e "/^if(DEFINED GLIBCXX_USE_CXX11_ABI)/i set(GLIBCXX_USE_CXX11_ABI 1)" CMakeLists.txt \
pip install -r requirements.txt
python setup.py bdist_wheel

Copying the result as an Artifact

Docker Buildx is amazing in the sense that we can utilize the --output type-local,dest=. command to output files to our local filesystem, making it such that docker builds and we can export the build result as an artifact.

To achieve this, we pull from the scratch image and copy over our result to it from the other docker layer. Our / path will then contain all the build wheels of PyTorch (e.g. torch-1.10.0a0+git36449ea-cp39-cp39-linux_aarch64.whl)

FROM scratch as artifact
COPY --from=builder /pytorch/dist/* /

Dockerfile Result

Finally, the full Dockerfile will look like this:

# ##################################################################################
# Setup Nvidia CUDA for Jetson
# ##################################################################################
FROM ubuntu:18.04 as cuda-devel
# Configuration Arguments
ARG V_CUDA_MAJOR=10
ARG V_CUDA_MINOR=2
ARG V_L4T_MAJOR=32
ARG V_L4T_MINOR=6
ENV V_CUDA=${V_CUDA_MAJOR}.${V_CUDA_MINOR}
ENV V_CUDA_DASH=${V_CUDA_MAJOR}-${V_CUDA_MINOR}
ENV V_L4T=r${V_L4T_MAJOR}.${V_L4T_MINOR}
# Expose environment variables everywhere
ENV CUDA=${V_CUDA_MAJOR}.${V_CUDA_MINOR}
# Accept default answers for everything
ENV DEBIAN_FRONTEND=noninteractive
# Fix CUDA info
ARG DPKG_STATUS
# Add NVIDIA repo/public key and install VPI libraries
RUN echo "$DPKG_STATUS" >> /var/lib/dpkg/status \
&& echo "[Builder] Installing Prerequisites" \
&& apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates software-properties-common curl gnupg2 apt-utils \
&& echo "[Builder] Installing CUDA Repository" \
&& curl https://repo.download.nvidia.com/jetson/jetson-ota-public.asc > /etc/apt/trusted.gpg.d/jetson-ota-public.asc \
&& echo "deb https://repo.download.nvidia.com/jetson/common ${V_L4T} main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
&& echo "deb https://repo.download.nvidia.com/jetson/t186 ${V_L4T} main" >> /etc/apt/sources.list.d/nvidia-l4t-apt-source.list \
&& echo "[Builder] Installing CUDA System" \
&& apt-get update \
&& apt-get install -y --no-install-recommends \
cuda-libraries-${V_CUDA_DASH} \
cuda-libraries-dev-${V_CUDA_DASH} \
cuda-nvtx-${V_CUDA_DASH} \
cuda-minimal-build-${V_CUDA_DASH} \
cuda-license-${V_CUDA_DASH} \
cuda-command-line-tools-${V_CUDA_DASH} \
libnvvpi1 vpi1-dev \
&& ln -s /usr/local/cuda-${V_CUDA} /usr/local/cuda \
&& rm -rf /var/lib/apt/lists/*
# Update environment
ENV LIBRARY_PATH=/usr/local/cuda/lib64/stubs
RUN ln -fs /usr/share/zoneinfo/Europe/Brussels /etc/localtime
# ##################################################################################
# Create PyTorch Docker Layer
# We do this seperately since else we need to keep rebuilding
# ##################################################################################
FROM --platform=$BUILDPLATFORM ubuntu:18.04 as downloader-pytorch
# Configuration Arguments
# https://github.com/pytorch/pytorch
ARG V_PYTORCH=v1.10.0
# https://github.com/pytorch/vision
ARG V_PYTORCHVISION=v0.11.1
# https://github.com/pytorch/audio
ARG V_PYTORCHAUDIO=v0.10.0
# Install Git Tools
RUN apt-get update \
&& apt-get install -y --no-install-recommends software-properties-common apt-utils git \
&& rm -rf /var/lib/apt/lists/* \
&& apt-get clean
# Accept default answers for everything
ENV DEBIAN_FRONTEND=noninteractive
# Clone Source
RUN git clone --recursive --branch ${V_PYTORCH} http://github.com/pytorch/pytorch
# ##################################################################################
# Build PyTorch for Jetson (with CUDA)
# ##################################################################################
FROM cuda-devel as builder
# Configuration Arguments
ARG V_PYTHON_MAJOR=3
ARG V_PYTHON_MINOR=9
ENV V_PYTHON=${V_PYTHON_MAJOR}.${V_PYTHON_MINOR}
# Accept default answers for everything
ENV DEBIAN_FRONTEND=noninteractive
# Download Common Software
RUN apt-get update \
&& apt-get install -y clang build-essential bash ca-certificates git wget cmake curl software-properties-common ffmpeg libsm6 libxext6 libffi-dev libssl-dev xz-utils zlib1g-dev liblzma-dev
# Setting up Python 3.9
WORKDIR /install
RUN add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y python${V_PYTHON} python${V_PYTHON}-dev python${V_PYTHON}-venv python${V_PYTHON_MAJOR}-tk \
&& rm /usr/bin/python \
&& rm /usr/bin/python${V_PYTHON_MAJOR} \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python \
&& ln -s $(which python${V_PYTHON}) /usr/bin/python${V_PYTHON_MAJOR} \
&& curl --silent --show-error https://bootstrap.pypa.io/get-pip.py | python
# PyTorch - Build - Source Code Setup
# copy everything from the downloader-pytorch layer /torch to /torch on this one
COPY --from=downloader-pytorch /pytorch /pytorch
WORKDIR /pytorch
# PyTorch - Build - Prerequisites
# Set clang as compiler
# clang supports the ARM NEON registers
# GNU GCC will give "no expression error"
ARG CC=clang
ARG CXX=clang++
# Set path to ccache
ARG PATH=/usr/lib/ccache:$PATH
# Other arguments
ARG USE_CUDA=ON
ARG USE_CUDNN=ON
ARG BUILD_CAFFE2_OPS=0
ARG USE_FBGEMM=0
ARG USE_FAKELOWP=0
ARG BUILD_TEST=0
ARG USE_MKLDNN=0
ARG USE_NNPACK=0
ARG USE_XNNPACK=0
ARG USE_QNNPACK=0
ARG USE_PYTORCH_QNNPACK=0
ARG TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2"
ARG USE_NCCL=0
ARG USE_SYSTEM_NCCL=0
ARG USE_OPENCV=0
ARG USE_DISTRIBUTED=0
# Build
RUN cd /pytorch \
&& rm build/CMakeCache.txt || : \
&& sed -i -e "/^if(DEFINED GLIBCXX_USE_CXX11_ABI)/i set(GLIBCXX_USE_CXX11_ABI 1)" CMakeLists.txt \
&& pip install -r requirements.txt \
&& python setup.py bdist_wheel \
&& cd ..
# ##################################################################################
# Prepare Artifact
# ##################################################################################
FROM scratch as artifact
COPY --from=builder /pytorch/dist/* /

GitHub Action Creation

Since PyTorch is now finally compiling! It is time to start automating this and publishing them to an Artifact in GitHub (this way we can always trigger it ourselves and kick of the building process). I want to start building automatically, as soon as a Release is published! So for our action, we have the following outline:

Action Outline

  1. When a release is created trigger the action
  2. Clone the repository
  3. Setup Docker with Buildx
  4. Run our container
  5. Copy over the Built Wheel to an artifact on GitHub

Used Actions

As for actions, the following actions could be reused:

  1. docker/setup-buildx-action
    • I cross-compile for ARM on AMD64 machines in the pipeline
  2. docker/setup-qemu-action
    • Configure QEMU to be able to compile for ARM and install the QEMU static binaries
  3. actions/checkout
    • Check out a repo
  4. actions/cache
    • Allow us to cache the docker layers
  5. actions/upload-artifact
    • Upload the output of a directory to GitHub artifacts

Result

Finally resulting in the following GitHub action:

name: ci
# https://docs.github.com/en/actions/learn-github-actions/events-that-trigger-workflows
on:
push:
branches: [ main ]
release:
types: [ created ]
jobs:
build_wheels:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/[email protected]
- name: Set up QEMU
uses: docker/setup-qemu-[email protected]
- name: Set up Docker Buildx
id: buildx
uses: docker/setup-buildx-[email protected]
- name: Cache Docker layers
uses: actions/[email protected]
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-
- name: Build Docker Image
run: |
docker buildx build \
--platform=linux/arm64 \
--progress=plain \
--output type=local,dest=./wheels \
--file Dockerfile.jetson .
- name: Upload Artifacts
uses: actions/upload-[email protected]
with:
name: wheels
path: |
wheels/*.whl
# This ugly bit is necessary if you don't want your cache to grow forever
# till it hits GitHub's limit of 5GB.
# Temp fix
# https://github.com/docker/build-push-action/issues/252
# https://github.com/moby/buildkit/issues/1896
- name: Move cache
run: |
rm -rf /tmp/.buildx-cache
mv /tmp/.buildx-cache-new /tmp/.buildx-cache

Future Work

Extra Optimizations to be made

Some support could still be added for the Nvidia Jetson Nano by adapting the source code, but this is currently out of scope of this project. These optimisations can be found in QEngineering their post

GitHub Actions Improvements

Currently, Build Arguments are included but not yet used. In theory, the following can be added to the docker buildx command to build for other PyThon versions:

Python 3.8

docker buildx build \
--platform=linux/arm64 \
--progress=plain \
--build-arg PYTHON_MAJOR=3 \
--build-arg PYTHON_MINOR=8 \
--output type=local,dest=./wheels \
--file Dockerfile.jetson .

Conclusion

This project was definitely not easy, taking a long time between builds, figuring out where to build, automating it, ... by sharing this I hope to help the community utilize GPUs more easily with the latest versions.

In a next article, I hope to show you how to run an actual AI model with CUDA enabled on the Nvidia Jetson nano and Python > 3.6 😉

References

All of the above was possible by some contributions of others:

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn