March 12, 2021 - ai ml rl

Reinforcement Learning with Azure Kubernetes Service (AKS) with Ray RLLib

Xavier Geerinck


Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

As any Machine Learning algorithm a compute pool is required to train your model. For Reinforcement Learning, this is exactly the same. Through cloud, we now have access to on-demand compute, such that we only have to pay for what we use. But how do we actually reap the benefits of this? In this post I will explain you how you can setup your own Reinforcement Learning cluster utilizing Azure Kubernetes Service and Spot-Instances!

Why Spot Instances

Before we start I would like to introduce Azure Spot Virtual Machines! You might wonder why we are setting up an Azure Kubernetes Cluster with spot instances? Well the reason for this is that Spot Instances allow us to utilize "unused" capacity, while getting a significant cost saving in return!

Now, for Machine Learning use cases where we have a lot of workers typically the costs can raise quite high, but our jobs don't mind that a worker sometimes drops out. When utilizing spot instances, we will then create Virtual Machines for our workers, but they can then be taken away when someone else requires this instance. However in our use cases, they will then be replaced by another instance. This typically costs us a bit of time, but the cost savings generated by it are well worth this!


Enough about the Spot Instances! How can we actually get started on creating this ourselves? For this, I created the following upfront:

  • A service principal having access rights on our subscription
  • Python version that is the same to the head node (in our example this is 3.7.7) which we can do by running pyenv install 3.7.7; pyenv local 3.7.7; pip install --upgrade pip see my previous article to setup pyenv

Creating our Cluster

For the automation part, I utilized a pre-generated script that we can find on GitHub which spins up a Kubernetes cluster with a Standard_DS2_V2 node pool (our main Ray node) and a node pool containing spot instances.

In the default case, it will utilize Standard_DS2_V2 instances, but feel free to tune this to utilize heavier instances!

A simple terraform init and terraform apply will allow us to now create the cluster and return the context!

Once this is done, I configured the kubectl context through editing the ~/.kube/config file and adding a context in there

Install Ray RLLib on Kubernetes

We can now start training after creating the cluster! For this install some dependencies first:

pip install kubernetes
pip install ray[rllib]

Once these are installed, we can utilize a demo configuration from the Ray project that will setup a small cluster containing one head node pod and 2 autoscaling worker node pods, only requiring 1 CPU and 0.5GiB RAM each.

So save the file above under ray-example.yaml and open it!

⚠ Make sure to adapt the file since else pods won't get scheduled due to the spot instances!

When opened, add the following under the spec: definition for the worker pods:

- key: ""
operator: "Equal"
value: "spot"
effect: "NoSchedule"

When done, execute ray up ray-example.yaml which will spin-up our cluster!

Finally, after this finishes, we are able to run ray monitor ray-example.yaml to get the statistics of our cluster running! Or when we prefer a GUI we can run ray dashboard ray-example.yaml to forward the cluster dashboard port locally.

Running an example on our cluster

Ok, so our cluster is set-up, let's now run a simple experiment on it!

To do this we can perform several actions:

  1. We package a container and run our code through this one on the Kubernetes cluster
  2. We forward the Ray head service port and configure ray.init as ray.utils.connect("") and forward with kubectl -n ray port-forward svc/example-cluster-ray-head 10001:10001

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn