March 12, 2021 - ai ml rl

Reinforcement Learning with Azure Kubernetes Service (AKS) with Ray RLLib

Xavier Geerinck

@XavierGeerinck

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Next to spinning up a Ray cluster on a Kubernetes Cluster, it's also possible to deploy it on Azure! (actually it's even easier).

Prerequisites

Make sure you have ray installed and are connected to Azure.

# Install ray + azure
pip install ray azure-cli azure-core
# Authenticate on Azure + set subscription id
az login
az account set -s <SUBSCRIPTION_ID>

Installing Ray on Azure

To start on this, we can utilize the Ray AutoScaler from the repository. Download this file and edit it to your wishes, afterwards execute the command below.

⚠️ You can edit the example-full.yaml file to set another subscription than the one logged in with (default).

In our example, I added the following EXTRA commands:

head_setup_commands:
- pip install azure-cli-core==2.20.0 azure-mgmt-compute==19.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==18.0.0
- sudo apt-get install libglib2.0-0
worker_setup_commands: []
setup_commands: []

Finally, start the cluster with:

ray up example-full.yaml

⚠️ At the time of writing this article, there was a small issue (PR #14750) in the upstream version. To correctly deploy this, make sure to include worker_setup_commands: [] and setup_comands: [] in your deploy file, else they will be overwritten .

Once this setup finishes, we will see something like this as output:

Acquiring an up-to-date head node
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/7] Waiting for SSH to become available
Running `uptime` as a test.
Fetched IP: <MASKED_IP>
ssh: connect to host <MASKED_IP> port 22: Connection refused
SSH still not available (SSH command failed.), retrying in 5 seconds.
# -- snipped
Warning: Permanently added '<MASKED_IP>' (ECDSA) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
09:45:57 up 1 min, 1 user, load average: 2.38, 0.77, 0.28
Shared connection to <MASKED_IP> closed.
Success.
Updating cluster configuration. [hash=60c1cc4dff2c06f8a558dd628bc149cd3fad461d]
New status: syncing-files
[2/7] Processing file mounts
Shared connection to <MASKED_IP> closed.
/home/ubuntu/.ssh/id_rsa.pub from /home/xavier/.ssh/id_rsa.pub
Shared connection to <MASKED_IP> closed.
Shared connection to <MASKED_IP> closed.
[3/7] No worker file mounts to sync
New status: setting-up
[4/7] Running initialization commands
Warning: Permanently added '<MASKED_IP>' (ECDSA) to the list of known hosts.
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
Connection to <MASKED_IP> closed.
Warning: Permanently added '<MASKED_IP>' (ECDSA) to the list of known hosts.
Connection to <MASKED_IP> closed.
[5/7] Initalizing command runner
Warning: Permanently added '<MASKED_IP>' (ECDSA) to the list of known hosts.
Shared connection to <MASKED_IP> closed.
nightly-py37: Pulling from rayproject/ray
5d3b2c2d21bb: Pull complete
3fc2062ea667: Pull complete
75adf526d75b: Pull complete
cb9cc0ffd7d7: Pull complete
20e6bba2821c: Pull complete
5f94c257d7a8: Pull complete
8d2d31defa88: Pull complete
0dc6a7b56a50: Pull complete
96fa1d3e5cdd: Pull complete
Digest: sha256:f3f7961c9b2fba6f870027b279fada0f5f53bdd02b23c95310c57bf6ab4c154c
Status: Downloaded newer image for rayproject/ray:nightly-py37
docker.io/rayproject/ray:nightly-py37
Shared connection to <MASKED_IP> closed.
NVIDIA-SMI has failed because it could not communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
Shared connection to <MASKED_IP> closed.
2021-03-19 10:47:53,745 WARNING command_runner.py:901 -- Nvidia Container Runtime is present, but no GPUs found.
Shared connection to <MASKED_IP> closed.
bc7d51d7d412b7286c45dc1c1ac5ba256f31b322619d7102c55d6dd6815c0d32
Shared connection to <MASKED_IP> closed.
# -- snipped python packages installation
Successfully installed PyJWT-1.7.1 azure-cli-core-2.20.0 azure-mgmt-compute-19.0.0 azure-mgmt-core-1.2.2 azure-mgmt-network-18.0.0 cryptography-3.3.2 knack-0.8.0rc2 msal-1.10.0
Shared connection to <MASKED_IP> closed.
[7/7] Starting the Ray runtime
Did not find any active Ray processes.
Shared connection to <MASKED_IP> closed.
Local node IP: <LOCAL_NODE_IP>
2021-03-19 02:48:49,516 INFO services.py:1256 -- View the Ray dashboard at http://127.0.0.1:8265
--------------------
Ray runtime started.
--------------------
Next steps
To connect to this Ray runtime from another node, run
ray start --address='<LOCAL_NODE_IP>:6379' --redis-password='<REDIS_PW>'
Alternatively, use the following Python code:
import ray
ray.init(address='auto', _redis_password='<REDIS_PW>')
If connection fails, check your firewall settings and network configuration.
To terminate the Ray runtime, run
ray stop
Shared connection to <MASKED_IP> closed.
New status: up-to-date
Useful commands
Monitor autoscaling with
ray exec /home/xavier/Projects/azure-rllib/rw-train/azure/deploy.yaml 'tail -n 100 -f /tmp/ray/session_latest/logs/monitor*'
Connect to a terminal on the cluster head:
ray attach /home/xavier/Projects/azure-rllib/rw-train/azure/deploy.yaml
Get a remote shell to the cluster manually:
ssh -tt -o IdentitiesOnly=yes -i ~/.ssh/id_rsa [email protected]<MASKED_IP> docker exec -it ray_container /bin/bash

Running a Test

Since our cluster is now installed, it's useful to check out what it can do! So let's start by running a test.

Create a python file named cartpole.py with the following content:

Once that is created, we can submit it to the cluster with:

ray submit deploy.yaml cartpoly.py --start --tmux
ray attach deploy.yaml --tmux

This command will take care of starting the cluster when needed and execute our command resulting in the below as output:

Conclusion

The Ray library is simply amazing in what it does and how it does it. Running distributed compute clusters in cloud has been made super easy. Together with Spot instances, it's a clear choice to utilize Ray whenever we are working with for example Reinforcement Learning!

Did you enjoy reading? Or do you want to stay up-to-date of new Articles?

Consider sponsoring me or providing feedback so I can continue creating high-quality articles!

Xavier Geerinck © 2020

Twitter - LinkedIn