5 min read

Training the Continuous Lunar Lander with Reinforcement Learning, RLLib and PPO

Training the Continuous Lunar Lander with Reinforcement Learning, RLLib and PPO

For an upcoming blog post, I would like to have a robotic arm to land a Lunar Lander autonomously. In Part 1 I explained how we can build such a robotic arm already, but now we need to be able to go deeper into how we are able to train an environment in a simulation environment (before deploying it on a physical device).

That's where this article comes in. Here I want to explain how we can train the "Continuous Lunar Lander" environment in the OpenAI Gym such that we can utilize these results in Part 2 to control the actual robot arm itself.


For this article, I want to utilize the Roadwork-RL framework in combination with RLLib to train a Lunar Lander to land. Seeing that the normal Lunar Lander has a discrete action space, I decided to focus on training the Continuous Lunar Lander, seeing that this has a Box action space. But more about this later.

  • Roadwork-RL: Roadwork-RL is a framework that acts as the wrapper between our environments and the training algorithms. Its goal is to abstract the physical or virtual environment to allow training frameworks such as RLLib to act on it. Roadwork-RL utilizes the gRPC RPC system for performant communication between the environments and the training environment.
  • RLLib: RLLib is a framework build on Ray that abstracts away Reinforcement Learning algorithms and is capable of running them in a scalable way.

At the end of this article you should be able to train a lunar-lander environment yourself as shown below:

Lunar Lander Environment

The Lunar Lander example is an example available in the OpenAI Gym (Discrete) and OpenAI Gym (Continuous) where the goal is to land a Lunar Lander as close between 2 flag poles as possible, making sure that both side boosters are touching the ground. We can land this Lunar Lander by utilizing actions and will get a reward in return - as is normal in Reinforcement Learning.

Note: We will be utilizing the LunarLanderContinuous environment seeing that we can map this easily to our Xbox Controller thumbsticks + it's more interesting.

To decompose this problem, we can state the following in terms of its action space, observation space and reward function:

Observation Space: The observation space is illustrated by a "Box" containing 8 values between [ $-\infty$, $\infty$ ] these values are:

  • Position X
  • Position Y
  • Velocity X
  • Velocity Y
  • Angle
  • Angular Velocity
  • Is left leg touching the ground: 0 OR 1
  • Is right leg touching the ground: 0 OR 1

Action Space:

  • Discrete (Discrete Action Space with 4 values):
  • 0 = Do Nothing
  • 1 = Fire Left Engine
  • 2 = Fire Main Engine
  • 3 = Fire Right Engine
  • Continuous (Box Action Space with 2 values between -1 and +1):
  • Value 1: [-1.0, +1.0] for main engine where [-1.0, 0.0] = Off and [0.0, +1.0] = On
  • Value 2:
  • [-1.0, -0.5]: Left Engine
  • [-0.5,  0.5]: Off
  • [0.5,   1.0]: Right Engine

Reward Function:

The Reward Function is a bit more complex and consists out of multiple components:

  • [100, 140] points for  Moving to the landing pad and zero speed
  • Negative reward for moving away from the landing pad
  • If lander crashes or comes to rest it gets -100 or +100
  • Each leg with ground contact gets +10
  • Firing the main engine is -0.3 per frame
  • Firing the side engine is -0.03 per frame
  • Solved is 200 points

Training the Lunar Lander

Now let's get on to the training part of this article. When we would normally train this, we would install the OpenAI Gym, create our customized Python code and tune it until it lands correctly.

There are however a lot of algorithms out there that are already implemented by certain libraries (such as RLLib in our case). Therefore we want to re-use those as much as possible.

Running the Roadwork Simulation Server

We install Roadwork-RL which will provide our Simulation Server that we can start with ./Scripts/linux/run-server.sh openai $(pwd)/../output-server/lunar-lander:

(base) xanrin@DESKTOP-C1BL10B:/mnt/e/Projects/roadwork-rl/src$ ./Scripts/linux/run-server.sh openai $(pwd)/../output-server/lunar-lander
Installing Dependencies
Running Server: openai
- Output Directory: /mnt/e/Projects/roadwork-rl/src/../output-server/lunar-lander
OUTPUT_DIRECTORY: /mnt/e/Projects/roadwork-rl/src/../output-server/lunar-lander
Starting server. Listening on port 50050.

Which is now ready to receive requests.

Running the Roadwork Client with RLLib

Since we can now receive requests, let's continue and create our Client implementation. This Client will utilize RLLib for the training algorithm, that gets implemented through the Roadwork RayEnvironment that is provided to us.

To create such a client, we create a new folder that contains a train.py and infer.py file with the following content:


import os
import gym
import ray
from ray.rllib.agents import ppo
from roadwork.client import RayEnvironment as RwRayEnvironment

CHECKPOINT_DIR = "/mnt/e/Projects/roadwork-rl/output-server/lunar-lander-continuous-checkpoint"
CHECKPOINT_FILE = "last_checkpoint.out"


# Configure RLLib with The Roadwork Environment
trainer = ppo.PPOTrainer(env=RwRayEnvironment, config={ "env_config": {
    "rw_sim": "openai",
    "rw_env": "LunarLanderContinuous-v2",
    "rw_grpc_host": "localhost",
    "rw_grpc_port": 50050

print(f"Starting training, you can view process through `tensorboard --logdir={CHECKPOINT_DIR}` and opening http://localhost:6006")

# Attempt to restore from checkpoint if possible.
if os.path.exists(f"{CHECKPOINT_DIR}/{CHECKPOINT_FILE}"):
    checkpoint_path = open(f"{CHECKPOINT_DIR}/{CHECKPOINT_FILE}").read()
    print("Restoring from checkpoint path", checkpoint_path)

while True:
    results = trainer.train()

    rw_date = results["date"]
    rw_timesteps_total = results["timesteps_total"]
    rw_time_total_s = results["time_total_s"]
    rw_episode_reward_mean = results["episode_reward_mean"]

    print(f"{rw_date} INFO Step: {rw_timesteps_total}. Time Elapsed: {rw_time_total_s}s Mean Reward: {rw_episode_reward_mean}")

    checkpoint_path = trainer.save(CHECKPOINT_DIR)
    print("--> Last checkpoint", checkpoint_path)
    with open(f"{CHECKPOINT_DIR}/{CHECKPOINT_FILE}", "w") as f:


import os
import gym
import ray
from ray.rllib.agents import ppo
from roadwork.client import RayEnvironment as RwRayEnvironment

CHECKPOINT_DIR = "/mnt/e/Projects/roadwork-rl/output-server/lunar-lander-continuous-checkpoint"
CHECKPOINT_FILE = "checkpoint_107/checkpoint-107" # TODO: Adapt this to your checkpoint file


# Create Agent
config = {
    "rw_sim": "openai",
    "rw_env": "LunarLanderContinuous-v2",
    "rw_grpc_host": "localhost",
    "rw_grpc_port": 50050

test_agent = ppo.PPOTrainer(env=RwRayEnvironment, config={ "env_config": config})

# Run Inference
env = RwRayEnvironment(config)

done = False
state = env.reset()
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    print(f"Taking action: {action}")
    state, reward, done, _ = env.step(action)
    print(f"Got reward: {reward}")
    cumulative_reward += reward

# env.monitor_stop()

Once these files are created, we can run them by going to the Roadwork-RL src/ directory and executing:

# Start Server
./Scripts/linux/run-server.sh openai $(pwd)/../output-server/lunar-lander-continuous

# Start Training
./Scripts/linux/experiment-train.sh python lunar-lander-continuous

# Start Inference
./Scripts/linux/experiment-infer.sh python lunar-lander-continuous

This will start up and show that checkpoint files are being created:

--> Last checkpoint /mnt/e/Projects/roadwork-rl/output-server/lunar-lander-continuous-checkpoint/checkpoint_29/checkpoint-29
2020-06-07_13-10-52 INFO Step: 120000. Time Elapsed: 658.788571357727s Mean Reward: 23.702257953854424
--> Last checkpoint /mnt/e/Projects/roadwork-rl/output-server/lunar-lander-continuous-checkpoint/checkpoint_30/checkpoint-30
2020-06-07_13-11-18 INFO Step: 124000. Time Elapsed: 684.3401682376862s Mean Reward: 26.25080527652681
--> Last checkpoint /mnt/e/Projects/roadwork-rl/output-server/lunar-lander-continuous-checkpoint/checkpoint_31/checkpoint-31
2020-06-07_13-11-48 INFO Step: 128000. Time Elapsed: 714.5010991096497s Mean Reward: 31.681706762331306

Which after ~300.000 Steps will achieve a good reward, that we can utilize in the infer step. This will then create a video in our output-server directory that is shown in the video at the beginning of this article.


I hope that by the end of this article, it should be clear on how you are able yourself to train the Lunar Lander example. In a next example, we will then utilize the gained knowledge to demonstrate landing this lander through the Robotic Arm.