3 min read

Gradually adopting Autonomous Systems in Production

Gradually adopting Autonomous Systems in Production

Reinforcement Learning is something I have been covering a while now! Lately, I have seen this space growing tremendously but keep on seeing the same issue arising each time: "How can I as a company with an existing infrastructure / applications (= brownfield) adopt Reinfocement Learning?"

To answer this question, I would love to share a learning I made a while ago after reading Facebook's amazing ReAgent platform. Where they explain an amazing philosophy on how you can easily go from an existing hardcoded rule engine towards full Reinforcement Learning models as illustrated below.

Identifying Use Cases

Adopting Reinforcement Learning should be seen as a phased adoption. This way you maintain full control, while at the same time improving your application. Often however, many people already get stuck on discovering the correct candidates for Reinforcement Learning applications.

The rule of thumb is often easily stated as: "When you can model it as a Markov Decision Process, it's suitable for Reinforcement Learning", meaning that when we are in a certain state, we transition to the next state s' by taking action a with a specific probability. Making it such that the next state depends on our current state. The current state and action however, are coditionally independent of all the previous states and actions (= it adheres to the Markov Property).

An application is a good fit for a Reinforcement Learning model when it can be modelled as a Markov Decision Process

Now, when you are just starting out with Reinforcement Learning, it's not as straightforward to remember this. So lets explain this with an example:

When living the life, you are either hungy or full. Depending on the state (hungry or full) you thus take an action (eat or don't eat) which results in a transition. We can go from hungry to full by eating, but we can also stay hungry by not eating. No matter where we start, we always know where we will be able to go to.

As a more practical example, let's take the process of balancing a balloon on the tip of our hand. Depending how the balloon is being balanced, we move our hand in any of the X-Y-Z coordinates. Which is a continuous problem that has to be solved.

Adopting Reinforcement Learning

Since we identified our use cases, it's now time to actually adopt Reinforcement Learning. Let's go over the different phases as illustrated above.

Phase 1: Hardcoded Rules

The first thing we do is that we try to hardcode our rules. Remember our Balloon? We now hardcode it stating that IF happens, THEN we perform.

Observations we should make here:

  • We are not training based on data, we coded all our personal knowledge in it
  • We are not utilizing any training engine, it was all pre-determined by us
  • If we did not cover all the rules, something will happen that we cannot catch.

Phase 2: Parameter Store

We then go more advanced, actually storing the parameters, such that we can incorporate feedback when it arises. Again back to the balloon: when we take a certain action, how will it have impacted the balloon from being balanced or not?

E.g. we have a parametered model that based on a state takes a certain action, when we then receive a reward, we adapt this parameter.

Our observations now changed from before:

  • We are actually training (we change our parameters) based on feedback that we received
  • We are utilizing a training engine that looks at the state
  • What we however did not do yet, is make sense of this state! I.e. is it windy or not for our balloon?

Phase 3: Full RL-Model

For the last observation, we want to solve the solution of "context". This is where a full-blown Reinforcement Learning model comes into play. We now add context to our model, such that we can start training on specific parts of the environment and the nuances that we encountered.

E.g. when a state states that it is windy and the balloon is in a specific position, we take a different action than when it is not windy but in the same state.

Finally our observations look as follows:

  • We are training (we change our parameters) based on the feedback
  • We are utilizing a more complex training engine that does not look only at the current state but also at the context it is in


Looking at the above, we can say that an application can gradually be adopted to incorporate a Reinforcement Learning model. By going through the different phases, we can adopt it in such a way that we understand what is happening, as well as being able to control when something goes wrong.