As seen in the previous article, we now know the general concept of Reinforcement Learning. But how do we actually get towards solving our third challenge: "Temporal Credit Assignment"?

To solve this, we first need to introduce a generalization of our reinforcement models. When we look at these models, we can see that we are modeling decision-making situations where the outcomes of these situations are partly random and partly under the control of the decision maker. This is what we call the Markov Decision Process or MDP - we say that it satisfies the Markov Property. But let's go a bit deeper in this.

Markov Decision Processes

Markov Property

If our state representation is as effective as having a full history, then we say that our model fulfills the requirements of the Markov Property.

To illustrate this with an example, think of playing Tic-Tac-Toe. When we are able to take a decision based on the current state, rather than needing to know the whole history, then we say that we satisfy the conditions of the Markov Property.

Or in more general terms:

"The future is independent of the past given the present"

We say that we can go from one Markov State $s$ to the successor state $s'$ by defining the state transition probability, which is defined by $P{ss'} = P[S{t+1} = s' \mid S_t = s]$.

Markov Process or Markov Chain

A Markov Process is a memoryless random process where we take a sequence of random states that fulfill the Markov Property requirements. Or in a definition:

A Markov Process is a tuple $$ where:

  • $S$ is a (finite) set of states
  • $P$ is a state transition probability matrix, $P{ss'} = P[S{t+1} = s' \mid S_t = s]$.

With our $P$ matrix being written as:

$$P = \begin{bmatrix}P{11} & … & P{1n} \ \vdots & … & \vdots \ P{n1} & … & P{nn} \ \end{bmatrix}$$

where each row of the matrix sums to 1.

Let's illustrate this with an example. Let's say that we want to represent weather conditions. How can we predict the weather on the following days?

When we have this transition matrix:

$$P = \begin{bmatrix}0.9 & 0.1 \ 0.5 & 0.5\end{bmatrix}$$

Then we can see that we will have a 90% chance of a sunny day following on a current sunny day and a 50% chance of a rainy day when we currently have a rainy day.

Representing this as a graph results in:


Markov Reward Process (MRP)

To come to the fact of taking decisions, as we do in Reinforcement Learning. We introduce something called "reward". This will help us choose an action, based on the current environment and the reward we will get for it.

The Markov Reward Process is an extension on the original Markov Process, but with adding rewards to it. Written in a definition:

A Markov Reward Process is a tuple $$ where:

  • $S$ is a (finite) set of states
  • $P$ is a state transition probability matrix, $P{ss'} = P[S{t+1} = s' \mid S_t = s]$.
  • $R$ is a reward function, $Rs = E[r{t+1} \mid S_t = s]$
  • $γ$ is a discount factor, $γ \in [0,1]$

Which means that we will add a reward of going to certain states. When we map this on our earlier example:


By adding this reward, we can find an optimal path for a couple of days when we are in the lead of deciding. Let's imagine that we can play god here, what path would you take? Well we would like to try and take the path that stays "sunny" the whole time, but why? Well because that means that we would end up with the highest reward possible.


But how do we calculate the complete return that we will get? Well this is represented by the following formula:

$$Gt = R{t+1} + R{t+2} + … + Rn$$

This however results in a couple of problems:

  • We tend to stop exploring (we choose the option with the highest reward every time)
  • Possibility of infinite returns in a cyclic Markov Process

Which is why we added a new factor called the discount factor. This factor will decrease the reward we get of taking the same action over time. Adding this to our original formula results in:

$$Gt = R{t+1} + γR{t+2} + … + γ^nRn = \sum^{\infty}{k=0}γ^kR{t + k + 1}$$

Markov Decision Process (MDP)

A Markov Decision Process is a Markov reward process with decisions. It is an environment in which all states are Markov.

We can now finalize our definition towards:

A Markov Decision Process is a tuple $$ where:

  • $S$ is a (finite) set of states
  • $A$ is a finite set of actions
  • $P$ is a state transition probability matrix, $P{ss'}^a = P[S{t+1} = s' \mid St = s, At = a]$.
  • $R$ is a reward function, $Rs^a = E[r{t+1} \mid St = s, At = a]$
  • $γ$ is a discount factor, $γ \in [0,1]$