Into Markov Chains

11 May 2016

The last post talked about Markov Chains, so the next step is to use these chains to implement a reward process that is the basis for Reinforcement Learning. The reward process (Specifically the Markov Reward Process) is composed of a state space, a state transition matrix, a reward function, and a discount factor. The discount factor just represents how much importance the agent puts on getting a reward immediately versus getting a reward later. When the discount factor is 0, the agent will do an action that will give it a reward immediately, when the factor is 1, then the agent places an importance on delayed gratification. For example, if someone offers me a cake and my discount factor is 0, I am looking for an instant gratification, so I would eat the cake immediately. But when my discount factor is one, I politely say no to the cake and eat an apple instead. The second decision is better for my health so I was aiming for a long term gratification.

Every state in the state diagram should be labeled with a reward. This reward could be based on a point system. For the state diagram that I made in my last post, I would say that sleeping gives me a reward of 1 point, studying gives me a reward of 10 points, eating gives me a reward of 3 points and hanging out online is -14 points (because hanging out online tends to be a time-waster and it is the most addictive compared to other states on this diagram). The reward and discount factor can be mixed into one equation that keeps track of how much the agent is getting as it moves from one state to another. This equation is called the return. In any Markov Reward Process, we want to maximize our return.

Now that we have defined the return, we can move on to the value function. the value function represents the value of a function after many state transitions. It is dependent on the discount factor. If the discount factor is 0, then the value of that state is the same as the reward of that state since the agent wants an instant reward. But as the discount factor moves away from 0, the state value of a certain state will not be the same as the reward of that state since the long term reward will be of a higher priority to the agent.

You can make a Bellman function using the properties of a value function. The Bellman function states that the value of a state is based on the reward of that state added with the values of the outgoing states multiplied with the probabilities of coming out of your state and going to the next states. Simply put, the Bellman equation is: v = r + d + p * v where v is the value, r is the reward, d is the discount, and p is the transition probability matrix. The Bellman equation is very important because it breaks a big problem into “bite-sized” sub problems that we are capable of solving.

The concept of Reinforcement Learning is thankfully not complicated at all. In my last post, I was wondering how I was going to implement a reward process for my agent. David Silver’s second Reinforcement Learning lecture answered that. He said that all Reinforcement Learning problems can be conceptualized in the form of Markov Chains. An awesome tutorial for Markov chains that I used is here .

A Markov Chain is a process of a state space (just a list of states) that can be drawn as a graph. The process of a Markov Chain is based on a probability of a cursor moving from one state to another. I am familiar with state spaces because I studied Finite State Machines last semester. For my class final project, I created a state machine and I used it to implement a random number generator. The good thing about Markov Chains is that it does not need to remember all of the past events, it only uses that current event to decide the next state. This is called the Markov Property.

The Markov Chain can be modeled in 2 ways, with directed graphs; where vertices represent the state and the edges represent the probabilities, and a transition matrix. The transition matrix is used more often because a Markov Chain can get very complex. These complexities can get so big that it may be impossible or tedious to model a Markov Chain using a graph. This is why a transition matrix is almost always used to model Markov Chains. "Every state in the state space is included once as a row and again as a column. And each cell in the matrix tells you the probability of transitioning from its row state to its column state."- Powell.

I won’t get into the specifics of transition matrix mathematics (because I don't know any of it yet), but I will create transition matrix. This transition matrix models my day to day life in the summer. The main activities that I do in the summer is sleep, eat, study, and hang out online. I am going to use numbers between 0 and 1 to model the probabilities where a 10 percent chance is 0.10 and a 50 percent chance is .50. I created this transition matrix by first creating a state diagram, but I am unable to put a state diagram on here at the moment. My state diagram starts from sleeping and it goes on from there. Every outgoing edge from any one of my states should add up to 1 because all probabilities are supposed to add up to 100 percent.

The transition matrix below conceptualizes a Markov Chain beautifully. It shows all of the transitions from a state to other states, for example the probability that I go from sleeping to sleeping again is 0.1, the probability that I go from eating to hanging out online is .75. The columns of the matrix are supposed to represent the arrows going from each state so each column should add up to 1 because there is 100 percent probability that something will happen no matter what state I am in.

Sleeping Eating Studying Hanging out online
Sleeping 0.1 0.25 0.50 0.075
Eating 0.30 0.05 0.10 0.75
Studying 0.02 0.175 0.30 0.15
Hanging out online 0.67 0.75 0.10 0.70

The next step is figuring out how to model the reward process using Markov Chains.

There are three methods of solving a Markov Decision process, these methods are, Dynamic Programming, Monte Carlo, and Temporal Difference Learning.

Dynamic Programming is a method where you know the full Markov Decision Process. The pros of this method, is that there is a specific algorithm that you can use to find your value function. The way this method is taught is “grid world” This is a grid that contains the value functions and determines the optimal policy that each state will contain until the state terminates. I am still trying to understand this grid world, so I can’t go into the specifics aspects of it.

Reinforcement learning depends on optimizing value functions. There also some situations where you don’t know what your Markov Decision Process is. This is an area that is suited for the Monte Carlo Method. The Monte Carlo Method relies on random sampling and it is also used for numerical integration. This random sampling is akin to the agent trying out everything in its environment and finding out what maximizes it’ s value function. This means that the Monte Carlo method learns from experience.

Temporal Difference Learning is a fundamental concept in reinforcement learning because it combines Dynamic Programming and the Monte Carlo Method. This method, like Monte Carlo also learns from experience, and it doesn’t have a Markov Decision Process. Unlike the Monte Carlo Method, Temporal Difference learning does not need the state to terminate to get the value function. Like Dynamic Programming, Temporal Difference Learning updates the value functions on the fly, the episode does not have to terminate. This is called bootstrapping.

The methods that solve a Markov Decision Process have the same steps. The first step is the policy evaluation that estimates the value function for a policy. The second step is finding the optimal policy, which is improving the policy when you have a value function.


Powell, Victor. "Markov Chains Explained Visually." Explained Visually. N.p., n.d. Web. 11 May 2016.

Silver, David. "Markov Decision Processes." SpringerReference (n.d.): n. pag. 2015. Web. 5 May 2016.

Sutton, Richard S., and Andrew G. Barto. Reinforcement Learning: An Introduction. Cambridge, MA: MIT, 1998. Print.