Tabulated Reinforcement Learning


  In general, the main idea behind reinforcement learning is that you have an Agent, Environment, State, Actions, and Rewards.

  In the environment we have an agent. This agent takes in some type of input from the environment, this input is the agents state. Based on its state, the agent takes some action. This action brings the agent to a new state. Depending on if this new state is good or bad, the agent receives a reward/punishment. The diagram below helps to visualize this process:

RL Diagram
Note the order of events written in blue

  To further drive the point home, if you think of yourself as an agent the entire universe is your environment. You have 5 senses, or 5 ways to receive input. At any particular time, your current state is what you are seeing, hearing, smelling, feeling, and tasting. You perform an action based on your current state. These actions eventually lead to either a reward (promotion at work maybe) or a punishment (getting fired). The idea is that you will learn from these punishments and rewards, so that your future actions will lead to better rewards.

  To get some more jargon out of the way, an episode is one full cycle of the agent interacting with the environment. For example, if the environment was tic tac toe an episode would be one game of tic tac toe.

Agent Memory/Value Table

  As the agent interacts with the environment, we can store all the states that the agent has been in and assign a value to each of these states. A state's value is a measure of how many rewards the agent can expect to receive for the rest of the episode (for right now, understanding what this means quantitatively is not important). I will explain how to assign the correct values in a moment, but for now just know that it is possible.

  For example, say our agent is just starting its first game of tic tac toe. The environment is the game tic tac toe, and its current state would be the empty game board. The agent would store this unique state, say s0, in its memory. Say our agent then takes a random action, call it a0. Recall that taking an action brings a new state, let's call it s1. Now say our agent continues play with his opponent and ends up losing the game (it will receive a punishment at this point).

  Since the agent lost, we can say that every state the agent was in during this episode has low value. Now say our agent starts his second game and is back in s0. Now, it can look in its memory and see what happens if the action a0 is taken. It will see that action a0 will lead to state s1, which has low value since last time it was in this state it ended up losing.

  Since the agent is trying to win, it will only pick actions that will lead to states with high value. To start game 2 the agent avoids action a0, since it has low value, and picks another action at random, say a2. Again, the agent continues play; but this time he ends up winning. Since the agent won, we can say that every state the agent was in during episode 2 has high value. Hopefully the figure below is helpful in visualizing this process.

state values
Note the values are actually numbers, we will get into finding the correct numerical value of each state next.

  You might see a problem with what we have so far. What if the action a0 really is the best action to take, and it was another action taken after a0 that caused us to lose that game? As of now, if our agent always avoids a0 it will never see that it is the best action to take. To solve this, we always make our agent have a chance to take a random action. If our agent takes a random action, it has a chance to take a0 and give that action another shot. We call this exploring the state space. We usually have the agent take completely random actions at first, then as time goes on we slowly reduce the chance that the agent takes a random action. An algorithm that does this is discussed in the tic tac toe project linked to at the bottom of this page.

  The agent will continue this process until it knows the best action to take in every state it can possible be in. To summarize, we are literally just telling the agent to try everything, recording everything that happens, and finding which actions lead to the best states (the states that lead to the largest reward).

Finding the value of each state

  To find the correct value of each state, we start by setting the value of each state to 0. Now, a states correct value is equal to the expected total future reward. See the figure below to help let this soak in.


  It is best to start at the last state, since in implementation we start at the end and go back. In S5 we receive a reward of +1. Therefore, the value of S5 is 1. Now S4 does not get a reward, but it leads to S5 which gives a reward of +1. Therefore, S4 has a value of the reward it has received + the value of S5, or a value of 0+1=1. Similarly, V(S3) = R3 + V(S4) = -1+0+1 = 0. In general, we can say;

Value Function
The Value of a state = The reward received in that state + the value of the next state

  It is important to realize that we are sampling over many episodes, so instead of setting the value explicitly to Rt + V(St+1), we turn this into an iterative process and just take a small step in the direction of Rt + V(St+1). This brings us to our value update function:

Value Update Function
This should look a lot like gradient descent. If this function is repeatedly applied, it will make the state values converge to their correct values. If it doesn't make sense as to why we take a small step, keep it in the back of your head and hopefully it will make more sense in implementation.

Theoretical Section Summary

  In summary, the agent begins by playing several episodes. While it is playing these episodes, sometimes it takes a random action to explore the state space, and other times the agent takes whatever action leads to the state with highest value. After each episode, we update the value of each state the agent was in during the episode using the value update function, starting from the last state and moving backwards to the first state. Eventually this will lead us to the correct value for each state. If then the agent only takes actions that will lead to the state with highest reward it will have mastered the game (hopefully).

  Alright, enough with the boring stuff. If all this theory is still a bit fuzzy to you, don't worry; true learning happens when we apply this stuff in the. So, without further ado, let's build an agent that kicks ass in tic tac toe!