How to get a program to play tic tac toe using reinforcement learning
Writing a program that learns to play tic tac toe is a first step in knowing how reinforcement learning works. For this project, I assume you have already been introduced to the theory of tabulated reinforcement learning. This includes understanding what the value of a state is and how to find it. I will go in depth into implementing this theory in code. After you learn how to do this project you are only one neural network away from getting the agent to learn to play atari games!
First, we will need a tic tac toe game to be our environment. I made a base game of tic tac toe here.
I will first show how to enumerate our states, with each state being a specific configuration of the game board. Then there are two functions that are in our way of tic tac toe autonomy, the get_action function and the update_values function. Once you are done you should be able to build 2 agents and have them play each other until they have both mastered the game of tic tac toe.
Representing states as numbers
As I said we need to store each state the agent encounters in its memory. For tic tac toe, we are defining the current configuration of the game board as the state. In my implementation, the game board is described as a list called 'spot_placeholders'. The problem is if we store each state, which is a list, ever visited in another list, we will have a huge list full of lists. This is very inefficient. Instead of having a list full of lists, lets aim to have a list full of integers with each integer representing a unique state.
To figure out how to represent each state as a number, let's start off by manually representing states and try to identify a pattern. Note that in the list 'spot_placeholders', a 1 means there is an X in that spot and a 2 means there is an O in that spot.
- State: [0,0,0,0,0,0,0,0,0] = 0
- State: [1,0,0,0,0,0,0,0,0] = 1
- State: [2,0,0,0,0,0,0,0,0] = 2
- State: [0,1,0,0,0,0,0,0,0] = 3
- State: [1,1,0,0,0,0,0,0,0] = 4
- State: [2,1,0,0,0,0,0,0,0] = 5
- State: [0,2,0,0,0,0,0,0,0] = 6
- State: [1,2,0,0,0,0,0,0,0] = 7
- State: [2,2,0,0,0,0,0,0,0] = 8
You can find the code for this function here. If this function doesn't make sense to you, don't stress out about it. This is very specific to tic tac toe and probably won't be used much in future projects. If you just copy and paste the code without a full understanding of how it works, not much harm is going to be done to your understanding of RL. Just know that you can improve efficiency by enumerating states in some way.
The epsilon greedy algorithm is how we will decide which action our agent will take. We want the agent to take random actions at first, but once it starts getting the hang of things we want it to play to win. So what we do is set a variable ε that represents the agents chance of taking a random action, and slowly decrease ε over time. The pseudocode is this:
- Initialize ε = 1
- Generate a random number between 0 and 1
- If this random number is < ε, explore the state space (take a random action).
- If this random number is > ε, perform the action that leads to the state with the highest value.
- Decrease ε by a small amount
ε continues to fall until it reaches some pre-determined value. You can think of ε as the percent chance that the agent will take a random action.
The Get Action Function
Currently, our get_action function asks the user which spot they would like to go. Since we are making this completely autonomous you are free to clear the get_action function and start from scratch. What we want to do is find out all the possible next states the agent can be in, then find which one of those states have highest value. We do this by looping through the game board (spot_placeholders) and if a spot is open, we record what the new state would be if the agent went in this free spot. We then look in the agents state value table and find the value of this state. If this state has the highest value, we move to the spot that results in this state.
The pseudocode my implementation of this function is the following;
- Check whose turn it is
- Find all the next possible states that player can be in.
- Perform ε greedy to either take a random action or be greedy(take best action).
- If greedy, loop through the next possible states and find the state with highest value.
- try to perform this action, but if the agent doesn't have any of these states in its table do a random action.
The code for my implementation of this function can be found here. Note that I also added the epsilon attribute to the player class for the ε greedy strategy.
The Update Values Function
Currently we can make two agents play each other, but these two agents have no memory whatsoever. It's like every game they play is their first ever game of tic tac toe. So, let's fix that.
What we do is keep a list of all the states visited in an episode. At the end of the episode we need to call a function that applies the value update function to every state in this list for each agent. Notice that we only receive a reward at the last state of the episode. Therefore, we can just set the last states value to the reward. The pseudocode to my implementation is the following:
- Set the value of the last state equal to either 1 or -1
- Append any new states to the players memory
- Apply the update function starting from the second to last state and moving towards the first state
The code to my implementation is here. I added a call to the update function in the play game function here too.
Putting it all together
We are now able to get two agents to play each other and become tic tac toe pros all on their own. You can find the code for my finished project here. I added a test function so that you can play against the computer. You will find that if you go first the agent doesn't play very well at all, this is due to the game of tic tac toe being very bias towards who goes first. Adjusting training time, the size of rewards, and other hyperparameters may help the agent play better when it goes second.
In summary we are making an agent play the game of tic tac toe thousands of times and recording what happens in every state of the game. We then assign a value to each state and correct this value using the value update function. When the agent is trying to win, he will always move to the state with the highest value. We call this tabulated RL since we keep all the states and their values in a table.
But what if this was a 4X4 or 5X5 or even a 10X10 game of tic tac toe? The number of total possible states grows exponentially as the game board grows. This means that our value table will eventually be way too large to be practical! If only there was a way to just approximate the value of each state. If only there existed a function that approximates value of each possible action, given the current input...
Well as always, there is such a function and we find it using a neural network. So if you are ready to graduate from the land of tabulated RL, then come here where we learn how deep RL works.