The CartPole problem is the Hello World of Reinforcement Learning, originally described in 1985 by Sutton et al. The environment is a pole balanced on a cart. Here I walk through a simple solution using Pytorch. The ipython notebook is up on Github.
The cartpole environment’s state is described by a 4-tuple:
( x position of cart, x velocity of cart, angular position of pole, angular velocity of pole )
At every timestep, the physics simulation is updated. The input is 0 or 1, depending on whether we want to move the cart to the left or to the right.
With the OpenAI Gym environment, we don’t have to code up the physics simulation, it comes out of the box. We just have to:
import gym env = gym.make('CartPole-v1')
-v1 in the environment spec makes each episode run for 500 steps.
CartPole-v0 only runs for 200 steps.
Before we get into neural networks and Reinforcement Learning (RL), let’s play around with the environment to get some intuition. The basic simulation loop is:
state = env.reset() while True: action = select_action(state) state, _, done, _ = env.step(action) env.render() if done: break
env.render() will open a GUI window and show you the cartpole.
The environment will return
done=True if either 500 timesteps have elapsed (episode success) or if the pole has fallen over (angular position of the pole has reached +- 12 degrees) or the cart has left the simulation space (cart position has reached +- 2.4), in which case the episode failed.
To make the above snippet work, we just have to supply a
select_action() function, which given the state, returns what to do: move the cart left or right. Let’s see what happens if we supply a random agent:
def select_action_random(state): if random() < 0.5: return 0 else: return 1
Obviously this will not perform very well, but it’s a start. How about something smarter: if the pole is falling to the right, let’s move the cart to the right to compensate, and vice versa:
def select_action_simple(state): if state < 0: return 0 else: return 1
This should perform better than our random agent, but how much better? Let’s write a simple function which counts, on average, how far our agent survives (out of 500, as a ratio), and returns the average, a number between 0 and 1, 1 being a perfect score:
def goodness_score(select_action, num_episodes=100): num_steps = 500 ts =  for episode in range(num_episodes): state = env.reset() for t in range(1, num_steps+1): action = select_action(state) state, _, done, _ = env.step(action) if done: break ts.append(t) score = sum(ts) / (len(ts)*num_steps) return score
If we run this for the above 2 test functions, like:
We see that the random scores 0.04, and simple 0.08. So the simple is better than random, but still very far from 1. Let’s try something better: how about if we also add the angular velocity:
def select_action_good(state): if state+state < 0: return 0 else: return 1 goodness_score(select_action_good)
Turns out this simple test function solves the problem pretty much perfectly, it gets a score of almost 1.0 (meaning the simulation survives all the way to 500 steps). We got lucky with this ansatz, but there’s no lucky shortcut to solving Starcraft, so let’s move on to Reinforcement Learning.
Essentially we will build a neural network which will try to guess/learn the
select_state() function above. The input state is 4 doubles, and the output is 2 doubles (the probability of going left and right, sum to 1). Since we got lucky above with
select_action_good(), we know a small neural network will do, it just has to learn to add the right components:
class PolicyNN(nn.Module): def __init__(self): super(PolicyNN, self).__init__() self.fc = nn.Linear(4, 2) def forward(self, x): x = self.fc(x) return F.softmax(x, dim=1)
select_action() that goes along with the network is:
def select_action_from_policy(model, state): state = torch.from_numpy(state).float().unsqueeze(0) probs = model(state) m = Categorical(probs) action = m.sample() return action.item(), m.log_prob(action) def select_action_from_policy_best(model, state): state = torch.from_numpy(state).float().unsqueeze(0) probs = model(state) if probs > probs: return 0 else: return 1
select_action_from_policy() runs the network on the state, and returns the left/right output according to the probabilities (and also a probability, see later).
select_action_from_policy_best() can be used after training, it always returns the action with the higher probability.
The next question is, how do we train our network? In supervised learning like MNIST, we train our network on independent samples, and for each sample, we know what the desired response is, and construct a loss function (from some sort of distance-like metric) from that. Here, there is no explicit training data to tell us whether in a given state a right/left prediction was good or bad. But it's actually not that hard to constuct a loss function.
Taking a step back, what is our goal in the CartPole environment? The goal is that the agent should survive as long as possible, 500 steps in this simulation. So as a first idea, we could simply try to use the
(1 - t/500) normalized length of the simulation (how far the agent got) as a loss function. Then we can use gradient descent to search for a minimum of the loss function (maximum length of simulation). Assuming we do a gradient descent after each episode (simulation run), the code would be:
def train_wont_work(num_episodes=1000): num_steps = 500 for episode in range(num_episodes): state = env.reset() for t in range(1, num_steps+1): action = select_action_from_policy(state) state, _, done, _ = env.step(action) if done: break loss = 1.0 - t / num_steps # this doesn't actually work, because # the loss function is not an explicit # function of the model's output; it's # a function of book keeping variables optimizer.zero_grad() loss.backward() # AttributeError: 'float' object has no attribute 'backward' optimizer.step()
However, this won't work. The problem is,
t is a book keeping
int variable, it's not a Pytorch variable. It's not derived from (not a function of) the neural network's output, so we can't take the derivative of it (with respect t o the network's weights)AZ.
But from the above idea, we can get something that works. At each timestep, let's say that the reward is how long the simulation survived after the timestep, multiplied by the network's output that we took (left/right probability). The loss is the negative sum of all rewards. This sounds good, because: (i) the global minimum of the this loss will be when the agent survives all the way, and takes the moves that lead there with 1 probability (ii) it's a function of the network's outputs (the probabilities). In code:
def train_simple(num_episodes=10*1000): num_steps = 500 ts =  for episode in range(num_episodes): state = env.reset() probs =  for t in range(1, num_steps+1): action, prob = select_action_from_policy(model, state) probs.append(prob) state, _, done, _ = env.step(action) if done: break loss = 0 for i, prob in enumerate(probs): loss += -1 * (t - i) * prob print(episode, t, loss.item()) optimizer.zero_grad() loss.backward() optimizer.step() ts.append(t) # check stopping condition: if len(ts) > 10 and sum(ts[-10:])/10.0 >= num_steps * 0.95: print('Stopping training, looks good...') return train_simple()
This works! In a couple hundred episodes, the simulation stops, because the network successfully balances the pole:
print(goodness_score(lambda state: select_action_from_policy_best(model, state))) >>> 1.0
This "derivation" should give an intuitive understanding of RL. There is one bit that we jumped over: we use the log probabilities in the loss function, not the regular probabilities (see last line in
select_action_from_policy()). The training loop doesn't work if we change it to use the regular probabilities. For an explanation, see this derivation of the policy gradient method, which is what we actually implemented here.
Another important note is that here we didn't use a discount factor (usually called
gamma; what we did here is the same as setting
gamma=1, ie. no decay of rewards), and the training loop still converges. Most RL problems use a discount factor, because there is an assumption that whatever action we took at time t influenced what happens after, but as time goes on it becomes less important (the importance decays). Check the official Pytorch CartPole example for an implementation with a discount factor; interestingly, it doesn't seem to have better convergence properties than this naive implementation.