# Solving a Reinforcement Learning Problem Using Cross-Entropy Method

• 时间: 2020-06-05 07:31:53

## Agent Creation Using Deep Neural Networks `After a parenthesis of three posts introducing basics in Deep Learning and Pytorch, in this post we put the focus back on Reinforcement Learning.`

In a previous posts we advanced that anAgentmake decisions to solve complex decision-making problems under uncertainty. For this purpouse the Agent employs apolicy,as a strategy to determine the next actionabased on the current states.

Even for fairly simple environments, we can have a variety of policies. Then we need a method to automatically find optimal policies. From this post onwards we will explore different methods to obtain a policy that allows an Agent to make decisions.

In this post we will start with Cross-Entropy method. Despite the simplicity of this method, it works well in basic environments and it’s easy to implement, which makes it an ideal baseline method to try.

## Overview

Remember that apolicy, denoted by ( | ),says which action the Agent should take for every state observed. In this post we will consider that the core of our Agent will bea neural networkthat produces the policy.We can refer to the methods that solves this type of problems aspolicy gradient methods, that train the neural network with the goal to maximize the expected Return(G).

In practice, the policy is usually represented as aprobability distribution over actions(that the Agent can take at a given state), which makes it very similar to a classification problem presented before (in the Deep Learning post), with the amount of classes being equal to the amount of actions we can carry out. In our case the output of our neural network is an action vector that represents a probability distribution:

We refer to it as astochastic policy gradient, because it returns a probability distribution over actions rather than returning a deterministic single action.

## How to improve our policy?

We want a policy, a probability distribution, and we initialize it at random. Then we improve our policy by playing a few games and then adjusting our policy (parameter of the neural network) in a way that is more efficient. Then repeat this process in order to our policy gradually gets better. One algorithm which can be used for that is the Cross-Entropy method.

## Training Dataset

Since we will consider a neural network as the heart of this first Agent, we need to find some way to obtain data that we can assimilate as a training dataset, which includes input data and their respective labels.

During the agent’s lifetime, its experience is presented asepisodes. Every episode is a sequence of observations of states that the Agent has got from the Environment, actions it has issued, and Rewards for these actions.

Imagine that our Agent has played several such episodes. For every episode, we can calculate theReturn(total reward) that the agent has claimed. Remember that an Agent tries to accumulate as much total Reward as possible by interacting with the Environment.

Again, for simplicity we will use the Frozen-Lake example. To understand what’s going on, we need to look deeper at the Reward structure of the Frozen-Lake Environment. We get the reward of`1.0`only when we reach the goal, and this Reward says nothing about how good each episode was. Was it quick and efficient? or did we make many rounds on the lake before we randomly stepped into the final cell? We don’t know; it’s just`1.0`reward and that’s it.

Let’s imagine that we already have the Agent programmed and we use it to create 4 episodes, that we will then visualize with the`.render()`method already presented:

Note that due to randomness in the Environment and the way that the Agent selects actions to take, the episodes have different lenght and also shown different Rewards. Obviously an episode that has a Reward of`1.0`is better than one that has a reward of`0.0`. What about episodes that end with the same reward?

It is clear that we can consider some episodes “better” than others, e.g. the third is shortest that the second. For this, we can use a gamma = 0,9 (discountfactor) presented previously. In this case, the Return (G) for shorter episodes will be higher than the Reward for longer ones.

Let’s illustrate these four episodes with a diagram where each cell represents the Agent’s step in the episode and its Return:

## Cross-Entropy Algoritm

Thecore of the Cross-Entropy methodis simple: generate episodes, throw away bad episodes and train on better ones. So, a summary of the steps of the method can be described as follows:

1. Play anumber of episodes in the Environment using our current Agent model.
2. Calculate the Return for every episode and decide on areturn boundary. Usually, we use some percentile of all rewards.
3. Throw away all episodes with a return below the return boundary.
4. Train the neural network of the Agent using episode steps (tuples <s,a,r>) from the remaining “elite” episodes, using the statesas the input and issued actionsaas the label (desired output).
5. Repeat from step 1 until we become satisfied with the result.

A variant of the method, which we will discuss in the next post, is that we can keep “elite” episodes for a longer time. I mean that the default version of the algorithm samples episodes from the Environment, train on the best ones, and threw them away. However, when the number of succesful episodes is small, the “elite” episodes can be maintained longer, keeping them for several iterations to train on them.

## The Environment

The Environment is the source of data from which we are going to create the dataset that will be used to train the neural network of our Agent.

## Episode steps

The Agent will start from a random policy, where the probability of all actions is uniform, and while training, the Agent will hopefully learn from data obtained from the Environment to optimize its policy toward reaching the optimal policy.

The data that comes from the Environment are episode steps that should be expressed with tuples of the form<s,a,r>(state, action and Reward) which are obtained in each timestep as indicated in the following scheme:

## Coding the Environment

Let’s code it. We must first import several packages:

`import numpy as npimport torchimport torch.nn as nnimport gymimport gym.spaces`

We will start by creating thenot slipperyEnvironment (in the next post we will discuss more about the slippery version):

`env = gym.make(‘FrozenLake-v0’, is_slippery=False)`

Ourstate spaceis discrete, which means that it’s just a number from zero to fifteen inclusive (our current position in the grid). Theaction spaceis also discrete, from zero to three.

Our neural network expects a vector of numbers. To get this, we can apply the traditional onehot encoding of discrete inputs (presented in this previous post), which means that the input to our network will have 16 numbers with zero everywhere except the index that we will encode. To ease the code, we can use the`ObservationWrapper`class from Gym and implement our`OneHotWrapper`class:

`class OneHotWrapper(gym.ObservationWrapper):def __init__(self, env):   super(OneHotWrapper, self).__init__(env)   self.observation_space = gym.spaces.Box(0.0, 1.0,               (env.observation_space.n, ), dtype=np.float32)def observation(self, observation):    r = np.copy(self.observation_space.low)    r[observation] = 1.0    return renv = OneHotWrapper(env)`

As a summary, we have in`env`an Environment (not slippery Frozen-Lake) that we will use for obtain data to train our Agent.

## The Agent

We have already advanced that our Agent is based on a neural network. Let’s see how to code this neural network and how it is used to perform the selection of actions that an Agent does.

## The model

Our model’s core is a one-hidden-layer neural network with 32 neurons using a Sigmoid activation function. There is nothing special about our neural network. We start with an arbitrary number of layers and number of neurons.

`obs_size = env.observation_space.shapen_actions = env.action_space.nHIDDEN_SIZE = 32net= nn.Sequential(     nn.Linear(obs_size, HIDDEN_SIZE),     nn.Sigmoid(),     nn.Linear(HIDDEN_SIZE, n_actions))`

The neural network takes a single observation from the environment as an input vector and outputs a number for every action we can perform, a probability distribution over actions. A straightforward way to proceed would be to include softmax nonlinearity after the last layer. However, remember from aprevious post that we try to avoid apply softmax to increase the numerical stability of the training process. Rather than calculating softmax and then calculating Cross-Entropy loss, in this example we use the PyTorch class`nn.CrossEntropyLoss`, which combines both softmax and Cross-Entropy in a single, more numerically stable expression. CrossEntropyLoss requires raw, unnormalized values from the neural network (also called logits).

## Optimizer and Loss function

Other “hyperparameters” as Loss function and the Optimizer are also set almost randomly for this example:

`objective = nn.CrossEntropyLoss()optimizer = optim.SGD(params=net.parameters(), lr=0.001)`

As we will see, the method is robust and converges very quickly, giving us plenty of room to choose the hyperparameters.

## Get an Action

This abstraction makes our agent very simple: it needs to pass an observation (state) that receives from the Environment to the neural network model and perform random sampling using the probability distribution to get anactionto carry out:

`   sm = nn.Softmax(dim=1)   def select_action(state):1:      state_t = torch.FloatTensor([state])2:      act_probs_t = sm(net(state_t))3:      act_probs = act_probs_t.data.numpy()4:      action = np.random.choice(len(act_probs), p=act_probs)        return action`

Line 1:This functions requires that a first step transform the state to a tensor to ingest it to our neural network. At every iteration, we convert our current observation (Numpy array of 16 positions) to a PyTorch tensor and pass it to the model to obtain action probabilities. Remember that our neural network model needs tensors as a input data.

Line 2:A consequence of using`nn.CrossEntropyLoss`we need to remember to apply softmax every time we need to get probabilities from our neural network output.

Line 3:We need to convert the output tensor (remember that the model and softmax function return tensors) into a NumPy array. This array will have the same 2D structure as the input, with the batch dimension on axis 0, so we need to get the first batch element to obtain a 1D vector of action probabilities.

Line 4:With the probability distribution of actions, we can use it to obtain the actual action for the current step by sampling this distribution using NumPy’s function`random.choice()`.

## Training the Agent

In the next figure we show a screenshot of the training loop indicating the general steps of the Cross-Entropy algorithm :

In order not to make this post too long, we leave for the next post the detailed explanation of this algorithm. Remember thatentire code of this post can be found on GitHub.For now I simply propose to run the code of this loop and see the results. Just to mention that we considered a good result to have a Reward of 80%.

## Test the Agent

In any case, what remains now is to see if the Agent really makes good decisions. To check this, we can create a new Environment (`test_env`), and check if our Agent is able to reach the Goal cell (we will use the`.render()`method in the code to make it more visual):

`test_env = OneHotWrapper(gym.make(‘FrozenLake-v0’,            is_slippery=False))state= test_env.reset()test_env.render()is_done = Falsewhile not is_done:   action = select_action(state)   new_state, reward, is_done, _ = test_env.step(action)   test_env.render()   state = new_stateprint(“reward = “, reward)`

If we try it several times we will see that it does it well enough:

## What next?

In the next post we will describe in detail the training loop (which we have skipped in this post) as well as see how we can improve the learning of the Agent taking into account a better neural network (with more neurons or different activation functions). Also we will consider the variant of the method that keeps “elite” episodes for several iterations of the training process. See you in the following post.