Indy's Weblog
2016 Oct 02

Reinforcement Learning Basics


These are my notes of RL Course by David Silver

UCL Course on RL

Reinforcement learning is a general method of making optimal decisions. It appears in various fields of science in different names, such as in psychology (Classical/operant conditioning), in Economics (bounded rationality), in Mathematics (operations research), in neuro science (Reward system)

Reward system (dopamine system) takes up a large part of the human brain.

RL is a trial and error paradim. It's an unsupervised learning algorithm where learning is performed by selecting the best action that would maximise the total reward. Reward signal is the feedback and it could be delayed. RL happens over a sequence of time steps.

RL reside inside an agent interacting with an environment. It gets to influence the environment and take actions depending on the state of the environment. Agent's perception depends on the actions it takes. This is in contrast to other machine learning technique where, the data is static and the algorithm doesn't interact with data such that to influence its state.


  • Control systems such as agent manoeuvring a car/helicopter/physical machine etc.
  • Board games where current move has consequences (chess,backgammon,go)
  • Managing investment

These are in contrast to other machine learning techniques such as image classification, where a decisions are discrete and independant of previous ones.


RL is based on the reward hypothesis, which says that all goals can be described by maximisation of expected cumulative reward.

Reward is a scalar feedback signal (Rt), indicates how well the agent was doing at step t. Agent's goal is to maximise the cumulative reward. Agent is rewarded with + as well as - rewards depending on the performance. Agent's actions have longterm consequences, it may be better to sacrifice short term gain for a better outcome later. (Likely to avoid local optimalities)

Reinforcement learning diagram

The history is composed of the observations actions and rewards upto the current time.

    History Ht = A1,O1,R1 .. At,Ot,Rt

For a robot or an embodied agent this would be the sensorimotor stream and the next state of the whole system depends on this history.

State summarises the history such that it is useful, because for each action/reward if the whole history has to be analysed, then the system as a whole would not function effectively.

    St = f(Ht)

Environment state is largely not visible to the agent. Agent only gets a constrained window of information about the state of the environment, and this also depends on the previous and current actions of the agent.

Information State (Markov state)

Markov state contains all useful infromation from the history. Next state is dependant only on the current state. Future is independent of past given the present.

    P[St+1 | St] = P[St+1 | S1 ..., St]

Some systems don't have Markov state (Quantum system)

  • Fully observerable Environment (Markov Decision Process) means , agent directly observes the environment state.

      Agent state = Env State =  Information State
  • Partially observerable environment. Partially Observable Markov Decision Process (POMDP) agent state != env state

Agent must construct its own state by using

  • Complete history (Naive approach)
  • Baysian belief system. (probability distribution of a belief what state the agent is in)
  • Recurrent neural network

Components of an RL Agent

  1. Policy - determines agents behaviour. Maps from state to action
  2. Value function - Estimation of how good a state or an action is
  3. Model - Agent's perception of the environment (which could be incomplete or wrong or correct)

Can be deterministic where an action is predetermined for a given state. Or it could be stochastic where the action is probabilistic given the state. ( Conditional probabilistic variable)

Value Function

Prediction of expected future reward. Used to estimate how good/bad a future state would be. Formally -> Expectation of future rewards (discounted such that we value immediate rewards more than the later rewards) conditioned on the current state.


Predicts what the environment will do next. Transitions will predict what next state would be. Rewards would predict what the next rewards would be. There are model free systems. i.e. we don't model the environment.

Taxonomy of RL Agents

Two broad categories

  • (a)

    • Value based - No policy necessary
    • Policy based - No value function necessary
    • Actor-Critic - Combination of both above where there is a policy and a value function
  • (b)

    • Model Free - Policy and/or value function and No model
    • Model based - Policy and/or value funciton with a model

Sub problems within RL

Two fundamental problems in sequential decision making are learning and planning.


The environment is initially unknown, agent interacts with the environment to learn and improve its policy (behaviour). Agent essentially builds up the model of the env by interacting with it.


The environment is fully known and modelled in some manner (e.g. differential equations etc.) and the agent does computations (e.g. tree search) with the model and improves its policy (behaviour).

Exploration vs Exploitation

In order to optimise the reward, there has to be a balance between exploration (finding out about the environment) and exploitation (perform actions that get rewards)

Prediction and Control

Prediction is evaluate the future reward given a policy where Control is finding the best policy. In RL in general, we have to solve the prediction problem inorder to solve the control problem.