# Reinforcement Learning¶

### a brief survey¶

Let's talk about reinforcement learning (RL).

• What is it?
• Why is it important?
• How can I implement it?

# What is AI?¶

A perceiving agent within an environment that takes actions to maximize its chances of achieving a specific set of goals.

### PEAS¶

(via Kai-Zhan Lee)

• Performance measure How well is the agent acting to achieve its goals?
• Environment What is there besides the agent?
• Actuators What actions can the agent perform?
• Sensors What does the agent perceive?

### Scenario¶

You design autonomous cars. You lucked out on your team: now you have to design the entire intelligent agent.

Luckily, you remembered Neil's talk.

• Performance measure How well is the agent acting to achieve its goals?
• Environment What is there besides the agent?
• Actuators What actions can the agent perform?
• Sensors What does the agent perceive?
• Performance measure # accidents; travel time; comfort
• Environment roads, traffic, pedestrians
• Actuators steering wheel, accelerator, brake, signals
• Sensors LIDAR, sonar, RGB camera, GPS, odometry, ECU, ...

# What is RL?¶

### RL vs ML¶

• In machine learning, we're given data with the objective of:
• generalizing trends/responses in existing data to new data, or
• discovering structure in (unlabeled) data
• In reinforcement learning, we're given data with the objective of:
• interacting with the data to accomplish some goal as effectively as possible

### Applications¶

• Play chess (anticipate moves and evaluate desirability of certain positions/moves)
• Robot control (manipulate environment while balancing internal state)
• For all examples: a decision-making agent interacts with its environment
• seeks to achieve some goal despite uncertainty about the environment
• agent affects the future state of the environment, but cannot fully predict that future

## Markov chains¶

http://setosa.io/ev/markov-chains/

How do we model sequences of states?

Markov property. For positive integer $n$, possible states $i_0, \dots, i_n$, and a sequence of states $X_0, \dots, X_m$ $$P(X_n=i_n \, \big{\vert} \, X_{n-1} = i_{n-1}) = P\big{(}X_n=i_n \, \big{\vert} \, X_{n-1}=i_{n-1},\, X_{n-2}=i_{n-2}, \dots, X_0 = i_0 \big{)}$$

That is, to determine the conditional probability of the next state, we depend only on the present state.

Markov chains are finite state machines with probabilistic state transitions.

Say you want to model the way Neil talks.

1. Map each $\textrm{word} \rightarrow \textrm{state}$
2. $P(X_n\,\big{\vert}\,X_{n-1}) = \frac{\textrm{# times sequence }X_{n-1}, X_n\textrm{ appears }}{\textrm{# times }X_{n-1} \textrm{ appears}}$

e.g.

• $P(\texttt{"LinkedIn"}\,\big{\vert}\,\texttt{"on"}) = 0.2$
• $P(\texttt{"calories"}\,\big{\vert}\,\texttt{"count"}) = 0.8$
• $P(\texttt{"mentee"}\,\big{\vert}\,\texttt{"best"}) = 0.6$

Markov decision processes (MDP) are Markov chains with actions and rewards.

• $\mathcal{S}$ : the set of states, with starting states $s_0 \sim p(s_0)$
• $\mathcal{A}$ : the set of actions
• $\mathcal{T}(s_{t+1}\,\big{\vert}\,s_t,a_t)$ : the set of transitions mapping states and their actions at time $t$ to states at time $t+1$
• $\mathcal{R}(s_t,a_t,s_{t+1})$ : the (instantaneous) reward function
• $\gamma \in [0,1]$ : the discount factor, used to weigh immediate rewards more heavily
• $\pi$: the policy mapping $\mathcal{S} \rightarrow p(\mathcal{A}=a\,\big{\vert}\,\mathcal{S})$
• What action should the agent take given an environment in a certain state?
• For some action $a_t \in \mathcal{A}$ and state $s_t \in \mathcal{S}$, $\pi$ is a function: $a_t, s_{t+1} = \pi(s_t)$
• $\mathcal{G}$ : accumulated rewards; our RL objective function
• e.g. $G_t = \mathcal{R}_{t+1} + \mathcal{R}_{t+2} + \dots + \mathcal{R}_{T}$
• e.g. $G_t = \sum_{k=1}^{T}\gamma ^k \mathcal{R}_{t+k}$

### The goal of an RL agent¶

is to find the optimal policy:

$$\pi ^* = \operatorname*{argmax}_\pi \mathbf{E}[\mathcal{G}\big{\vert}\pi]$$

where optimal means we maximize long-term reward ($\mathcal{G}$).

When you have a pset due at 5:00 and it's 4:58
When you have a pset due at 5:00 and it's 4:58

basic policy: $\pi^{\textrm{basic}} =$

1. Lift right foot
2. Extend right foot
3. Lower right foot
4. $\dots$

reward: $R = \begin{cases} +k & \textrm{moving forward} \\ 0 & \textrm{upright} \\ -k & \textrm{fallen} \end{cases}$

### How do we find the optimal policy $\pi^*$?¶

#### Value functions¶

• Estimate the value, or expected return (expected cumulative future reward), of being in a state.
• How good is it for the agent to be in some state?
• State-value function for policy $\pi$: $V^{\pi}(s) = \mathbf{E}[\mathcal{R}\,\big{\vert}\,s,\pi]$
• How good is it for the agent to take some action from some state?
• Action-value function for policy $\pi$: $Q^{\pi}(s,a) = \mathbf{E}[\mathcal{R}\,\big{\vert}\,s,a,\pi]$
• We can relate our value functions:
• $V^{\pi}(s) = \max_a Q^{\pi}(s, a)$

### Putting it all together¶

• The optimal policy returns the actions maximizing expected return. $$\pi ^* = \operatorname*{argmax}_\pi \mathbf{E}[\mathcal{G}\big{\vert}\pi] \\ = \operatorname*{argmax}_a \sum_{t} \mathcal{T}(s_{t+1} \, \big{\vert} \, s_t, a)\,\mathcal{G(t)}$$
• Solve this using policy iteration; iteratively update $Q^*$

$$Q^*(s_t, a_t) \leftarrow Q^{\pi}(s_t, a_t) + \alpha \delta$$

where $\delta$ is error, e.g. an undesirable outcome; then the $Q$-value at state $s_t$ for action $a_t$ is reduced.

### Where does Deep-* fit in?¶

• We don't always have access to a model of the environment ($\mathcal{T}, \mathcal{R}$)
• But we can approximate the future behavior of the environment using neural networks

• Backpropagate error (low rewards) through the network
• More formally, approximate the function $Q^{\pi}$ using a neural network

# Demo¶

## Challenges in RL¶

1. Exploration-exploitation tradeoff. Should we explore new states (and potentially reap more reward), or exploit current knowledge (play it safe?)
• e.g. speed vs. safety for a self-driving car
2. Delayed reward. If we take many actions and only receive a substantial reward later, which action(s) should we attribute the reward to?
• e.g. Chess
3. Generalization. It's impractical to visit every possible state - how do we know which ones will enable maximal reward?

# Moving forward¶

• Learn by practice
• Machine learning reading group