Let's talk about reinforcement learning (`RL`

).

**What**is it?**Why**is it important?**How**can I implement it?

*A perceiving agent within an environment that takes actions to maximize its chances of achieving a specific set of goals. *

^{(via Kai-Zhan Lee)}

**Performance measure**How well is the agent acting to achieve its goals?**Environment**What is there besides the agent?**Actuators**What actions can the agent perform?**Sensors**What does the agent perceive?

You design autonomous cars. You lucked out on your team: now you have to design the entire intelligent agent.

Luckily, you remembered Neil's talk.

**Performance measure**How well is the agent acting to achieve its goals?**Environment**What is there besides the agent?**Actuators**What actions can the agent perform?**Sensors**What does the agent perceive?

**Performance measure**# accidents; travel time; comfort**Environment**roads, traffic, pedestrians**Actuators**steering wheel, accelerator, brake, signals**Sensors**LIDAR, sonar, RGB camera, GPS, odometry, ECU, ...

`RL`

vs `ML`

¶- In
`machine learning`

, we're given data with the objective of:- generalizing trends/responses in existing data to new data, or
- discovering structure in (unlabeled) data

- In
`reinforcement learning`

, we're given data with the objective of:**interacting with the data**to accomplish some goal as effectively as possible

- Play chess (anticipate moves and evaluate desirability of certain positions/moves)
- Robot control (manipulate environment while balancing internal state)
**For all examples**: a decision-making**agent**interacts with its environment- seeks to achieve some goal despite uncertainty about the environment
**agent**affects the future state of the environment, but cannot fully predict that future

_{http://setosa.io/ev/markov-chains/}

How do we model sequences of states?

**Markov property.**
For positive integer $n$, possible states $i_0, \dots, i_n$, and a sequence of states $X_0, \dots, X_m$
$$P(X_n=i_n \, \big{\vert} \, X_{n-1} = i_{n-1}) = P\big{(}X_n=i_n \, \big{\vert} \, X_{n-1}=i_{n-1},\, X_{n-2}=i_{n-2}, \dots, X_0 = i_0 \big{)}$$

**That is,** to determine the conditional probability of the next state, we depend only on the present state.

**Markov chains** are finite state machines with probabilistic state transitions.

Say you want to model the way Neil talks.

- Map each $\textrm{word} \rightarrow \textrm{state}$
- $P(X_n\,\big{\vert}\,X_{n-1}) = \frac{\textrm{# times sequence }X_{n-1}, X_n\textrm{ appears }}{\textrm{# times }X_{n-1} \textrm{ appears}}$

*e.g.*

- $P(\texttt{"LinkedIn"}\,\big{\vert}\,\texttt{"on"}) = 0.2$
- $P(\texttt{"calories"}\,\big{\vert}\,\texttt{"count"}) = 0.8$
- $P(\texttt{"mentee"}\,\big{\vert}\,\texttt{"best"}) = 0.6$

**Markov decision processes** (MDP) are Markov chains with *actions* and *rewards*.

- $\mathcal{S}$ : the set of states, with
*starting states*$s_0 \sim p(s_0)$ - $\mathcal{A}$ : the set of actions
- $\mathcal{T}(s_{t+1}\,\big{\vert}\,s_t,a_t)$ : the set of transitions mapping states and their actions at time $t$ to states at time $t+1$
- $\mathcal{R}(s_t,a_t,s_{t+1})$ : the (instantaneous) reward function
- $\gamma \in [0,1]$ : the discount factor, used to weigh immediate rewards more heavily

- $\pi$: the
*policy*mapping $\mathcal{S} \rightarrow p(\mathcal{A}=a\,\big{\vert}\,\mathcal{S})$- What action should the agent take given an environment in a certain state?
- For some action $a_t \in \mathcal{A}$ and state $s_t \in \mathcal{S}$, $\pi$ is a function: $a_t, s_{t+1} = \pi(s_t)$

- $\mathcal{G}$ : accumulated rewards; our
`RL`

objective function*e.g.*$G_t = \mathcal{R}_{t+1} + \mathcal{R}_{t+2} + \dots + \mathcal{R}_{T}$*e.g.*$G_t = \sum_{k=1}^{T}\gamma ^k \mathcal{R}_{t+k}$

`RL`

agent¶is to find the *optimal* policy:

$$ \pi ^* = \operatorname*{argmax}_\pi \mathbf{E}[\mathcal{G}\big{\vert}\pi] $$

where *optimal* means we maximize long-term reward ($\mathcal{G}$).

*basic policy*: $\pi^{\textrm{basic}} = $

- Lift right foot
- Extend right foot
- Lower right foot
- $\dots$

*reward*: $R = \begin{cases}
+k & \textrm{moving forward} \\
0 & \textrm{upright} \\
-k & \textrm{fallen}
\end{cases}
$

- Estimate the value, or expected return (expected cumulative future reward), of being in a state.

*How good is it for the agent to be in some state?***State-value function**for policy $\pi$: $V^{\pi}(s) = \mathbf{E}[\mathcal{R}\,\big{\vert}\,s,\pi]$

*How good is it for the agent to take some action from some state?***Action-value function**for policy $\pi$: $Q^{\pi}(s,a) = \mathbf{E}[\mathcal{R}\,\big{\vert}\,s,a,\pi]$

- We can relate our value functions:
- $V^{\pi}(s) = \max_a Q^{\pi}(s, a)$

- The optimal policy returns the actions maximizing expected return. $$ \pi ^* = \operatorname*{argmax}_\pi \mathbf{E}[\mathcal{G}\big{\vert}\pi] \\ = \operatorname*{argmax}_a \sum_{t} \mathcal{T}(s_{t+1} \, \big{\vert} \, s_t, a)\,\mathcal{G(t)}$$

- Solve this using
*policy iteration*; iteratively update $Q^*$

$$Q^*(s_t, a_t) \leftarrow Q^{\pi}(s_t, a_t) + \alpha \delta$$

where $\delta$ is *error*, e.g. an undesirable outcome; then the $Q$-value at state $s_t$ for action $a_t$ is reduced.

`Deep-*`

fit in?¶- We don't always have access to a model of the environment ($\mathcal{T}, \mathcal{R}$)

- But we
*can*approximate the future behavior of the environment using neural networks

- Backpropagate error (low rewards) through the network

- More formally, approximate the function $Q^{\pi}$ using a neural network

`RL`

¶**Exploration-exploitation tradeoff.**Should we explore new states (and potentially reap more reward), or exploit current knowledge (play it safe?)*e.g.*speed vs. safety for a self-driving car

**Delayed reward.**If we take many actions and only receive a substantial reward later, which action(s) should we attribute the reward to?*e.g.*Chess

**Generalization.**It's impractical to visit every possible state - how do we know which ones will enable maximal reward?

- A fantastic book by
`Sutton`

&`Barto`

- Learn by practice
- Machine learning reading group