Markov Decision Process
-
States
-
(Transition) Model: transition matrix T(s, a, s’) = Pr(s’ \mid s, a)
-
Actions: up, down, left, right A(s)
-
Reward: R(s) or R(s, a) or R(s, a, s’) all math equivalent
- Markovian property: only present matters for the transition model.
- Stationary
Solution => Policy: $\pi(s) \to a$ and $\pi^*$ = up, up right, right, right
Why policy instead of a plan (trace)
- work everywhere
- robust against probabilistic model *
- Delayed reward
- Minor reward changes matter => reward is domain knowledge