Algorithms for Walking, Running, Swimming, Flying, and Manipulation

© Russ Tedrake, 2024

Last modified .

How to cite these notes, use annotations, and give feedback.

**Note:** These are working notes used for a course being taught
at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.

Table of contents |

For notational simplicity, let's start the discussion with discrete time, states, actions, and even observations. This allows us to have a lossless representation of the probability masses over these quantities, and to write them as vectors.

A partially-observable Markov decision process (POMDP) is a stochastic
state-space dynamical system with discrete states $X$, actions $U$,
observations $Y$, initial condition probabilities $P_{x_0}(x)$, transition
probabilities $P_x(x'|x,u)$, observation probabilities $P_y(y | x)$, and a
deterministic reward function $R: X \times U \rightarrow \Re$

An important idea for POMDP's is the notion of a "belief state" of a
POMDP: \[[b_n]_i = \Pr(x_n = x_i | u_0, y_0, ... u_{n-1}, y_{n-1}).\] $b_n
\in \mathcal{B}_n \subset \Delta^{|X|}$ where $\Delta^n \subset \Re^n$ is
the $n$-dimensional simplex
and we use $[b_n]_i$ to denote the $i$th element. $b_n$ is a sufficient
statistic for the POMDP, in the sense that it captures all information
about the history that can be used to predict the future state (and
therefore also observations, and rewards) of the process. Using $h_n =
[u_0, y_0, ..., u_{n-1}, y_{n-1}]$ to denote the history of all actions and
observations, we have $$\Pr(x_{n+1} = x | b_n, h_n, u_n) = \Pr(x_{n+1} = x
| b_n, u_n).$$ It is also *sufficient for optimal control*, in the
sense that the optimal policy can be written as $u_n^* = \pi^*(b_n,
n).$

The belief state is *observable*, with the optimal Bayesian
estimator/observer given by the (belief-)state-space form as
\begin{gather*} [b_{n+1}]_i = f_i(b_n, u_n, y_n) = \frac{P_y(y_n | x_i)
\sum_{j \in |X|} P_x(x_i|x_j,u_n) [b_n]_j}{\Pr(y_n | b_n,
u_n)}\end{gather*} where $b_0 \sim \mathcal{B}_0$, and $$\Pr(y_j | b, u) =
\sum_{x' \in X} P_y(y | x') \sum_{j \in |X|} P_x(x' | x_j, u) [b]_j = {\bf
P}_y(y_j|x) {\bf P}_{x,u} b,$$ where ${\bf P}_y(y_j|x)$ is the $|X|$
element row vector and ${\bf P}_{x,u}$ is the $|X| \times |X|$ matrix of
transition probabilities given $u$. We can even write $$b_{n+1} = f(b_n,
u_n, y_n) = \frac{{\bf D}_n {\bf P}_{x,u} b_{n}}{\Pr(y_n|b_n, u_n)},$$
where ${\bf D}_j$ is the diagonal matrix with $P_y(y_j|x_i)$ on the $i$the
diagonal. Moreover, we can use this belief update to form the ``belief
MDP'' for planning by marginalizing over the future observations, giving
the \begin{align*} \Pr(b_{n+1}=b|b_n, u_n) = \sum_{\{y|b = f(b_n, u_n,
y)\}} P_y(y | b_n, u_n) = \sum_{\{j | b=f(b_n, u_n, y_j)\}} {\bf D}_j {\bf
P}_{x,u_n} b_n.\end{align*} Note that although $b_n$ is represented with
a vector of real values, it actually lives in a finite space,
$\mathcal{B}_n$, which is why we use probability mass and MDP notation
here. It is well known that the optimal policy $\pi^*(b_n, n)$ is the
optimal state feedback for the belief MDP.

Information-gathering actions...

It's worth noting that, even for fully-actuated robots, planning in
belief space is *underactuated*. Even in the linear-Gaussian setting,
the the dimension of the belief space is the number of means (equal to the
number of deterministic states) plus the number of covariances (equal to
the number of deterministic states squared), minus one if we take into
account the constraint that the probabilities have to integrate to one. No
robot has enough actuators to instantaneously control all of these degrees
of freedom.

A number of other techniques that we have discussed are intimately related to this notion of planning under uncertainty and belief-space, even if that connection is not typically made explicitly. For instance, if we perform policy search in a partially-observable setting, using a dynamic policy, then the state realization in the policy is best thought of as an (approximate) belief state. These days it's also common to learn value functions or Q-functions with internal state; again the interpretation should be that the observations are being collated into an (approximate) belief state.

Let us contrast this, however, with the "student-teacher" architecture that is sometimes used in RL -- where one first trains a state-feedback policy with RL, then uses supervised learning to turn this into an output feedback policy. This optimal state-feedback policy can be fundamentally different -- it will not take any information-gathering actions...

- "Planning and Acting in Partially Observable Stochastic Domains", Artificial Intelligence, vol. 101, January, 1998. ,

Table of contents |