Underactuated Robotics

Algorithms for Walking, Running, Swimming, Flying, and Manipulation

Russ Tedrake

© Russ Tedrake, 2024
Last modified .
How to cite these notes, use annotations, and give feedback.

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2024 semester. Lecture videos are available on YouTube.

Table of contents

Planning Under Uncertainty

Partially-obervable Markov decision processes (POMDPs)

For notational simplicity, let's start the discussion with discrete time, states, actions, and even observations. This allows us to have a lossless representation of the probability masses over these quantities, and to write them as vectors.

A partially-observable Markov decision process (POMDP) is a stochastic state-space dynamical system with discrete states $X$, actions $U$, observations $Y$, initial condition probabilities $P_{x_0}(x)$, transition probabilities $P_x(x'|x,u)$, observation probabilities $P_y(y | x)$, and a deterministic reward function $R: X \times U \rightarrow \Re$ Kaelbling98. Our goal is to maximize the finite-horizon expected rewards, $E[\sum_{n=1}^N R(x_n, u_n)].$

An important idea for POMDP's is the notion of a "belief state" of a POMDP: \[[b_n]_i = \Pr(x_n = x_i | u_0, y_0, ... u_{n-1}, y_{n-1}).\] $b_n \in \mathcal{B}_n \subset \Delta^{|X|}$ where $\Delta^n \subset \Re^n$ is the $n$-dimensional simplex and we use $[b_n]_i$ to denote the $i$th element. $b_n$ is a sufficient statistic for the POMDP, in the sense that it captures all information about the history that can be used to predict the future state (and therefore also observations, and rewards) of the process. Using $h_n = [u_0, y_0, ..., u_{n-1}, y_{n-1}]$ to denote the history of all actions and observations, we have $$\Pr(x_{n+1} = x | b_n, h_n, u_n) = \Pr(x_{n+1} = x | b_n, u_n).$$ It is also sufficient for optimal control, in the sense that the optimal policy can be written as $u_n^* = \pi^*(b_n, n).$

The belief state is observable, with the optimal Bayesian estimator/observer given by the (belief-)state-space form as \begin{gather*} [b_{n+1}]_i = f_i(b_n, u_n, y_n) = \frac{P_y(y_n | x_i) \sum_{j \in |X|} P_x(x_i|x_j,u_n) [b_n]_j}{\Pr(y_n | b_n, u_n)}\end{gather*} where $b_0 \sim \mathcal{B}_0$, and $$\Pr(y_j | b, u) = \sum_{x' \in X} P_y(y | x') \sum_{j \in |X|} P_x(x' | x_j, u) [b]_j = {\bf P}_y(y_j|x) {\bf P}_{x,u} b,$$ where ${\bf P}_y(y_j|x)$ is the $|X|$ element row vector and ${\bf P}_{x,u}$ is the $|X| \times |X|$ matrix of transition probabilities given $u$. We can even write $$b_{n+1} = f(b_n, u_n, y_n) = \frac{{\bf D}_n {\bf P}_{x,u} b_{n}}{\Pr(y_n|b_n, u_n)},$$ where ${\bf D}_j$ is the diagonal matrix with $P_y(y_j|x_i)$ on the $i$the diagonal. Moreover, we can use this belief update to form the ``belief MDP'' for planning by marginalizing over the future observations, giving the \begin{align*} \Pr(b_{n+1}=b|b_n, u_n) = \sum_{\{y|b = f(b_n, u_n, y)\}} P_y(y | b_n, u_n) = \sum_{\{j | b=f(b_n, u_n, y_j)\}} {\bf D}_j {\bf P}_{x,u_n} b_n.\end{align*} Note that although $b_n$ is represented with a vector of real values, it actually lives in a finite space, $\mathcal{B}_n$, which is why we use probability mass and MDP notation here. It is well known that the optimal policy $\pi^*(b_n, n)$ is the optimal state feedback for the belief MDP.

Cheeze Maze

Make a simple visualization where one can walk through the maze and watch the belief states update. Could be fun to have them do it again, without the visualization of the map (only the belief states).
There might be a few more useful thoughts in my "model-reduction for pomdps" draft: https://www.overleaf.com/project/5fbc14452ec47212266a4de5

Continuous states, actions, and observations

Light-dark domain

Finite-parameterizations of distributions

multi-modal guassian, particles, histograms, deep nets

Why plan in belief space?

Information-gathering actions...

It's worth noting that, even for fully-actuated robots, planning in belief space is underactuated. Even in the linear-Gaussian setting, the the dimension of the belief space is the number of means (equal to the number of deterministic states) plus the number of covariances (equal to the number of deterministic states squared), minus one if we take into account the constraint that the probabilities have to integrate to one. No robot has enough actuators to instantaneously control all of these degrees of freedom.

Trajectory optimization

SLAG iLQG (following LQG from Output Feedback chapter), iLEG (papers by Buchli, then Righetti)

Sampling-based planning

Learned state representations

A number of other techniques that we have discussed are intimately related to this notion of planning under uncertainty and belief-space, even if that connection is not typically made explicitly. For instance, if we perform policy search in a partially-observable setting, using a dynamic policy, then the state realization in the policy is best thought of as an (approximate) belief state. These days it's also common to learn value functions or Q-functions with internal state; again the interpretation should be that the observations are being collated into an (approximate) belief state.

Let us contrast this, however, with the "student-teacher" architecture that is sometimes used in RL -- where one first trains a state-feedback policy with RL, then uses supervised learning to turn this into an output feedback policy. This optimal state-feedback policy can be fundamentally different -- it will not take any information-gathering actions...

AIS

References

  1. Leslie Pack Kaelbling and Michael L. Littman and Anthony R. Cassandra, "Planning and Acting in Partially Observable Stochastic Domains", Artificial Intelligence, vol. 101, January, 1998.

Table of contents