Algorithms for Walking, Running, Swimming, Flying, and Manipulation

© Russ Tedrake, 2023

Last modified .

How to cite these notes, use annotations, and give feedback.

**Note:** These are working notes used for a course being taught
at MIT. They will be updated throughout the Spring 2023 semester. Lecture videos are available on YouTube.

Previous Chapter | Table of contents | Next Chapter |

My goal of presenting a relatively consumable survey of a few of the main ideas is perhaps more important in this chapter than any other. It's been said that "robust control is encrypted" (as in you need to know the secret code to get in). The culture in the robust control community has been to leverage high-powered mathematics, sometimes at the cost of offering more simple explanations. This is unfortunate, I think, because robotics and machine learning would benefit from a richer connection to these tools, and are perhaps destined to reinvent many of them.

The classic reference for robust control is

So far in the notes, we have concerned ourselves primarily with known,
deterministic systems. In the stochastic
systems chapter, we started our study of nonlinear dynamics of
stochastic systems, which can be beautiful! In this chapter we will begin
to consider computational tools for analysis and control of those systems.
Stochasticity can come in many forms... we may not know the governing
equations (e.g. the coefficient of friction in the joints), our robot may
be walking on unknown terrain,
subject to unknown disturbances, or even be picking up unknown objects.
There are a number of mathematical frameworks for considering this
uncertainty; for our purposes this chapter will generalizing our thinking
to equations of the form: $$\dot\bx = {\bf f}(\bx, \bu, \bw, t) \qquad
\text{or} \qquad \bx[n+1] = {\bf f}(\bx[n], \bu[n], \bw[n], n),$$ where
$\bw$ is a new *random* input signal to the equations capturing all of
this potential variability. Although it is certainly possible to work in
continuous time, and treat $\bw(t)$ as a continuous-time random signal
(c.f. Wiener
process), the notation and intuition is a bit simpler when we work with
$\bw[n]$ as a discrete-time random signal. For this reason, we'll devote
our attention in this chapter to the discrete-time systems.

In order to simulate equations of this form, or to design controllers
against them, we need to define the random process that generates $\bw[n]$.
It is typical to assume the values $\bw[n]$ are independent and identically
distributed (i.i.d.), meaning that $\bw[i]$ and $\bw[j]$ are uncorrelated
when $i \neq j$. As a result, we typically define our distribution via a
probability density $\bw[n] \sim p_\bw(\bw)$

This modeling framework is rich enough for us to convey the key ideas;
but it is not quite sufficient for all of the systems I am interested in.
In *dimension of the state vector* may change in different
realizations of the problem! Consider, for instance, the case of a robot
manipulating random numbers of dishes in a sink. I do not know many control
formulations that handle this type of randomness well, and I consider this
a top priority to think more about! (We'll begin to address it in the output feedback chapter.)

Given a stochastic model, what sort of cost function should we write in order to capture the desired aspect of performance? There are a number of natural choices, which provide nice trade-offs between capturing the desired phenomena and computational tractability.

Remember that $\bx[n]$ is now a random variable, so we want to write our cost and constraints using the distribution described e.g. $p_n(\bx)$. Broadly speaking, one might specify a cost in a handful of ways:

**Average cost**, $E[ \sum_n \ell(\bx[n], \bu[n]) ]$, or some time-averaged/discounted variant. This is by far the most popular choice (especially in reinforcement learning), because it preserves the dynamic programming recursion for additive cost.**Worst-case cost**, e.g. $\min_{\bu[\cdot]}\max_{\bx[0], {\bf w}[\cdot]} \sum_n \ell(\bx[n], \bu[n]).$ When we are able to upper-bound this cost and push down on that bound, then this is sometimes call*guaranteed cost control*.**Relative worst-case**(aka "gain bounds"). Even for linear systems, bounding the absolute performance (for all $\bw$) leads to vacuous bounds; we can do much better by evaluating our performance metric relative to the magnitude of the disturbance input.**Value at Risk (VaR)**and other related metrics from economics and operations research. $\text{VaR}(\alpha)$ is the maximum cost if we ignore the tails of the distribution (worse outcomes whose combined probability is at most $\alpha$). Note that VaR is easy to understand, but the related quantity "conditional value at risk" (CVaR) has superior mathematical properties, including connections to convex optimizationMajumdar20 .**Regret**formulations try to minimize the difference between the designed controller and e.g. an optimal controller in some class that has access to privileged information, such as knowing a priori the disturbances $\bw[n].$ These formulations have become popular in the field of online optimization, and we are now starting to see "regret bounds" for control.

**Chance constraints**, which take the form e.g. $\Pr[ g(\bx) > 0 ] \le \alpha.$ For instance, we might like to guarantee that the probability that our airplane will crash into a tree is less that $\alpha$. Even for Gaussian uncertainty and polytopic constraints, these quantities can be hard to evaluate exactly; we often end up making approximationsScher22 though there is some hope to provide rigorous bounds with Lyapunov-like argumentsSteinhardt11a .**Worst-case constraints**, which are the limit of the chance constraints when we take $\alpha \rightarrow 0$, tend to be much more amenable to computation, and connect directly to reachability analysis.

The term "robust control" is typically associated with the class of
techniques that try to guarantee some *worst-case* performance or a
worst-case bound (e.g. the gain bounds). The term "stochastic optimal
control" or "stochastic control" is typically used as a catch-all for other
methods that reason about stochasticity, without necessarily providing the
strict robustness guarantees.

It's important to realize that proponents of robust control are not
necessarily pessimistic by nature; there is a philosophical stance about
how difficult it can be to model the true distribution of randomness that a
control system will face in the world. Worst-case analysis typically
requires only a definition of the *set* of possible values that the
random variable can take, and not a detailed *distribution* over those
possible values. It may be more practical to operationalize a concept like
"this UAV will not crash so long as the peak wind gusts are below 35 mph"
than requiring a detailed distribution over wind gust probabilities. But
this simpler specification does come at some cost -- worst-case
certificates are often pessimistic about the true performance of a system,
and optimizing worst-case performance can lead to conservatism in the
control design.

The Bellman equation.

Discounted cost. Infinite-horizon average cost.

We already had quick preview into stochastic optimal control in one of the cases where it is particularly easy: finite Markov Decision Processes (MDPs).

Let's consider a stochastic extension of the (discrete-time) LQR problem, where the system is now subjected to additive Gaussian white noise: \begin{gather*} \bx[n+1] = \bA\bx[n] + \bB\bu[n] + \bw[n],\\ E\left[\bw[i]\right] = 0, \quad E\left[ \bw[i]\bw^T[j] \right] = \delta_{ij}{\bf \Sigma_w},\end{gather*} where $\delta_{ij}$ is one if $i=j$ and zero otherwise, and ${\bf \Sigma_w}$ is the covariance matrix of the disturbance, and we take the average cost: $$\min E\left[ \sum_{n=0}^\infty \bx^T[n]\bQ\bx[n] + \bu^T[n] \bR\bu[n] \right], \qquad \bQ=\bQ^T \succeq 0, \bR = \bR^T \succ 0.$$ Note that we saw one version of this already when we discussed policy search for LQR.

In the standard LQR derivation, we started by assuming a quadratic form for the cost-to-go function. Is that a suitable starting place for this stochastic version? We've already studied the dynamics of a linear system subjected to Gaussian noise, and learned that (at best) we should expect it to have a Gaussian stationary distribution. But that sounds problematic, no? If the system can not drive the system to zero and stay at zero in order to stop accruing cost, then won't the infinite-horizon cost be infinite (regardless of the controller)?

Yes. The cost-to-go for this problem is infinite! But it turns out that it still has enough structure for us to work with. In particular, let's "guess" a cost-to-go function of the form: $$J_n(\bx) = \bx^T {\bf S}_n \bx + c_n,$$ where $c_n$ is a (scalar) constant. Now we can write the Bellman equation and do some algebra: \begin{align*} J_n(\bx) &= \min_\bu E_\bw\left[\bx^T \bQ \bx + \bu^T\bR\bu + [\bA\bx + \bB\bu + \bw]^T{\bf S}_{n+1}[\bA\bx + \bB\bu + \bw] + c_{n+1}\right] \\ & \begin{split}= \min_\bu &[\bx^T \bQ \bx + \bu^T\bR\bu + [\bA\bx + \bB\bu]^T{\bf S}_{n+1}[\bA\bx + \bB\bu] + c_{n+1} \\ &+ E_\bw\left[ [\bA\bx + \bB\bu]^T{\bf S}_{n+1} \bw + \bw^T{\bf S}_{n+1}[\bA\bx + \bB\bu]\right] + E_\bw[\bw^T \bw]]\end{split} \\ &= \min_\bu [\bx^T \bQ \bx + \bu^T\bR\bu + [\bA\bx + \bB\bu]^T{\bf S}_{n+1}[\bA\bx + \bB\bu] + c_{n+1} + \tr({\bf \Sigma_w})]. \end{align*} The second line follows by simply pulling all of the deterministic terms outside of the expected value. The third line follows by observing that $\bx$ and $\bu$ are uncorrelated with $\bw,$ and $E[\bw] = 0$, so those cross terms are zero in expectation. Notice that, apart from the $c_{n+1}$ and ${\bf \Sigma_w}$, the remaining terms are exactly the same as the deterministic (discrete-time) LQR version.

Remarkably, we can therefore achieve our dynamic programming recursion by using ${\bf S}_n$ as the solution of the discrete Riccati equation, and $c_n = c_{n+1} + \tr({\bf \Sigma_w}).$ As we take $n\rightarrow \infty$, ${\bf S}_n$ converges to the steady-state solution to the algebraic Riccati equation, and only $c_n$ grows to infinity. As a result, even though the cost-to-go is infinite, the optimal control is still well defined: it is the same $\bu^* = -{\bf K}\bx$ that we obtain from the deterministic LQR problem!

Take note of the fact that the true cost-to-go blows up to infinity.
In reinforcement learning, for instance, it is common practice to avoid
this blow-up by considering discounted-cost formulations,
$$\sum_{n=0}^\infty \gamma^n \ell(\bx[n], \bu[n]),\quad 0 < \gamma \le
1,$$ or average-cost formulations, $$\lim_{N\rightarrow \infty}
\frac{1}{N} \sum_{n=0}^N \ell(\bx[n], \bu[n]).$$ These are satisfactory
solutions to the problem, but please make sure to understand why they
*must* be used.

The LQR derivation above assumed that the disturbances $\bw[n]$ were independent and identically distributed (i.i.d.). But many of the disturbances we would like to model are not i.i.d.. For instance, consider a UAV flying in the wind. The wind is correlated over time, sometimes building up to gusts but even those gusts are relatively long compared to any the sampling rate of a control system.

In fact, the standard models of wind are typically the output of a
Gaussian i.i.d. random signal passed through a linear low-pass filter

Which cost function should we use to do worst-case design and analysis
for linear systems? Certainly we can put an absolute bound on the
magnitude of $\bw$, and perform reachability analysis (we will do it
below). But knowing that the dependency on $\bw$ is linear allows us to
do something more natural, which can lead to tighter bounds. In
particular, we expect the magnitude of the deviation in $\bx$ compared to
the undisturbed case to be proportional to the magnitude of the
disturbance, $\bw$. So the natural bound for a linear system is a
*relative* bound on the magnitude of the response (from zero initial
conditions) relative to the magnitude of the disturbance.

Typically, this is done with the a scalar "$L_2$ gain", $\gamma$, defined as: \begin{align*}\argmin_\gamma \quad \subjto& \quad \sup_{\bw(\cdot) \in \int \|\bw(t)\|^2 dt\le \infty} \frac{\int_0^T \| \bx(t) \|^2 dt}{\int_0^T \| \bw(t) \|^2dt} \le \gamma^2, \qquad \text{or} \\ \argmin_\gamma \quad \subjto& \sup_{\bw[\cdot] \in \sum_n \|\bw[n]\|^2 \le \infty} \frac{\sum_0^N \|\bx[n]\|^2}{\sum_0^N \| \bw[n] \|^2} \le \gamma^2.\end{align*} The name "$L_2$ gain" comes from the use of the $\ell_2$ norm on the signals $\bw(t)$ and $\bx(t)$, which is assumed only to be finite.

More often, these gains are written not in terms of $\bx[n]$ directly, but in terms of some "performance output", $\bz[n]$. For instance, if would would like to bound the cost of a quadratic regulator objective as a function of the magnitude of the disturbance, we can minimize $$ \min_\gamma \quad \subjto \quad \sup_{\bw[n]} \frac{\sum_0^N \|\bz[n]\|^2}{\sum_0^N \| \bw[n] \|^2} \le \gamma^2, \qquad \bz[n] = \begin{bmatrix}\sqrt{\bQ} \bx[n] \\ \sqrt{\bR} \bu[n]\end{bmatrix}.$$ This is a simple but important idea, and understanding it is the key to understanding the language around robust control. In particular the $\mathcal{H}_2$ norm of a system (from input $\bw$ to output $\bz$) is the energy of the impulse response; when $\bz$ is chosen to represent the quadratic regulator cost as above, it corresponds to the expected LQR cost. The $\mathcal{H}_\infty$ norm of a system (from $\bw$ to $\bz$) is the largest singular value of the transfer function; it corresponds to the $L_2$ gain.

One of the mechanisms for certifying an $L_2$ gain for a system
comes from a generalization of Lyapunov analysis to consider the contributions of system inputs via the so-called "dissipation
inequalities". Dissipation inequalities are a general tool, and
$L_2$-gain analysis is only one application of them; for a broader
treatment see

Informally, the idea is to generalize the Lyapunov conditions, $V(\bx) \succ 0, \dot{V}(\bx) \preceq 0,$ into the more general form $$V(\bx) \succ 0, \quad \dot{V}(\bx) \preceq s(\bx, \bw),$$ where $\bw$ is the input of interest in this setting, and $s()$ is a scalar quantity representing a "supply rate". Once a system has input, the value of $V$ may go up or it may go down, but if we can bound the way that it goes up by a simple function of the input, then we may still be able to provide input-to-state or input-output bounds for the system. Integrating both sides of the derivative condition with respect to time yields: $$\forall \bx(0),\quad V(\bx(T)) \le V(\bx(0)) + \int_0^T s(\bx(t), \bw(t))dt.$$

To obtain a bound on the $L_2$ gain between input $\bw(t)$ and output $\bz(t)$, the supply rate of interest is $$s(\bx,\bw) = \gamma^2 \|\bw\|^2 - \|\bz\|^2,$$ which yields $$\forall \bx(0),\quad V(\bx(T)) \le V(\bx(0)) + \int_0^T \gamma^2 \|\bw(t)\|^2dt - \int_0^T \|\bz(t)\|^2dt .$$ Now, since this must hold for all $\bx(0)$, it holds for $\bx(0) = 0$. Furthermore, we know $V(\bx(T))$ is non-negative, so we also have $$0 \le V(\bx(T)) \le \int_0^T \gamma^2 \|\bw(t)\|^2dt - \int_0^T \|\bz(t)\|^2dt .$$ Therefore, if we can find a $V$ that satisfied the dissipation inequality for this storage function, we have certified the $\gamma$ is an $L_2$ gain for the system: $$ \frac{\int_0^T \| \bz(t) \|^2 dt}{\int_0^T \| \bw(t) \|^2dt} \le \gamma^2.$$

Coming soon...

Let's consider a robust variant of the LQR problem: \begin{gather*} \min_{\bu[\cdot]} \max_{\bw[\cdot]} \sum_{n=0}^\infty \bx^T[n]\bQ\bx[n] + \bu^T[n] \bR\bu[n] - \gamma^2 \bw^T[n]\bw[n],\\ \bx[n+1] = \bA\bx[n] + \bB\bu[n] + \bB_\bw \bw[n],\\ \bQ=\bQ^T \succeq 0, \bR = \bR^T \succ 0. \end{gather*} The reason for this choice of cost function will become clear in the derivation, but the intuition is that we want to reward the controller for having a small response to large $\bw[\cdot]$. Note that unlike the stochastic LQR formulation, here we do not specify the distribution over $\bw[n]$, and in fact we don't even restrict it to a bounded set. All we know is that at each time step, an omniscient adversary is allowed to choose the $\bw[n]$ that tries to maximize this objective.

In fact, since we don't need to specify a continuous-time random process and the continuous-time derivation is both cleaner and by now, I think, more familiar, let's do this one in continuous time. \begin{gather*} \min_{\bu[\cdot]} \max_{\bw[\cdot]} \int_{n=0}^\infty dt \left[\bx^T(t)\bQ\bx(t) + \bu^T(t) \bR\bu(t) - \gamma^2 \bw^T(t)\bw(t)\right],\\ \dot\bx(t) = \bA\bx(t) + \bB\bu(t) + \bB_\bw \bw(t),\\ \bQ=\bQ^T \succeq 0, \bR = \bR^T \succ 0. \end{gather*} We will once again guess a quadratic cost-to-go function: $$J(\bx) = \bx^T {\bf S} \bx, \quad {\bf S} = {\bf S}^T \succ 0.$$ The dynamic programming recursion still holds for this problem , resulting in the Bellman equation: $$\forall \bx,\quad 0 = \min_\bu \max_\bw \left[\bx^T\bQ\bx + \bu^T\bR\bu - \gamma^2 \bw^T\bw + \pd{J^*}{\bx}[\bA\bx + \bB\bu + \bB_\bw \bw]\right].$$ Since the inside is a concave quadratic form over $\bw$, we can solve for the adversary by finding the maximum: \begin{gather*} \pd{}{\bw} = -2 \gamma^2 \bw^T + 2\bx^T {\bf S} \bB_\bw,\\ \bw^* = \frac{1}{\gamma^2} \bB_\bw^T {\bf S} \bx.\end{gather*} This is the disturbance input that can cause the largest cost; the optimal play for the adversary. Now we can solve the remainder as we did with the original LQR problem: \begin{gather*} \pd{}{\bu} = 2 \bu^T \bR + 2\bx^T {\bf S} \bB,\\ \bu^* = -\bR^{-1} \bB^T {\bf S} \bx = -\bK \bx.\end{gather*} Substituting this back into the Bellman equation gives: $$0 = \bQ + {\bf S}\left[ \gamma^{-2} \bB_\bw \bB_\bw^T - \bB \bR^{-1} \bB^T \right]{\bf S} + {\bf S}^T \bA + \bA^T {\bf S},$$ which is the original LQR Riccati equation with one additional term involving $\gamma.$ And like the original LQR Riccati equation, we must ask whether it has a positive-definite solution for ${\bf S}$. It turns out that if the system is stabilizable and $\gamma$ large enough, then it does have a PD solution. However, as we reduce $\gamma$ towards zero, there will be some threshold $\gamma$ beneath which this Riccati equation does not have a PD solution. Intuitively, if $\gamma$ is too small, then the adversary is rewarded for injecting disturbances that are so large as to break the convergence.

Here is the fascinating thing: the $\gamma$ in this robust LQR cost function can be interpreted as an $L_2$ gain for the system. Recall that when we were making connections between the Bellman equation and Lyapunov equations, we observed that the Bellman equation can be written as $\forall \bx, \quad \dot{J}^*(\bx) \le -\ell(\bx,\bu)$? Here this yields: $$\forall \bx, \quad \dot{J}^*(\bx) \le \gamma^2 \bw^T(t)\bw(t) - \bx^T(t)\bQ\bx(t) - \bu^T(t) \bR\bu(t).$$ This means that the cost-to-go is a valid dissipation inequality for the supply rate that provides an $L_2$ gain for the performance output $\bz = \begin{bmatrix}\sqrt{\bQ} \bx \\ \sqrt{\bR} \bu\end{bmatrix}.$ Moreover, we can find the minimal $L_2$ gain by finding the minimal $\gamma > 0$ for which the Riccati equation has a positive-definite solution. And given the properties of the Riccati equation, this can be done with a line search in $\gamma$.

Because the $L_2$ gain is the $\mathcal{H}_\infty$-norm of the system, this recipe of searching for the smallest $\gamma$ and taking the Riccati solution is an instance of $\mathcal{H}_\infty$ control design.

The standard criticism of $\mathcal{H}_2$ optimal control is that minimizing the expected value does not allow any guarantees on performance. The standard criticism of $\mathcal{H}_\infty$ optimal control is that it concerns itself with the worst case, and may therefore be conservative, especially because distributions over possible disturbances chosen a priori may be unnecessarily conservative. One might hope that we could get some of this performance back if we are able to update our models of uncertainty online, adapting to the statistics of the disturbances we actually receive. This is one of the goals of adaptive control.

One of the fundamental problems in online adaptive control is the trade-off between exploration and exploitation. Some inputs might drive the system to build more accurate models of the dynamics / uncertainty quickly, which could lead to better performance. But how can we formalize this trade-off?

There has been some nice progress on this challenge in machine
learning in the setting of (contextual) multi-armed
bandit problems. For our purposes, you can think of bandits as a
limiting case of an optimal control problem where there are no dynamics
(the effects of one control action do not effect the results of the next
control action). In this simpler setting, the online optimization
community has developed exploration-exploitation strategies based on the
notion of minimizing *regret* -- typically the accumulated
difference in the performance achieved by my online algorithm vs the
performance that would have been achieved if I had been privy to the data
from my experiments before I started. This has led to methods that make
use of concepts like upper-confidence bound (UCB) and more recently
bounds using a least-squares squares confidence bound

In the last few years, we've see these results translated into the setting of linear optimal control...

Coming soon...

L2-gain with dissipation inequalities. Finite-time verification with sums of squares.

Occupation Measures

SOS-based design- "Essentials of Robust Control", Prentice Hall , 1997. ,
- "Robust Control Design using H-infinity Methods", Springer-Verlag , 2000. ,
- "How should a robot assess risk? towards an axiomatic theory of risk in robotics", Robotics Research , pp. 75--84, 2020. ,
- "Elliptical Slice Sampling for Probabilistic Verification of Stochastic Systems with Signal Temporal Logic Specifications", To appear in the Proceedings of Hybrid Systems: Computation and Control (HSCC) , 2022. [ link ] ,
- "Finite-time Regional Verification of Stochastic Nonlinear Systems", International Journal of Robotics Research, vol. 31, no. 7, pp. 901-923, June, 2012. [ link ] ,
- "Robust Post-Stall Perching with a Fixed-Wing UAV", PhD thesis, Massachusetts Institute of Technology, September, 2014. [ link ] ,
- "Convex Optimization", Cambridge University Press , 2004. ,
- "Dissipation inequalities in systems theory: An introduction and recent results", Invited Lectures of the International Congress on Industrial and Applied Mathematics 2007 , pp. 23-42, 2009. ,
- "Linear {Matrix} {Inequalities} in {Control}", Online Draft , pp. 293, 2015. ,
- "Optimal stochastic linear systems with exponential performance criteria and their relation to deterministic differential games", IEEE Transactions on Automatic Control, vol. 18, no. 2, pp. 124--131, apr, 1973. ,
- "A Tutorial on Geometric Programming", Optimization and Engineering, vol. 8, no. 1, pp. 67-127, 2007. ,
- "Risk-sensitive optimal control", Wiley New York , vol. 20, 1990. ,
- "Beyond {UCB}: Optimal and Efficient Contextual Bandits with Regression Oracles", Proceedings of the 37th International Conference on Machine Learning , vol. 119, pp. 3199--3210, 13--18 Jul, 2020. ,
- "End-to-end training of deep visuomotor policies", The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334--1373, 2016. ,
- "Robust model predictive control of constrained linear systems with bounded disturbances", Automatica, vol. 41, no. 2, pp. 219--224, 2005. ,
- "Linear Encodings for Polytope Containment Problems", Proceedings of the 2019 IEEE 58th Conference on Decision and Control (CDC) , pp. 8, 2019. [ link ] ,

Previous Chapter | Table of contents | Next Chapter |