Algorithms for Walking, Running, Swimming, Flying, and Manipulation

© Russ Tedrake, 2019

How to cite these notes

**Note:** These are working notes used for a course being taught at
MIT. They will be updated throughout the Spring 2019 semester. Lecture videos
are available on YouTube.

This book is about building robots that move with speed, efficiency, and grace. I believe that this can only be achieve through a tight coupling between mechanical design, passive dynamics, and nonlinear control synthesis. Therefore, these notes contain selected material from dynamical systems theory, as well as linear and nonlinear control.

These notes also reflect a deep belief in computational algorithms playing an essential role in finding and optimizing solutions to complex dynamics and control problems. Algorithms play an increasingly central role in modern control theory; these days even rigorous mathematicians consider finding convexity in a problem (therefore making it amenable to an efficient computational solution) almost tantamount to an analytical result. Therefore, the notes necessarily also cover selected material from optimization theory, motion planning, and machine learning.

Although the material in the book comes from many sources, the presentation is targeted very specifically at a handful of robotics problems. Concepts are introduced only when and if they can help progress the capabilities we are trying to develop. Many of the disciplines that I am drawing from are traditionally very rigorous, to the point where the basic ideas can be hard to penetrate for someone that is new to the field. I've made a conscious effort in these notes to keep a very informal, conversational tone even when introducing these rigorous topics, and to reference the most powerful theorems but only to prove them when that proof would add particular insights without distracting from the mainstream presentation. I hope that the result is a broad but reasonably self-contained and readable manuscript that will be of use to any enthusiastic roboticist.

The material in these notes is organized into a few main parts. "Model Systems" introduces a series of increasingly complex dynamical systems and overviews some of the relevant results from the literature for each system. "Nonlinear Planning and Control" introduces quite general computational algorithms for reasoning about those dynamical systems, with optimization theory playing a central role. Many of these algorithms treat the dynamical system as known and deterministic until the last chapters in this part which introduce stochasticity and robustness. "Estimation and Learning" follows this up with techniques from statistics and machine learning which capitalize on this viewpoint to introduce additional algorithms which can operate with less assumptions on knowing the model or having perfect sensors. The book closes with an "Appendix" that provides slightly more introduction (and references) for the main topics used in the course.

The order of the chapters was chosen to make the book valuable as a reference. When teaching the course, however, I take a spiral trajectory through the material, introducing robot dynamics and control problems one at a time, and introducing only the techniques that are required to solve that particular problem.

All of the examples and algorithms in this book, plus many more, are now
available as a part of our open-source software project:

Please see the appendix
for specific instructions for using

Robots today move far too conservatively, and accomplish only a fraction of the tasks and achieve a fraction of the performance that they are mechanically capable of. In many cases, we are still fundamentally limited by control technology which matured on rigid robotic arms in structured factory environments. The study of underactuated robotics focuses on building control systems which use the natural dynamics of the machines in an attempt to achieve extraordinary performance in terms of speed, efficiency, or robustness.

Let's start with some examples, and some videos.

The world of robotics changed when, in late 1996, Honda Motor Co. announced that they had been working for nearly 15 years (behind closed doors) on walking robot technology. Their designs have continued to evolve, resulting in a humanoid robot they call ASIMO (Advanced Step in Innovative MObility). For nearly 20 years, Honda's robots were widely considered to represent the state of the art in walking robots, although there are now many robots with designs and performance very similar to ASIMO's. We will dedicate effort to understanding a few of the details of ASIMO when we discuss algorithms for walking... for now I just want you to become familiar with the look and feel of ASIMO's movements [watch the asimo video below now].

I hope that your first reaction is to be incredibly impressed with the
quality and versatility of ASIMO's movements. Now take a second look.
Although the motions are very smooth, there is something a little
unnatural about ASIMO's gait. It feels a little like an astronaut
encumbered by a heavy space suit. In fact this is a reasonable analogy...
ASIMO is walking a bit like somebody that is unfamiliar with his/her
dynamics. Its control system is using high-gain feedback, and therefore
considerable joint torque, to cancel out the natural dynamics of the
machine and strictly follow a desired trajectory. This control approach
comes with a stiff penalty. ASIMO uses roughly 20 times the energy
(scaled) that a human uses to walk on the flat (measured by cost of
transport)

For contrast, let's now consider a very different type of walking
robot, called a passive dynamic walker (PDW). This "robot" has no motors,
no controllers, no computer, but is still capable of walking stably down a
small ramp, powered only by gravity [watch videos above now]. Most people
will agree that the passive gait of this machine is more natural than
ASIMO's; it is certainly more efficient. Passive walking machines have a
long history - there are patents for passively walking toys dating back to
the mid 1800's. We will discuss, in detail, what people know about the
dynamics of these machines and what has been accomplished experimentally.
This most impressive passive dynamic walker to date was built by Steve
Collins in Andy Ruina's lab at Cornell

Passive walkers demonstrate that the high-gain, dynamics-cancelling feedback approach taken on ASIMO is not a necessary one. In fact, the dynamics of walking is beautiful, and should be exploited - not cancelled out.

The world is just starting to see what this vision could look like. This video from Boston Dynamics is one of my favorites of all time:

This result is a marvel of engineering (the mechanical design alone is amazing...). In this class, we'll teach you the computational tools to required to make robots perform this way. We'll also try to reason about how robust these types of maneuvers are and can be. Don't worry, if you do not have a super lightweight, super capable, and super durable humanoid, then a simulation will be provided for you.

The story is surprisingly similar in a very different type of machine. Modern airplanes are extremely effective for steady-level flight in still air. Propellers produce thrust very efficiently, and today's cambered airfoils are highly optimized for speed and/or efficiency. It would be easy to convince yourself that we have nothing left to learn from birds. But, like ASIMO, these machines are mostly confined to a very conservative, low angle-of-attack flight regime where the aerodynamics on the wing are well understood. Birds routinely execute maneuvers outside of this flight envelope (for instance, when they are landing on a perch), and are considerably more effective than our best aircraft at exploiting energy (eg, wind) in the air.

As a consequence, birds are extremely efficient flying machines; some
are capable of migrating thousands of kilometers with incredibly small
fuel supplies. The wandering albatross can fly for hours, or even days,
without flapping its wings - these birds exploit the shear layer formed by
the wind over the ocean surface in a technique called dynamic soaring.
Remarkably, the metabolic cost of flying for these birds is
indistinguishable from the baseline metabolic cost

Birds are also incredibly maneuverable. The roll rate of a highly
acrobatic aircraft (e.g, the A-4 Skyhawk) is approximately 720
deg/sec

Although many impressive statistics about avian flight have been
recorded, our understanding is partially limited by experimental
accessibility - it is quite difficult to carefully measure birds (and the
surrounding airflow) during their most impressive maneuvers without
disturbing them. The dynamics of a swimming fish are closely related, and
can be more convenient to study. Dolphins have been known to swim
gracefully through the waves alongside ships moving at 20
knots

Despite a long history of success in industrial applications, and the huge potential for consumer applications, we still don't have robot arms that can perform any meaningful tasks in the home. Admittedly, the perception problem (using sensors to detect/localize objects and understand the scene) for home robotics is incredibly difficult. But even if we were given a perfect perception system, our robots are still a long way from performing basic object manipulation tasks with the dexterity and versatility of a human.

Most robots that perform object manipulation today use a stereotypical pipeline. First, we enumerate a handful of contact locations on the hand (these points, and only these points, are allowed to contact the world). Then, given a localized object in the environment, we plan a collision-free trajectory for the arm that will move the hand into a "pre-grasp" location. At this point the robot closes it's eyes (figuratively) and closes the hand, hoping that the pre-grasp location was good enough that the object will be successfully grasped using e.g. only current feedback in the fingers to know when to stop closing. "Underactuated hands" make this approach more successful, but the entire approach really only works well for enveloping grasps.

The enveloping grasps approach may actually be sufficient for a number
of simple pick-and-place tasks, but it is a very poor representation of
how humans do manipulation. When humans manipulate objects, the contact
interactions with the object and the world are very rich -- we often use
pieces of the environment as fixtures to reduce uncertainty, we commonly
*exploit* slipping behaviors (e.g. for picking things up, or
reorienting it in the hand), and our brains don't throw NaNs if we use the
entire surface of our arms to e.g. manipulate a large object.

By the way, in most cases, if the robots fail to make contact at the anticipated contact times/locations, bad things can happen. The results are hilarious and depressing at the same time. (Let's fix that!)

Classical control techniques for robotics are based on the idea that feedback can be used to override the dynamics of our machines. These examples suggest that to achieve outstanding dynamic performance (efficiency, agility, and robustness) from our robots, we need to understand how to design control systems which take advantage of the dynamics, not cancel them out. That is the topic of this course.

Surprisingly, many formal control ideas do not support the idea of "exploiting" the dynamics. Optimal control formulations (which we will study in depth) allow it in principle, but optimal control of nonlinear systems is still a relatively ad hoc discipline. Sometimes I joke that in order to convince a control theorist to consider the dynamics, you have to do something drastic, like taking away her control authority - remove a motor, or enforce a torque-limit. These issues have created a formal class of systems, the underactuated systems, for which people have begun to more carefully consider the dynamics of their machines in the context of control.

According to Newton, the dynamics of mechanical systems are second order ($F = ma$). Their state is given by a vector of positions, $\bq$ (also known as the configuration vector), and a vector of velocities, $\dot{\bq}$, and (possibly) time. The general form for a second-order controllable dynamical system is: $$\ddot{\bq} = {\bf f}(\bq,\dot{\bq},\bu,t),$$ where $\bu$ is the control vector. As we will see, the dynamics for many of the robots that we care about turn out to be affine in commanded torque, so let's consider a slightly constrained form: \begin{equation}\ddot{\bq} = {\bf f}_1(\bq,\dot{\bq},t) + {\bf f}_2(\bq,\dot{\bq},t)\bu \label{eq:f1_plus_f2}.\end{equation}

Notice that whether or not a control system is underactuated may depend
on the state of the system or even on time, although for most systems
(including all of the systems in this book) underactuation is a global
property of the system. We will refer to a system as underactuated if it
is underactuated in *all* states and times. In practice, we often
refer informally to systems as fully actuated as long as they are fully
actuated in *most* states (e.g., a "fully-actuated" system might still
have joint limits or lose rank at a kinematic singularity). Admittedly,
this permits the existence of a gray area, where it might feel awkward to
describe the *system* as either fully- or underactuated (we should
instead only describe its states); we'll see examples, for instance, of
hybrid systems like walking robots that are fully-actuated in some modes
but underactuated in others. I will still informally refer to these
systems as being underactuated whenever reasoning about the underactuation
is useful/necessary for developing a control strategy.

Consider the simple robot manipulator illustrated above. As described in the Appendix, the equations of motion for this system are quite simple to derive, and take the form of the standard "manipulator equations": $${\bf M}(\bq)\ddot\bq + \bC(\bq,\dot\bq)\dot\bq = \btau_g(\bq) + {\bf B}\bu.$$ It is well known that the inertia matrix, ${\bf M}(\bq)$ is (always) uniformly symmetric and positive definite, and is therefore invertible. Putting the system into the form of equation \ref{eq:f1_plus_f2} yields: \begin{align*}\ddot{\bq} =& {\bf M}^{-1}(\bq)\left[ \btau_g(\bq) + \bB\bu - \bC(\bq,\dot\bq)\dot\bq \right].\end{align*} Because ${\bf M}^{-1}(\bq)$ is always full rank, we find that a system described by the manipulator equations is fully-actuated if and only if $\bB$ is full row rank. For this particular example, $\bq = [\theta_1,\theta_2]^T$ and $\bu = [\tau_1,\tau_2]^T$ (motor torques at the joints), and $\bB = {\bf I}_{2 \times 2}$. The system is fully actuated.

I personally learn best when I can experiment and get some physical intuition. The companion software for the course should make it easy for you to see this system in action. To try it, make sure you've followed the installation instructions in the Appendix, then open Python and try the following lines in your Python console:

It's worth taking a peek at the file that describes the robot. URDF and SDF are two of the standard formats, and they can be used to describe even very complicated robots (like the Boston Dynamics humanoid).

We can also use

While the basic double pendulum is fully actuated, imagine the somewhat bizarre case that we have a motor to provide torque at the elbow, but no motor at the shoulder. In this case, we have $\bu = \tau_2$, and $\bB(\bq) = [0,1]^T$. This system is clearly underactuated. While it may sound like a contrived example, it turns out that it is almost exactly the dynamics we will use to study as our simplest model of walking later in the class.

The matrix ${\bf f}_2$ in equation \ref{eq:f1_plus_f2} always has dim$[\bq]$ rows, and dim$[\bu]$ columns. Therefore, as in the example, one of the most common cases for underactuation, which trivially implies that ${\bf f}_2$ is not full row rank, is dim$[\bu] < $ dim$[\bq]$. This is the case when a robot has joints with no motors. But this is not the only case. The human body, for instance, has an incredible number of actuators (muscles), and in many cases has multiple muscles per joint; despite having more actuators than position variables, when I jump through the air, there is no combination of muscle inputs that can change the ballistic trajectory of my center of mass (barring aerodynamic effects). My control system is underactuated.

For completeness, let's generalize the definition of underactuation to systems beyond the second-order control affine systems.

It is easy to see that equation \ref{eq:underactuated_def} is a sufficient condition for underactuation. This definition can also be extended to discrete-time systems and/or differential inclusions.

A quick note about notation. When describing the dynamics of rigid-body systems in this class, I will use $\bq$ for configurations (positions), $\dot{\bq}$ for velocities, and use $\bx$ for the full state ($\bx = [\bq^T,\dot{\bq}^T]^T$). There is an important limitation to this convention (3D angular velocity should not be represented as the derivative of 3D pose) described in the Appendix, but it will keep the notes cleaner. Unless otherwise noted, vectors are always treated as column vectors. Vectors and matrices are bold (scalars are not).

Fully-actuated systems are dramatically easier to control than underactuated systems. The key observation is that, for fully-actuated systems with known dynamics (e.g., ${\bf f}_1$ and ${\bf f}_2$ are known for a second-order control-affine system), it is possible to use feedback to effectively change a nonlinear control problem into a linear control problem. The field of linear control is incredibly advanced, and there are many well-known solutions for controlling linear systems.

The trick is called feedback linearization. When ${\bf f}_2$ is full row
rank, it is invertible

Let's say that we would like our simple double pendulum to act like a
simple single pendulum (with damping), whose dynamics are given by:
\begin{align*} \ddot \theta_1 &= -\frac{g}{l}\sin\theta_1 -b\dot\theta_1 \\
\ddot\theta_2 &= 0. \end{align*} This is easily achieved using

Since we are embedding a nonlinear dynamics (not a linear one), we refer to this as "feedback cancellation", or "dynamic inversion". This idea can, and does, make control look easy - for the special case of a fully-actuated deterministic system with known dynamics. For example, it would have been just as easy for me to invert gravity. Observe that the control derivations here would not have been any more difficult if the robot had 100 joints.

As always, make soure you take a moment to read through the source code (linked above).

The underactuated systems are not feedback linearizable. Therefore, unlike fully-actuated systems, the control designer has no choice but to reason about the nonlinear dynamics of the plant in the control design. This dramatically complicates feedback controller design.

Although the dynamic constraints due to missing actuators certainly embody the spirit of this course, many of the systems we care about could be subject to other dynamic constraints as well. For example, the actuators on our machines may only be mechanically capable of producing some limited amount of torque, or there may be a physical obstacle in the free space with which we cannot permit our robot to come into contact with.

In practice it can be useful to separate out constraints which depend only on the input, e.g. $\phi(\bu)\ge0$, such as actuator limits, as they can often be easier to handle than state constraints. An obstacle in the environment might manifest itself as one or more constraints that depend only on position, e.g. $\phi(\bq)\ge0$.

By our generalized definition of underactuation, we can see that input constraints can certainly cause a system to be underactuated. State (only) constraints are more subtle -- in general these actually reduce the dimensionality of the state space, therefore requiring less dimensions of actuation to achieve "full" control, but we only reap the benefits if we are able to perform the control design in the "minimal coordinates" (which is often difficult).

Input and state constraints can complicate control design in similar ways to having an insufficient number of actuators, (i.e., further limiting the set of the feasible trajectories), and often require similar tools to find a control solution.

You might have heard of the term "nonholonomic system" (see e.g.

Contrast the wheeled robot example with a robot on train tracks. The train tracks correspond to a holonomic constraint: the track constraint can be written directly in terms of the configuration $\bq$ of the system, without using the velocity ${\bf \dot{q}}$. Even though the track constraint could also be written as a differential constraint on the velocity, it would be possible to integrate this constraint to obtain a constraint on configuration. The track restrains the possible configurations of the system.

A nonholonomic constraint like the no-side-slip constraint on the wheeled vehicle certainly results in an underactuated system. The converse is not necessarily true—the double pendulum system which is missing an actuator is underactuated but would not typically be called a nonholonomic system. Note that the Lagrangian equations of motion are a constraint of the form \[\bphi(\bq,{\bf \dot{q}},{\bf \ddot{q}},t) = 0,\] so do not qualify as a nonholonomic constraint.

The control of underactuated systems is an open and interesting problem in controls. Although there are a number of special cases where underactuated systems have been controlled, there are relatively few general principles. Now here's the rub... most of the interesting problems in robotics are underactuated:

- Legged robots are underactuated. Consider a legged machine with $N$ internal joints and $N$ actuators. If the robot is not bolted to the ground, then the degrees of freedom of the system include both the internal joints and the six degrees of freedom which define the position and orientation of the robot in space. Since $\bu \in \Re^N$ and $\bq \in \Re^{N+6}$, equation \ref{eq:underactuated_def} is satisfied.
- (Most) Swimming and flying robots are underactuated. The story is the same here as for legged machines. Each control surface adds one actuator and one DOF. And this is already a simplification, as the true state of the system should really include the (infinite-dimensional) state of the flow.
- Robot manipulation is (often) underactuated. Consider a fully-actuated robotic arm. When this arm is manipulating an object with degrees of freedom (even a brick has six), it can become underactuated. If force closure is achieved, and maintained, then we can think of the system as fully-actuated, because the degrees of freedom of the object are constrained to match the degrees of freedom of the hand. That is, of course, unless the manipulated object has extra DOFs (for example, any object that is deformable).

Even control systems for fully-actuated and otherwise unconstrained systems can be improved using the lessons from underactuated systems, particularly if there is a need to increase the efficiency of their motions or reduce the complexity of their designs.

This course is based on the observation that there are new computational tools from optimization theory, control theory, motion planning, and even machine learning which can be used to design feedback control for underactuated systems. The goal of this class is to develop these tools in order to design robots that are more dynamic and more agile than the current state-of-the-art.

The target audience for the class includes both computer science and mechanical/aero students pursuing research in robotics. Although I assume a comfort with linear algebra, ODEs, and Python, the course notes aim to provide most of the material and references required for the course.

I have a confession: I actually think that the material we'll cover in these notes is valuable far beyond robotics. I think that systems theory provides a powerful language for organizing computation in exceedingly complex systems -- especially when one is trying to program and/or analyze systems with continuous variables in a feedback loop (which happens throughout computer science and engineering, by the way). I hope you find these tools to be broadly useful, even if you don't have a humanoid robot capable of performing a backflip at your immediate disposal.

Our goals for this chapter are modest: we'd like to understand the dynamics of a pendulum.

Why a pendulum? In part, because the dynamics of a majority of our multi-link robotics manipulators are simply the dynamics of a large number of coupled pendula. Also, the dynamics of a single pendulum are rich enough to introduce most of the concepts from nonlinear dynamics that we will use in this text, but tractable enough for us to (mostly) understand in the next few pages.

The Lagrangian derivation of the equations of motion (as described in the appendix) of the simple pendulum yields: \begin{equation*} m l^2 \ddot\theta(t) + mgl\sin{\theta(t)} = Q. \end{equation*} We'll consider the case where the generalized force, $Q$, models a damping torque (from friction) plus a control torque input, $u(t)$: $$Q = -b\dot\theta(t) + u(t).$$

Let us first consider the dynamics of the pendulum if it is driven in a particular simple way: a torque which does not vary with time: \begin{equation} ml^2 \ddot\theta + b\dot\theta + mgl \sin\theta = u_0. \end{equation}

.You can experiment with this system in

(use the slider at the top to adjust the torque input).

These are relatively simple differential equations, so if I give you $\theta(0)$ and $\dot\theta(0)$, then you should be able to integrate them to obtain $\theta(t)$... right? Although it is possible, integrating even the simplest case ($b = u = 0$) involves elliptic integrals of the first kind; there is relatively little intuition to be gained here.

This is in stark contrast to the case of linear systems, where much of our understanding comes from being able to explicitly integrate the equations. For instance, for a simple linear system we have $$\dot{q} = a q \quad \rightarrow \quad q(t) = q(0) e^{at},$$ and we can immediately understand that the long-term behavior of the system is a (stable) decaying exponential if $a<0$, an (unstable) growing exponential if $a>0$, and that the system does nothing if $a=0$. Here we are with certainly one of the simplest nonlinear systems we can imagine, and we can't even solve this system?

All is not lost. If what we care about is the long-term behavior of the
system, then there are a number of techniques we can apply. In this
chapter, we will start by investigating graphical solution methods. These
methods are described beautifully in a book by Steve
Strogatz

Let's start by studying a special case -- intuitively when $b\dot\theta \gg ml^2\ddot\theta$ -- which via dimensional analysis (using the natural frequency $\sqrt{\frac{g}{l}}$ to match units) occurs when $b \sqrt\frac{l}{g} \gg ml^2$. This is the case of heavy damping, for instance if the pendulum was moving in molasses. In this case, the damping term dominates the acceleration term, and we have: $$ml^2 \ddot\theta + b\dot\theta \approx b\dot\theta = u_0 - mgl\sin\theta.$$ In other words, in the case of heavy damping, the system looks approximately first order. This is a general property of heavily damped systems, such as fluids at very low Reynolds number.

I'd like to ignore one detail for a moment: the fact that $\theta$ wraps around on itself every $2\pi$. To be clear, let's write the system without the wrap-around as: \begin{equation}b\dot{x} = u_0 - mgl\sin{x}.\label{eq:overdamped_pend_ct}\end{equation} Our goal is to understand the long-term behavior of this system: to find $x(\infty)$ given $x(0)$. Let's start by plotting $\dot{x}$ vs $x$ for the case when $u_0=0$:

The first thing to notice is that the system has a number of *fixed
points* or *steady states*, which occur whenever $\dot{x} = 0$.
In this simple example, the zero-crossings are $x^* = \{..., -\pi, 0, \pi,
2\pi, ...\}$. When the system is in one of these states, it will never
leave that state. If the initial conditions are at a fixed point, we know
that $x(\infty)$ will be at the same fixed point.

Next let's investigate the behavior of the system in the local vicinity
of the fixed points. Examining the fixed point at $x^* = \pi$, if the
system starts just to the right of the fixed point, then $\dot{x}$ is
positive, so the system will move away from the fixed point. If it starts
to the left, then $\dot{x}$ is negative, and the system will move away in
the opposite direction. We'll call fixed-points which have this property
*unstable*. If we look at the fixed point at $x^* = 0$, then the
story is different: trajectories starting to the right or to the left will
move back towards the fixed point. We will call this fixed point
*locally stable*. More specifically, we'll distinguish between
three types of local stability:

- Locally stable
*in the sense of Lyapunov*(i.s.L.). A fixed point, $x^*$ is locally stable i.s.L. if for every small $\epsilon > 0$, I can produce a $\delta > 0$ such that if $\| x(0) - x^* \| < \delta$ then $\forall t$ $\| x(t) - x^*\| < \epsilon$. In words, this means that for any ball of size $\epsilon$ around the fixed point, I can create a ball of size $\delta$ which guarantees that if the system is started inside the $\delta$ ball then it will remain inside the $\epsilon$ ball for all of time. - Locally
*asymptotically stable*. A fixed point is locally asymptotically stable if $x(0) = x^* + \epsilon$ implies that $\lim_{t\rightarrow \infty} x(t) = x^*$. - Locally
*exponentially stable*. A fixed point is locally exponentially stable if $x(0) = x^* + \epsilon$ implies that $\| x(t) - x^* \| < Ce^{-\alpha t}$, for some positive constants $C$ and $\alpha$.

An initial condition near a fixed point that is stable in the sense of
Lyapunov may never reach the fixed point (but it won't diverge), near an
asymptotically stable fixed point will reach the fixed point as $t
\rightarrow \infty$, and near an exponentially stable fixed point will
reach the fixed point with a bounded rate. An exponentially stable fixed
point is also an asymptotically stable fixed point, but the converse is
not true. Asymptotic stability and Lyapunov stability, however, are
distinct notions -- it is actually possible to have a system that is
asymptotically stable but not stable i.s.L.†*finite-time stability*;
we will see examples of this later in the book, but it is a difficult
topic to penetrate with graphical analysis. Rigorous nonlinear system
analysis is rich with subtleties and surprises. Moreover, these
differences actually matter -- the code that we will write to stabilize
the systems will be subtley different depending on what type of stability
we want, and it can make or break the success of our methods.

Our graph of $\dot{x}$ vs. $x$ can be used to convince ourselves of i.s.L. and asymptotic stability by visually inspecting $\dot{x}$ in the vicinity of a fixed point. Even exponential stability can be inferred if the function can be bounded away from the origin by a negatively-sloped line through the fixed point, since it implies that the nonlinear system will converge at least as fast as the linear system represented by the straight line. I will graphically illustrate unstable fixed points with open circles and stable fixed points (i.s.L.) with filled circles.

Next, we need to consider what happens to initial conditions which begin farther from the fixed points. If we think of the dynamics of the system as a flow on the $x$-axis, then we know that anytime $\dot{x} > 0$, the flow is moving to the right, and $\dot{x} < 0$, the flow is moving to the left. If we further annotate our graph with arrows indicating the direction of the flow, then the entire (long-term) system behavior becomes clear:

For instance, we can see that any initial condition $x(0) \in
(-\pi,\pi)$ will result in $\lim_{t\rightarrow \infty} x(t) = 0$. This
region is called the *basin of attraction* of the fixed point at
$x^* = 0$. Basins of attraction of two fixed points cannot overlap, and
the manifold separating two basins of attraction is called the
*separatrix*. Here the unstable fixed points, at $x^* = \{..,
-\pi, \pi, 3\pi, ...\}$ form the separatrix between the basins of
attraction of the stable fixed points.

As these plots demonstrate, the behavior of a first-order one dimensional system on a line is relatively constrained. The system will either monotonically approach a fixed-point or monotonically move toward $\pm \infty$. There are no other possibilities. Oscillations, for example, are impossible. Graphical analysis is a fantastic analysis tool for many first-order nonlinear systems (not just pendula); as illustrated by the following example:

One last piece of terminology. In the neuron example, and in many
dynamical systems, the dynamics were parameterized; in this case by a
single parameter, $w$. As we varied $w$, the fixed points of the system
moved around. In fact, if we increase $w$ through $w=1$, something
dramatic happens - the system goes from having one fixed point to having
three fixed points. This is called a *bifurcation*. This
particular bifurcation is called a pitchfork bifurcation. We often draw
bifurcation diagrams which plot the fixed points of the system as a
function of the parameters, with solid lines indicating stable fixed
points and dashed lines indicating unstable fixed points, as seen in the
figure:

Our pendulum equations also have a (saddle-node) bifurcation when we
change the constant torque input, $u_0$. Finally, let's return to the
original equations in $\theta$, instead of in $x$. Only one point to
make: because of the wrap-around, this system will *appear* have
oscillations. In fact, the graphical analysis reveals that the pendulum
will turn forever whenever $|u_0| > mgl$, but now you understand
that this is not an oscillation, but an instability with $\theta
\rightarrow \pm \infty$.

Consider again the system $$ml^2 \ddot\theta = u_0 - mgl \sin\theta -
b\dot\theta,$$ this time with $b = 0$. This time the system dynamics are
truly second-order. We can always think of any second-order system as
(coupled) first-order system with twice as many variables. Consider a
general, autonomous (not dependent on time), second-order system,
$$\ddot{q} = f(q,\dot q,u).$$ This system is equivalent to the
two-dimensional first-order system \begin{align*} \dot x_1 =& x_2 \\ \dot
x_2 =& f(x_1,x_2,u), \end{align*} where $x_1 = q$ and $x_2 = \dot q$.
Therefore, the graphical depiction of this system is not a line, but a
vector field where the vectors $[\dot x_1, \dot x_2]^T$ are plotted over
the domain $(x_1,x_2)$. This vector field is known as the *phase
portrait* of the system.

In this section we restrict ourselves to the simplest case when $u_0 = 0$. Let's sketch the phase portrait. First sketch along the $\theta$-axis. The $x$-component of the vector field here is zero, the $y$-component is $-mgl\sin\theta.$ As expected, we have fixed points at $\pm \pi, ...$ Now sketch the rest of the vector field. Can you tell me which fixed points are stable? Some of them are stable i.s.L., none are asymptotically stable.

You might wonder how we drew the black contour lines in the figure above. We could have obtained them by simulating the system numerically, but those lines can be easily obtained in closed-form. Directly integrating the equations of motion is difficult, but at least for the case when $u_0 = 0$, we have some additional physical insight for this problem that we can take advantage of. The kinetic energy, $T$, and potential energy, $U$, of the pendulum are given by $$T = \frac{1}{2}I\dot\theta^2, \quad U = -mgl\cos(\theta),$$ where $I=ml^2$ and the total energy is $E(\theta,\dot\theta) = T(\dot\theta)+U(\theta)$. The undamped pendulum is a conservative system: total energy is a constant over system trajectories. Using conservation of energy, we have: \begin{gather*} E(\theta(t),\dot\theta(t)) = E(\theta(0),\dot\theta(0)) = E_0 \\ \frac{1}{2} I \dot\theta^2(t) - mgl\cos(\theta(t)) = E_0 \\ \dot\theta(t) = \pm \sqrt{\frac{2}{I}\left[E_0 + mgl\cos\left(\theta(t)\right)\right]} \end{gather*} Using this, if you tell me $\theta$ I can determine one of two possible values for $\dot\theta$, and the solution has all of the richness of the black countour lines from the plot. This equation has a real solution when $\cos(\theta) > \cos(\theta_{max})$, where $$\theta_{max} = \begin{cases} \cos^{-1}\left( \frac{E}{mgl} \right), & E < mgl \\ \pi, & \text{otherwise}. \end{cases}$$ Of course this is just the intuitive notion that the pendulum will not swing above the height where the total energy equals the potential energy. As an exercise, you can verify that differentiating this equation with respect to time indeed results in the equations of motion.

Now what happens if we add a constant torque? If you visualize the bifurcation diagram, you'll see that the fixed points come together, towards $q = \frac{\pi}{2}, \frac{5\pi}{2}, ...$, until they disappear. One fixed-point is unstable, and one is stable.

Before we continue, let me now give you the promised example of a system that is asymptotically stable, but not stable i.s.L.. We can accomplish this with a very pendulum-like example (written here in polar coordinates):

The system \[ \dot{r} = r(1-r) \\ \dot\theta = \sin^2 (\frac{\theta}{2}) \] is asymptotically stable to $x^* = [1,0]^T$, but is not stable i.s.L.

Take a minute to draw the vector field of this (you can draw each coordinate independently, if it helps) to make sure you understand. Note that to wrap-around rotation is convenient but not essential -- we could have written the same dynamical system in cartesian coordinates without wrapping. And if this feel too arbitrary, we will see it happen in practice when we introduce the energy-shaping swing-up controller for the pendulum in the next chapter.

Now let's add damping back. You can still add torque to move the fixed points (in the same way).

With damping, the downright fixed point of the pendulum now becomes
asymptotically stable (in addition to stable i.s.L.). Is it also
exponentially stable? How can we tell? One technique is to linearize the
system at the fixed point. A smooth, time-invariant, nonlinear system
that is locally exponentially stable *must* have a stable
linearization; we'll discuss linearization more in the next chapter.

Here's a thought exercise. If $u$ is no longer a constant, but a
function $\pi(\theta,\dot{\theta})$, then how would you choose $\pi$ to
stabilize the vertical position. Feedback linearization is the trivial
solution, for example: $$u = \pi(\theta,\dot{\theta}) = 2mgl\sin\theta.$$
But these plots we've been making tell a different story. How would you
shape the natural dynamics - at each point pick a $u$ from the stack of
phase plots - to stabilize the vertical fixed point *with minimal
torque effort*? This is exactly the way that I would like you to
think about control system design. And we'll give you your first solution
techniques -- using dynamic programming -- in the next lecture.

The simple pendulum is fully actuated. Given enough torque, we can produce any number of control solutions to stabilize the originally unstable fixed point at the top (such as designing a feedback controller to effectively invert gravity).

The problem begins to get interesting (a.k.a. becomes underactuated) if we impose a torque-limit constraint, $|u|\le u_{max}$. Looking at the phase portraits again, you can now visualize the control problem. Via feedback, you are allowed to change the direction of the vector field at each point, but only by a fixed amount. Clearly, if the maximum torque is small (smaller than $mgl$), then there are some states which cannot be driven directly to the goal, but must pump up energy to reach the goal. Furthermore, if the torque-limit is too severe and the system has damping, then it may be impossible to swing up to the top. The existence of a solution, and number of pumps required to reach the top, is a non-trivial function of the initial conditions and the torque-limits.

Although this system is very simple, its solution requires much of the same reasoning necessary for controlling much more complex underactuated systems; this problem will be a work-horse for us as we introduce new algorithms throughout this book.

A great deal of work in the control of underactuated systems has been done in the context of low-dimensional model systems. These model systems capture the essence of the problem without introducing all of the complexity that is often involved in more real-world examples. In this chapter we will focus on two of the most well-known and well-studied model systems--the Acrobot and the Cart-Pole. After we have developed some tools, we will see that they can be applied directly to other model systems; we will give a number of examples using Quadrotors. All of these systems are trivially underactuated, having less actuators than degrees of freedom.

The Acrobot is a planar two-link robotic arm in the vertical plane
(working against gravity), with an actuator at the elbow, but no actuator at
the shoulder. It was first described in detail in

The Acrobot is representative of the primary challenge in underactuated robots. In order to swing up and balance the entire system, the controller must reason about and exploit the state-dependent coupling between the actuated degree of freedom and the unactuated degree of freedom. It is also an important system because, as we will see, it closely resembles one of the simplest models of a walking robot.

Figure 3.1 illustrates the model parameters used in our analysis. $\theta_1$ is the shoulder joint angle, $\theta_2$ is the elbow (relative) joint angle, and we will use $\bq = [\theta_1,\theta_2]^T$, $\bx = [\bq,\dot\bq]^T$. The zero configuration is with both links pointed directly down. The moments of inertia, $I_1,I_2$ are taken about the pivots. The task is to stabilize the unstable fixed point $\bx = [\pi,0,0,0]^T$.

We will derive the equations of motion for the Acrobot using the method of Lagrange. The kinematics are given by: \begin{equation} \bx_1 = \begin{bmatrix} l_1 s_1 \\ -l_1 c_1 \end{bmatrix}, \quad \bx_2 = \bx_1 + \begin{bmatrix} l_2 s_{1+2} \\ - l_2 c_{1+2} \end{bmatrix} . \end{equation} The energy is given by: \begin{gather} T = T_1 + T_2, \quad T_1 = \frac{1}{2} I_1 \dot{q}_1^2 \\ T_2 = \frac{1}{2} ( m_2 l_1^2 + I_2 + 2 m_2 l_1 l_{c2} c_2 ) \dot{q}_1^2 + \frac{1}{2} I_2 \dot{q}_2^2 + (I_2 + m_2 l_1 l_{c2} c_2 ) \dot{q}_1 \dot{q}_2 \\ % from expanding sum of point masses U = -m_1 g l_{c1} c_1 - m_2 g (l_1 c_1 + l_{c2} c_{1+2}) \end{gather} Entering these quantities into the Lagrangian yields the equations of motion: \begin{gather} (I_1 + I_2 + m_2 l_1^2 + 2m_2 l_1 l_{c2} c_2) \ddot{q}_1 + (I_2 + m_2 l_1 l_{c2} c_2)\ddot{q}_2 - 2m_2 l_1 l_{c2} s_2 \dot{q}_1 \dot{q}_2 \\ \quad -m_2 l_1 l_{c2} s_2 \dot{q}_2^2 + m_1 g l_{c1}s_1 + m_2 g (l_1 s_1 + l_{c2} s_{1+2}) = 0 \\ (I_2 + m_2 l_1 l_{c2} c_2) \ddot{q}_1 + I_2 \ddot{q}_2 + m_2 l_1 l_{c2} s_2 \dot{q}_1^2 + m_2 g l_{c2} s_{1+2} = \tau \end{gather} In standard, manipulator equation form: $$\bM(\bq)\ddot\bq + \bC(\bq,\dot\bq)\dot\bq = \btau_g(\bq) + \bB\bu,$$ using $\bq = [\theta_1,\theta_2]^T$, $\bu = \tau$ we have: \begin{gather} \bM(\bq) = \begin{bmatrix} I_1 + I_2 + m_2 l_1^2 + 2m_2 l_1 l_{c2} c_2 & I_2 + m_2 l_1 l_{c2} c_2 \\ I_2 + m_2 l_1 l_{c2} c_2 & I_2 \end{bmatrix},\label{eq:Hacrobot}\\ \bC(\bq,\dot{\bq}) = \begin{bmatrix} -2 m_2 l_1 l_{c2} s_2 \dot{q}_2 & -m_2 l_1 l_{c2} s_2 \dot{q}_2 \\ m_2 l_1 l_{c2} s_2 \dot{q}_1 & 0 \end{bmatrix}, \\ \btau_g(\bq) = \begin{bmatrix} -m_1 g l_{c1}s_1 - m_2 g (l_1 s_1 + l_{c2}s_{1+2}) \\ -m_2 g l_{c2} s_{1+2} \end{bmatrix}, \quad \bB = \begin{bmatrix} 0 \\ 1 \end{bmatrix}. \end{gather}

You can experiment with the Acrobot dynamics in

The other model system that we will investigate here is the cart-pole system, in which the task is to balance a simple pendulum around its unstable equilibrium, using only horizontal forces on the cart. Balancing the cart-pole system is used in many introductory courses in control, including 6.003 at MIT, because it can be accomplished with simple linear control (e.g. pole placement) techniques. In this chapter we will consider the full swing-up and balance control problem, which requires a full nonlinear control treatment.

The figure shows our parameterization of the system. $x$ is the horizontal position of the cart, $\theta$ is the counter-clockwise angle of the pendulum (zero is hanging straight down). We will use $\bq = [x,\theta]^T$, and $\bx = [\bq,\dot\bq]^T$. The task is to stabilize the unstable fixed point at $\bx = [0,\pi,0,0]^T.$

The kinematics of the system are given by \begin{equation}\bx_1 = \begin{bmatrix} x \\ 0 \end{bmatrix}, \quad \bx_2 = \begin{bmatrix} x + l\sin\theta \\ -l\cos\theta \end{bmatrix}. \end{equation} The energy is given by \begin{align} T=& \frac{1}{2} (m_c + m_p)\dot{x}^2 + m_p \dot{x}\dot\theta l \cos{\theta} + \frac{1}{2}m_p l^2 \dot\theta^2 \\ U =& -m_p g l \cos\theta. \end{align} The Lagrangian yields the equations of motion: \begin{gather} (m_c + m_p)\ddot{x} + m_p l \ddot\theta \cos\theta - m_p l \dot\theta^2 \sin\theta = f \\ m_p l \ddot{x} \cos\theta + m_p l^2 \ddot\theta + m_p g l \sin\theta = 0 \end{gather} In standard, manipulator equation form: $$\bM(\bq)\ddot\bq + \bC(\bq,\dot\bq)\dot\bq = \btau_g(\bq) + \bB\bu,$$ using $\bq = [x,\theta]^T$, $\bu = f$, we have: \begin{gather*} \bM(\bq) = \begin{bmatrix} m_c + m_p & m_p l \cos\theta \\ m_p l \cos\theta & m_p l^2 \end{bmatrix}, \quad \bC(\bq,\dot{\bq}) = \begin{bmatrix} 0 & -m_p l \dot\theta \sin\theta \\ 0 & 0 \end{bmatrix}, \\ \btau_g(\bq) = \begin{bmatrix} 0 \\ - m_p gl \sin\theta \end{bmatrix}, \quad \bB = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \end{gather*} In this case, it is particularly easy to solve directly for the accelerations: \begin{align} \ddot{x} =& \frac{1}{m_c + m_p \sin^2\theta}\left[ f+m_p \sin\theta (l \dot\theta^2 + g\cos\theta)\right] \label{eq:ddot_x}\\ \ddot{\theta} =& \frac{1}{l(m_c + m_p \sin^2\theta)} \left[ -f \cos\theta - m_p l \dot\theta^2 \cos\theta \sin\theta - (m_c + m_p) g \sin\theta \right] \label{eq:ddot_theta} \end{align} In some of the analysis that follows, we will study the form of the equations of motion, ignoring the details, by arbitrarily setting all constants to 1: \begin{gather} 2\ddot{x} + \ddot\theta \cos\theta - \dot\theta^2 \sin\theta = f \label{eq:simple}\\ \ddot{x}\cos\theta + \ddot\theta + \sin\theta = 0. \label{eq:simple2} \end{gather}

You can experiment with the Cart-Pole dynamics in

Quadrotors have become immensely popular over the last few years -- advances in outrunner motors from the hobby community made them powerful, light-weight, and inexpensive! They are strong enough to carry an interesting payload (e.g. of sensors for mapping / photographing the environment), but dynamic enough to move relatively quickly. The most interesting dynamics start showing up at higher speeds, when the propellors start achieving lift from the airflow across them due to horizontal motion, or when they are close enough to the ground to experience significant ground-effect, but we won't try to capture those effects (yet) here.

When the quadrotor revolution started to happen, I predicted that it would be followed quickly by a realization that fixed-wing vehicles are better for most applications. Propellors are almost optimally efficient for producing thrust -- making quadrotors very efficient for hovering -- but to be efficient in forward flight you probably want to have an airfoil. Wings are a really good idea! But I was wrong -- quadrotors have completely dominated fixed-wings for commercial UAVs. Perhaps it's only because they are easier to control? Maybe there is still hope...

We can get started with an extremely simple model of a quadrotor that is restricted to live in the plane. In that case, it actually only needs two propellors, but calling it a "birotor" doesn't have the same ring to it. The equations of motion are almost trivial, since it is only a single rigid body, and certainly fit into our standard manipulator equations: \begin{gather} m \ddot{x} = -(u_1 + u_2)\sin\theta, \label{eq:quad_x}\\ m \ddot{y} = (u_1 + u_2)\cos\theta - mg, \label{eq:quad_y}\\ I \ddot\theta = r (u_1 - u_2) \label{eq:quad_theta} \end{gather}

See `drake/examples/Quadrotor`

. Will fill in the text
description/derivation here soon! The most interesting part is that the
model needs to include the moment produced by the rotating props,
otherwise the system linearized about the hovering configuration is
actually not controllable.

For both the Acrobot and the Cart-Pole systems, we will begin by designing a linear controller which can balance the system when it begins in the vicinity of the unstable fixed point. To accomplish this, we will linearize the nonlinear equations about the fixed point, examine the controllability of this linear system, then using linear quadratic regulator (LQR) theory to design our feedback controller.

Although the equations of motion of both of these model systems are relatively tractable, the forward dynamics still involve quite a few nonlinear terms that must be considered in any linearization. Let's consider the general problem of linearizing a system described by the manipulator equations.

We can perform linearization around a fixed point, $(\bx^*, \bu^*)$, using a Taylor expansion: \begin{equation} \dot\bx = {\bf f}(\bx,\bu) \approx {\bf f}(\bx^*,\bu^*) + \left[ \pd{\bf f}{\bx}\right]_{\bx=\bx^*,\bu=\bu^*} (\bx - \bx^*) + \left[ \pd{\bf f}{\bu}\right]_{\bx=\bx^*,\bu=\bu^*} (\bu - \bu^*) \end{equation} Let us consider the specific problem of linearizing the manipulator equations around a (stable or unstable) fixed point. In this case, ${\bf f}(\bx^*,\bu^*)$ is zero, and we are left with the standard linear state-space form: \begin{align} \dot\bx =& \begin{bmatrix} \dot\bq \\ \bM^{-1}(\bq) \left[ \btau_g(\bq) + {\bf B}(\bq)\bu - \bC(\bq,\dot\bq)\dot\bq \right] \end{bmatrix},\\ \approx& {\bf A}_{lin} (\bx-\bx^*) + \bB_{lin} (\bu - \bu^*), \end{align} where ${\bf A}_{lin}$, and $\bB_{lin}$ are constant matrices. Let us define $\bar\bx = \bx - \bx^*, \bar\bu = \bu - \bu^*$, and write $$\dot{\bar\bx} = {\bf A}_{lin}\bar\bx + \bB_{lin}\bar\bu.$$ Evaluation of the Taylor expansion around a fixed point yields the following, very simple equations, given in block form by: \begin{align} {\bf A}_{lin} =& \begin{bmatrix} {\bf 0} & {\bf I} \\ \bM^{-1} \pd{\btau_g}{\bq} + \sum_{j} \bM^{-1}\pd{\bB_j}{\bq} u_j & {\bf 0} \end{bmatrix}_{\bx=\bx^*,\bu=\bu^*} \\ \bB_{lin} =& \begin{bmatrix} {\bf 0} \\ \bM^{-1} \bB \end{bmatrix}_{\bx=\bx^*, \bu=\bu^*} \end{align} where $\bB_j$ is the $j$th column of $\bB$. Note that the term involving $\pd{\bM^{-1}}{q_i}$ disappears because $\btau_g + \bB\bu - \bC\dot{\bq}$ must be zero at the fixed point. All of the $\bC\dot\bq$ derivatives drop out, too, because $\dot{\bq}^* = 0$, and any terms with $\bC$ drop out as well, since centripetal and centrifugal forces are zero at when velocity is zero. In many cases, including both the Acrobot and Cart-Pole systems (but not the Quadrotors), the matrix $\bB(\bq)$ is a constant, so the $\pd\bB{\bq}$ terms also drop out.

Linearizing around the (unstable) upright point, we have: \begin{gather} \left[\pd{\bf \tau_g}{\bq}\right]_{\bx=\bx^*} = \begin{bmatrix} g (m_1 l_{c1} + m_2 l_1 + m_2 l_{c2}) & m_2 g l_{c2} \\ m_2 g l_{c2} & m_2 g l_{c2} \end{bmatrix} \end{gather} The linear dynamics follow directly from these equations and the manipulator form of the Acrobot equations.

Linearizing around the unstable fixed point in this system, we have: \begin{gather} \left[\pd{\btau_g}{\bq}\right]_{\bx=\bx^*} = \begin{bmatrix} 0 & 0 \\ 0 & m_p g l \end{bmatrix} \end{gather} Again, the linear dynamics follow simply.

Studying the properties of the linearized system can tell us some
things about the (local) properties of the nonlinear system. For
instance, having a stable linearization implies local exponential
stability of the nonlinear system

For the linear system $$\dot{\bx} = {\bf A}\bx + \bB\bu,$$ we can integrate this linear system in closed form, so it is possible to derive the exact conditions of controllability. In particular, for linear systems it is sufficient to demonstrate that there exists a control input which drives any initial condition to the origin.

Let us first examine a special case, which falls short as a general tool but may be more useful for understanding the intuition of controllability. Let's perform an eigenvalue analysis of the system matrix ${\bf A}$, so that: $${\bf A}{\bf v}_i = \lambda_i {\bf v}_i,$$ where $\lambda_i$ is the $i$th eigenvalue, and ${\bf v}_i$ is the corresponding (right) eigenvector. There will be $n$ eigenvalues for the $n \times n$ matrix ${\bf A}$. Collecting the (column) eigenvectors into the matrix ${\bf V}$ and the eigenvalues into a diagonal matrix ${\bf \Lambda}$, we have $${\bf A}{\bf V} = {\bf V}{\bf \Lambda}.$$ Here comes our primary assumption: let us assume that each of these $n$ eigenvalues takes on a distinct value (no repeats). With this assumption, it can be shown that the eigenvectors ${\bf v}_i$ form a linearly independent basis set, and therefore ${\bf V}^{-1}$ is well-defined.

We can continue our eigenmodal analysis of the linear system by defining the modal coordinates, ${\bf r}$, with: $$\bx = {\bf V}{\bf r},\quad \text{or}\quad {\bf r} = {\bf V}^{-1}\bx.$$ In modal coordinates, the dynamics of the linear system are given by $$\dot{\bf r} = {\bf V}^{-1} {\bf A} {\bf V} {\bf r} + {\bf V}^{-1} \bB \bu = {\bf \Lambda} {\bf r} + {\bf V}^{-1}\bB \bu.$$ This illustrates the power of modal analysis; in modal coordinates, the dynamics diagonalize yielding independent linear equations: $$\dot{r}_i = \lambda_i r_i + \sum_j \beta_{ij} u_j,\quad {\bf \beta} = {\bf V}^{-1} \bB.$$

Now the concept of controllability becomes clear. Input $j$ can
influence the dynamics in modal coordinate $i$ if and only if ${\bf
\beta}_{ij} \neq 0$. In the special case of non-repeated eigenvalues,
having control over each individual eigenmode is sufficient to (in
finite time) regulate all of the eigenmodes

A more general solution to the controllability issue, which removes
our assumption about the eigenvalues, can be obtained by examining the
time-domain solution of the linear equations. The solution of this
system is $$\bx(t) = e^{{\bf A}t} \bx(0) + \int_0^{t} e^{{\bf A}(t -
\tau)} \bB \bu(\tau) d\tau.$$ Without loss of generality, lets consider
the that the final state of the system is zero. Then we have: $$\bx(0)
= - \int_0^{t_f} e^{-{\bf A}\tau}\bB \bu(\tau) d\tau.$$ You might be
wondering what we mean by $e^{{\bf A}t}$; a scalar raised to the power
of a matrix..? Recall that $e^{z}$ is actually defined by a convergent
infinite sum: $$e^{z} =1 + z + \frac{1}{2} z^2 + \frac{1}{6} z^3 + ...
.$$ The notation $e^{{\bf A}t}$ uses the same definition: $$e^{{\bf A}t}
= {\bf I} + {\bf A}t + \frac{1}{2}({\bf A}t)^2 + \frac{1}{6}({\bf A}t)^3
+ ... .$$ Not surprisingly, this has many special forms. For instance,
if ${\bf A}$ is diagonalizable, $e^{{\bf A}t} = {\bf
V}e^{{\bf\Lambda}t}{\bf V}^{-1},$ where ${\bf A} = {\bf V \Lambda
V}^{-1}$ is the eigenvalue decomposition of ${\bf A}$

Although we only treated the case of a scalar $u$, it is possible to
extend the analysis to a vector $\bu$ of size $m$, yielding the
condition $$\text{rank} \begin{bmatrix} \bB & {\bf AB} & {\bf A}^2\bB &
\cdots & {\bf A}^{n-1}\bB \end{bmatrix}_{n \times (nm)} = n.$$ In Matlab
(using the Control Systems Toolbox), you can obtain the controllability
matrix using `Cm = ctrb(A,B)`

, and evaluate its rank with
`rank(Cm)`

.

Note that a linear feedback to change the eigenvalues of the
eigenmodes is not sufficient to accomplish our goal of getting to the
goal in finite time. In fact, the open-loop control to reach the goal
is easily obtained with a final-value LQR problem, and (for ${\bf
R}={\bf I}$) is actually a simple function of the controllability
Grammian

Analysis of the controllability of both the Acrobot and Cart-Pole systems reveals that the linearized dynamics about the upright are, in fact, controllable. This implies that the linearized system, if started away from the zero state, can be returned to the zero state in finite time. This is potentially surprising - after all the systems are underactuated. For example, it is interesting and surprising that the Acrobot can balance itself in the upright position without having a shoulder motor.

The controllability of these model systems demonstrates an extremely
important, point: *An underactuated system is not necessarily an
uncontrollable system.* Underactuated systems cannot follow
arbitrary trajectories, but that does not imply that they cannot arrive
at arbitrary points in state space. However, the trajectory required to
place the system into a particular state may be arbitrarily complex.

The controllability analysis presented here is for linear time-invariant (LTI) systems. A comparable analysis exists for linear time-varying (LTV) systems. We will even see extensions to nonlinear systems; although it will often be referred to by the synonym of "reachability" analysis.

Controllability tells us that a trajectory to the fixed point exists, but does not tell us which one we should take or what control inputs cause it to occur. Why not? There are potentially infinitely many solutions. We have to pick one.

The tools for controller design in linear systems are very advanced.
In particular, as we describe in the linear
optimal control chapter, one can easily design an optimal feedback
controller for a regulation task like balancing, so long as we are
willing to linearize the system around the operating point and define
optimality in terms of a quadratic cost function: $$J(\bx_0) =
\int_0^\infty \left[ \bx^T(t) \bQ \bx(t) + \bu^T(t) \bR \bu(t) \right]dt,
\quad \bx(0)=\bx_0, \bQ=\bQ^T>0, \bR=\bR^T>0.$$ The linear feedback
matrix $\bK$ used as $$\bu(t) = - \bK\bx(t),$$ is the so-called optimal
linear quadratic regulator (LQR). Even without understanding the
detailed derivation, we can quickly become practitioners of LQR.

```
K =
LinearQuadraticRegulator(A, B, Q, R)
```

for linear
systems. It also provides a version ```
controller =
LinearQuadraticRegulator(system, context, Q, R)
```

that
will linearize the system for you around an equilibrium and return the
controller (in the original coordinates). Therefore, to use LQR, one
simply needs to define the symmetric positive-definite cost matrices,
${\bf Q}$ and ${\bf R}$. In their most common form, ${\bf Q}$ and ${\bf
R}$ are positive diagonal matrices, where the entries $Q_{ii}$ penalize
the relative errors in state variable $x_i$ compared to the other state
variables, and the entries $R_{ii}$ penalize actions in $u_i$.
Take a minute to play around with the LQR controller for the Acrobot

and Cart-Pole

Make sure that you take a minute to look at the code which runs during these examples. Can you set the ${\bf Q}$ and ${\bf R}$ matrices differently, to improve the performance?

Simulation of the closed-loop response with LQR feedback shows that the task is indeed completed - and in an impressive manner. Oftentimes the state of the system has to move violently away from the origin in order to ultimately reach the origin. Further inspection reveals the (linearized) closed-loop dynamics are in fact non-minimum phase (acrobot has 3 right-half zeros, cart-pole has 1).

LQR works essentially out of the box for Quadrotors, if linearized around a nominal fixed point (where the non-zero thrust from the propellers is balancing gravity).

For the planar quadrotor, try:

For the 3D quadrotor, try:

Note: you'll need to open a web-browser to see the 3D visualization; the instructions are printed to the console output

Note that LQR, although it is optimal for the linearized system, is
not necessarily the best linear control solution for maximizing basin of
attraction of the fixed-point. The theory of *robust
control*(e.g.,

In the introductory chapters, we made the point that the underactuated
systems are not feedback linearizable. At least not completely. Although we
cannot linearize the full dynamics of the system, it is still possible to
linearize a portion of the system dynamics. The technique is called
*partial feedback linearization*.

Consider the cart-pole example. The dynamics of the cart are affected by the motions of the pendulum. If we know the model, then it seems quite reasonable to think that we could create a feedback controller which would push the cart in exactly the way necessary to counter-act the dynamic contributions from the pendulum - thereby linearizing the cart dynamics. What we will see, which is potentially more surprising, is that we can also use a feedback controller for the cart to feedback linearize the dynamics of the passive pendulum joint.

We'll use the term *collocated* partial feedback linearization to
describe a controller which linearizes the dynamics of the actuated joints.
What's more surprising is that it is often possible to achieve
*non-collocated* partial feedback linearization - a controller which
linearizes the dynamics of the unactuated joints. The treatment presented
here follows from

Starting from the equations \ref{eq:simple} and \ref{eq:simple2}, we have \begin{gather*} \ddot\theta = -\ddot{x}c - s \\ % \ddot\theta = -\frac{1}{l} (\ddot{x} c + g s) \ddot{x}(2-c^2) - sc - \dot\theta^2 s = f \end{gather*} Therefore, applying the feedback control \begin{equation} f = (2 - c^2) \ddot{x}^d - sc - \dot\theta^2 s % f = (m_c + m_p) u + m_p (u c + g s) c - m_p l \dot\theta^2 s \end{equation} results in \begin{align*} \ddot{x} =& \ddot{x}^d \\ \ddot{\theta} =& -\ddot{x}^dc - s, \end{align*} which are valid globally.

Starting again from equations \ref{eq:simple} and \ref{eq:simple2}, we have \begin{gather*} \ddot{x} = -\frac{\ddot\theta + s}{c} \\ \ddot\theta(c - \frac{2}{c}) - 2 \tan\theta - \dot\theta^2 s = f \end{gather*} Applying the feedback control \begin{equation} f = (c - \frac{2}{c}) \ddot\theta^d - 2 \tan\theta - \dot\theta^2 s \end{equation} results in \begin{align*} \ddot\theta =& \ddot\theta^d \\ \ddot{x} =& -\frac{1}{c} \ddot\theta^d - \tan\theta. \end{align*} Note that this expression is only valid when $\cos\theta \neq 0$. This is not surprising, as we know that the force cannot create a torque when the beam is perfectly horizontal.

For systems that are trivially underactuated (torques on some joints, no torques on other joints), we can, without loss of generality, reorganize the joint coordinates in any underactuated system described by the manipulator equations into the form: \begin{align} \bM_{11} \ddot{\bq}_1 + \bM_{12} \ddot{\bq}_2 &= \btau_1, \label{eq:passive_dyn}\\ \bM_{21} \ddot{\bq}_1 + \bM_{22} \ddot{\bq}_2 &= \btau_2 + \bu, \label{eq:active_dyn} \end{align} with $\bq \in \Re^n$, $\bq_1 \in \Re^l$, $\bq_2 \in \Re^m$, $l=n-m$. $\bq_1$ represents all of the passive joints, and $\bq_2$ represents all of the actuated joints, and the $\btau = \btau_g - \bC\dot\bq$ terms capture all of the Coriolis and gravitational terms, and $$\bM(\bq) = \begin{bmatrix} \bM_{11} & \bM_{12} \\ \bM_{21} & \bM_{22} \end{bmatrix}.$$ Fortunately, because $\bM$ is uniformly (e.g. for all $\bq$) positive definite, $\bM_{11}$ and $\bM_{22}$ are also positive definite.

Performing the same substitutions into the full manipulator
equations, we get:
\begin{gather}
\ddot\bq_1 = \bM_{11}^{-1} \left[ \btau_1 - \bM_{12} \ddot\bq_2
\right] \\
(\bM_{22} - \bM_{21} \bM_{11}^{-1} \bM_{12})
\ddot\bq_2 - \btau_2 + \bM_{21} \bM_{11}^{-1} \btau_1 = \bu
\end{gather}
It can be easily shown that the matrix $(\bM_{22} - \bM_{21}
\bM_{11}^{-1} \bM_{12})$ is invertible

\begin{gather} \ddot\bq_2 = \bM_{12}^+ \left[ \btau_1 - \bM_{11} \ddot\bq_1 \right] \\ (\bM_{21} - \bM_{22} \bM_{12}^+ \bM_{11}) \ddot\bq_1 - \btau_2 + \bM_{22} \bM_{12}^+ \btau_1 = \bu \end{gather} where $\bM_{12}^+$ is a Moore-Penrose pseudo-inverse. This inverse provides a unique solution when the rank of $\bM_{12}$ equals $l$, the number of passive degrees of freedom in the system (it cannot be more, since the matrix only has $l$ rows). This rank condition is sometimes called the property of "Strong Inertial Coupling". It is state dependent. A system has Global Strong Inertial Coupling if it exhibits Strong Inertial Coupling in every state.

In general, we can define some combination of active and passive
joints that we would like to control. This combination is sometimes
called a "task space".

If the actuated joints are commanded so that \begin{equation} \ddot\bq_2 = \bar\bH^+ \left [\ddot\by^d - \dot\bH\dot\bq - \bH_1 \bM_{11}^{-1}\btau_1 \right], \label{eq:q2cmd} \end{equation} where $\bar{\bH} = \bH_2 - \bH_1 \bM_{11}^{-1} \bM_{12}.$ and $\bar\bH^+$ is the right Moore-Penrose pseudo-inverse, $$\bar\bH^+ = \bar\bH^T (\bar\bH \bar\bH^T)^{-1},$$ then we have \begin{equation} \ddot\by = \ddot\by^d.\end{equation} subject to \begin{equation}\text{rank}\left(\bar{\bH} \right) = p, \label{eq:rank_condition}\end{equation}

**Proof Sketch.**
Differentiating the output function we have
\begin{gather*}
\dot\by = \bH \dot\bq \\
\ddot\by = \dot\bH \dot\bq + \bH_1 \ddot\bq_1 + \bH_2 \ddot\bq_2.
\end{gather*}
Solving \ref{eq:passive_dyn} for the dynamics of the unactuated joints
we have:
\begin{equation}
\ddot\bq_1 = \bM_{11}^{-1} (\btau_1 - \bM_{12} \ddot\bq_2) \label{eq:q1cmd}
\end{equation}
Substituting, we have
\begin{align}
\ddot\by =& \dot\bH \dot\bq + \bH_1 \bM_{11}^{-1}(\btau_1 -
\bM_{12}\ddot\bq_2) + \bH_2 \ddot\bq_2 \\
=& \dot\bH \dot\bq + \bar{\bH} \ddot\bq_2 + \bH_1 \bM_{11}^{-1}\btau_1 \\
=& \ddot\by^d
\end{align}
Note that the last line required the rank condition
($\ref{eq:rank_condition}$) on $\bar\bH$ to ensure that the rows
of $\bar{\bH}$ are linearly independent, allowing $\bar\bH
\bar\bH^+ = \bI$.

In order to execute a task space trajectory one could command
$$\ddot\by^d = \ddot{\bar{\by}}^d + \bK_d (\dot{\bar{\by}}^d - \dot\by) +
\bK_p (\bar{\by}^d -\by).$$ Assuming the internal dynamics are stable,
this yields converging error dynamics, $(\bar{\by}^d - \by)$, when
$\bK_p,\bK_d > 0$

Consider the task of trying to track a desired kinematic trajectory with the endpoint of pendulum in the cart-pole system. With one actuator and kinematic constraints, we might be hard-pressed to track a trajectory in both the horizontal and vertical coordinates. But we can at least try to track a trajectory in the vertical position of the end-effector.

Using the task-space PFL derivation, we have: \begin{gather*} y = h(\bq) = -l \cos\theta \\ \dot{y} = l \dot\theta \sin\theta \end{gather*} If we define a desired trajectory: $$\bar{y}^d(t) = \frac{l}{2} + \frac{l}{4} \sin(t),$$ then the task-space controller is easily implemented using the derivation above.

The task space derivation above provides a convenient generalization
of the partial feedback linearization (PFL)

If we choose ${\bf y} = \bq_1$ (non-collocated), we have $$\bH_1 =
\bI, \bH_2 = 0, \dot\bH = 0, \bar{\bH}=-\bM_{11}^{-1}\bM_{12}.$$ The
rank condition ($\ref{eq:rank_condition}$) requires that
$\text{rank}(\bM_{12}) = l$, in which case we can write
$\bar{\bH}^+=-\bM_{12}^+\bM_{11}$, reducing the rank condition to
precisely the "Strong Inertial Coupling" condition described in

Recall the phase portraits that we used to understand the dynamics of
the undamped, unactuated, simple pendulum ($u=b=0$). The orbits of this
phase plot were defined by contours of constant energy. One very special
orbit, known as a *homoclinic* orbit, is the orbit which passes
through the unstable fixed point. In fact, visual inspection will reveal
that any state that lies on this homoclinic orbit must pass into the
unstable fixed point. Therefore, if we seek to design a nonlinear
feedback control policy which drives the simple pendulum from any initial
condition to the unstable fixed point, a very reasonable strategy would be
to use actuation to regulate the energy of the pendulum to place it on
this homoclinic orbit, then allow the system dynamics to carry us to the
unstable fixed point.

This idea turns out to be a bit more general than just for the simple pendulum. As we will see, we can use similar concepts of `energy shaping' to produce swing-up controllers for the acrobot and cart-pole systems. It's important to note that it only takes one actuator to change the total energy of a system.

Although a large variety of swing-up controllers have been proposed
for these model systems (c.f.

Recall the equations of motion for the undamped simple pendulum were given by $$ml^2 \ddot\theta + mgl\sin\theta = u.$$ The total energy of the simple pendulum is given by $$E = \frac{1}{2} m l^2 \dot\theta^2 - mgl\cos\theta.$$ To understand how to control the energy, observe that \begin{align*} \dot{E} =& ml^2\dot\theta \ddot\theta + \dot\theta mgl\sin\theta \\ =& \dot\theta \left[ u - mgl\sin\theta \right] + \dot\theta mgl\sin\theta \\ =& u\dot\theta. \end{align*} In words, adding energy to the system is simple - simply apply torque in the same direction as $\dot\theta$. To remove energy, simply apply torque in the opposite direction (e.g., damping).

To drive the system to the homoclinic orbit, we must regulate the energy of the system to a particular desired energy, $$E^d = mgl.$$ If we define $\tilde{E} = E - E^d$, then we have $$\dot{\tilde{E}} = \dot{E} = u\dot\theta.$$ If we apply a feedback controller of the form $$u = -k \dot\theta \tilde{E},\quad k>0,$$ then the resulting error dynamics are $$\dot{\tilde{E}} = - k \dot\theta^2 \tilde{E}.$$ These error dynamics imply an exponential convergence: $$\tilde{E} \rightarrow 0,$$ except for states where $\dot\theta=0$. The essential property is that when $E > E^d$, we should remove energy from the system (damping) and when $E < E^d$, we should add energy (negative damping). Even if the control actions are bounded, the convergence is easily preserved.

This is a nonlinear controller that will push all system trajectories
to the unstable equilibrium. But does it make the unstable equilibrium
locally stable? *It is asymptotically stable, but not stable
i.s.L.* Small perturbations may cause the system to drive all of the
way around the circle in order to once again return to the unstable
equilibrium. For this reason, once trajectories come into the vicinity of
our swing-up controller, we prefer to switch to our LQR balancing
controller to performance to complete the task.

Take a minute to play around with the energy shaping + LQR controller for swinging up the pendulum

Make sure that you take a minute to look at the code which runs during these examples. Note the somewhat arbitrary threshold for switching to the LQR controller. We'll give a much more satisfying answer for that in the chapter on Lyapunov methods.

Having thought about the swing-up problem for the simple pendulum,
let's try to apply the same ideas to the cart-pole system. The basic
idea, from *of the pendulum* (a unit mass, unit length, simple
pendulum in unit gravity) is given by: $$E(\bx) = \frac{1}{2} \dot\theta^2
- \cos\theta.$$
The desired energy, equivalent to the energy at the desired fixed-point,
is $$E^d = 1.$$ Again defining $\tilde{E}(\bx) =
E(\bx) - E^d$, we now observe that
\begin{align*}
\dot{\tilde{E}}(\bx) =& \dot{E}(\bx) = \dot\theta \ddot\theta +
\dot\theta s \\
% \dot\tilde{E} = ml^2 \dot\theta \ddot\theta + mgl s
=& \dot\theta [ -uc - s] + \dot\theta s \\
% - ml \dot\theta [ u c + g s ] + mgl s
=& - u\dot\theta \cos\theta.
% - ml u \dot\theta c
\end{align*}
Therefore, if we design a controller of the form $$u = k
\dot\theta\cos\theta \tilde{E},\quad k>0$$ the result is $$\dot{\tilde{E}}
= - k \dot\theta^2 \cos^2\theta \tilde{E}.$$ This
guarantees that $| \tilde{E} |$ is non-increasing, but isn't quite enough
to guarantee that it will go to zero. For example, if $\theta =
\dot\theta = 0$, the system will never move. However, if we have that
$$\int_0^t \dot\theta^2(t') \cos^2 \theta(t') dt' \rightarrow \infty,\quad
\text{as } t \rightarrow \infty,$$ then we have $\tilde{E}(t) \rightarrow
0$. This condition is satisfied for all but the trivial constant
trajectories at fixed points.

Now we return to the full system dynamics (which includes the cart).

Swing-up control for the acrobot can be accomplished in much the same
way.

Use collocated PFL. ($\ddot{q}_2 = \ddot{q}_2^d$). $$E(\bx) = \frac{1}{2}\dot\bq^T\bM\dot{\bq} + U(\bx).$$ $$ E_d = U(\bx^*).$$ $$\bar{u} = \dot{q}_1 \tilde{E}.$$ $$\ddot{q}_2^d = - k_1 q_2 - k_2 \dot{q}_2 + k_3 \bar{u},$$

Extra PD terms prevent proof of asymptotic convergence to homoclinic
orbit. Proof of another energy-based controller in

The energy shaping controller for swing-up presented here is a pretty faithful representative from the field of nonlinear underactuated control. Typically these control derivations require some clever tricks for simplifying or canceling out terms in the nonlinear equations, then some clever Lyapunov function to prove stability. In these cases, PFL was used to simplify the equations, and therefore the controller design.

These controllers are important, representative, and relevant. But clever tricks with nonlinear equations seem to be fundamentally limited. Most of the rest of the material presented in this book will emphasize more general computational approaches to formulating and solving these and other control problems.

Although the "swing up and balance" task is quite interesting for the acrobot and cart-pole systems, quadrotors tend to inspire a slightly different question. First of all, stabilizing a hovering fixed point is relatively easy on a quadrotor -- even our LQR controller will have an extremely large region of attraction. But people also tend to want to do more with their quadrotors than just have them hover in place. The task we'll consider here is trajectory planning: how can you find a feasible trajectory through state space for the quadrotor, even if there are obstacles to avoid that are only known at runtime?

Trajectory design, and especially trajectory optimization, is a big idea that we will explore more thoroughly later in the text. But there is one idea that I would like to present here, because in addition to being a very satisfying solution for quadrotors, it is philosophically quite close to the idea of partial feedback linearization. That idea is called differential flatness.

One of the most important lessons from partial feedback linearization, is the idea that if you have $m$ actuators, then you basically get to control exactly $m$ quantities of your system. With the "task space" partial feedback linearization, we showed that these can be chosen in a fairly general way. Of course there are some requirements -- captured above by the rank condition which checks for "inertial coupling". Differential flatness exploits this same idea (choosing $m$ outputs), but in the opposite direction. The idea is that, for some systems, if you give me a trajectory in those $m$ coordinates, it may in fact dictate what all of the states/actuators must have been doing. The fact that you can only execute a subset of possible trajectories can, in this case, make trajectory planning much easier!

Let's start with an example...

The planar quadrotor model described above has 3 degrees of freedom ($x,y,\theta$) but only two actuators (one for each propellor). My claim is that, if you give me a trajectory for just the location of the center of mass: $x(t), y(t), \forall t \in [t_0, t_f],$ then I will be able to infer what $\theta(t)$ must be over that same interval in order to be feasible. Furthermore, I can even infer $\bu(t)$. There is one technical condition required: the trajectory you give me for $x(t)$ and $y(t)$ needs to be continuously differentiable (four times).

To see this, first observe that from ($\ref{eq:quad_x}$) and ($\ref{eq:quad_y}$) we have $$\frac{-m \ddot{x}}{ m \ddot{y} + mg} = \frac{(u_1 + u_2)\sin\theta}{(u_1+u_2)\cos\theta} = \tan\theta.$$ In words, given $\ddot{x}(t)$ and $\ddot{y}(t)$, I can solve for $\theta$(t). I can differentiate this relationship (in time) twice more to obtain $\ddot\theta$. Using ($\ref{eq:quad_theta}$) with ($\ref{eq:quad_x}$) or ($\ref{eq:quad_y}$) give us two linear equations for $u_1$ and $u_2$ which are easily solved.

Now you can see why we need the original trajectories to be smooth -- the solution to $\bu(t)$ depends on $\ddot\theta(t)$ which depends on $\frac{d^4 x(t)}{dt^4}$ and $\frac{d^4 y(t)}{dt^4}$; we need those derivatives to exist along the entire trajectory.

What's more -- if we ignore input limits for a moment -- *any*
sufficiently smooth trajectory of $x(t), y(t)$ is feasible, so if I can
simply find one that avoids the obstacles, then I have designed my state
trajectory. As we will see, optimizing even high-degree
piecewise-polynomials is actually an easy problem (it works out to be a
quadratic program), assuming the constraints are convex. In practice,
this means that once you have determined whether to go left or right
around the obstacles, trajectory design is easy and fast.

I've coded up a simple example of that for you here:

The example above demonstrates that the planar quadrotor system is
differentially flat in the outputs $x(t)$, $y(t)$. More generally, if we
have a dynamical system $$\dot{\bx} = f(\bx, \bu),$$ and we design some
output coordinates (essentially "task-space"): $$\bz(t) = h\left(\bx, \bu,
\frac{d\bu}{dt}, ..., \frac{d^k\bu}{dt^k}\right),$$ such that
\begin{gather*} \bx (t) = \bx\left(\bz, \frac{d\bz}{dt}, ...,
\frac{d^k\bz}{dt^k}\right), \\ \bu (t) = \bu\left(\bz, \frac{d\bz}{dt}, ..,
\frac{d^k\bz}{dt^k}\right),\end{gather*} then we say that the system $f$ is
differentially flat in the outputs $\bz$

I'm not aware of any numerical recipes for showing that a system is differentially flat nor for finding potential flat outputs, but I admit I haven't worked on it nor looked for those recipes. That would be interesting! I think of differential flatness as a property that one must find for your system -- typically via a little mechanical intuition and a lot of algebra. Once found, it can be very useful.

Probably the most famous example of differential flatness is on
the full 3D quadrotors.

A few things to note about these examples, just so we also understand the limitations. First, the technique above is great for designing trajectories, but additional work is required to stabilizing those trajectories (we'll cover that topic in more detail later in the notes). Trajectory stabilization benefits greatly from good state estimation, and the examples above were all performed in a motion capture arena. Also, the "simple" version of the trajectory design that is fast enough to be used for flying through moving hoops is restricted to convex optimization formulations -- which means one has to hard-code apriori the decisions about whether to go left or right / up or down around each obstacle.

The acrobot and cart-pole systems are just two of the model systems used heavily in underactuated control research. Other examples include:

- Pendubot
- Inertia wheel pendulum
- Furuta pendulum (horizontal rotation and vertical pendulum)
- Hovercraft

Practical legged locomotion is one of the fundamental problems in robotics; we've seen amazing progress over the last few years, but there are still some fundamental problems. Much of the recent progress is due to improvements in hardware -- a legged robot must carry all of its sensors, actuators and power and traditionally this meant underpowered motors that act as far-from-ideal force/torque sources -- but Boston Dynamics and other companies have made incredible progress here. The control systems implemented on these systems, though, are still surprisingly heuristic -- they require dramatically higher bandwidth and lower latency that the human motor control system and still perform worse in challenging environments. I may be biased, but I think a lot of the work left to do involves thinking about underactuated control.

In this chapter we'll introduce some of the simple models of walking and robots, the control problems that result, and a very brief summary of some of the control solutions described in the literature. Compared to the robots that we have studied so far, our investigations of legged locomotion will require additional tools for thinking about limit cycle dynamics and dealing with impacts.

In many of the systems that we have studied so far, we have analyzed the stability of a fixed-point, or even an (infinite-horizon) trajectory. For walking systems the natural equivalent is to talk about the stability of periodic solutions -- a fixed "gait" is a cycle that repeats footstep after footstep. So we begin our discussion with a discussion of the stability of a cycle.

A limit cycle is an asymptotically stable or unstable periodic
orbit†

$$\ddot{q} + \mu (q^2 - 1)\dot{q} + q = 0, \quad \mu>0$$ One can think of this system as almost a simple spring-mass-damper system, except that it has nonlinear damping. In particular, the velocity term dissipates energy when $|q|>1$, and adds energy when $|q|<1$. Therefore, it is not terribly surprising to see that the system settles into a stable oscillation from almost any initial conditions (the exception is the state $q=0,\dot{q}=0$). This can be seen nicely in the phase portrait below.

However, as you can see from the plot of system trajectories in the time domain, then a slightly different picture emerges. Although the phase portrait clearly reveals that all trajectories converge to the same orbit, the time domain plot reveals that these trajectories do not necessarily synchronize in time.

The Van der Pol oscillator clearly demonstrates what we would think of as a stable limit cycle, but also exposes the subtlety in defining this limit cycle stability. Neighboring trajectories do not necessarily converge on a stable limit cycle. In contrast, defining the stability of a particular trajectory (parameterized by time) is relatively easy.

Let's make a few quick points about the existence of closed-orbits. If we
can define a closed region of phase space which does not contain any fixed
points, then it must contain a closed-orbit

One definition for the stability of a limit cycle uses the method of
Poincaré. Let's consider an $n$ dimensional dynamical system,
$\dot{\bx} = {\bf f}(\bx).$ Define an $n-1$ dimensional *surface of
section*, $S$. We will also require that $S$ is transverse to the
flow (i.e., all trajectories starting on $S$ flow through $S$, not
parallel to it). The Poincaré map (or return map) is a mapping from
$S$ to itself: $$\bx_p[n+1] = {\bf P}(\bx_p[n]),$$ where $\bx_p[n]$ is
the state of the system at the $n$th crossing of the surface of section.
Note that we will use the notation $\bx_p$ to distinguish the state of the
discrete-time system from the continuous time system; they are related by
$\bx_p[n] = \bx(t_c[n])$, where $t_c[n]$ is the time of the $n$th crossing
of $S$.

Since the full system is two dimensional, the return map dynamics are one dimensional. One dimensional maps, like one dimensional flows, are amenable to graphical analysis. To define a Poincaré section for the Van der Pol oscillator, let $S$ be the line segment where $q = 0$, $\dot{q} > 0$.

If ${\bf P}(\bx_p)$ exists for all $\bx_p \in S$, then this method turns the stability analysis for a limit cycle into the stability analysis of a fixed point on a discrete map. In practice it is often difficult or impossible to find ${\bf P}$ analytically, but it can be obtained quite reasonably numerically. Once ${\bf P}$ is obtained, we can infer local limit cycle stability with an eigenvalue analysis. There will always be a single eigenvalue of 0 - corresponding to perturbations along the limit cycle which do not change the state of first return. The limit cycle is considered locally exponentially stable if all remaining eigenvalues, $\lambda_i$, have magnitude less than one, $|\lambda_i|<1$.

In fact, it is often possible to infer more global stability properties
of the return map by examining ${\bf P}$. *unimodal*
maps.

A particularly graphical method for understanding the dynamics of a one-dimensional iterated map is with the staircase method. Sketch the Poincaré map and also the line of slope one. Fixed points are the crossings with the unity line. Asymptotically stable if $|\lambda| < 1$. Unlike one dimensional flows, one dimensional maps can have oscillations (happens whenever $\lambda < 0$).

One of the earliest models of walking was proposed by
McMahon

McGeer

The most elementary model of passive dynamic walking, first used in the
context of walking by

- Collisions with ground are inelastic and impulsive (only angular momentum is conserved around the point of collision).
- The stance foot acts as a pin joint and does not slip.
- The transfer of support at the time of contact is instantaneous (no double support phase).
- $0 \le \gamma < \frac{\pi}{2}$, $0 < \alpha < \frac{\pi}{2}$, $l > 0$.

Note that the coordinate system used here is slightly different than for the simple pendulum ($\theta=0$ is at the top, and the sign of $\theta$ has changed).

The most comprehensive analysis of the rimless wheel was done by

The dynamics of the system when one leg is on the ground are given by $$\ddot\theta = \frac{g}{l}\sin(\theta).$$ If we assume that the system is started in a configuration directly after a transfer of support ($\theta(0^+) = \gamma-\alpha$), then forward walking occurs when the system has an initial velocity, $\dot\theta(0^+) > \omega_1$, where $$\omega_1 = \sqrt{ 2 \frac{g}{l} \left[ 1 - \cos \left (\gamma-\alpha \right) \right]}.$$ $\omega_1$ is the threshold at which the system has enough kinetic energy to vault the mass over the stance leg and take a step. This threshold is zero for $\gamma = \alpha$ and does not exist for $\gamma > \alpha$. The next foot touches down when $\theta(t) = \gamma+\alpha$, at which point the conversion of potential energy into kinetic energy yields the velocity $$\dot\theta(t^-) = \sqrt{ \dot\theta^2(0^+) + 4\frac{g}{l} \sin\alpha \sin\gamma }.$$ $t^-$ denotes the time immediately before the collision.

The angular momentum around the point of collision at time $t$ just before the next foot collides with the ground is $$L(t^-) = -ml^2\dot\theta(t^-) \cos(2\alpha).$$ The angular momentum at the same point immediately after the collision is $$L(t^+) = -ml^2\dot\theta(t^+).$$ Assuming angular momentum is conserved, this collision causes an instantaneous loss of velocity: $$\dot\theta(t^+) = \dot\theta(t^-) \cos(2\alpha).$$

Given the stance dynamics, the collision detection logic ($\theta = \gamma \pm \alpha$), and the collision update, we can numerically simulate this simple model. Doing so reveals something remarkable... the wheel starts rolling, but then one of two things happens: it either runs out of energy and stands still, or it quickly falls into a stable periodic solution. Convergence to this stable periodic solution appears to happen from a huge range of initial conditions. Try it yourself.

Optionally you can specify the initial conditions. Here are a few good values to try:

One of the fantastic things about the rimless wheel is that the dynamics are exactly the undamped simple pendulum that we've thought so much about. So you will recognize the phase portraits of the system below - they are centered around the unstable fixed point of the simple pendulum. Phase plots of the trajectories from the various initial conditions recommended above are plotted below.

Take a minute to make sure you can follow these trajectories through state space for each of the initial conditions. It's non-trivial, but worth the effort.

We can now derive the angular velocity at the beginning of each
stance phase as a function of the angular velocity of the previous
stance phase. First, we will handle the case where $\gamma \le \alpha$
and $\dot\theta_n^+ > \omega_1$. The "step-to-step return map",
factoring losses from a single collision, the resulting map is:
$$\dot\theta^{+}_{n+1} = \cos(2\alpha) \sqrt{ ({\dot\theta_n}^{+})^{2} +
4\frac{g}{l}\sin\alpha \sin\gamma}.$$ where the $\dot{\theta}^{+}$
indicates the velocity just *after* the energy loss at impact has
occurred.

Using the same analysis for the remaining cases, we can complete the return map. The threshold for taking a step in the opposite direction is $$\omega_2 = - \sqrt{2 \frac{g}{l} [1 - \cos(\alpha + \gamma)]}.$$ For $\omega_2 < \dot\theta_n^{+} < \omega_1,$ we have $$\dot\theta_{n+1}^{+} = -\dot\theta_n^{+} \cos(2\alpha).$$ Finally, for $\dot\theta_n^{+} < \omega_2$, we have $$\dot\theta_{n+1}^{+} = - \cos(2\alpha)\sqrt{(\dot\theta_n^{+})^2 - 4 \frac{g}{l} \sin\alpha \sin\gamma}.$$ Altogether, we have (for $\gamma \le \alpha$) $$\dot\theta_{n+1}^{+} = \begin{cases} \cos(2\alpha) \sqrt{(\dot\theta_n^{+})^2 + 4 \frac{g}{l} \sin\alpha \sin\gamma}, & \text{ } \omega_1 < \dot\theta_n^{+} \\ -\dot\theta_n^{+} \cos(2\alpha), & \text{ } \omega_2 < \dot\theta_n^{+} < \omega_1 \\ -\cos(2\alpha) \sqrt{(\dot\theta_n^{+})^2 - 4\frac{g}{l} \sin\alpha \sin\gamma}, & \dot\theta_n^{+} < w_2 \end{cases}.$$

Notice that the return map is undefined for $\dot\theta_n = \{ \omega_1, \omega_2 \}$, because from these configurations, the wheel will end up in the (unstable) equilibrium point where $\theta = 0$ and $\dot\theta = 0$, and will therefore never return to the map.

This return map blends smoothly into the case where $\gamma > \alpha$. In this regime, $$\dot\theta_{n+1}^{+} = \begin{cases} \cos(2\alpha) \sqrt{(\dot\theta_n^{+})^2 + 4 \frac{g}{l} \sin\alpha \sin\gamma}, & \text{ } 0 \le \dot\theta_n^{+} \\ -\dot\theta_n^{+} \cos(2\alpha), & \text{ } \omega_2 < \dot\theta_n^{+} < 0 \\ -\cos(2\alpha) \sqrt{(\dot\theta_n^{+})^2 - 4\frac{g}{l} \sin\alpha \sin\gamma}, & \dot\theta_n^{+} \le w_2 \end{cases}.$$ Notice that the formerly undefined points at $\{\omega_1,\omega_2\}$ are now well-defined transitions with $\omega_1 = 0$, because it is kinematically impossible to have the wheel statically balancing on a single leg.

For a fixed point, we require that $\dot\theta_{n+1}^{+} = \dot\theta_n^{+} = \omega^*$. Our system has two possible fixed points, depending on the parameters: $$ \omega_{stand}^* = 0,~~~ \omega_{roll}^* = \cot(2 \alpha) \sqrt{4 \frac{g}{l} \sin\alpha\sin\gamma}.$$ The limit cycle plotted above illustrates a state-space trajectory in the vicinity of the rolling fixed point. $\omega_{stand}^*$ is a fixed point whenever $\gamma < \alpha$. $\omega_{roll}^*$ is a fixed point whenever $\omega_{roll}^* > \omega_1$. It is interesting to view these bifurcations in terms of $\gamma$. For small $\gamma$, $\omega_{stand}$ is the only fixed point, because energy lost from collisions with the ground is not compensated for by gravity. As we increase $\gamma$, we obtain a stable rolling solution, where the collisions with the ground exactly balance the conversion of gravitational potential to kinetic energy. As we increase $\gamma$ further to $\gamma > \alpha$, it becomes impossible for the center of mass of the wheel to be inside the support polygon, making standing an unstable configuration.

The rimless wheel models only the dynamics of the stance leg, and simply assumes that there will always be a swing leg in position at the time of collision. To remove this assumption, we take away all but two of the spokes, and place a pin joint at the hip. To model the dynamics of swing, we add point masses to each of the legs. For actuation, we first consider the case where there is a torque source at the hip - resulting in swing dynamics equivalent to an Acrobot (although in a different coordinate frame).

In addition to the modeling assumptions used for the rimless wheel, we
also assume that the swing leg retracts in order to clear the ground
without disturbing the position of the mass of that leg. This model, known
as the compass gait, is well studied in the literature using numerical
methods

The state of this robot can be described by 4 variables: $\theta_{st},\theta_{sw},\dot\theta_{st}$, and $\dot\theta_{sw}$. The abbreviation $st$ is shorthand for the stance leg and $sw$ for the swing leg. Using $\bq = [ \theta_{st}, \theta_{sw} ]^T$ and $\bu = \tau$, we can write the dynamics as $$\bM(\bq)\ddot\bq + \bC(\bq,\dot\bq)\dot\bq = \btau_g(q) + \bB\bu,$$ with \begin{gather*} \bM = \begin{bmatrix} (m_h+m)l^2 + ma^2 & -mlb\cos(\theta_{st}-\theta_{sw}) \\ -mlb\cos(\theta_{st}-\theta_{sw}) & mb^2 \end{bmatrix}\\ \bC = \begin{bmatrix} 0 & -mlb\sin(\theta_{st}-\theta_{sw})\dot\theta_{sw} \\ mlb\sin(\theta_{st}-\theta_{sw})\dot\theta_{st} & 0 \end{bmatrix} \\ \btau_g(q) = \begin{bmatrix} (m_hl + ma + ml)g\sin(\theta_{st}) \\ -mbg\sin(\theta_{sw}) \end{bmatrix},\\ \bB = \begin{bmatrix} -1 \\ 1 \end{bmatrix} \end{gather*} and $l=a+b$.

The foot collision is an instantaneous change of velocity caused by a impulsive force at the foot that brings the foot to rest. The update to the velocities can be calculated using the derivation for impulsive collisions derived in the appendix. To use it, we proceed with the following steps:

- Add a "floating base" to the system by adding a free (x,y) joint at the pre-impact stance toe, $\bq_{fb} = [x,y,\theta_{st},\theta_{sw}]^T.$
- Calculate the mass matrix for this new system, call it $\bM_{fb}$.
- Write the position of the (pre-impact) swing toe as a function $\bphi(\bq_{fb})$. Define the Jacobian of this function: $\bJ = \frac{\partial \bphi}{\partial \bq_{fb}}.$
- The post-impact velocity that ensures that $\dot\bphi = 0$ after the collision is given by $$\dot\bq^+ = \left[ \bI - \bM_{fb}^{-1}\bJ^T[\bJ\bM_{fb}^{-1}\bJ^T]^{-1}\bJ\right] \dot\bq^-,$$ noting that $\dot{x}^- = \dot{y}^- = 0$, and that you need only read out the last two elements of $\dot{\bq}^+$. The velocity of the post-impact stance foot will be zero by definition, and the new velocity of the pre-impact stance foot can be completely determined from the minimal coordinates.
- Switch the stance and swing leg positions and velocities.

Numerical integration of these equations reveals a stable limit cycle,
plotted below. Notice that when the left leg is in stance (top part of
the plot), the trajectory quite resembles the familiar pendulum dynamics
of the rimless wheel. But this time, when the leg impacts, it takes a
long arc around as the swing leg before returning to stance. The impacts
are also clearly visible as discontinuous changes (decreases) in velocity.
The dependence of this limit cycle on the system parameters has been
studied extensively in

The basin of attraction of the stable limit cycle is a narrow band of states surrounding the steady state trajectory. Although the simplicity of this model makes it analytically attractive, this lack of stability makes it difficult to implement on a physical device.

To achieve a more anthropomorphic gait, as well as to acquire better
foot clearance and ability to walk on rough terrain, we want to model a
walker that includes knee

At the beginning of each step, the system is modeled as a three-link
pendulum, like the ballistic
walker

After this collision, the knee is locked and we switch to the compass gait model with a different mass distribution. In other words, the system becomes a two-link pendulum. Again, the heel-strike is modeled as an inelastic collision. After the collision, the legs switch instantaneously. After heel-strike then, we switch back to the ballistic walker's three-link pendulum dynamics. This describes a full step cycle of the kneed walker, which is shown above.

By switching between the dynamics of the continuous three-link and two-link pendulums with the two discrete collision events, we characterize a full point-feed kneed walker walking cycle. After finding appropriate parameters for masses and link lengths, a stable gait is found. As with the compass gait's limit cycle, there is a swing phase (top) and a stance phase (bottom). In addition to the two heel-strikes, there are two more instantaneous velocity changes from the knee-strikes as marked in the figure. This limit cycle is traversed clockwise.

The model is a point mass, $m$, on top of a massless, springy leg with rest length of $l_0$, and spring constant $k$. The state of the system is given by the $x,y$ position of the center of mass, and the length, $l$, and angle $\theta$ of the leg. Like the rimless wheel, the dynamics are modeled piecewise - with one dynamics governing the flight phase, and another governing the stance phase.

Flight Phase. State variables: $\bx = [x,y,\dot{x},\dot{y}]^T.$ Dynamics are $$\dot{\bx} = \begin{bmatrix} \dot{x} \\ \dot{y} \\ 0 \\ - g \end{bmatrix}.$$

Stance Phase. State variables: $\bx = [r, \theta, \dot{r}, \dot\theta]^T.$ Kinematics are $$x = \begin{bmatrix} - r \sin\theta \\ r \cos\theta \end{bmatrix},$$ Energy is given by $$T = \frac{m}{2} (\dot{r}^2 + r^2 \dot\theta^2 ), \quad U = mgr\cos\theta + \frac{k}{2}(r_0 - r)^2.$$ Plugging these into Lagrange yields: \begin{gather} m \ddot{r} - m r \dot\theta^2 + m g \cos\theta - k (r_0 - r) = 0 \\ m r^2 \ddot{\theta} + 2mr\dot{r}\dot\theta - mgr \sin\theta = 0 \end{gather}

The passive dynamic walkers and hopping robots described in the last chapter capture the fundamental dynamics of legged locomotion -- dynamics which are fundamentally nonlinear and punctuated by impulsive events due to making and breaking contact with the environment. But if you start reading the literature on humanoid robots, or many-legged robots like quadrupeds, then you will find a quite different set of ideas taking center stage: ideas like the "zero-moment point" and footstep planning. The goal of this chapter is to penetrate that world.

I'd like to start the discussion with a model that might seem quite far from the world of legged robots, but I think it's a very useful way to think about the problem.

Imagine you have a flying vehicle modeled as a single rigid body in a gravity field with some number of force "thrusters" attached. We'll describe the configuration of the vehicle by its orientation, $\theta$ and the location of its center of mass $(x,z)$. The vector from the center of mass to thruster $i$ is given by $r_i$, yielding the equations of motion: \begin{align*} m\ddot{x} =& \sum_i F_{i,x},\\ m\ddot{z} =& \sum_i F_{i,z} - mg,\\ I\ddot\theta =& \sum_i \left[ r_i \times F_i \right]_y, \end{align*} where I've used $F_{i,x}$ to represent the component of the force in the $x$ direction and have taken just the $y$-axis component of the cross product on the last line.

Our goal is to move the hovercraft to a desired position/orientation and to keep it there. If we take the inputs to be $F_i$, then the dynamics are affine (linear plus a constant term). As such, we can stabilize a stabilizable fixed point using a change of coordinates plus LQR or even plan/track a desired trajectory using time-varying LQR. If I were to add additional linear constraints, for instance constraining $F_{min} \le F_i \le F_{max}$, then I can still use linear model-predictive control (MPC) to plan and stabilize a desired motion. By most accounts, this is a relatively easy control problem. (Note that it would be considerably harder if my control input or input constraints depend on the orientation, $\theta$, but we don't need that here yet).

Now imagine that you just a small number of thrusters (let's say two) each with severe input constraints. To make things more interesting, let us say that you're allowed to move the thrusters, so $\dot{r}_i$ becomes an additional control input, but with some important limitations: you have to turn the thrusters off when they are moving (e.g. $|F_i||\dot{r}_i| = 0$) and there is a limit to how quickly you can move the thrusters $|\dot{r}|_i \le v_{max}$. This problem is now a lot more difficult, due especially to the fact that constraints of the form $u_1 u_2 = 0$ are non-convex.

I find this a very useful thought exercise; somehow our controller needs to make an effectively discrete decision to turn off a thruster and move it. The framework of optimal control should support this - you are sacrificing a short-term loss of control authority for the long-term benefit of having the thruster positioned where you would like, but we haven't developed tools yet that deal well with this discrete plus continuous decision making. We'll need to address that here.

Unfortunately, although the problem is already difficult, we need to continue to add constraints. Now let's say additionally, that the thrusters can only be turned on in certain locations, as cartooned here:

The union of these regions need not form a convex set. Furthermore, these locations could be specified in the robot frame, but may also be specified in the world frame, which I'll call ${\bf p}_i$: $${\bf p}_i = {\bf r}_i - \begin{bmatrix} x \\ 0 \\ z \end{bmatrix}.$$ This problem still feels like it should be tractable, but it's definitely getting hard.

In my view, the hovercraft problem above is a central component of the walking problem. If we consider a walking robot with massless legs, then the feet are exactly like movable thrusters. As above, they are highly constrained - they can only produce force when they are touching the ground, and (typically) they can only produce forces in certain directions, for instance as described by a "friction cone" (you can push on the ground but not pull on the ground, and with Coulomb friction the forces tangential to the ground must be smaller than the forces normal to the ground, as described by a coefficient of friction, e.g. $|F_{\parallel}| < \mu |F_{\perp}|$).

The constraints on where you can put your feet / thrusters will depend on the kinematics of your leg, and the speed at which you can move them will depend on the full dynamics of the leg -- these are difficult constraints to deal with. But the actual dynamics of the rigid body are actually still affine, and still simple!

We don't actually need to have massless legs for this discussion to
make sense. If we use the coordinates $x,z$ to describe the location of
the center of mass (CoM) of the entire robot, and $m$ to represent the
entire mass of the robot, then the first two equations remain unchanged.
The center of mass is a configuration dependent point, and does not have
an orientation, but one important generalization of the orientation
dynamics is given by the centroidal momentum matrix, $A(\bq)$, where
$A(\bq)\dot{\bq}$ captures the linear and angular momentum of the robot
around the center of mass

In the previous chapter we devoted relatively a lot of attention to dynamics of impact, characterized for instance by a guard that resets dynamics in a hybrid dynamical system. In those models we used impulsive ground-reaction forces (these forces are instantaneously infinite, but doing finite work) in order to explain the discontinuous change in velocity required to avoid penetration with the ground. This story can be extended naturally to the dynamics of the center of mass.

For an articulated robot, though, there are a number of possible strategies for the legs which can effect the dynamics of the center of mass. For example, if the robot hits the ground with a stiff leg like the rimless wheel, then the angular momentum about the point of collision will be conserved, but any momentum going into the ground will be lost. However, if the robot has a spring leg and a massless toe like the SLIP model, then no energy need be lost.

One strategy that many highly-articulated legged robots use is to keep their center of mass at a constant height, $$z = c \quad \Rightarrow \quad \dot{z} = \ddot{z} = 0,$$ and minimize their angular momentum about the center of mass (here $\ddot\theta=0$). Using this strategy, if the swinging foot lands roughly below the center of mass, then even with a stiff leg there is no energy dissipated in the collision - all of the momentum is conserved. This often (but does not always) justify ignoring the impacts in the center of mass dynamics of the system.

While not the only important case, it is extremely common for our robots to be walking over flat, or nearly flat terrain. In this situation, even if the robot is touching the ground in a number of places (e.g., two heels and two toes), the constraints on the center of mass dynamics can be summarized very efficiently.

First, on flat terrain $F_{i,z}$ represents the force that is normal to the surface at contact point $i$. If we assume that the robot can only push on the ground (and not pull), then this implies $$\forall i, F_{i,z} \ge 0 \Rightarrow \sum_i F_{i,z} \ge 0 \Rightarrow \ddot{z} \ge -g.$$ In other words, if I cannot pull on the ground, then my center of mass cannot accelerate towards the ground faster than gravity.

Furthermore, if we use a Coulomb model of friction on the ground, with friction coefficient $\mu$, then $$\forall i, |F_{i,x}| \le \mu F_{i,z} \Rightarrow \sum_i |F_{i,x}| \le \mu \sum_i F_z \Rightarrow |\ddot{x}| \le \mu (\ddot{z} + g).$$ For instance, if I keep my center of mass at a constant height, then $\ddot{z}=0$ and $|\ddot{x}| \le \mu g$; this is a nice reminder of just how important friction is if you want to be able to move around in the world.

Even better, let us define the "center of pressure" (CoP) as the point on the ground where $$x_{cop} = \frac{\sum_i p_{i,x} F_{i,z}}{\sum_i F_{i,z}},$$ and since all $p_{i,z}$ are equal on flat terrain, $z_{cop}$ is just the height of the terrain. It turns out that the center of pressure is a "zero-moment point" (ZMP) -- a property that we will demonstrate below -- and moment-balance equation gives us a very important relationship between the location of the CoP and the dynamics of the CoM: \[ (m\ddot{z} + mg) (x_{cop} - x) = (z_{cop} - z) m\ddot{x} - I\ddot\theta. \] If we use the strategy proposed above for ignoring collision dynamics, $\ddot{z} = \ddot{\theta} = 0$, then we have $z - z_{cop}$ is a constant height $h$, and the result is the famous "ZMP equations": \[ \ddot{x} = -\frac{g}{h} (x_{cop}-x). \] So the location of the center of pressure completely determines the acceleration of the center of mass, and vice versa! What's more, this relationship is affine -- a property that we can exploit in a number of ways.

As an example, we can easily relate constraints on the CoP to constraints on $\ddot{x}$. It is easy to verify from the definition that the CoP must live inside the convex hull of the ground contact points. Therefore, if we use the $\ddot{z}=\ddot\theta=0$ strategy, then this directly implies strict bounds on the possible accelerations of the center of mass given the locations of the ground contacts.

The zero-moment point (ZMP) is discussed very frequently in the current
literature on legged robots. It also has an unfortunate tendency to be
surrounded by some confusion; a number of people have defined the ZMP is
slightly different ways (see e.g.

First let us recall that for rigid body systems I can always summarize
the contributions from many external forces as a single *wrench*
(force and torque) on the object -- this is simply because the $F_i$ terms
enter our equations of motion linearly. Specifically, given any point in
space, $r$, in coordinates relative to $(x,z)$:

I can re-write the equations of motion as \begin{align*} m\ddot{x} =& \sum_i F_{i,x} = F_{net,x},\\ m\ddot{z} =& \sum_i F_{i,z} - mg = F_{net,z} - mg,\\ I\ddot\theta =& \sum_i \left[ r_i \times F_i \right]_y = ({\bf r} \times {\bf F}_{net})_y + \tau_{net}, \end{align*} where ${\bf F}_{net} = \sum_i {\bf F}_i$ and the value of $\tau_{net}$ depends on the location ${\bf r}$. For some choices of ${\bf r}$, we can make $\tau_{net}=0$. Examining \[ ({\bf r} \times {\bf F}_{net})_y = r_z F_{net,x} - r_x F_{net,z} = \left[ r_i \times F_i \right]_y, \] we can see that when $F_{net,z} > 0$ there is an entire line of solutions, $r_x = a r_z + b$, including one that will intercept the terrain. For walking robots, it is this point on the ground from which the external wrench can be described by a single force vector (and no moment) that is the famous "zero-moment point" (ZMP). Substituting the back in to replace $F_{net}$, we can see that \[ r_x = \frac{r_z m \ddot{x} - I \ddot\theta}{m\ddot{z} + mg}. \] If we assume that $\ddot{z}=\ddot{\theta}=0$ and replace the relative coordinates with the global coordinates ${\bf r} = {\bf p} - [x,0,z]^T$, then we arrive at precisely the equation presented above.

Furthermore, since \[\left[ r_i \times F_i \right]_y = \sum_i \left(
r_{i,z} F_{i,x} - r_{i,x} F_{i,z} \right), \] and for *flat terrain*
we have \[ r_z F_{net,x} = \sum_i r_{i,z} F_{i,x}, \] then we can see that
this ZMP is exactly the CoP: \[ r_x = \frac{\sum_i r_{i,x} F_{i,z}}{
F_{net,z} }. \]

In three dimensions, we solve for the point on the ground where $\tau_{net,y} = \tau_{net,x} = 0$, but allow $\tau_{net,z} \ne 0$ to extract the analogous equations in the $y$-axis: \[ r_y = \frac{r_z m \ddot{y} - I \ddot\theta}{m\ddot{z} + mg}. \]

Coming soon. For a description of our approach with Atlas, see

Coming soon. For a description of our approach with Atlas, see

Coming soon. For a description of our approach with Atlas, see

My goals for this chapter are to build intuition for the beautiful and rich behavior of nonlinear dynamical system that are subjected to random (noise/disturbance) inputs. So far we have focused primarily on systems described by \[ \dot{\bx}(t) = f(\bx(t),\bu(t)) \quad \text{or} \quad \bx[n+1] = f(\bx[n],\bu[n]). \] In this chapter, I would like to broaden the scope to think about \[ \dot{\bx}(t) = f(\bx(t),\bu(t),\bw(t)) \quad \text{or} \quad \bx[n+1] = f(\bx[n],\bu[n],\bw[n]), \] where this additional input $\bw$ is the (vector) output of some random process. In other words, we can begin thinking about stochastic systems by simply understanding the dynamics of our existing ODEs subjected to an additional random input.

This form is extremely general as written. $\bw(t)$ can represent time-varying random disturbances (e.g. gusts of wind), or even constant model errors/uncertainty. One thing that we are not adding, yet, is measurement uncertainty. There is a great deal of work on observability and state estimation that study the question of how you can infer the true state of the system given noise sensor readings. For this chapter we are assuming perfect measurements of the full state, and are focused instead on the way that "process noise" shapes the long-term dynamics of the system.

I will also stick primarily to discrete time dynamics for this chapter,
simply because it is easier to think about the output of a discrete-time
random process, $\bw[n]$, than a $\bw(t)$. But you should know that all of the
ideas work in continuous time, too. Also, most of our examples will take the
form of *additive noise*: \[ \bx[n+1] = f(\bx[n],\bu[n]) + \bw[n], \]
which is a particular useful and common specialization of our general form.
And this form doesn't give up much -- the disturbance on step $k$ can pass
through the nonlinear function $f$ on step $k+1$ giving rich results -- but is
often much easier to work with.

Let's start by looking at some simple examples.

Let's consider (a time-reversed version of) one of my favorite
one-dimensional systems: \[ \dot{x} = x - x^3. \]

A reasonable discrete-time
approximation of these dynamics, now with additive noise, is \[ x[n+1] =
x[n] + h (x[n] - x[n]^3 + w[n]). \] When $w[n]=0$, this system has the
same fixed points and stability properties as the continuous time system.
But let's examine the system when $w[n]$ is instead the result of a
*zero-mean Gaussian white noise process*, defined by: \begin{gather*}
\forall n, E\left[ w[n] \right] = 0,\\ E\left[ w[i]w[j] \right] =
\begin{cases} \sigma^2, & \text{ if } i=j,\\ 0, & \text{ otherwise.}
\end{cases} \end{gather*} Here $\sigma$ is the standard deviation of the
Gaussian.

When you simulate this system for small values of $\sigma$, you will see trajectories move roughly towards one of the two fixed points (for the deterministic system), but every step is modified by the noise. In fact, even if the trajectory were to arrive exactly at what was once a fixed point, it is almost surely going to move again on the very next step. In fact, if we plot many runs of the simulation from different initial conditions all on the same plot, we will see something like the figure below.

During any individual simulation, the state jumps around randomly for all time, and could even transition from the vicinity of one fixed point to the other fixed point. Another way to visualize this output is to animate a histogram of the particles over time.

You can run this demo for yourself:

Let's take a moment to appreciate the implications of what we've just
observed. Every time that we've analyzed a system to date, we've asked
questions like "given x[0], what is the long-term behavior of the system,
$\lim_{n\rightarrow\infty} x[n]$?", but now $x[n]$ is a *random
variable*. The trajectories of this system do not converge, and the
system does not exhibit any form of stability that we've introduced so far.

All is not lost. If you watch the animation closely, you might notice
the *distribution* of this random variable is actually very well
behaved. This is the key idea for this chapter.

Let us use $p_n(x)$ to denote the
probability density function over the random variable $x$ at time $n$.
Note that this density function must satisfy \[ \int_{-\infty}^\infty p_n(x)
dx = 1.\] It is actually possible to write the *dynamics of the
probability density* with the simple relation \[ p_{n+1}(x) =
\int_{-\infty}^\infty p(x|x') p_n(x') dx', \] where $p(x|x')$ encodes the
stochastic dynamics as a conditional distribution of the next state (here
$x$) as a function of the current state (here $x'$). Dynamical systems that
can be encoded in this way are known as *continuous-state Markov
Processes*, and the governing equation above is often referred to as the
"master
equation" for the stochastic process. In fact this update is even
linear(!) ; a fact that can enable closed-form solutions to some impressive
long-term statistics, like mean time to failure or first passage
times

The slightly more general form of the master equation, which works for multivariable distributions with state-domain ${\cal X}$, and systems with control inputs $\bu$, is \[ p_{n+1}(\bx) = \int_{\cal X} p(\bx|\bx',\bu) p_n(\bx') d\bx'. \] This is a (continuous-state) Markov Decision Process.

In the example above, the histogram is our numerical approximation of the
probability density. If you simulate it a few times, you will probably
believe that, although the individual trajectories of the system *do
not* converge, the probability distribution actually *does* converge
to what's known as a *stationary distribution* -- a fixed point of the
master equation. Instead of thinking about the dynamics of the
trajectories, we need to start thinking about the dynamics of the
distribution.

Let's consider the one-dimensional linear system with additive noise: \[ x[n+1] = a x[n] + w[n]. \] When $w[n]=0$, the system is stable (to the origin) for $-1 < a < 1$. Let's make sure that we can understand the dynamics of the master equation for the case when $w[n]$ is Gaussian white noise with standard deviation $\sigma$.

First, recall that the probability density function of a Gaussian with mean $\mu$ is given by \[ p(w) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(w-\mu)^2}{2\sigma^2}}. \] When conditioned on $x[n]$, the distribution given by the dynamics subjected to mean-zero Gaussian white noise is simply another Gaussian, with the mean given by $ax[n]$: \[ p(x[n+1]|x[n]) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x[n+1]-ax[n])^2}{2\sigma^2}}, \] yielding the master equation \[ p_{n+1}(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \int_{-\infty}^\infty e^{-\frac{(x-ax')^2}{2\sigma^2}} p_n(x') dx'. \]

Now here's the magic. Let's push a distribution, $p_n(x)$, which is zero-mean, with standard deviation $\sigma_n$, through the master equation: \begin{align*} p_{n+1}(x) =& \frac{1}{\sqrt{2 \pi \sigma^2}}\frac{1}{\sqrt{2 \pi \sigma_n^2}} \int_{-\infty}^\infty e^{-\frac{(x-ax')^2}{2\sigma^2}} e^{-\frac{(x')^2}{2\sigma_n^2}} dx',\\ =& \frac{1}{\sqrt{2 \pi (\sigma^2+a^2 \sigma_n^2)}} e^{-\frac{x^2}{2(\sigma^2 + a^2 \sigma_n^2)}}. \end{align*} The result is another mean-zero Gaussian with $\sigma_{n+1}^2 = \sigma^2 + a^2 \sigma_n^2$. This result generalizes to the multi-variate case, too, and might be familiar to you e.g. from the process update of the Kalman filter.

Taking it a step further, we can see that a stationary distribution for this system is given by a mean-zero Gaussian with \[ \sigma_*^2 = \frac{\sigma^2}{1-a^2}. \] Note that this distribution is well defined when $-1 < a < 1$ (only when the system is stable).

The stationary distribution of the linear Gaussian system reveals the fundamental and general balance between two terms in the governing equations of any stochastic dynamical system: the stability of the deterministic system is bringing trajectories together (smaller $a$ means faster convergence of the deterministic system and results in a more narrow distribution), but the noise in the system is forcing trajectories apart (larger $\sigma$ means larger noise and results in a wider distribution).

Given how rich the dynamics can be for deterministic nonlinear systems, you can probably imagine that the possible long-term dynamics of the probability are also extremely rich. If we simply flip the signs in the dynamics we examined above, we'll get our next example:

Now let's consider the discrete-time approximation of \[ \dot{x} = -x + x^3, \] again with additive noise: \[ x[n+1] = x[n] + h (-x[n] + x[n]^3 + w[n]). \] With $w[n]=0$, the system has only a single stable fixed point (at the origin), whose deterministic region of attraction is given by $x \in (-1,1)$. If we again simulate the system from a set of random initial conditions and plot the histogram, we will see something like the figure below.

Be sure to watch the animation. Or better yet, run the simulation for yourself by changing the sign of the derivative in the example and re-running:

Now try increasing the noise (e.g. pre-multiply the noise input, $w$, in the dynamics by a scalar like 2).

Click here to see the animation

What is the stationary distribution for this system? In fact, there isn't one. Although we initially collect probability density around the stable fixed point, you should notice a slow leak -- on every step there is some probability of transitioning past unstable fixed points and getting driven by the unstable dynamics off towards infinity. If we run the simulation long enough, there won't be any probability density left at $x=0$.

One more example; this is a fun one. Let's think about one of the simplest examples that we had for a system that demonstrates limit cycle stability -- the Van der Pol oscillator -- but now we'll add Gaussian white noise, $$\ddot{q} + (q^2 - 1)\dot{q} + q = w(t),$$ Here's the question: if we start with a small set of initial conditions centered around one point on the limit cycle, then what is the long-term behavior of this distribution?

Since the long-term behavior of the deterministic system is periodic, it would be very logical to think that the state distribution for this stochastic system would fall into a stable periodic solution, too. But think about it a little more, and then watch the animation (or run the simulation yourself).

Click here to see the animation (first 20 seconds)

Click here to see the particles after 2000 seconds of simulation

The explanation is simple: the periodic solution of the system is only
*orbitally stable*; there is no stability along the limit cycle. So
any disturbances that push the particles along the limit cycle will go
unchecked. Eventually, the distribution will "mix" along the entire
cycle. Perhaps surprisingly, this system that has a limit cycle stability
when $w=0$ eventually reaches a stationary distribution (fixed point) in
the master equation.

Coming soon. Read the paper

Sensor models. Beam model from probabilistic robotics. RGB-D dropouts.

Perception subsystem. Output of a perception system is not Gaussian noise, it's missed detections/drop-outs...

Distributions over tasks/environments.

Coming soon. Uncertainty quantification (e.g., linear guassian approximation; discretize then closed form solutions using the transition matrix....). Monte-Carlo. Particle sim/filter. Rare event simulation

L2-gain with dissipation inequalities. Finite-time verification with sums of squares.

In chapter 2, we spent some time thinking about the phase portrait of the
simple pendulum, and concluded with a challenge: can we design a nonlinear
controller to *reshape* the phase portrait, with a very modest amount
of actuation, so that the upright fixed point becomes globally stable? With
unbounded torque, feedback linearization solutions (e.g., invert gravity) can
work well, but can also require an unnecessarily large amount of control
effort. The energy-based swing-up control solutions presented for the acrobot
and cart-pole systems are considerably more appealing, but required some
cleverness and might not scale to more complicated systems. Here we
investigate another approach to the problem, using computational optimal
control to synthesize a feedback controller directly.

In this chapter, we will introduce optimal control - a control design
process using optimization. This approach is powerful for a number of
reasons. First and foremost, it is very general - allowing us to specify
the goal of control equally well for fully- or under-actuated, linear or
nonlinear, deterministic or stochastic, and continuous or discrete
systems. Second, it permits concise descriptions of potentially very
complex desired behaviours, specifying the goal of control as an scalar
objective (plus a list of constraints). Finally, and most importantly,
optimal control is very amenable to numerical solutions.

The fundamental idea in optimal control is to formulate the goal of
control as the *long-term* optimization of a scalar cost function.
Let's introduce the basic concepts by considering a system that is even
simpler than the simple pendulum.

Consider the double integrator system $$\ddot{q} = u, \quad |u| \le 1.$$ If you would like a mechanical analog of the system (I always do), then you can think about this as a unit mass brick moving along the x-axis on a frictionless surface, with a control input which provides a horizontal force, $u$. The task is to design a control system, $u = \pi(\bx,t)$, $\bx=[q,\dot{q}]^T$ to regulate this brick to $\bx = [0,0]^T$.

In order to formulate this control design problem using optimal control, we must define a scalar objective which scores the long-term performance of running each candidate control policy, $\pi(\bx,t)$, from each initial condition, $(\bx_0,t_0)$, and a list of constraints that must be satisfied. For the task of driving the double integrator to the origin, one could imagine a number of optimal control formulations which would accomplish the task, e.g.:

- Minimum time: $\min_\pi t_f,$ subject to $\bx(t_0) = \bx_0,$ $\bx(t_f) = {\bf 0}.$
- Quadratic cost: $\min_\pi \int_{t_0}^{\infty} \left[ \bx^T(t) {\bf Q} \bx(t) \right] dt,$ ${\bf Q}\succ0$.

Note that the input limits, $|u|\le1$ are also required to make this problem well-posed; otherwise both optimizations would result in the optimal policy using infinite control input to approach the goal infinitely fast. Besides input limits, another common approach to limiting the control effort is to add an additional quadratic cost on the input (or "effort"), e.g. $\int \left[ \bu^T(t) {\bf R} \bu(t) \right] dt,$ ${\bf R}\succ0$. This could be added to either formulation above. We will examine many of these formulations in some details in the examples worked out at the end of this chapter.

Optimal control has a long history in robotics. For instance, there has been a great deal of work on the minimum-time problem for pick-and-place robotic manipulators, and the linear quadratic regulator (LQR) and linear quadratic regulator with Gaussian noise (LQG) have become essential tools for any practicing controls engineer. With increasingly powerful computers and algorithms, the popularity of numerical optimal control has grown at an incredible pace over the last few years.

For more intuition, let's do an informal derivation of the solution to the minimum time problem for the double integrator with input constraints: \begin{align*} \minimize_{\pi} \quad & t_f\\ \subjto \quad & \bx(t_0) = \bx_0, \\ & \bx(t_f) = {\bf 0}, \\ & \ddot{q}(t) = u(t), \\ & |u| \le 1. \end{align*} What behavior would you expect an optimal controller exhibit?

Your intuition might tell you that the best thing that the brick can
do, to reach the goal in minimum time with limited control input, is to
accelerate maximally towards the goal until reaching a critical point,
then hitting the brakes in order to come to a stop exactly at the goal.
This would be called a *bang-bang* control policy; these are often
optimal for systems with bounded input, and it is in fact optimal for the
double integrator, although we will not prove it until we have developed
more tools.

Let's work out the details of this bang-bang policy. First, we can figure out the states from which, when the brakes are fully applied, the system comes to rest precisely at the origin. Let's start with the case where $q(0) < 0$, and $\dot{q}(0)>0$, and "hitting the brakes" implies that $u=-1$ . Integrating the equations, we have \begin{gather*} \ddot{q}(t) = u = -1 \\\dot{q}(t) = \dot{q}(0) - t \\ q(t) = q(0) + \dot{q}(0) t - \frac{1}{2} t^2. \end{gather*} Substituting $t = \dot{q}(0) - \dot{q}$ into the solution give $\dot{q}$ reveals that the system orbits are parabolic arcs: \[ q = -\frac{1}{2} \dot{q}^2 + c_{-}, \] with $c_{-} = q(0) + \frac{1}{2}\dot{q}^2(0)$.

Similarly, the solutions for $u=1$ are $q = \frac{1}{2} \dot{q}^2 + c_{+}$, with $c_{+}=q(0)-\frac{1}{2}\dot{q}^2(0)$.

Perhaps the most important of these orbits are the ones that pass directly through the origin (e.g., $c_{-}=0$). Following our initial logic, if the system is going slower than this $\dot{q}$ for any $q$, then the optimal thing to do is to slam on the accelerator ($u=-\text{sgn}(q)$). If it's going faster than the $\dot{q}$ that we've solved for, then still the best thing to do is to brake; but inevitably the system will overshoot the origin and have to come back. We can summarize this policy with: \[ u = \begin{cases} +1 & \text{if } (\dot{q} < 0 \text{ and } q \le \frac{1}{2} \dot{q}^2) \text{ or } (\dot{q}\ge 0 \text{ and } q < -\frac{1}{2} \dot{q}^2) \\ 0 & \text{if } q=0 \text{ and } \dot{q}=0 \\ -1 & \text{otherwise} \end{cases} \]

and illustrate some of the optimal solution trajectories:

And for completeness, we can compute the optimal time to the goal by solving for the amount of time required to reach the switching surface plus the amount of time spent moving along the switching surface to the goal. With a little algebra, you will find that the time to goal, $T(\bx)$, is given by \[ T(\bx) = \begin{cases} 2\sqrt{\frac{1}{2}\dot{q}^2-q} - \dot{q} & \text{for } u=+1 \text{ regime}, \\ 0 & \text{for } u=0, \\ \dot{q} + 2\sqrt{\frac{1}{2}\dot{q}^2+q} & \text{for } u=-1, \end{cases} \] plotted here:

Notice that the function is continuous (though not smooth), even though the policy is discontinuous.

As we begin to develop theoretical and algorithmic tools for optimal
control, we will see that some formulations are much easier to deal with
than others. One important example is the dramatic simplification that
can come from formulating objective functions using *additive
cost*, because they often yield recursive solutions. In the additive
cost formulation, the long-term "score" for a trajectory can be written
as $$\int_0^T g(x(t),u(t)) dt,$$ where $g()$ is the instantaneous cost
(also referred to as the "running cost"), and $T$ can be either a finite
real number or $\infty$. We will call a problem specification with a
finite $T$ a "finite-horizon" problem, and $T=\infty$ an
"infinite-horizon" problem. Problems and solutions for infinite-horizon
problems tend to be more elegant, but care is required to make sure that
the integral converges for the optimal controller (typically by having an
achievable goal that allows the robot to accrue zero-cost).

The quadratic cost function suggested in the double integrator example above is clearly written as an additive cost. At first glance, our minimum-time problem formulation doesn't appear to be of this form, but we actually can write it as an additive cost problem using an infinite horizon and the instantaneous cost $$g(x,u) = \begin{cases} 0 & \text{if } x=0, \\ 1 & \text{otherwise.} \end{cases}$$

We will examine a number of approaches to solving optimal control
problems throughout the next few chapters. For the remainder of this
chapter, we will focus on additive-cost problems and their solution via
*dynamic programming*.

For systems with continuous states and continuous actions, dynamic
programming is a set of theoretical ideas surrounding additive cost optimal
control problems. For systems with a finite, discrete set of states and a
finite, discrete set of actions, dynamic programming also represents a set
of very efficient numerical *algorithm* which can compute optimal
feedback controllers. Many of you will have learned it before as a tool for
graph search.

Imagine you have a directed graph with states (or nodes) $\{s_1,s_2,...\} \in S$ and "actions" associated with edges labeled as $\{a_1,a_2,...\} \in A$, as in the following trivial example:

Let us also assume that each edge has an associate weight or cost, using $g(s,a)$ to denote the cost of being in state $s$ and taking action $a$. Furthermore we will denote the transition "dynamics" using \[ s[n+1] = f(s[n],a[n]). \] For instance, in the graph above, $f(s_1,a_1) = s_2$.

There are many algorithms for finding (or approximating) the optimal
path from a start to a goal on directed graphs. In dynamic programming,
the key insight is that we can find the shortest path from every node by
solving recursively for the optimal *cost-to-go* (the cost that will
be accumulated when running the optimal controller) from every node to the
goal. One such algorithm starts by initializing an estimate $\hat{J}^*=0$
for all $s_i$, then proceeds with an iterative algorithm which sets
\begin{equation} \hat{J}^*(s_i) \Leftarrow \min_{a \in A} \left[ g(s_i,a) +
\hat{J}^*\left({f(s_i,a)}\right) \right]. \label{eq:value_update}
\end{equation} In software, $\hat{J}^*$ can be represented as a vector with
dimension equal to the number of discrete states. This algorithm,
appropriately known as *value iteration*, is guaranteed to converge
to the optimal cost-to-go up to a constant factor, $\hat{J}^* \rightarrow
J^* + c$ *batch* -- e.g. the estimate is
updated for all $i$ at once -- but the *asynchronous* version where
states are updated one at a time is also known to converge, so long as
every state is eventually updated infinitely often. Assuming the graph has
a goal state with a zero-cost self-transition, then this cost-to-go
function represents the weighted shortest distance to the goal.

Value iteration is an amazingly simple algorithm, but it accomplishes
something quite amazing: it efficiently computes the long-term cost of an
optimal policy from *every* state by iteratively evaluating the
one-step cost. If we know the optimal cost-to-go, then it's easy to extract
the optimal policy, $a = \pi^*(s)$: \begin{equation} \pi^*(s_i) = \argmin_a
\left[ g(s_i,a) + J^*\left( f(s_i,a) \right) \right].
\label{eq:policy_update} \end{equation} It's a simple algorithm, but playing
with an example can help our intuition.

Imagine a robot living in a grid (finite state) world. Wants to get
to the goal location. Possibly has to negotiate cells with obstacles.
Actions are to move up, down, left, right, or do nothing.

You can run value iteration for the double integrator (using
barycentric interpolation to interpolate between nodes) in

Please do take some time to try different cost functions by editing the code yourself.

Let's take a minute to appreciate how amazing this is. Our solution to finding the optimal controller for the double integrator wasn't all that hard, but it required some mechanical intuition and solutions to differential equations. The resulting policy was non-trivial -- bang-bang control with a parabolic switching surface. The value iteration algorithm doesn't use any of this directly -- it's a simple algorithm for graph search. But remarkably, it can generate effectively the same policy with just a few moments of computation.

It's important to note that there *are* some differences between
the computed policy and the optimal policy that we derived, due to
discretization errors. We will ask you to explore these in the
problems.

The real value of this numerical solution, however, is unlike our analytical solution for the double integrator, we can apply this same algorithm to any number of dynamical systems virtually without modification. Let's apply it now to the simple pendulum, which was intractable analytically.

You can run value iteration for the simple pendulum (using barycentric
interpolation to interpolate between nodes) in

Again, you can easily try different cost functions by editing the code yourself.

I find the graph search algorithm extremely satisfying as a first step, but also become quickly frustrated by the limitations of the discretization required to use it. In many cases, we can do better; coming up with algorithms which work more natively on continuous dynamical systems. We'll explore those extensions in this section.

It's important to understand that the value iteration equations,
equations (\ref{eq:value_update}) and (\ref{eq:policy_update}), are more
than just an algorithm. They are also sufficient conditions for
optimality: if we can produce a $J^*$ and $\pi^*$ which satisfy these
equations, then $\pi^*$ must be an optimal controller. There are an
analogous set of conditions for the continuous systems. For a system
$\dot{\bx} = f(\bx,\bu)$ and an infinite-horizon additive cost
$\int_0^\infty g(\bx,\bu)dt$, we have: \begin{gather} 0 = \min_\bu \left[
g(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) \right], \label{eq:HJB} \\
\pi^*(\bx) = \argmin_\bu \left[ g(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right]. \end{gather} Equation \ref{eq:HJB} is known as the
*Hamilton-Jacobi-Bellman* (HJB) equation. *time
derivative* depends on the first-order partial derivatives over
state, which we get in the finite-time deriviation; Eq \ref{eq:HJB} is
the steady-state solution of the Hamilton-Jacobi equation.

The following statement is adapted from Proposition 3.2.1 of

As a tool for verifying optimality, the HJB equations are actually surprisingly easy to work with: we can verify optimality for an infinite-horizon objective without doing any integration; we simply have to check a derivative condition on the optimal cost-to-go function $J^*$. Let's see this play out on the double integrator example.

Consider the problem of regulating the double integrator (this time without input limits) to the origin using a quadratic cost: $$ g(\bx,\bu) = q^2 + \dot{q}^2 + u^2. $$ I claim (without derivation) that the optimal controller for this objective is $$\pi(\bx) = -q - \sqrt{3}\dot{q}.$$ To convince you that this is indeed optimal, I have produced the following cost-to-go function: $$J(\bx) = \sqrt{3} q^2 + 2 q \dot{q} + \sqrt{3} \dot{q}^2.$$

Taking \begin{gather*} \pd{J}{q} = 2\sqrt{3} q + 2\dot{q}, \qquad \pd{J}{\dot{q}} = 2q + 2\sqrt{3}\dot{q}, \end{gather*} we can write \begin{align*} g(\bx,\bu) + \pd{J}{\bx}f(\bx,\bu) &= q^2 + \dot{q}^2 + u^2 + (2\sqrt{3} q + 2\dot{q}) \dot{q} + (2q + 2\sqrt{3}\dot{q}) u \end{align*} This is a convex quadratic function in $u$, so we can find the minimum with respect to $u$ by finding where the gradient with respect to $u$ evaluates to zero. \[ \pd{}{u} \left[ g(\bx,\bu) + \pd{J}{\bx} f(\bx,\bu) \right] = 2u + 2q + 2\sqrt{3}\dot{q}. \] Setting this equal to $0$ and solving for $u$ yields: $$u^* = -q - \sqrt{3} \dot{q},$$ thereby confirming that our policy $\pi$ is in fact the minimizer. Substituting $u^*$ back into the HJB reveals that the right side does in fact simplify to zero. I hope you are convinced!

Note that evaluating the HJB for the time-to-go of the minimum-time problem for the double integrator will also reveal that the HJB is satisfied wherever that gradient is well-defined. This is certainly mounting evidence in support of our bang-bang policy being optimal, but since $\pd{J}{\bx}$ is not defined everywhere, it does not actually satisfy the requirements of the sufficiency theorem as stated above.

We still face a few barriers to actually using the HJB in an algorithm. The first barrier is the minimization over $u$. When the action set was discrete, as in the graph search version, we could evaluate the one-step cost plus cost-to-go for every possible action, and then simply take the best. For continuous action spaces, in general we cannot rely on the strategy of evaluating a finite number of possible $\bu$'s to find the minimizer.

All is not lost. In the quadratic cost double integrator example above, we were able to solve explicitly for the minimizing $\bu$ in terms of the cost-to-go. It turns out that this strategy will actually work for a number of the problems we're interested in, even when the system (which we are given) or cost function (which we are free to pick, but which should be expressive) gets more complicated.

Recall that I've already tried to convinced you that a majority of the
systems of interest are *control affine*, e.g. I can write \[
f(\bx,\bu) = f_1(\bx) + f_2(\bx)\bu. \] We can make another dramatic
simplification by restricting ourselves to instantaneous cost functions of
the form \[ g(\bx,\bu) = g_1(\bx) + \bu^T {\bf R} \bu, \qquad {\bf R}={\bf
R}^T \succ 0. \] In my view, this is not very restrictive - many of the
cost functions that I find myself choosing to write down can be expressed
in this form. Given these assumptions, we can write the HJB as \[ 0 =
\min_{\bu} \left[ g_1(\bx) + \bu^T {\bf R} \bu + \pd{J}{\bx} \left[
f_1(\bx) + f_2(\bx)\bu \right]\right]. \] Since this is a positive
quadratic function in $\bu$, if the system does not have any constraints
on $\bu$, then we can solve in closed-form for the minimizing $\bu$ by
taking the gradient of the right-hand side: \[ \pd{}{\bu} = 2\bu^T {\bf R}
+ \pd{J}{\bx} f_2(\bx) = 0, \] and setting it equal to zero to obtain \[
\bu^* = -\frac{1}{2}{\bf R}^{-1}f_2^T(\bx) \pd{J}{\bx}^T.\] If there are
linear constraints on the input, such as torque limits, then more
generally this could be solved (at any particular $\bx$) as a quadratic
program.

What happens in the case where our system is not control affine or if
we really do need to specify an instantaneous cost function on $\bu$
that is not simply quadratic? If the goal is to produce an iterative
algorithm, like value iteration, then one common approach is to make a
(positive-definite) quadratic approximation in $\bu$ of the HJB, and
updating that approximation on every iteration of the algorithm. This
broad approach is often referred to as *differential dynamic
programming* (c.f.

The other major barrier to using the HJB in a value iteration algorithm is that the estimated optimal cost-to-go function, $\hat{J}^*$, must somehow be represented with a finite set of numbers, but we don't yet know anything about the potential form it must take. In fact, knowing the time-to-goal solution for minimum-time problem with the double integrator, we see that this function might need to be non-smooth for even very simple dynamics and objectives.

One natural way to parameterize $\hat{J}^*$ -- a scalar valued-function defined over the state space -- is to define the values on a mesh. This approach then admits algorithms with close ties to the relatively very advanced numerical methods used to solve other partial differential equations (PDEs), such as the ones that appear in finite element modeling or fluid dynamics. One important difference, however, is that our PDE lives in the dimension of the state space, while many of the mesh representations from the other disciplines are optimized for two or three dimensional space. Also, our PDE may have discontinuities (or at least discontinuous gradients) at locations in the state space which are not known apriori.

A slightly more general view of the problem would describe the mesh (and the associated interpolation functions) as just one form of representations for function approximation. Using a deep convolutional neural network to represent the cost-to-go would also fall under the domain of function approximation, perhaps representing the other extreme in terms of complexity; using deep networks in approximate dynamic programming is now called deep reinforcement learning. We will discuss it more later in the book.

If we approximate $J^*$ with a finitely-parameterized function $\hat{J}_\balpha^*$, with parameter vector $\balpha$, then this immediately raises many important questions:

- What if the true cost-to-go function does not live in the prescribed function class; e.g., there does not exist an $\balpha$ which satisfies the sufficiency conditions for all $\bx$? (This seems very likely to be the case.)
- What update should we apply in the iterative algorithm?
- What can we say about it's convergence?

In general, many of these questions will have to go unanswered. But
there are some answers available for the special case of *linear
function approximators*. A linear function approximator takes the
form: \[ \hat{J}^*_\balpha(\bx) = \sum_i \alpha_i \psi_i(\bx) =
\bpsi^T(\bx) \balpha, \] where $\bpsi(\bx)$ is a vector of potentially
nonlinear features. Common examples of features include polynomials,
radial basis functions, or most interpolation schemes used on a mesh.
The distinguishing feature of a linear function approximator is the
ability to exactly solve for $\balpha$ in order to represent a desired
function optimally, in a least-squares sense. For instance, if we have
a desired set of $K$ input-output points, $(\bx_k,J^d_k)$, then we can
write down the closed-form solution to \[ \minimize_\balpha \sum_k
\left(\hat{J}_\balpha(\bx_k) - J^d_k \right)^2. \] Using the least
squares solution in a value iteration update is sometimes referred to as
*fitted value iteration*, where $\bx_k$ are some number of samples
taken from the continuous space and for discrete-time systems the
iterative approximate solution to \begin{gather*} J^*(\bx_0) =
\min_{u[\cdot]} \sum_{n=0}^\infty g(\bx[n],\bu[n]), \\ \text{ s.t. }
\bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*} becomes
\begin{gather*} J^d_k = \min_\bu \left[ g(\bx_k,\bu) +
\hat{J}^*_\alpha\left({f(\bx_k,\bu)}\right) \right], \\ \balpha
\Leftarrow \argmin_\balpha \sum_k \left(\hat{J}^*_\balpha(\bx_k) - J^d_k
\right)^2. \end{gather*} For linear function approximators, this is
simply: \begin{gather*} \balpha \Leftarrow \begin{bmatrix}
\bpsi^T(\bx_1) \\ \vdots \\ \bpsi^T(\bx_K)\end{bmatrix}^+
\begin{bmatrix} J^d_1 \\ \vdots \\ J^d_K \end{bmatrix}, \end{gather*}
where the $^+$ notation refers to a Moore-Penrose pseudoinverse.
Remarkably, for linear function approximators, this update is still
known to converge to the globally optimal $\balpha^*$.

Imagine that we use a mesh to approximate the cost-to-go function
over that state space with $K$ mesh points $\bx_k$. We would like to
perform the value iteration update: \begin{equation} \forall k,
\hat{J}^*(\bx_k) \Leftarrow \min_\bu \left[ g(\bx_k,\bu) +
\hat{J}^*\left({f(\bx_k,\bu)}\right) \right],
\label{eq:mesh_value_iteration} \end{equation} but must deal with the
fact that $f(\bx_k,\bu)$ might not result in a next state that is
directly at a mesh point. Most interpolation schemes for a mesh can be
written as some weighted combination of the values at nearby mesh
points, e.g. \[ \hat{J}^*(\bx) = \sum_i \beta_i(\bx) \hat{J}^*(\bx_i),
\quad \sum_i \beta_i = 1 \] with $\beta_i$ the relative weight of the
$i$th mesh point. In *at a mesh
point* is the value at the mesh point itself), the update in Eq.
(\ref{eq:mesh_value_iteration}) is precisely the least-squares update
(and it achieves zero error).

For solutions to systems with continuous-time dynamics, I have to
uncover one of the details that I've so far been hiding to keep the
notation simpler. Let us consider a problem with a finite-horizon:
\begin{gather*} \min_{\bu[\cdot]} \sum_{n=0}^N g(\bx[n],\bu[n]), \\
\text{ s.t. } \bx[n+1] = f(\bx[n], \bu[n]), \bx[0] = \bx_0\end{gather*}
In fact, the way that we compute this is by solving the *time-varying
cost-to-go function* backwards in time: \begin{gather*}J^*(\bx,N) =
\min_\bu g(\bx, \bu) \\ J^*(\bx,n-1) = \min_\bu \left[ g(\bx, \bu) +
J^*(f(\bx,\bu), n) \right]. \end{gather*} The convergence of the value
iteration update is equivalent to solving this time-varying cost-to-go
backwards in time until it reaches a steady-state solution (the
infinite-horizon solution). Which explains why value iteration only
converges if the optimal cost-to-go is bounded.

Now let's consider the continuous-time version. Again, we have a
time-varying cost-to-go, $J^*(\bx,t)$. Now $$\frac{dJ^*}{dt} =
\pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t},$$ and our sufficiency condition
is $$0 = \min_\bu \left[g(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu) +
\pd{J^*}{t} \right].$$ But since $\pd{J^*}{t}$ doesn't depend on $\bu$,
we can pull it out of the $\min$ and write the (true) HJB:
$$-\pd{J^*}{t} = \min_\bu \left[g(\bx, \bu) + \pd{J^*}{\bx}f(\bx,\bu)
\right].$$ The standard numerical recipe

Probably most visible and consistent campaign using numerical HJB
solutions in applied control (at least in robotics) has come from Claire Tomlin's group
at Berkeley. Their work leverages Ian Mitchell's Level
Set Toolbox, which solves the Hamilton-Jacobi PDEs on a Cartesian
mesh using the technique cartooned above, and even includes the
minimum-time problem for the double integrator as a tutorial
example

One of the most amazing features of the dynamic programming, additive cost approach to optimal control is the relative ease with which it extends to optimizing stochastic systems.

For discrete systems, we can generalize our dynamics on a graph by adding action-dependent transition probabilities to the edges. This new dynamical system is known as a Markov Decision Process (MDP), and we write the dynamics completely in terms of the transition probabilities \[ \Pr(s[n+1] = s' | s[n] = s, a[n] = a). \] For discrete systems, this is simply a big lookup table. The cost that we incur for any execution of system is now a random variable, and so we formulate the goal of control as optimizing the expected cost, e.g. \begin{equation} J^*(s[0]) = \min_{a[\cdot]} E \left[ \sum_{n=0}^\infty g(s[n],a[n]) \right]. \label{eq:stochastic_dp_optimality_cond} \end{equation} Note that there are many other potential objectives, such as minimizing the worst-case error, but the expected cost is special because it preserves the dynamic programming recursion: \[ J^*(s) = \min_a E \left[ g(s,a) + J^*(s')\right] = \min_a \left[ g(s,a) + \sum_{s'} \Pr(s'|s,a) J^*(s') \right].\] Remarkably, if we use these optimality conditions to construct our value iteration algorithm \[ \hat{J}(s) \Leftarrow \min_a \left[ g(s,a) + \sum_{s'} \Pr(s'|s,a) \hat{J}(s') \right],\] then this algorithm has the same strong convergence guarantees of its counterpart for deterministic systems. And it is essentially no more expensive to compute!

There is a particularly cute observation to be made here. Let's assume that we have discrete control inputs and discrete-time dynamics, but a continuous state space. Recall the fitted value iteration on a mesh algorithm described above. In fact, the resulting update is exactly the same as if we treated the system as a discrete state MDP with $\beta_i$ representing the probability of transitioning to state $x_i$! This sheds some light on the impact of discretization on the solutions -- discretization error here will cause a sort of diffusion corresponding to the probability of spreading across neighboring nodes.

For discrete MDPs, value iteration is a magical algorithm because it is simple, but known to converge to the global optimal, $J^*$. However, other important algorithms are known; one of the most important is a solution approach using linear programming. This formulation provides an alternative view, but may also be more generalizable and even more efficient for some instances of the problem.

Recall the optimality conditions from Eq. (\ref{eq:stochastic_dp_optimality_cond}). If we describe the cost-to-go function as a vector, $J_i = J(s_i)$, then these optimality conditions can be rewritten in vector form as \begin{equation} \bJ = \min_a \left[ \bg(a) + \bT(a) \bJ \right], \label{eq:vector_stochastic_dp} \end{equation} where $g_i(a) = g(s_i,a)$ is the cost vector, and $T_{i,j}(a) = \Pr(s_j|s_i,a)$ is the transition probability matrix. Let us take $\bJ$ as the vector of decision variables, and replace Eq. (\ref{eq:vector_stochastic_dp}) with the constraints: \begin{equation} \forall a, \bJ \le \bg(a) + \bT(a) \bJ.\end{equation} Clearly, for finite $a$, this is finite list of linear constraints, and for any vector $\bJ$ satisfying these constraints we have $\bJ \le \bJ^*$ (elementwise). Now write the linear program: \begin{gather*} \maximize_\bJ \quad \bc^T \bJ, \\ \subjto \quad \forall a, \bJ \le \bg(a) + \bT(a) \bJ, \end{gather*} where $c$ is any positive vector. The solution to this problem is $\bJ = \bJ^*$.

Perhaps even more interesting is that this approach can be generalized
to linear function approximators. Taking a vector form of my notation
above, now we have a matrix of features with $\bPsi_{i,j} =
\psi^T_j(\bx_i)$ and we can write the LP \begin{gather} \maximize_\balpha
\quad \bc^T \bPsi \balpha, \\ \subjto \quad \forall a, \bPsi \balpha \le
\bg(a) + \bT(a) \bPsi \balpha. \end{gather} This time the solution is not
necessarily optimal, because $\bPsi \balpha^*$ only approximates $\bJ^*$,
and the relative values of the elements of $\bc$ (called the
"state-relevance weights") can determine the relative tightness of the
approximation for different features

While solving the dynamic programming problem for continuous systems is very hard in general, there are a few very important special cases where the solutions are very accessible. Most of these involve variants on the case of linear dynamics and quadratic cost. The simplest case, called the linear quadratic regulator (LQR), is formulated as stabilizing a time-invariant linear system to the origin.

The linear quadratic regulator is likely the most important and influential result in optimal control theory to date. In this chapter we will derive the basic algorithm and a variety of useful extensions.

Consider a linear time-invariant system in state-space form, $$\dot{\bx} = {\bf A}\bx + \bB\bu,$$ with the infinite-horizon cost function given by $$J = \int_0^\infty \left[ \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu \right] dt, \quad {\bf Q} = {\bf Q}^T \ge {\bf 0}, {\bf R} = {\bf R}^T > 0.$$ Our goal is to find the optimal cost-to-go function $J^*(\bx)$ which satisfies the HJB: $$\forall \bx, \quad 0 = \min_\bu \left[ \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu + \pd{J^*}{\bx} \left( {\bf A}\bx + \bB\bu \right) \right].$$

There is one important step here -- it is well known that for this problem the optimal cost-to-go function is quadratic. This is easy to verify. Let us choose the form: $$J^*(\bx) = \bx^T {\bf S} \bx, \quad {\bf S} = {\bf S}^T \succeq 0.$$ The gradient of this function is $$\pd{J^*}{\bx} = 2 \bx^T {\bf S}.$$

Since we have guaranteed, by construction, that the terms inside the $\min$ are quadratic and convex (because ${\bf R} \succ 0$), we can take the minimum explicitly by finding the solution where the gradient of those terms vanishes: $$\pd{}{\bu} = 2\bu^T {\bf R} + 2 \bx^T {\bf S} \bB = 0.$$ This yields the optimal policy $$\bu^* = \pi^*(\bx) = - {\bf R}^{-1} \bB^T {\bf S} \bx = - \bK \bx.$$

Inserting this back into the HJB and simplifying yields $$0 = \bx^T
\left[ {\bf Q} - {\bf S B R}^{-1}\bB^T{\bf S} + 2{\bf SA} \right]\bx.$$ All
of the terms here are symmetric except for the $2{\bf SA}$, but since
$\bx^T{\bf SA}\bx = \bx^T{\bf A}^T{\bf S}\bx$, we can write $$0 = \bx^T
\left[ {\bf Q} - {\bf S B R}^{-1}\bB^T{\bf S} + {\bf SA} + {\bf A}^T{\bf S}
\right]\bx.$$ and since this condition must hold for all $\bx$, it is
sufficient to consider the matrix equation $$0 = {\bf S} {\bf A} + {\bf A}^T
{\bf S} - {\bf S} \bB {\bf R}^{-1} \bB^T {\bf S} + {\bf Q}.$$ This extremely
important equation is a version of the *algebraic Riccati equation*.
Note that it is quadratic in ${\bf S}$, making its solution non-trivial, but
it is well known that the equation has a single positive-definite solution
if and only if the system is controllable and there are good numerical
methods for finding that solution, even in high-dimensional problems. Both
the optimal policy and optimal cost-to-go function are available from
```
(K,S) =
LinearQuadraticRegulator(A,B,Q,R)
```

.

If the appearance of the quadratic form of the cost-to-go seemed mysterious, consider that the solution to the linear system $\dot\bx = (\bA - \bB\bK)\bx$ takes the form $\bx(t) = e^{(\bA-\bB\bK)t}\bx(0)$, and try inserting this back into the integral cost function.

It is worth examining the form of the optimal policy more closely. Since the value function represents cost-to-go, it would be sensible to move down this landscape as quickly as possible. Indeed, $-{\bf S}\bx$ is in the direction of steepest descent of the value function. However, not all directions are possible to achieve in state-space. $-\bB^T {\bf S} \bx$ represents precisely the projection of the steepest descent onto the control space, and is the steepest descent achievable with the control inputs $\bu$. Finally, the pre-scaling by the matrix ${\bf R}^{-1}$ biases the direction of descent to account for relative weightings that we have placed on the different control inputs. Note that although this interpretation is straight-forward, the slope that we are descending (in the value function, ${\bf S}$) is a complicated function of the dynamics and cost.

LQR is extremely relevant to us even though our primary interest is in nonlinear dynamics, because it can provide a local approximation of the optimal control solution for the nonlinear system. Given the nonlinear system $\dot{\bx} = f(\bx,\bu)$, and a stabilizable operating point, $(\bx_0,\bu_0)$, with $f(\bx_0,\bu_0) = 0.$ We can define a relative coordinate system $$\bar\bx = \bx - \bx_0, \quad \bar\bu = \bu - \bu_0,$$ and observe that $$\dot{\bar\bx} = \dot{\bx} = f(\bx,\bu),$$ which we can approximate with a first-order Taylor expansion to $$\dot{\bar\bx} \approx f(\bx_0,\bu_0) + \pd{f(\bx_0,\bu_0)}{\bx} (\bx - \bx_0) + \pd{f(\bx_0,\bu_0)}{\bu} (\bu - \bu_0) = {\bf A}\bar{\bx} + \bB\bar\bu.$$

Similarly, we can define a quadratic cost function in the error coordinates, or take a (positive-definite) second-order approximation of a nonlinear cost function about the operating point (linear and constant terms in the cost function can be easily incorporated into the derivation by parameterizing a full quadratic form for $J^*$, as seen in the Linear Quadratic Tracking derivation below).

The resulting controller takes the form $\bar\bu^* = -{\bf K}\bar\bx$
or $$\bu^* = \bu_0 - {\bf K} (\bx - \bx_0).$$ For convenience,
```
controller =
LinearQuadraticRegulator(system, context, Q, R)
```

on most dynamical
systems (including block diagrams built up of many subsystems); it will
perform the linearization for you.

Recall that the cost-to-go for finite-horizon problems is time-dependent, and therefore the HJB sufficiency condition requires an additional term for $\pd{J^*}{t}$. $$ \forall \bx, \forall t\in[t_0,t_f],\quad 0 = \min_\bu \left[ g(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu) + \pd{J^*}{t} \right]. $$

Consider systems governed by an LTI state-space equation of the form
$$\dot{\bx} = {\bf A}\bx + \bB\bu,$$ and a finite-horizon cost function
with
\begin{gather*}
h(\bx) = \bx^T {\bf Q}_f \bx,\quad {\bf Q}_f = {\bf Q}_f^T \ge {\bf 0}
\\ g(\bx,\bu,t) = \bx^T {\bf Q} \bx + \bu^T {\bf R} \bu, \quad {\bf Q} =
{\bf Q}^T \ge 0, {\bf R}={\bf R}^T>0
\end{gather*}
Writing the HJB, we have $$ 0 = \min_\bu \left[\bx^T {\bf Q} \bx + \bu^T
{\bf R}\bu + \pd{J^*}{\bx} \left({\bf A} \bx + \bB \bu \right) +
\pd{J^*}{t} \right]. $$ Due to the positive definite quadratic form on
$\bu$, we can find the minimum by setting the gradient to zero:
\begin{gather*}
\pd{}{\bu} = 2 \bu^T {\bf R} + \pd{J^*}{\bx} \bB = 0 \\
\bu^* = \pi^*(\bx,t) = - \frac{1}{2}{\bf R}^{-1} \bB^T \pd{J^*}{\bx}^T
\end{gather*}
In order to proceed, we need to investigate a particular form for the
cost-to-go function, $J^*(\bx,t)$. Let's try a solution of the form:
$$J^*(\bx,t) = \bx^T {\bf S}(t) \bx, \quad {\bf S}(t) = {\bf S}^T(t) >
{\bf 0}.$$ In this case we have $$\pd{J^*}{\bx} = 2 \bx^T {\bf S}(t),
\quad \pd{J*}{t} = \bx^T \dot{\bf S}(t) \bx,$$ and therefore
\begin{gather*} \bu^* = \pi^*(\bx,t) = - {\bf R}^{-1} \bB^T {\bf S}(t) \bx
\\ 0 = \bx^T \left[ {\bf Q} - {\bf S}(t) \bB {\bf R}^{-1} \bB^T {\bf S}(t)
+ {\bf S}(t) {\bf A} + {\bf A}^T {\bf S}(t) + \dot{\bf S}(t) \right]
\bx.\end{gather*}
Therefore, ${\bf S}(t)$ must satisfy the condition (known as the
continuous-time *differential Riccati equation*): $$-\dot{\bf S}(t)
= {\bf S}(t) {\bf A} + {\bf A}^T {\bf S}(t) - {\bf S}(t) \bB {\bf R}^{-1}
\bB^T {\bf S}(t) + {\bf Q},$$ and the terminal condition $${\bf S}(T) =
{\bf Q}_f.$$ Since we were able to satisfy the HJB with the minimizing
policy, we have met the sufficiency condition, and have found the optimal
policy and optimal cost-to-go function.

Note that the infinite-horizon LQR solution described in the prequel is exactly the steady-state solution of this equation, defined by $\dot{\bf S}(t) = 0$. Indeed, for controllable systems this equation is stable (backwards in time), and as expected the finite-horizon solution converges on the infinite-horizon solution as the horizon time limits to infinity.

The derivation above holds even if the dynamics are given by $$\dot{\bx} = {\bf A}(t)\bx + {\bf B(t)}\bu.$$ Similarly, the cost functions ${\bf Q}$ and ${\bf R}$ can also be time-varying. This is quite surprising, as the class of time-varying linear systems is a quite general class of systems. It requires essentially no assumptions on how the time-dependence enters, except perhaps that if ${\bf A}$ or $\bB$ is discontinuous in time then one would have to use the proper techniques to accurately integrate the differential equation. As we will see in the chapter on trajectory optimization, one of the most powerful applications involves linearizing around a nominal trajectory of a nonlinear system and using LQR to provide a trajectory controller.

For completeness, we consider a slightly more general form of the linear quadratic regulator. The standard LQR derivation attempts to drive the system to zero. Consider now the problem: \begin{gather*} \dot{\bx} = {\bf A}\bx + \bB\bu \\ h(\bx) = (\bx - \bx^d(T))^T {\bf Q}_f (\bx - \bx^d(T)), \quad {\bf Q}_f = {\bf Q}_f^T \ge 0 \\ g(\bx,\bu,t) = (\bx - \bx^d(t))^T {\bf Q} (\bx - \bx^d(t)) + (\bu - \bu^d(t))^T {\bf R} (\bu - \bu^d(t)),\\ {\bf Q} = {\bf Q}^T \ge 0, {\bf R}={\bf R}^T>0 \end{gather*} Now, guess a solution \begin{gather*} J^*(\bx,t) = \bx^T {\bf S_2}(t) \bx + \bx^T {\bf s_1}(t) + s_0(t), \quad {\bf S_2}(t) = {\bf S_2}^T(t) > {\bf 0}. \end{gather*} In this case, we have $$\pd{J^*}{\bx} = 2 \bx^T {\bf S_2}(t) + {\bf s_1}^T(t),\quad \pd{J^*}{t} = \bx^T \dot{\bf S_2}(t) \bx + \bx^T \dot{\bf s_1}(t) + \dot{s_0}(t).$$ Using the HJB, $$ 0 = \min_\bu \left[(\bx - \bx^d(t))^T {\bf Q} (\bx - \bx^d(t)) + (\bu - \bu^d(t))^T {\bf R} (\bu - \bu^d(t)) + \pd{J^*}{\bx} \left({\bf A} \bx + \bB \bu \right) + \pd{J^*}{t} \right], $$ we have \begin{gather*} \pd{}{\bu} = 2 (\bu - \bu^d(t))^T{\bf R} + (2\bx^T{\bf S_2}(t) + {\bf s_1}^T(t))\bB = 0,\\ \bu^*(t) = \bu^d(t) - {\bf R}^{-1} \bB^T\left[{\bf S_2}(t)\bx + \frac{1}{2}{\bf s_1}(t)\right] \end{gather*} The HJB can be satisfied by integrating backwards \begin{align*} -\dot{\bf S_2}(t) =& {\bf Q} - {\bf S_2}(t) \bB {\bf R}^{-1} {\bf B}^T {\bf S_2}(t) + {\bf S_2}(t) {\bf A} + {\bf A}^T {\bf S_2}(t)\\ -\dot{\bf s_1}(t) =& -2 {\bf Q} \bx^d(t) + \left[{\bf A}^T - {\bf S}_2 \bB {\bf R}^{-1} \bB^T \right]{\bf s_1}(t) + 2{\bf S_2}(t) \bB \bu^d(t)\\ -\dot{s_0}(t) =& \bx^d(t)^T {\bf Q} \bx^d(t) - \frac{1}{4} {\bf s_1}^T(t) \bB {\bf R}^{-1} \bB^T {\bf s_1}(t) + {\bf s_1}(t)^T \bB \bu^d(t), \end{align*} from the final conditions \begin{align*} {\bf S_2}(T) =& {\bf Q}_f\\ {\bf s_1}(T) =& -2 {\bf Q}_f \bx^d(T) \\ s_0(T) =& \left[\bx^d(T)\right]^T {\bf Q}_f \left[ \bx^d(T) \right]. \end{align*} Notice that the solution for ${\bf S_2}$ is the same as the simpler LQR derivation, and is symmetric (as we assumed). Note also that $s_0(t)$ has no effect on control (even indirectly), and so can often be ignored.

A quick observation about the quadratic form, which might be helpful in debugging. We know that $J(x,t)$ must be uniformly positive. This is true iff ${\bf S}_2>0$ and $s_0 > \frac{1}{4} {\bf s}_1^T {\bf S}_2^{-1} {\bf s}_1$, which comes from evaluating the function at $x_{min}$ defined by $\pd{}{x} = 0$.

The finite-horizon LQR formulation can be used to impose a strict final boundary value condition by setting an infinite ${\bf Q}_f$. However, integrating the Riccati equation backwards from an infinite initial condition isn't very practical. To get around this, let us consider solving for ${\bf P}(t) = {\bf S}(t)^{-1}$. Using the matrix relation $\frac{d {\bf S}^{-1}}{dt} = - {\bf S}^{-1} \frac{d {\bf S}}{dt} {\bf S}^{-1}$, we have: $$-\dot{\bf P}(t) = - {\bf P}(t){\bf Q P}(t) + {\bf B R}^{-1} \bB - {\bf A P}(t) - {\bf P}(t){\bf A}^T,$$ with the final conditions $${\bf P}(T) = 0.$$ This Riccati equation can be integrated backwards in time for a solution.

It is very interesting, and powerful, to note that, if one chooses
${\bf Q} = 0$, therefore imposing no position cost on the trajectory
before time $T$, then this inverse Riccati equation becomes a linear ODE
which can be solved explicitly.

Optimal control provides a powerful framework for formulating control
problems using the language of optimization. But solving optimal control
problems for nonlinear systems is hard! In many cases, we don't really care
about finding the *optimal* controller, but would be satisfied with any
controller that is guaranteed to accomplish the specified task. In many
cases, we still formulate these problems using computational tools from
optimization, and in this chapter we'll learn about tools that can provide
guaranteed control solutions for systems that are beyond the complexity for
which we can find the optimal feedback.

There are many excellent books on Lyapunov analysis; for instance

Let's start with our favorite simple example.

Recall that the equations of motion of the damped simple pendulum are given by \[ ml^2 \ddot{\theta} + mgl\sin\theta = -b\dot{\theta}, \] which I've written with the damping on the right-hand side to remind us that it is an external torque that we've modeled.

These equations represent a simple second-order differential equation; in chapter 2 we discussed at some length what was known about the solutions to this differential equation--even without damping the equation for $\theta(t)$ as a function of $\theta(0)$, $\dot{\theta}(0)$, involves elliptic integrals of the first kind, and with damping we don't even have that. Since we couldn't provide a solution analytically, in chapter 2 we resorted to a graphical analysis, and confirmed the intuition that there are fixed points in the system (at $\theta = k\pi$ for every integer $k$) and that the fixed points at $\theta = 2\pi k$ are asymptotically stable with a large basin of attraction. The graphical analysis gave us this intuition, but can we actually prove this stability property? In a way that might also work for much more complicated systems?

One route forward was from looking at the total system energy (kinetic + potential), which we can write down: \[ E(\theta,\dot{\theta}) = \frac{1}{2} ml^2\dot{\theta}^2 - mgl \cos\theta. \] Recall that the contours of this energy function are the orbits of the undamped pendulum.

A natural route to proving the stability of the downward fixed points is by arguing that energy decreases for the damped pendulum (with $b>0$) and so the system will eventually come to rest at the minimum energy, $E = -mgl$, which happens at $\theta=2\pi k$. Let's make that argument slightly more precise.

Evaluating the time derivative of the energy reveals \[ \frac{d}{dt} E = - b\dot\theta^2 \le 0. \] This is sufficient to demonstrate that the energy will never increase, but it doesn't actually prove that the energy will converge to the minimum. To take that extra step, we must observe that for most of the states with $\dot{E} = 0$, which for $b>0$ is the manifold where $\dot\theta=0$, the system can't actually remain in that state, but will rather pass through and enter another region where it continues to lose energy. In fact, the fixed points are the only subset of this $\dot{E}=0$ manifold where the system can stay on the manifold. Therefore, unless the system is at a fixed point, it will continue to dissipate energy. And therefore we can conclude that as $t\rightarrow \infty$, the system will indeed come to rest at a fixed point (though it could be any fixed point with an energy less than or equal to the initial energy in the system, $E(0)$).

This is an important example. It demonstrated that we could use a relatively simple function, the system energy, to describe something about the long-term dynamics of the pendulum even though the actual trajectories of the system are (analytically) very complex. It also demonstrated one of the subtleties of using an energy-like function that is non-increasing (instead of strictly decreasing) to prove asymptotic stability.

Lyapunov functions generalize this notion of an energy function to more
general systems, which might not be stable in the sense of some mechanical
energy. If I can find any positive function, call it $V(\bx)$, that gets
smaller over time as the system evolves, then I can potentially use $V$ to
make a statement about the long-term behavior of the system. $V$ is called
a *Lyapunov function*.

Recall that we defined three separate notions for stability of a fixed-point of a nonlinear system: stability i.s.L., asymptotic stability, and exponential stability. We can use Lyapunov functions to demonstrate each of these, in turn.

Given a system $\dot{\bx} = f(\bx)$, with $f$ continuous, and for some region ${\cal B}$ around the origin (specifically an open subset of $\mathbf{R}^n$ containing the origin), if I can produce a scalar, continuously-differentiable function $V(\bx)$, such that \begin{gather*} V(\bx) > 0, \forall \bx \in {\cal B} \ne 0 \quad V(0) = 0, \text{ and} \\ \dot{V}(\bx) = \pd{V}{\bx} f(\bx) \le 0, \forall \bx \in {\cal B} \ne 0 \quad \dot{V}(0) = 0, \end{gather*} then the origin $(\bx = 0)$ is stable in the sense of Lyapunov (i.s.L.).

If, additionally, we have $$\dot{V}(\bx) = \pd{V}{\bx} f(\bx) < 0, \forall \bx \in {\cal B} \ne 0,$$ then the origin is (locally) asymptotically stable. And if we have $$\dot{V}(\bx) = \pd{V}{\bx} f(\bx) \le -\alpha V(x), \forall \bx \in {\cal B} \ne 0,$$ for some $\alpha>0$, then the origin is (locally) exponentially stable.

Note that for the sequel we will use the notation $V \succ 0$ to denote a
*positive-definite function*, meaning that $V(0)=0$ and $V(\bx)>0$
for all $\bx\ne0$ (and also $V \succeq 0$ for positive semi-definite, $V
\prec 0$ for negative-definite functions).

The intuition here is exactly the same as for the energy argument we made in the pendulum example: since $\dot{V}(x)$ is always zero or negative, the value of $V(x)$ will only get smaller (or stay the same) as time progresses. Inside the subset ${\cal B}$, for every $\epsilon$-ball, I can choose a $\delta$ such that $|x(0)|^2 < \delta \Rightarrow |x(t)|^2 < \epsilon, \forall t$ by choosing $\delta$ sufficiently small so that the sublevel-set of $V(x)$ for the largest value that $V(x)$ takes in the $\delta$ ball is completely contained in the $\epsilon$ ball. Since the value of $V$ can only get smaller (or stay constant) in time, this gives stability i.s.L.. If $\dot{V}$ is strictly negative away from the origin, then it must eventually get to the origin (asymptotic stability). The exponential condition is implied by the fact that $\forall t>0, V(\bx(t)) \le V(\bx(0)) e^{-\alpha t}$.

Notice that the system analyzed above, $\dot{\bx}=f(\bx)$, did not have any control inputs. Therefore, Lyapunov analysis is used to study either the passive dynamics of a system or the dynamics of a closed-loop system (system + control in feedback). We will see generalizations of the Lyapunov functions to input-output systems later in the text.

The notion of a fixed point being stable i.s.L. is inherently a local notion of stability (defined with $\epsilon$- and $\delta$- balls around the origin), but the notions of asymptotic and exponential stability can be applied globally. The Lyapunov theorems work for this case, too, with only minor modification.

Given a system $\dot{\bx} = f(\bx)$, with $f$ continuous, if I can produce a scalar, continuously-differentiable function $V(\bx)$, such that \begin{gather*} V(\bx) \succ 0, \\ \dot{V}(\bx) = \pd{V}{\bx} f(\bx) \prec 0, \text{ and} \\ V(\bx) \rightarrow \infty \text{ whenever } ||x||\rightarrow \infty,\end{gather*} then the origin $(\bx = 0)$ is globally asymptotically stable (G.A.S.).

If additionally we have that $$\dot{V}(\bx) \preceq -\alpha V(\bx),$$ for some $\alpha>0$, then the origin is globally exponentially stable.

The new condition, on the behavior as $||\bx|| \rightarrow \infty$ is known as "radially unbounded", and is required to make sure that trajectories cannot diverge to infinity even as $V$ decreases; it is only required for global stability analysis.

Perhaps you noticed the disconnect between the statement above and the argument that we made for the stability of the pendulum. In the pendulum example, using the mechanical energy resulted in a Lyapunov function that was only negative semi-definite, but we eventually argued that the fixed points were asymptotically stable. That took a little extra work, involving an argument about the fact that the fixed points were the only place that the system could stay with $\dot{E}=0$; every other state with $\dot{E}=0$ was only transient. We can formalize this idea for the more general Lyapunov function statements--it is known as LaSalle's Theorem.

Given a system $\dot{\bx} = f(\bx)$ with $f$ continuous. If we can
produce a scalar function $V(\bx)$ with continuous derivatives for
which we have $$V(\bx) \succ 0,\quad \dot{V}(\bx) \preceq 0,$$ and
$V(\bx)\rightarrow \infty$ as $||\bx||\rightarrow \infty$, then $\bx$
will converge to the largest *invariant set* where $\dot{V}(\bx) =
0$.

To be clear, an *invariant set*, ${\cal G}$, of the dynamical
system is a set for which $\bx(0)\in{\cal G} \Rightarrow \forall t>0,
\bx(t) \in {\cal G}$. In other words, once you enter the set you never
leave. The "largest invariant set" need not be connected; in fact for the
pendulum example each fixed point is an invariant set, so the largest
invariant set is the *union* of all the fixed points of the
system. There are also variants of LaSalle's Theorem which work over
a region.

Finding a Lyapunov function which $\dot{V} \prec 0$ is more difficult
than finding one that has $\dot{V} \preceq 0$. LaSalle's theorem gives us
the ability to make a statement about *asymptotic* stability even in
this case. In the pendulum example, every state with $\dot\theta=0$ had
$\dot{E}=0$, but only the fixed points are in the largest invariant
set.

At this point, you might be wondering if there is any relationship between Lyapunov functions and the cost-to-go functions that we discussed in the context of dynamic programming. After all, the cost-to-go functions also captured a great deal about the long-term dynamics of the system in a scalar function. We can see the connection if we re-examine the HJB equation \[ 0 = \min_\bu \left[ g(\bx,\bu) + \pd{J^*}{\bx}f(\bx,\bu). \right] \]Let's imagine that we can solve for the optimizing $\bu^*(\bx)$, then we are left with $ 0 = g(\bx,\bu^*) + \pd{J^*}{\bx}f(\bx,\bu^*) $ or simply \[ \dot{J}^*(\bx) = -g(\bx,\bu^*) \qquad \text{vs} \qquad \dot{V}(\bx) \preceq 0. \] In other words, in optimal control we must find a cost-to-go function which matches this gradient for every $\bx$; that's very difficult and involves solving a potentially high-dimensional partial differential equation. By contrast, Lyapunov analysis is asking for much less - any function which is going downhill (at any rate) for all states. This can be much easier, for theoretical work, but also for our numerical algorithms. Also note that if we do manage to find the optimal cost-to-go, $J^*(\bx)$, then it can also serve as a Lyapunov function so long as $g(\bx,\bu^*(\bx)) \succeq 0$.

One of the primary limitations in Lyapunov analysis as I have presented it so far is that it is potentially very difficult to come up with suitable Lyapunov function candidates for interesting systems, especially for underactuated systems. ("Underactuated" is almost synonymous with "interesting" in my vocabulary.) Even if somebody were to give me a Lyapunov candidate for a general nonlinear system, the Lyapunov conditions can be difficult to check -- for instance, how would I check that $\dot{V}$ is strictly negative for all $\bx$ except the origin if $\dot{V}$ is some arbitrarily complicated nonlinear function over a vector $\bx$?

In this section, we'll look at some computational approaches to verifying the Lyapunov conditions, and even to searching for the description of the Lyapunov functions themselves.

If you're imagining numerical algorithms to check the Lyapunov conditions
for complicated Lyapunov functions and complicated dynamics, the first
thought is probably that we can evaluate $V$ and $\dot{V}$ at a large number
of sample points and check whether $V$ is positive and $\dot{V}$ is
negative. This
does work, and could potentially be combined with some smoothness or
regularity assumptions to generalize beyond the sample points. *for all $\bx$* without
dense sampling; these will also give us additional leverage in formulating
the search for Lyapunov functions.

Let's take a moment to see how things play out for linear systems.

Imagine you have a linear system, $\dot\bx = {\bf A}\bx$, and can find a Lyapunov function $$V(\bx) = \bx^T {\bf P} \bx, \quad {\bf P} = {\bf P^T} \succ 0,$$ which also satisfies $$\dot{V}(\bx) = \bx^T {\bf PA} \bx + \bx^T {\bf A}^T {\bf P}\bx \prec 0.$$ Then the origin is globally asymptotically stable.

Note that the radially-unbounded condition is satisfied by ${\bf P} \succ 0$, and that the derivative condition is equivalent to the matrix condition $${\bf PA} + {\bf A}^T {\bf P} \prec 0.$$

For stable linear systems the existence of a quadratic Lyapunov function is actually a necessary (as well as sufficient) condition. Furthermore, a Lyapunov function can always be found by finding the positive-definite solution to the matrix Lyapunov equation \begin{equation}{\bf PA} + {\bf A}^T{\bf P} = - {\bf Q},\label{eq:algebraic_lyapunov} \end{equation} for any ${\bf Q}={\bf Q}^T\succ 0$.

This is a very powerful result - for nonlinear systems it will be potentially difficult to find a Lyapunov function, but for linear systems it is straight-forward. In fact, this result is often used to propose candidates for non-linear systems, e.g., by linearizing the equations and solving a local linear Lyapunov function which should be valid in the vicinity of a fixed point.

Lyapunov analysis for linear systems has an extremely important
connection to convex optimization. In particular, we could have also
formulated the Lyapunov conditions for linear systems above using
*semi-definite programming* (SDP). Semidefinite programming is a
subset of convex optimization -- an extremely important class of problems
for which we can produce efficient algorithms that are guaranteed find the
global optima solution (up to a numerical tolerance and barring any
numerical difficulties).

If you don't know much about convex optimization or want a quick refresher, please take a few minutes to read the optimization preliminaries in the appendix. The main requirement for this section is to appreciate that it is possible to formulate efficient optimization problems where the constraints include specifying that one or more matrices are positive semi-definite (PSD). These matrices must be formed from a linear combination of the decision variables. For a trivial example, the optimization $$\min_a a,\quad \subjto \begin{bmatrix} a & 0 \\ 0 & 1 \end{bmatrix} \succeq 0,$$ returns $a = 0$ (up to numerical tolerances).

The value in this is immediate for linear systems. For example, we can
formulate the search for a Lyapunov function for the linear system
$\dot\bx = {\bf A} \bx$ by using the parameters ${\bf p}$ to populate a
symmetric matrix ${\bf P}$ and then write the SDP: \[ \find_{\bf p} \quad
\subjto \quad {\bf P} \succeq 0, \quad {\bf PA} + {\bf A}^T {\bf P}
\preceq 0.\] Note that you would probably never use that particular
formulation, since there specialized algorithms for solving the simple
Lyapunov equation which are more efficient and more numerically stable.
But the SDP formulation does provide something new -- we can now easily
formulate the search for a *"common Lyapunov function"* for uncertain
linear systems.

Suppose you have a system governed by the equations $\dot\bx = {\bf A}\bx$, where the matrix ${\bf A}$ is unknown, but its uncertain elements can be bounded. There are a number of ways to write down this uncertainty set; let us choose to write this by describing ${\bf A}$ as the convex combination of a number of known matrices, $${\bf A} = \sum_{i} \beta_i {\bf A}_i, \quad \sum_i \beta_i = 1, \quad \forall i, \beta_i > 0.$$ This is just one way to specify the uncertainty; geometrically it is describing a polygon of uncertain parameters (in the space of elements of ${\bf A}$ with each ${\bf A}_i$ as one of the vertices in the polygon.

Now we can formulate the search for a common Lyapunov function using \[ \find_{\bf p} \quad \subjto \quad {\bf P} \succeq 0, \quad \forall_i, {\bf PA}_i + {\bf A}_i^T {\bf P} \preceq 0.\] The solver will then return a matrix ${\bf P}$ which satisfies all of the constraints, or return saying "problem is infeasible". It can easily be verified that if ${\bf P}$ satisfies the Lyapunov condition at all of the vertices, then it satisfies the condition for every ${\bf A}$ in the set: \[ {\bf P}(\sum_i \beta_i {\bf A}_i) + (\sum_i \beta_i {\bf A}_i)^T {\bf P} = \sum_i \beta_i ({\bf P A}_i + {\bf A}_i^T {\bf P}) \preceq 0, \] since $\forall i$, $\beta_i > 0$. Note that, unlike the simple Lyapunov equation for a known linear system, this condition being satisfied is a sufficient but not a necessary condition -- it is possible that the set of uncertain matrices ${\bf A}$ is robustly stable, but that this stability cannot be demonstrated with a common quadratic Lyapunov function.

You can try this example for yourself in

As always, make sure that you open up the code and take a look.

This example is very important because it establishes a connection between Lyapunov functions and (convex) optimization. But so far we've only demonstrated this connection for linear systems where the PSD matrices provide a magical recipe for establishing the positivity of the (quadratic) functions for all $\bx$. Is there any hope of extending this type of analysis to more general nonlinear systems? Surprisingly, it turns out that there is.

It turns out that in the same way that we can use SDP to search over
the positive quadratic equations, we can generalize this to search over
the positive polynomial equations. To be clear, for quadratic equations
we have \[ {\bf P}\succeq 0 \quad \Rightarrow \quad \bx^T {\bf P} \bx \ge
0, \quad \forall \bx. \] It turns out that we can generalize this to \[
{\bf P}\succeq 0 \quad \Rightarrow \quad {\bf m}^T(\bx) {\bf P} {\bf
m}(\bx) \ge 0, \quad \forall \bx, \] where ${\bf m}(\bx)$ is a vector of
polynomial equations, typically chosen as a vector of *monomials*
(polynomials with only one term). The set of positive polynomials
parameterized in this way is exactly the set of polynomials that can be
written as a *sum of squares*

Even better, there is quite a bit that is known about how to choose
the terms in ${\bf m}(\bx)$. For example, if you give me the polynomial
\[ p(x) = 2 - 4x + 5x^2 \] and ask if it is positive for all real $x$, I
can convince you that it is by producing the sums-of-squares
factorization \[ p(x) = 1 + (1-2x)^2 + x^2, \] and I know that I can
formulate the problem without needing any monomials of degree greater
than 1 (the square-root of the degree of $p$) in my monomial vector. In
practice, what this means for us is that people have authored
optimization front-ends which take a high-level specification with
constraints on the positivity of polynomials and they automatically
generate the SDP problem for you, without having to think about what
terms should appear in ${\bf m}(\bx)$ (c.f.

As it happens, the particular choice of ${\bf m}(\bx)$ can have a huge
impact on the numerics of the resulting semidefinite program (and on your
ability to solve it with commercial solvers).

We will write optimizations using sums-of-squares constraints as \[ p(\bx) \sos \] as shorthand for the constraint that $p(\bx) \ge 0$ for all $\bx$, as demonstrated by finding a sums-of-squares decomposition.

This is surprisingly powerful. It allows us to use convex optimization to solve what appear to be very non-convex optimization problems.

Consider the following famous non-linear function of two variables
(called the "six-hump-camel": $$p(x) = 4x^2 + xy - 4y^2 - 2.1x^4 + 4y^4
+ \frac{1}{3}x^6.$$ This function has six local minima, two of them
being global minima

By formulating a simple sums-of-squares optimization, we can actually find the minimum value of this function (technically, it is only a lower bound, but in this case and many cases, it is surprisingly tight) by writing: \begin{align*} \max_\lambda \ \ & \lambda \\ \text{s.t. } & p(x) - \lambda \text{ is sos.} \end{align*} Go ahead and play with the code (most of the lines are only for plotting; the actual optimization problem is nice and simple to formulate).

Note that this finds the minimum value, but does not actually
produce the $\bx$ value which mimimizes it. This is possible

Now we can see that it may be possible to generalize the optimization
approach using SDP to search for Lyapunov functions for linear systems to
searching for Lyapunov functions for at least the polynomial systems:
$\dot\bx = f(\bx),$ where $f$ is a vector-valued polynomial function. If
we parameterize a *fixed-degree* Lyapunov candidate as a polynomial
with unknown coefficients, e.g., \[ V_\alpha(\bx) = \alpha_0 + \alpha_1
x_1 + \alpha_2 x_2 + \alpha_3 x_1x_2 + \alpha_4 x_1^2 + ..., \] then the
time-derivative of $V$ is also a polynomial, and I can formulate the
optimization: \begin{align*} \find_\alpha, \quad \subjto \quad&
V_\alpha(\bx) \sos \\ & -\dot{V}_\alpha(\bx) = -\pd{V_\alpha}{\bx} f(\bx)
\sos. \end{align*} Because this is a convex optimization, the solver will
return a solution if one exists.

This example is example 7.2 from *fixed
* Lyapunov function $V(x) = x_0^2 + 2x_1^2$, test if $\dot{V}(x)$ is
negative definite.

The numerical solution can be written in a few lines of code, and is a convex optimization.

Verifying a candidate Lyapunov function is all well and good, but
the real excitement starts when we use optimization to *find* the
Lyapunov function. In the following code, we parameterize $V(x)$ as
the polynomial containing all monomials up to degree 2, with the
coefficients as decision variables: $$V(x) = c_0 + c_1x_0 + c_2x_1 +
c_3x_0^2 + c_4 x_0x_1 + c_5 x_1^2.$$ We will set the scaling
(arbitrarily) to avoid numerical issues by setting $V(0)=0$, $V([1,0])
= 1$. Then we write: \begin{align*} \find_{\bc} \ \ \subjto \ \ &
V\text{ is sos, } \\ & -\dot{V} \text{ is sos.}\end{align*}

Up to numerical convergence tolerance, it discovers the same coefficients that we chose above (zeroing out the unnecessary terms).

It is important to remember that there are a handful of gaps which make
the existence of this solution a sufficient condition (for proving that
every sub-level set of $V$ is an invariant set of $f$) instead of a
necessary one. First, there is no guarantee that a stable polynomial
system can be verified using a polynomial Lyapunov function (of any
degree, and in fact there are known counter-examples

Despite these caveats, I have found this formulation to be surprisingly effective in practice. Intuitively, I think that this is because there is relatively a lot of flexibility in the Lyapunov conditions -- if you can find one function which is a Lyapunov function for the system, then there are also many "nearby" functions which will satisfy the same constraints.

There is another very important connection between Lyapunov functions and
the concept of an invariant set: *any sub-level set of a Lyapunov
function is also an invariant set*. This gives us the ability to use
sub-level sets of a Lyapunov function as approximations of the region of
attraction for nonlinear systems.

Given a system $\dot{\bx} = f(\bx)$ with $f$ continuous, if we can find a scalar function $V(\bx) \succ 0$ and a sub-level set $${\cal G}: \{ \bx | V(\bx) < \rho \}$$ on which $$\forall \bx \in {\cal G}, \dot{V}(\bx) \preceq 0,$$ then ${\cal G}$ is an invariant set. By LaSalle, $\bx$ will converge to the largest invariant subset of ${\cal G}$ on which $\dot{V}=0$.

Furthermore, if $\dot{V}(\bx) \prec 0$ in ${\cal G}$, then the origin is locally asymptotically stable and the set ${\cal G}$ is inside the region of attraction of this fixed point. Alternatively, if $\dot{V}(\bx) \preceq 0$ in ${\cal G}$ and $\bx = 0$ is the only invariant subset of ${\cal G}$ where $\dot{V}=0$, then the origin is asymptotically stable and the set ${\cal G}$ is inside the region of attraction of this fixed point.

Consider the first-order, one-dimensional system $\dot{x} = -x + x^3.$ We can quickly understand this system using our tools for graphical analysis.

In the vicinity of the origin, $\dot{x}$ looks like $-x$, and as we move away it looks increasingly like $x^3$. There is a stable fixed point at the origin and unstable fixed points at $\pm 1$. In fact, we can deduce visually that the region of attraction to the stable fixed point at the origin is $\bx \in (-1,1)$. Let's see if we can demonstrate this with a Lyapunov argument.

First, let us linearize the dynamics about the origin and use the Lyapunov equation for linear systems to find a candidate $V(\bx)$. Since the linearization is $\dot{x} = -x$, if we take ${\bf Q}=1$, then we find ${\bf P}=\frac{1}{2}$ is the positive definite solution to the algebraic Lyapunov equation (\ref{eq:algebraic_lyapunov}). Proceeding with $$V(\bx) = \frac{1}{2} x^2,$$ we have $$\dot{V} = x (-x + x^3) = -x^2 + x^4.$$ This function is zero at the origin, negative for $|x| < 1$, and positive for $|x| > 1$. Therefore we can conclude that the sub-level set $V < \frac{1}{2}$ is invariant and the set $x \in (-1,1)$ is inside the region of attraction of the nonlinear system. In fact, this estimate is tight.

While we will defer most discussions on robustness analysis until later
in the notes, there is a simple and powerful idea that can be presented
now: the idea of a *common Lyapunov function*. Imagine that you
have a model of a dynamical system but that you are uncertain about some
of the parameters. For example, you have a model of a quadrotor, and are
fairly confident about the mass and lengths (both of which are easy to
measure), but are not confident about the moment of inertia. One approach
to robustness analysis is to define a bounded uncertainty, which could
take the form of $$\dot{\bx} = f_\alpha(\bx), \quad \alpha_{min} \le
\alpha \le \alpha_{max},$$ with $\alpha$ a vector of uncertain parameters
in your model. Richer specifications of the uncertainty bounds are also
possible, but this will serve for our examples.

In standard Lyapunov analysis, we are searching for a function that
goes downhill for all $\bx$ to make statements about the long-term
dynamics of the system. In common Lyapunov analysis, we can make many
similar statements about the long-term dynamics of an uncertain system if
we can find a single Lyapunov function that goes downhill *for all
possible values of $\alpha$*. If we can find such a function, then we
can use it to make statements with all of the variations we've discussed
(local, regional, or global; in the sense of Lyapunov, asymptotic, or
exponential).

Let's consider the same one-dimensional example used above, but add
an uncertain parameter into the dynamics. In particular, consider the
system: $$\dot{x} = -x + \alpha x^3, \quad \frac{3}{4} < \alpha <
\frac{3}{2}.$$ Plotting the graph of the one-dimensional dynamics for a
few values of $\alpha$, we can see that the fixed point at the origin is
still stable, but the *robust region of attraction* to this fixed
point (shaded in blue below) is smaller than the region of attraction
for the system with $\alpha=1$.

Taking the same Lyapunov candidate as above, $V = \frac{1}{2} x^2$, we have $$\dot{V} = -x^2 + \alpha x^4.$$ This function is zero at the origin, and negative for all $\alpha$ whenever $x^2 > \alpha x^4$, or $$|x| < \frac{1}{\sqrt{\alpha_{max}}} = \sqrt{\frac{2}{3}}.$$ Therefore, we can conclude that $|x| < \sqrt{\frac{2}{3}}$ is inside the robust region of attraction of the uncertain system.

Not all forms of uncertainty are as simple to deal with as the gain
uncertainty in that example. For many forms of uncertainty, we might not
even know the location of the fixed points of the uncertain system. In
this case, we can often still use common Lyapunov functions to give some
guarantees about the system, such as guarantees of *robust set
invariance*. For instance, if you have uncertain parameters on a
quadrotor model, you might be ok with the quadrotor stabilizing to a pitch
of $0.01$ radians, but you would like to guarantee that it definitely does
not flip over and crash into the ground.

Now consider the system: $$\dot{x} = -x + x^3 + \alpha, \quad -\frac{1}{4} < \alpha < \frac{1}{4}.$$ Plotting the graph of the one-dimensional dynamics for a few values of $\alpha$, this time we can see that the fixed point is not necessarily at the origin; the location of the fixed point moves depending on the value of $\alpha$. But we should be able to guarantee that the uncertain system will stay near the origin if it starts near the origin, using an invariant set argument.

Taking the same Lyapunov candidate as above, $V = \frac{1}{2} x^2$,
we have $$\dot{V} = -x^2 + x^4 + \alpha x.$$ This Lyapunov function
allows us to easily verify, for instance, that $V \le \frac{1}{3}$ is a
*robust invariant set*, because whenever $V = \frac{1}{3}$, we
have $$\forall \alpha \in [\alpha_{min},\alpha_{max}],\quad
\dot{V}(x,\alpha) < 0.$$ Therefore $V$ can never start at less than
one-third and cross over to greater than one-third. To see this, see
that $$ V=\frac{1}{3} \quad \Rightarrow \quad x = \pm \sqrt{\frac{2}{3}}
\quad \Rightarrow \quad \dot{V} = -\frac{2}{9} \pm \alpha
\sqrt{\frac{2}{3}} < 0, \forall \alpha \in
\left[-\frac{1}{4},\frac{1}{4} \right]. $$ Note that not all sub-level
sets of this invariant set are invariant. For instance $V <
\frac{1}{32}$ does not satisfy this condition, and by visual inspection
we can see that it is in fact not robustly invariant.

Now we have arrived at the tool that I believe can be a work-horse for many serious robotics applications. Most of our robots are not actually globally stable (that's not because they are robots -- if you push me hard enough, I will fall down, too), which means that understanding the regions where a particular controller can be guaranteed to work can be of critical importance.

Sums-of-squares optimization effectively gives us an oracle which we can ask: is this polynomial positive for all $\bx$? To use this for regional analysis, we have to figure out how to modify our questions to the oracle so that the oracle will say "yes" or "no" when we ask if a function is positive over a certain region which is a subset of $\Re^n$. That trick is called the S-procedure. It is closely related to the Lagrange multipliers from constrained optimization, and has deep connections to "Positivstellensatz" from algebraic geometry.

Consider a scalar polynomial, $p(\bx)$, and a semi-algebraic set $g(\bx) \preceq 0$, where $g$ is a vector of polynomials. If I can find a polynomial "multiplier", $\lambda(\bx)$, such that \[ p(\bx) + \lambda^T(\bx) g(\bx) \sos, \quad \text{and} \quad \lambda(\bx) \sos, \] then this is sufficient to demonstrate that $$p(\bx)\ge 0 \quad \forall \bx \in \{ g(\bx) \le 0 \}.$$ To convince yourself, observe that when $g(\bx) \le 0$, it is only harder to be positive, but when $g(\bx) > 0$, it is possible for the combined function to be SOS even if $p(\bx)$ is negative.

We can also handle equality constraints with only a minor modification -- we no longer require the multiplier to be positive. If I can find a polynomial "multiplier", $\lambda(\bx)$, such that \[ p(\bx) + \lambda^T(\bx) g(\bx) \sos \] then this is sufficient to demonstrate that $$p(\bx)\ge 0 \quad \forall \bx \in \{ g(\bx) = 0 \}.$$ Here the intuition is that $\lambda(x)$ can add arbitrary positive terms to help me be SOS, but those terms contribute nothing precisely when $g(x)=0$.

Let's return to our example from above: \[ \dot{x} = -x + x^3 \] and try to use SOS optimization to demonstrate that the region of attraction of the fixed point at the origin is $x \in (-1,1)$, using the Lyapunov candidate $V = x^2.$

First, define the multiplier polynomial, \[ \lambda(x) = c_0 + c_1 x + c_2 x^2 + c_3 x^3 + c_4 x^4. \] Then define the optimization \begin{align*} \find_{\bf c} \quad & \\ \subjto \quad& - \dot{V}(x) - \lambda(x) (1-V(x)) \sos \\ & \lambda(x) \sos \end{align*}

You can try this example for yourself in

While the example above only verifies that the one-sub-level set of the pre-specified Lyapunov candidate is negative (certifying the ROA that we already understood), we can generalize the optimization to allow us to search for the largest sub-level set (with the objective using a convex approximation for volume). We can even search for the coefficients of the Lyapunov function using an iteration of convex optimizations. There are a number of variations and nuances in the various formulations, and some basic rescaling tricks that can help make the numerics of the problem better for the solvers.

`regionOfAttraction`

method that is defined for objects of
type `PolynomialSystem`

. This makes it as simple as, for
instance:

```
cd(fullfile(getDrakePath,'examples'));
% create a new CubicPolynomialExample object
p = CubicPolynomialExample();
% compute region of attraction
% the levelset V<1 is the region of attraction
V=regionOfAttraction(p,Point(p.getStateFrame,0));
% display the polynomial representation of V that results
display(V.getPoly);
```

which yields the console output:
```
[ (1)*x1^2 ]
```

The algorithm successfully found that $x \in (-1,1)$ is the region of attraction.
Remember that although we have tried to make it convenient to call
these functions, they are not a black box. I highly recommend opening
up the `regionOfAttraction`

method and understanding how it
works.

```
cd(fullfile(getDrakePath,'examples'));
VanDerPol.run();
```

LQR gives the cost-to-go which can be used as the Lyapunov candidate. Otherwise, use a Lyapunov equation. You may not even need to search for a better Lyapunov function, but rather just ask the question: what is the largest region of attraction that can be demonstrated for the nonlinear system using the Lyapunov function from linear analysis?

Coming soon... For a nice explanation of how rigid-body kinematics are
polynomial, see

We opened this chapter using our intuition about energy to discuss stability on the simple pendulum. Now we'll replace that intuition with convex optimization (because it will also work for more difficult systems where our intuition fails).

But wait! The equations of motion of the simple pendulum are not
polynomial, are they (they have a $\sin(\theta)$ in them)? Actually
nearly all rigid-body systems can be written as polynomials given a
change of coordinates. Indeed, the rigid-body-assumption is simply an
assumption that points on the same body stay a constant distance apart,
and Euclidean distance is certainly a polynomial. For a more thorough
discussion see, for instance,

For the purposes of this example, let's change coordinates from $[\theta,\dot\theta]^T$ to $\bx = [s,c,\dot\theta]^t$, where $s \equiv \sin\theta$ and $c \equiv \cos\theta$. Then we can write the pendulum dynamics as $$\dot\bx = \begin{bmatrix} c \dot\theta \\ -s \dot\theta \\ -\frac{1}{m l^2} \left( b \dot\theta + mgls \right) \end{bmatrix}.$$

Now let's parameterize a Lyapunov candidate $V(s,c,\dot\theta)$ as
the polynomial with unknown coefficients which contains all monomials
up to degree 2: $$V = \alpha_0 + \alpha_1 s + \alpha_2 c + ...
\alpha_{9} s^2 + \alpha_{10} sc + \alpha_{11} s\dot\theta.$$ Now we'll
formulate the feasibility problem: \[ \find_{\bf \alpha} \quad \subjto
\quad V \sos, \quad -\dot{V} \sos.\] In fact, this is asking too much
-- really $\dot{V}$ only needs to be negative when $s^2+c^2=1$. We can
accomplish this with the S-procedure, and instead write \[ \find_{{\bf
\alpha},\lambda} \quad \subjto \quad V \sos, \quad -\dot{V}
-\lambda(\bx)(s^2+c^2-1) \sos.\] (Recall that $\lambda(\bx)$ is another
polynomial with free coefficients which the optimization can use to
make terms arbitrarily more positive when $s^2+c^2 \neq 1$.) Finally,
for style points, in the code example in

As always, make sure that you open up the code and take a look. The result is a Lyapunov function that looks familiar (visualized as a contour plot here):

Aha! Not only does the optimization finds us coefficients for the Lyapunov function which satisfy our Lyapunov conditions, but the result looks a lot like mechanical energy. In fact, the result is a little better than energy... there are some small extra terms added which prove exponential stability without having to invoke LaSalle's Theorem.

In practice, you can also Taylor approximate any smooth nonlinear function using polynomials. This can be an effective strategy in practice, because you can limit the degrees of the polynomial, and because in many cases it is possible to provide conservative bounds on the errors due to the approximation.

I've argued that optimal control is a powerful framework for specifying complex behaviors with simple objective functions, letting the dynamics and constraints on the system shape the resulting feedback controller (and vice versa!). But the computational tools that we've provided so far have been limited in some important ways. The numerical approaches to dynamic programming which involve putting a mesh over the state space do not scale well to systems with state dimension more than four or five. Linearization around a nominal operating point (or trajectory) allowed us to solve for locally optimal control policies (e.g. using LQR) for even very high-dimensional systems, but the effectiveness of the resulting controllers is limited to the region of state space where the linearization is a good approximation of the nonlinear dynamics. The computational tools for Lyapunov analysis from the last chapter can provide, among other things, an effective way to compute estimates of those regions. But we have not yet provided any real computational tools for approximate optimal control that work for high-dimensional systems beyond the linearization around a goal. That is precisely the goal for this chapter.

The big change that is going to allow us to scale to high-dimensional
systems is that we are going to give up the goal of solving for the optimal
feedback controller for the entire state space, and instead attempt to find
an optimal control solution that is valid from only a single initial
condition. Instead of representing this as a feedback control function, we
can represent this solution as a *trajectory*, $\bx(t), \bu(t)$,
typically defined over a finite interval.

Given an initial condition, $\bx_0$, and an input trajectory $\bu(t)$ defined over a finite interval, $t\in[t_0,t_f]$, we can compute the long-term (finite-horizon) cost of executing that trajectory using the standard additive-cost optimal control objective, \[ J_{\bu(\cdot)}(\bx_0) = \int_{t_0}^{t_f} g(\bx(t),\bu(t)) dt. \] We will write the trajectory optimization problem as \begin{align*} \min_{\bu(\cdot)} \quad & \int_{t_0}^{t_f} g(\bx(t),\bu(t)) dt \\ \subjto \quad & \forall t, \dot{\bx}(t) = f(\bx(t),\bu(t)), \\ & \bx(t_0) = \bx_0. \\ \end{align*} Some trajectory optimization problems may also include additional constraints, such as collision avoidance (e.g., where the constraint is that the signed distance between the robot's geometry and the obstacles stays positive) or input limits (e.g. $\bu_{min} \le \bu \le \bu_{max}$ ), which can be defined for all time or some subset of the trajectory.

As written, the optimization above is an optimization over continuous
trajectories. In order to formulate this as a numerical optimization, we
must parameterize it with a finite set of numbers. Perhaps not
surprisingly, there are many different ways to write down this
parameterization, with a variety of different properties in terms of speed,
robustness, and accuracy of the results. We will outline just a few of the
most popular below; I would recommend

Before we dive in, we need to take a moment to understand the optimization tools that we will be using. In the graph-search dynamic programming algorithm, we were magically able to provide an iterative algorithm that was known to converge to the optimal cost-to-go function. With LQR we were able to reduce the problem to a matrix Riccati equation, for which we have scalable numerical methods to solve. In the Lyapunov analysis chapter, we were able to formulate a very specific kind of optimization problem--a semi-definite program (or SDP)--which is a subset of convex optimization, and relied on custom solvers like Gurobi or Mosek to solve the problems for us. Convex optimization is a hugely important subset of nonlinear optimization, in which we can guarantee that the optimization has no "local minima". In this chapter we won't be so lucky; most of the optimizations that we formulate may have local minima and the solution techniques will at best only guarantee that they give a locally optimal solution.

The generic formulation of a nonlinear optimization problem is \[
\min_z c(z) \quad \subjto \quad \bphi(z) \le 0, \] where $z$ is a
vector of *decision variables*, $c$ is a scalar *objective
function* and $\phi$ is a vector of *constraints*. Note
that, although we write $\phi \le 0$, this formulation captures
positivity constraints on the decision variables (simply multiply the
constraint by $-1$) and equality constraints (simply list both $\phi\le0$
and $-\phi\le0$) as well.

The picture that you should have in your head is a nonlinear, potentially non-convex objective function defined over (multi-dimensional) $z$, with a subset of possible $z$ values satisfying the constraints.

Note that minima can be the result of the objective function having zero
derivative *or* due to a sloped objective up against a
constraint.

Numerical methods for solving these optimization problems require an
initial guess, $\hat{z}$, and proceed by trying to move down the objective
function to a minima. Common approaches include *gradient descent*,
in which the gradient of the objective function is computed or estimated, or
second-order methods such as *sequential quadratic programming (SQP)*
which attempts to make a local quadratic approximation of the objective
function and local linear approximations of the constraints and solves a
quadratic program on each iteration to jump directly to the minimum of the
local approximation.

While not strictly required, these algorithms can often benefit a great
deal from having the gradients of the objective and constraints computed
explicitly; the alternative is to obtain them from numerical
differentiation. Beyond pure speed considerations, I strongly prefer to
compute the gradients explicitly because it can help avoid numerical
accuracy issues that can creep in with finite difference methods. The
desire to calculate these gradients will be a major theme in the discussion
below, and we have gone to great lengths to provide explicit gradients of
our provided functions and automatic differentiation of user-provided
functions in

When I started out, I was of the opinion that there is nothing difficult
about implementing gradient descent or even a second-order method, and I
wrote all of the solvers myself. I now realize that I was wrong. The
commercial solvers available for nonlinear programming are substantially
higher performance than anything I wrote myself, with a number of tricks,
subtleties, and parameter choices that can make a huge difference in
practice. Some of these solvers can exploit sparsity in the problem (e.g.,
if the constraints operate in a sparse way on the decision variables).
Nowadays, we make heaviest use of SNOPT

Let us start by representing the finite-time trajectories, $\bx(t)$
and $\bu(t)$ $\forall t\in[t_0,t_f]$, by their values at a series of
*break points*, $t_0,t_1,t_2,t_3$, and denoting the values at those
points (aka the *knot points*) $\bx_0,...,\bx_N$ and
$\bu_0,...,\bu_N$, respectively.

Then perhaps the simplest mapping of the trajectory optimization problem onto a nonlinear program is to fix the break points at even intervals, $dt$, and use Euler integration to write \begin{align*} \minimize_{\bx_1,...,\bx_N,\bu_0,...,\bu_{N-1}} & \sum_{n=0}^{N-1} g(\bx_n,\bu_n)dt \\ & \bx_{n+1} = \bx_n + f(\bx_n,\bu_n)dt, \quad \forall n\in [0,N-1]. \end{align*} Note that the decision variables here are $(\bx_1,...,\bx_N,\bu_0,...,\bu_{N-1})$, because $\bx_0$ is given, and $\bu_{N}$ does not appear in the cost nor any of the constraints. It is easy to generalize this approach to add additional costs or constraints on $\bx$ and/or $\bu$. (Note that this formulation does not actually benefit from the additive cost structure, so more general cost formulations are also possible.) Computing the gradients of the objective and constraints is essentially as simple as computing the gradients of $g()$ and $f()$.

If you take a moment to think about what the direct transcription
algorithm is doing, you will see that by satisfying the dynamic
constraints, the optimization is effectively solving the (Euler
approximation of the) differential equation. But instead of marching
forward through time, it is minimizing the inconsistency at each of the
knot points simultaneously. While it's easy enough to generalize the
constraints to use higher-order integration schemes, paying attention to
the trade-off between the number of times the constraint must be evaluated
vs the density of the knot points in time, if the goal is to obtain smooth
$\bx$ trajectory solutions then another approach quickly emerges: the
approach taken by the so-called *collocation methods*.

In direct collocation (c.f.,

It turns out that in this sweet spot, the only decision variables we need in our optimization are the values $\bu(t)$ and $\bx(t)$ at the break points of the spline. You might think that you would need the coefficients of the cubic spline parameters, but you do not. For the first order interpolation on $\bu(t)$, given $\bu(t_k)$ and $\bu(t_{k+1})$, we can solve for every value $\bu(t)$ over the internval $t \in [k, k+1]$. But we also have everything that we need for the cubic spline: given $\bx(t_k)$ and $\bu(t_k)$, we can compute $\dot\bx(t_k) = f (\bx(t_k), \bu(t_k))$; and the four values $\bx(t_k), \bx(t_{k+1}), \dot\bx (t_k), \dot\bx(t_{k+1})$ completely define all of the parameters of the cubic spline over the interval $t\in[k, k+1]$. This is very convenient, because it is easy for us to add additional constraints to $\bu$ and $\bx$ at the knot points (and would have been relatively harder to have to convert every constraint into constraints on the spline coefficients.

So far, given $\dot{\bx} = f(\bx,\bu)$, a list of break points $t_0, .
t_N$, and arbitrary values $\bx(t_k), \bu(t_k)$, we can extract the
spline representation of $\bu(t)$ and $\bx(t)$. So far we haven't added
any actual constraints to our mathematical program. Now, just like in
direct transcription, we need a constraint that will tell the optimizer
how $\bx(t_{k+1})$ is related to $\bx(t_k)$, etc. In direct collocation,
the constraint that we add is that the derivative of the spline at the
*collocation points* matches the dynamics. In particular, for our
sweet spot, if we choose the collocation points to be the midpoints of
the spline, then we have that \begin{gather*} t_{c} =
\frac{1}{2}\left(t_k + t_{k+1}\right), \qquad h = t_{k+1} - t_k, \\
\bu(t_c) = \frac{1}{2}\left(\bu(t_k) + \bu(t_{k+1})\right), \\ \bx(t_c) =
\frac{1}{2}\left(\bx(t_k) + \bx(t_{k+1})\right) + \frac{h}{8}
\left(\dot\bx(t_k) - \dot\bx(t_{k+1})\right), \\ \dot\bx(t_c) =
-\frac{3}{2h}\left(\bx(t_k) - \bx(t_{k+1})\right) - \frac{1}{4}
\left(\dot\bx(t_k) + \dot\bx(t_{k+1})\right). \end{gather*} These
equations come directly from the equations that fit the cubic spline to
the end points/derivatives then interpolate them at the midpoint. They
give us precisely what we need to add the constraint to our optimization
that $\dot{\bx(t_c}) = f(\bx(t_c), \bu(t_c))$, for each segment of the
spline.

Once again, direct collocation effectively integrates the equations of
motion by satisfying the constraints of the optimization -- this time
producing an integration of the dynamics that is accurate to third-order
with effectively two evaluations of the plant dynamics per segment (since
we use $\dot\bx(t_k)$ for two intervals).

Direct collocation also easily solves the minimum-time problem for the double integrator. The performance is similar to direct transcription, but the convergence is visibly different. Try it for yourself:

In the methods described above, by asking the optimization package to perform the numerical integration of the equations of motion, we are effectively over-parameterizing the problem. In fact, the optimization is perfectly well defined if we restrict the decision variables to $\bu_0,...,\bu_{N-1}$ only--we can compute $\bx_1,...,\bx_{N}$ ourselves by knowing $\bx_0$, and $\bu(\cdot)$. This is exactly the approach taken in the shooting methods.

Again, providing gradients of the objectives and constraints to the solver is not strictly required -- most solvers will obtain them from finite differences if they are not provided -- but I feel strongly that the solvers are faster and more robust when exact gradients are provided. Now that we have removed the $\bx$ decision variables from the program, we have to take a little extra care to compute the gradients. This is still easily accomplished using the chain rule. To be concise (and slightly more general), let us define $\bx[n+1]=f_d(\bx[n],\bu[n])$ as the discrete-time approximation of the continuous dynamics; for example, the forward Euler integration scheme used above would give $f_d(\bx[n],\bu[n]) = \bx[n]+f(\bx[n],\bu[n])dt.$ Then we have \[ \pd{J}{\bu_k} = \sum_{n=0}^{N-1} \left( \pd{g(\bx[n],\bu[n])}{\bx[n]} \pd{\bx[n]}{\bu_k} + \pd{g(\bx[n],\bu[n])}{\bu_k} \right), \] where the gradient of the state with respect to the inputs can be computed during the "forward simulation", \[ \pd{\bx[n+1]}{\bu_k} = \pd{f_d(\bx[n],\bu[n])}{\bx[n]} \pd{\bx[n]}{\bu_k} + \pd{f_d(\bx[n],\bu[n])}{\bu_k}. \] These simulation gradients can also be used in the chain rule to provide the gradients of any constraints. Note that there are a lot of terms to keep around here, on the order of (state dim) $\times$ (control dim) $\times$ (number of timesteps). Ouch. Note also that many of these terms are zero; for instance with the Euler integration scheme above $\pd{\bu[n]}{\bu_k} = 0$ if $k\ne n$. (If this looks like I'm mixing two notations here, recall that I'm using $\bu_k$ to represent the decision variable and $\bu[n]$ to represent the input used in the $n$th step of the simulation.)

By solving for $\bx(\cdot)$ ourselves, we've removed a large number of constraints from the optimization. If no additional state constraints are present, and the only gradients we need to compute are the gradients of the objective, then a surprisingly efficient algorithm emerges. I'll give the steps here without derivation, but will derive it in the Pontryagin section below:

- Simulate forward: $$\bx[n+1] = f_d(\bx[n],\bu_n),$$ from $\bx[0] = \bx_0$
- Calculate backwards: $$\lambda[n-1] = \pd{g(\bx[n],\bu[n])}{\bx[n]}^T + \pd{f(\bx[n],\bu[n])}{\bx[n]}^T \lambda[n],$$ from $\lambda[N-1]=0$
- Extract the gradients: $$\pd{J}{\bu[n]} = \pd{g(\bx[n],\bu[n])}{\bu[n]} + \lambda[n]^T \pd{f(\bx[n],\bu[n])}{\bu[n]},$$ with $\pd{J}{\bu_k} = \sum_n \pd{J}{\bu[n]}\pd{\bu[n]}{\bu_k}$.

Here $\lambda[n]$ is a
vector the same size as $\bx[n]$ which has an interpretation as
$\lambda[n]=\pd{J}{\bx[n+1]}^T$. The equation governing $\lambda$ is
known as the *adjoint equation*, and it represents a dramatic
efficient improvement over calculating the huge number of simulation
gradients described above. In case you are interested, the adjoint
equation is known as the *backpropagation algorithm* in the
neural networks literature and it is one of the primary reasons that
training neural networks became so popular in the 1980's; super fast GPU
implementations of this algorithm are also one of the reasons that
*deep learning* is incredibly popular right now (the availability
of massive training databases is perhaps the other main reason).

To take advantage of this efficiency, advocates of the shooting methods often use penalty methods instead of enforcing hard state constraints; instead of telling the solver about the constraint explicitly, you simply add an additional term to the cost function which gives a large penalty commensurate with the amount by which the constraint is violated. These are not quite as accurate and can be harder to tune (you'd like the cost to be high compared to other costs, but making it too high can lead to numerical conditioning issues), but they can work.

Although the decision about which algorithm is best may depend on the situation, in our work we have come to favor the direct collocation method (and occasionally direct transcription) for most of our work. There are a number of arguments for and against each approach; I will try to discuss a few of them here.

Numerical conditioning. Tail wagging the dog.

Sparse constraints. Potential for parallel evaluation of the constraints (computing the dynamics and their derivatives are often the most expensive part).

to avoid local minima. direct transcription and collocation can take an initial guess in $\bx$, too.

Another potential advantage of the direct transcription and collocation methods is that the dynamics constraints can be written in implicit form.

There are number of useful variations to the problem formulations I've presented above. By far the most common is the addition of a terminal cost, e.g.: $$\min \quad h(\bx[N]) + \sum_{n=0}^{N-1} g(\bx[n],\bu[n]).$$ These terms are easily added to the cost function in the any of methods, and the adjoint equations of the shooting method simply require the a modified terminal condition $$\lambda[N-1] = \pd{h(\bx[N])}{\bx[N]}$$ when computing the gradients.

Another common modification is including the spacing of the breakpoints as additional decision variables. This is particularly easy in the direct transcription and collocation methods, and can also be worked into the shooting methods. Typically we add a lower-bound on the time-step so that they don't all vanish to zero.

One potential complaint about the direct transcription and collocation algorithms is that we tend to use simplistic numerical integration methods and often fix the integration timestep (e.g. by choosing Euler integration and selecting a $dt$); it is difficult to bound the resulting integration errors in the solution. One tantalizing possibility in the shooting methods is that the forward integration could be accomplished by more sophisticated methods, like variable-step integration. But I must say that I have not had much success with this approach, because while the numerical accuracy of any one function evaluation might be improved, these integration schemes do not necessarily give smooth outputs as you make incremental changes to the initial conditions and control (changing $\bu_2$ by $\epsilon$ could result in taking a different number of steps in the integration scheme). This lack of smoothness can wreak havoc on the convergence of the optimization. If numerical accuracy is a premium, then I think you will have more success by imposing consistency constraints (e.g. as in the Runge-Kutta 4th order simulation with 5th order error checking method) as additional constraints on the time-steps; shooting methods do not have any particular advantage here.

The tools that we've been developing for numerical trajectory optimization are closely tied to theorems from (analytical) optimal control. Let us take one section to appreciate those connections.

What precisely does it mean for a trajectory, $\bx(\cdot),\bu(\cdot)$, to
be locally optimal? It means that if I were to perturb that trajectory in
any way (e.g. change $\bu_3$ by $\epsilon$), then I would either incur
higher cost in my objective function or violate a constraint. For an
unconstrained optimization, a *necessary condition* for local
optimality is that the gradient of the objective at the solution be exactly
zero. Of course the gradient can also vanish at local maxima or saddle
points, but it certainly must vanish at local minima. We can generalize this
argument to constrained optimization using *Lagrange multipliers*.

Given the equality-constrained optimization problem $$\minimize_\bz g(\bz) \quad \subjto \quad \bphi(\bz) = 0,$$ where $\bphi$ is a vector. Define a vector $\lambda$ of Lagrange multipliers, the same size as $\phi$, and the scalar Lagrangian function, $$L(\bz,{\bf \lambda}) = g(\bz) + \lambda^T\phi(\bz).$$ A necessary condition for $\bz^*$ to be an optimal value of the constrained optimization is that the gradients of $L$ vanish with respect to both $\bz$ and $\lambda$: $$\pd{L}{\bz} = 0, \quad \pd{L}{\lambda} = 0.$$ Note that $\pd{L}{\lambda} = \phi(\bz)$, so requiring this to be zero is equivalent to requiring the constraints to be satisfied.

Consider the following optimization: $$\min_{x,y} x+y, \quad \subjto \quad x^2 + y^2 = 1.$$ The level sets of $x+y$ are straight lines with slope $-1$, and the constraint requires that the solution lives on the unit circle.

Simply by inspection, we can determine that the optimal solution should be $x=y=-\frac{\sqrt{2}}{2}.$ Let's make sure we can obtain the same result using Lagrange multipliers.

Formulating $$L = x+y+\lambda(x^2+y^2-1),$$ we can take the derivatives and solve \begin{align*} \pd{L}{x} = 1 + 2\lambda x = 0 \quad & \Rightarrow & \lambda = -\frac{1}{2x}, \\ \pd{L}{y} = 1 + 2\lambda y = 1 - \frac{y}{x} = 0 \quad & \Rightarrow & y = x,\\ \pd{L}{\lambda} = x^2+y^2-1 = 2x^2 -1 = 0 \quad & \Rightarrow & x = \pm \frac{1}{\sqrt{2}}. \end{align*} Given the two solutions which satisfy the necessary conditions, the negative solution is clearly the minimizer of the objective.

Let us use Lagrange multipliers to derive the necessary conditions for our constrained trajectory optimization problem in discrete time \begin{align*} \minimize_{\bx_1,...,\bx_N,\bu_0,...,\bu_{N-1}} & \sum_{n=0}^{N-1} g(\bx[n],\bu[n]),\\ \subjto \quad & \bx[n+1] = f_d(\bx[n],\bu[n]). \end{align*} Formulate the Lagrangian, \[ L(\bx[\cdot],\bu[\cdot],\lambda[\cdot]) = \sum_{n=0}^{N-1} g(\bx[n],\bu[n]) + \sum_{n=0}^{N-1} \lambda^T[n] \left(f_d(\bx[n],\bu[n]) - \bx[n+1]\right), \] and set the derivatives to zero to obtain the adjoint equation method described for the shooting algorithm above: \begin{gather*} \forall n\in[0,N-1], \pd{L}{\lambda[n]} = f_d(\bx[n],\bu[n]) - \bx[n+1] = 0 \Rightarrow \bx[n+1] = f(\bx[n],\bu[n]) \\ \forall n\in[0,N-1], \pd{L}{\bx[n]} = \pd{g(\bx[n],\bu[n])}{\bx} + \lambda^T[n] \pd{f_d(\bx[n],\bu[n])}{\bx} - \lambda^T[n-1] = 0 \\ \quad \Rightarrow \lambda[n-1] = \pd{g(\bx[n],\bu[n])}{\bx}^T + \pd{f_d(\bx[n],\bu[n])}{\bx}^T \lambda[n]. \\ \pd{L}{\bx[N]} = -\lambda[N-1] = 0 \Rightarrow \lambda[N-1] = 0 \\ \forall n\in[0,N-1], \pd{L}{\bu[n]} = \pd{g(\bx[n],\bu[n])}{\bu} + \lambda^T[n] \pd{f_d(\bx[n],\bu[n])}{\bu} = 0. \end{gather*} Therefore, if we are given an initial condition $\bx_0$ and an input trajectory $\bu[\cdot]$, we can verify that it satisfies the necessary conditions for optimality by simulating the system forward in time to solve for $\bx[\cdot]$, solving the adjoint equation backwards in time to solve for $\lambda[\cdot]$, and verifying that $\pd{L}{\bu[n]} = 0$ for all $n$. The fact that $\pd{J}{\bu} = \pd{L}{\bu}$ when $\pd{L}{\bx} = 0$ and $\pd{L}{\lambda} = 0$ follows from some basic results in the calculus of variations.

You won't be surprised to hear that these necessary conditions have an analogue in continuous time. I'll state it here without derivation. Given the initial conditions, $\bx_0$, a continuous dynamics, $\dot\bx = f(\bx,\bu)$, and the instantaneous cost $g(\bx,\bu)$, for a trajectory $\bx(\cdot),\bu(\cdot)$ defined over $t\in[t_0,t_f]$ to be optimal it must satisfy the conditions that \begin{align*} \forall t\in[t_0,t_f],\quad & \dot{\bx} = f(\bx,\bu), \quad &\bx(0)=\bx_0\\ \forall t\in[t_0,t_f],\quad & -\dot\lambda = \pd{g}{\bx}^T + \pd{f}{\bx}^T \lambda, \quad &\lambda(T) = 0\\ \forall t\in[t_0,t_f],\quad & \pd{g}{\bu} + \lambda^T\pd{f}{\bu} = 0.& \end{align*}

In fact the statement can be generalized even beyond this to the case
where $\bu$ has constraints. The result is known as Pontryagin's minimum
principle -- giving *necessary conditions* for a trajectory to be
optimal.

Adapted from

Note that the terms which are minimized in the final line of the theorem are commonly referred to as the Hamiltonian of the optimal control problem, $$H(\bx,\bu,\lambda,t) = g(\bx,\bu) + \lambda^T f(\bx,\bu).$$ It is distinct from, but inspired by, the Hamiltonian of classical mechanics. Remembering that $\lambda$ has an interpretation as $\pd{J}{\bx}^T$, you should also recognize it from the HJB.

An important special case. Linear/Quadratic objectives results in an LP/QP $\Rightarrow$ Convex optimization.

Once we have obtained a locally optimal trajectory from trajectory optimization, there is still work to do...

Take $\bar\bx(t) = \bx(t) - \bx_0(t)$, and $\bar\bu(t) = \bu(t)-\bu_0(t)$, then apply finite-horizon LQR (see the LQR chapter).

The term "motion planning" is a hopelessly general term which almost certainly encompasses the dynamic programming, feedback design, and trajectory optimization algorithms that we have already discussed. However, there are a number of algorithms and ideas that we have not yet discussed which have grown from the idea of formulating motion planning as a search problem -- for instance searching for a path from a start to a goal in a graph which is too large solve completely with dynamic programming. Some, but certainly not all, of these algorithms sacrifice optimality in order to find any path if it exists, and the notion of a planner being "complete" -- guaranteed to find a path if one exists -- is highly valued. This is precisely our goal for this chapter, to add some additional tools that will be able to provide some form of solutions for our most geometrically complex, highly non-convex, robot control problems.

*motion planning* typically refers to problems
where the planning domain is continuous (e.g. continuous state space,
continuous action space), but many motion planning algorithms trace their
origins back to ideas in discrete domains (e.g., graph search).

For this chapter, we will consider the following problem formulation: given a system defined by the nonlinear dynamics (in continuous- or discrete-time) $$\dot{\bx} = f(\bx,\bu) \quad \text{or} \quad \bx[n+1] = f(\bx[n],\bu[n]),$$ and given a start state $\bx(0) = \bx_0$ and a goal region ${\cal G}$, find any finite-time trajectory from $\bx_0$ to to $\bx \in {\cal G}$ if such a trajectory exists.

A long history... some people feel that the route to creating intelligent machines is to collect large ontologies of knowledge, and then perform very efficient search. (The more modern view of AI is focused instead on machine learning, the right answer probably involves pieces of both.) Samuels checker players, Deep Blue playing chess, theorem proving, Cyc, IBM Watson,...

One of the key ideas is the use of "heuristics" to guide the search. "Is it possible to find and optimal path from the start to a goal without visiting every node?". $A^*$. Admissible heuristics. Example: google maps.

Online planning. $D^*$, $D^*$-Lite.

If you remember how we introduced dynamic programming initially as a graph search, you'll remember that there were some challenges in discretizing the state space. Let's assume that we have discretized the continuous space into some finite set of discrete nodes in our graph. Even if we are willing to discretize the action space for the robot (this might be even be acceptable in practice), we had a problem where discrete actions from one node in the graph, integrated over some finite interval $h$, are extremely unlikely to land exactly on top of another node in the graph. To combat this, we had to start working on methods for interpolating the value function estimate between nodes.

Interpolation can work well if you are trying to solve for the cost-to-go function over the entire state space, but it's less compatible with search methods which are trying to find just a single path through the space. If I start in node 1, and land between node 2 and node 3, then which node to I continue to expand from?

One approach to avoiding this problem is to build a *search tree*
as the search executes, instead of relying on a predefined mesh
discretization. This tree will contains nodes rooted in the continuous
space at exactly the points where system can reach.

Another other problem with any fixed mesh discretization of a continuous
space, or even a fixed discretization of the action space, is that unless
we have specific geometric / dynamic insights into our continuous system,
it very difficult to provide a *complete* planning algorithm. Even
if we can show that no path to the goal exists on the tree/graph, how can
we be certain that there is no path for the continuous system? Perhaps a
solution would have emerged if we had discretized the system differently,
or more finely?

One approach to addressing this second challenge is to toss out the
notion of fixed discretizations, and replace them with random sampling
(another approach would be to adaptively add resolution to the
discretization as the algorithm runs). Random sampling, e.g. of the action
space, can yield algorithms that are *probabilistically complete*
for the continuous space -- if a solution to the problem exists, then a
probabilistically complete algorithm will find that solution with
probability 1 as the number of samples goes to infinity.

With these motivations in mind, we can build what is perhaps the simplest probabilistically complete algorithm for finding a path from the a starting state to some goal region with in a continuous state and action space:

Let us denote the data structure which contains the tree as ${\cal T}$. The algorithm is very simple:

- Initialize the tree with the start state: ${\cal T} \leftarrow \bx_0$.
- On each iteration:
- Select a random node, $\bx_{rand}$, from the tree, ${\cal T}$
- Select a random action, $\bu_{rand}$, from a distribution over feasible actions.
- Compute the dynamics: $\bx_{new} = f(\bx_{rand},\bu_{rand})$
- If $\bx_{new} \in {\cal G}$, then terminate. Solution found!
- Otherwise add the new node to the tree, ${\cal T} \leftarrow \bx_{new}$.

```
T = struct('parent',zeros(1,1000),'node',zeros(2,1000)); % pre-allocate memory for the "tree"
for i=2:size(T.parent,2)
T.parent(i) = randi(i-1);
x_rand = T.node(:,T.parent(i));
u_rand = 2*rand(2,1)-1;
x_new = x_rand+u_rand;
if (15<=x_new(1) && x_new(1)<=20 && 15<=x_new(2) && x_new(2)<=20)
disp('Success!'); break;
end
T.node(:,i) = x_new;
end
clf;
line([T.node(1,T.parent(2:end));T.node(1,2:end)],[T.node(2,T.parent(2:end));T.node(2,2:end)],'Color','k');
patch([15,15,20,20],[15,20,20,15],'g')
axis([-10,25,-10,25]);
```

Again, this algorithm We're nowhere close to the goal yet, and it's not exactly a hard problem.

While the idea of generating a tree of feasible points has clear advantages, we have lost the ability to cross off a node (and therefore a region of space) once it has been explored. It seems that, to make randomized algorithms effective, we are going to at the very least need some form of heuristic for encouraging the nodes to spread out and explore the space.

Multi-query planning with PRMs, ...

RRT*, RRT-sharp, RRTx, ...

Kinodynamic-RRT*, LQR-RRT(*)

Complexity bounds and dispersion limits

Not sure yet whether randomness is fundamental here, or whether is a temporary "crutch" until we understand geometric and dynamic planning better.

Cell decomposition...

Mixed-integer planning.

Approximate decompositions for complex environments (e.g. IRIS)

So far, most of our recommendations for control design have been relatively "local" -- leveraging trajectory planning/optimization as a tool and our ability to locally stabilize trajectories for even very complex systems using linear optimal control. This is in stark contrast to the dynamic programming / value iteration methods that we started with, which attempt to solve for a control policy for every possible state; unfortunately, the dynamic programming methods as presented are restricted to relatively low dimensional state spaces. What is missing so far is algorithms for synthesizing feedback controllers that scale to large state spaces and produce controllers that are, hopefully, less "local" than trajectory stabilization.

In this chapter, we will explore another very natural idea: let us parameterize a controller with some decision variables, and then search over those decision variables directly in order to achieve a task and/or optimize a performance objective. We'll refer to this broad class of methods as "policy search" or, when optimization methods are used, "policy optimization".

Consider a static full-state feedback policy, $$\bu = \bpi_\balpha(\bx),$$ where $\bpi$ is potentially a nonlinear function, and $\balpha$ is the vector of parameters that describe the controller. The control might take time as an input, or might even have it's own internal state, but let's start with this simple form.

How should we write an objective function for optimizing $\balpha$? The
approach that we used for trajectory optimization is quite reasonable --
the objective was typically to minimize an integral cost over some time
horizon (be it finite or infinite). But in trajectory optimization, the
cost is only ever defined based on forward simulation from a single initial
condition. We used the same additive cost structures in dynamic
programming, where the Hamilton-Bellman-Jacobi equation provided optimality
conditions for optimizing an additive cost from *every* initial
condition; at least in the idealized equations, we were able to get away
with saying $\forall \bx, \minimize_\bu ...$.

But now we are playing a different game. If we are searching over the some finitely parameterized policy, $\bpi_{\balpha}$, we can almost never expect to be optimal for every state -- and we need to somehow define the relevant importance of different states. For finite-time, a distribution over initial conditions. For infinite horizon, what really matters is the stationary distribution (which depends on the policy). Let's start with the distribution over initial conditions.

Searching directly for $K$ with an LQR objective is known to be bad. The objective is non-convex, and the set of stabilizing controllers is not a convex set. (TODO: Give the 2D example)

The discussion of walking and running robots in Chapter 4 motivated the notion of limit cycle stability. Linear systems are not capable of producing stable limit cycle behavior, so this rich topic is unique to nonlinear systems design and analysis. Furthermore, the tools that are required to design, stabilize, and verify limit cycles will have applicability beyond simple periodic motions.

The first natural question we must ask is, given a system $\dot{\bx} = f(\bx)$, or a control system $\dot{x} = f(\bx,\bu)$, how do we go about finding periodic solutions which may be passively stable, open-loop stable, or stabilized with closed-loop feedback? It turns out that the trajectory optimization tools that we developed in the last chapter are very well suited to this task.

I introduced the trajectory optimization tools as means for optimizing a
single trajectory of the system from a particular known initial condition.
But they can be used in a slightly different way, too -- in some
formulations we might choose to have the initial conditions as one of the
decision variables. In fact, the simplest formulation for finding a
periodic solution is to leave the entire trajectory as a decision variable,
but to add *periodicity constraints* enforcing that the initial
conditions and the final conditions match. Some care is needed here; if
you're not careful then these constraints will all be met if the entire
trajectory remains at a fixed point in the system. In the example below, we
constrain that the initial and final conditions lie on the surface of
section that we used before, and additionally constrain that the velocity at
the initial/final points is non-zero (to avoid the fixed point
solution).

Note that although we introduced trajectory optimization as searching over control inputs $\bu$, this formulation is now rich enough that it solves an important problem even for passive systems -- with no control input at all.

Recall the dynamics of the Van der Pol oscillator given by $$\ddot{q} + \mu (q^2 - 1) \dot{q} + q = 0, \quad \mu>0,$$ which exhibited a stable limit cycle.

Formulate the direct collocation optimization: \begin{align*} \find_{\bx_0,...,\bx_N,dt} \quad \\ \subjto \quad & q_0 = 0, \quad \dot{q}_0 > 0, \\ & \bx_N = \bx_0, \text{(periodicity constraint)}\\ & \text{collocation dynamic constraints} \end{align*}

Try it yourself:

As always, make sure that you take a look at the code. Poke around. Try changing some things.

One of the things that you should notice in the code is that I provide an initial guess for the solver. In most of the examples so far I've been able to avoid doing that--the solver takes small random numbers as a default initial guess and solves from there. But for this problem, I found that it was getting stuck in a local minima. Adding the initial guess that the solution moves around a circle in state space was enough.

Recall the important distinction between stability of a trajectory in
time and stability of a limit cycle was that the limit cycle does not
converge in phase -- trajectories near the cycle converge to the cycle,
but trajectories on the cycle may not converge with each other. This is
type of stability, also known as *orbital stability* can be written
as stability to the manifold described by cycle $\bx^*(t)$, \[ \min_\tau
|| x(t) - x^*(\tau) || \rightarrow 0.\] In the case of limit cycles, this
manifold is a periodic solution with $\bx^*(t+t_{period}) = \bx^*(t)$.
Depending on exactly how that convergence happens, we can define orbital
stability in the sense of Lyapunov, asymptotic orbital stability,
exponential orbital stability, or even finite-time orbital stability.

In order to prove that a system is orbitally stable (locally, over a region, or globally), or to analyze the region of attraction of a limit cycle, we can use a Lyapunov function. In particular, we would like to consider Lyapunov functions which have the form cartooned below; they vanish to zero everywhere along the cycle, and are strictly positive everywhere away from the cycle.

How can we parameterize this class of functions? For arbitrary cycles this could be very difficult in the original coordinates. For simple cycles like in the cartoon, one could imagine using polar coordinates. More generally, we will define a new coordinate system relative to the orbit, with coordinates

- $\tau$ - the phase along the orbit
- $\bx_\perp(\tau)$ - the remaining coordinates, linearly independent from $\tau$.

Given a state $\bx$ in the original coordinates, we must define a smooth mapping $\bx \rightarrow (\tau, \bx_\perp)$ to this new coordinate system. For example, for a simple ring oscillator we might have:

In general, for an $n$-dimensional state space, $\tau$ will always be a scalar, and $\bx_\perp$ will be an $(n-1)$-dimensional vector defining the remaining coordinates relative to $\bx^*(\tau)$. In fact, although we use the notation $\bx_\perp$ the coordinate system need not be strictly orthogonal to the orbit, but must simply beThe value of this construction for
Lyapunov analysis was proposed in

For a dynamical system $\dot\bx = f(\bx)$ with $\bx \in \Re^n$, $f$
continuous, a continuous periodic solution $\bx^*(\tau)$, and a smooth
mapping $\bx \rightarrow (\tau,\bx_\perp)$ where $\bx_\perp$ vanishes on
$\bx^*$, then for some $n-1$ dimensional ball ${\cal B}$ around the
origin, if I can produce a $V(\bx_\perp,\tau)$ such that \begin{gather*}
\forall \tau, V(0,\tau) = 0, \\ \forall \tau, \forall \bx_\perp \in
{\cal B}, \bx_\perp \ne 0, V(\bx_\perp,\tau) > 0, \end{gather*} with
\begin{gather*} \forall \tau, \dot{V}(0,\tau) = 0, \\ \forall \tau,
\forall \bx_\perp \in {\cal B}, \bx_\perp \ne 0, \dot{V}(\bx_\perp,\tau)
< 0, \end{gather*} then the solution $\bx^*(t)$ is *locally orbitally
asymptotically stable*.

Perhaps the simplest oscillator is the first-order system which converges to the unit circle. In cartesian coordinates, the dynamics are \begin{align*} \dot{x}_1 =& x_2 -\alpha x_1 \left( 1 - \frac{1}{\sqrt{x_1^2+x_2^2}}\right) \\ \dot{x}_2 =& -x_1 -\alpha x_2 \left( 1 - \frac{1}{\sqrt{x_1^2+x_2^2}}\right), \end{align*} where $\alpha$ is a positive scalar gain.

If we take the transverse coordinates to be the polar coordinates, shifted so that $x_\perp$ is zero on the unit circle, \begin{align*}\tau =& \text{atan2}(-x_2,x_1) \\ x_\perp =& \sqrt{x_1^2+x_2^2}-1, \end{align*} which is valid when $x_\perp>-1$, then the simple transverse dynamics are revealed: \begin{align*} \dot\tau =& 1 \\ \dot{x}_\perp =& -\alpha x_\perp. \end{align*} Taking a Lyapunov candidate $V(x_\perp,\tau) = x_\perp^2$, we can verify that $$\dot{V} = -2 \alpha x_\perp^2 \prec 0, \quad \forall x_\perp > -1.$$ This demonstrates that the limit cycle is locally asymptotically stable, and furthermore that the invariant open-set $V < 1$ is inside the region of attraction of that limit cycle. In fact, we know that all $x_\perp > -1$ are in the region of attraction that limit cycle, but this is not proven by the Lyapunov argument above.

Let's compare this approach with the approach that we used in the chapter on walking robots, where we used a Poincaré map analysis to investigate limit cycle stability. In transverse coordinates approach, there is an additional burden to construct the coordinate system along the entire trajectory, instead of only at a single surface of section. In fact, the transverse coordinates approach is sometimes referred to as a "moving Poincaré section". But the reward for this extra work is that we can check a condition that only involves the instantaneous dynamics, $f(\bx)$, as opposed to having to integrate the dynamics over an entire cycle to generate the discrete Poincaré map, $\bx_p[n+1] = P(\bx_p[n])$. While computing $P$ in closed-form happened to be possible for the simple rimless wheel model, it is rarely possible for more complex models. As we will see below, this approach will also be more compatible with designing continuous feedback controller that stabilize the limit cycle.

In the case of Lyapunov analysis around a fixed point, there was an important special case: for stable linear systems, we actually have a recipe for constructing a Lyapunov function. As a result, for nonlinear systems, we often found it convenient to begin our search by linearizing around the fixed point and using the Lyapunov candidate for the linear system as an initial guess. That same approach can be extended to limit cycle analysis.

Coming soon. If you are interested, see

So far we have developed a fairly strong toolbox for planning and control with "smooth" systems -- systems where the equations of motion are described by a function $\dot{\bx} = f(\bx,\bu)$ which is smooth everywhere. But our discussion of the simple models of legged robots illustrated that the dynamics of making and breaking contact with the world are more complex -- these are often modeled as hybrid dynamics with impact discontinuities at the collision event and constrained dynamics during contact (with either soft or hard constraints).

My goal for this chapter is to extend our computational tools into this richer class of models. Many of our core tools still work: trajectory optimization, Lyapunov analysis (e.g. with sums-of-squares), and LQR all have natural equivalents.

Recall how we modeled the dynamics of the simple legged robots. First, we derived the equations of motion (independently) for each possible contact configuration -- for example, in the spring-loaded inverted pendulum (SLIP) model we had one set of equations governing the $(x,y)$ position of the mass during the flight phase, and a completely separate set of equations written in polar coordinates, $(r,\theta)$, describing the stance phase. Then we did a little additional work to describe the transitions between these models -- e.g., in SLIP we transitioned from flight to stance when the foot first touches the ground. When simulating this model, it means that we have a discrete "event" which occurs at the moment of foot collision, and an immediate discontinuous change to the state of the robot (in this case we even change out the state variables).

The language of *hybrid systems* gives us a rich language for
describing systems like this, and a suite of tools for simulating them.
*autonomous* hybrid systems because the internal dynamics can cause
the discrete changes without any exogeneous input; in contrast to, for
instance, the model of a power-train where a change in gears comes as an
external input.*modes* each described by (ideally smooth)
continuous dynamics, a set of *guards* which here are continuous
functions who's zero-level set describes the conditions which trigger an
event, and a set of *resets* which describe the discrete update to the
state that is triggered by the guard. Each guard is associated with a
particular mode, and we can have multiple guards per mode. Every guard has
at most one reset. You will occasionally here guards referred to as
"witness functions", since they play that role in simulation, and resets
are sometimes referred to as "transition functions".

The imagery that I like to keep in my head for hybrid systems is illustrated below for a simple example of a robot's heel striking the ground. A solution trajectory of the hybrid system has a continuous trajectory inside each mode, punctionated by discrete updates when the trajectory hits the zero-level set of the guard (here the distance between the heel and the ground becomes zero), with the reset describing the discrete change in the state variables.

There is some work to do in order to derive the equations of motion in
this form. Do you remember how we did it for the rimless wheel and compass
gait examples? In both cases we assumed that exactly one foot was
attached to the ground and that it would not slip, this allowed us to write
the Lagrangian as if there was a pin joint attaching the foot to the ground
to obtain the equations of motion. For the SLIP model, we derived the
flight phase and stance phase using separate Lagrangian equations each with
different state representations. I would describe this as the *minimal
coordinates* modeling approach -- it is elegant and has some important
computational advantages that we will come to appreciate in the algorithms
below. But it's a lot of work! For instance, if we also wanted to consider
friction in the foot contact of the rimless wheel, we would have to derive
yet another set of equations to describe the sliding mode (adding, for
instance, a prismatic joint that moved the foot along the ramp), plus the
guards which compute the contact force for a given state and the distance
from the boundary of the friction cone, and on and on.

Fortunately, there is an alternative modeling approach for deriving the
modes, guards, and resets for contact that is less work. We can instead
model the robot in the *floating-base coordinates* -- we add a
fictitious six degree-of-freedom "floating-base" joint connecting some part
of the robot to the world (in planar models, we use just three
degrees-of-freedom, e.g. $(x,y,\theta)$). We can derive the equations of
motion for the floating-base robot once, without considering contact, then
add the additional constraints that come from being in contact as contact
forces which get applied to the bodies. The resulting manipulator equations
take the form \begin{equation}\bM({\bq})\ddot{\bq} +
\bC(\bq,\dot{\bq})\dot\bq = \btau_g(\bq) + \bB\bu + \sum_i \bJ_i^T(\bq)
\blambda_i,\end{equation} where $\blambda_i$ are the constraint forces and
$\bJ_i$ are the constraint Jacobians. Conveniently, if the guard function
in our contact equations is the signed distance from contact,
$\phi_i(\bq)$, then this Jacobian is simply $\bJ_i(\bq) =
\pd{\phi_i}{\bq}$. I've written the basic derivations for the common cases
(position constraints, velocity constraints, impact equations, etc) in the
appendix. What is
important to understand here is that this is an alternative formulation for
the equations governing the modes, guards, and resets, but that is it no
longer a minimal coordinate system -- the equations of motion are written
in $2N$ state variables but the system might actually be constrained to
evolve only along a lower dimensional manifold (if we write the rimless
wheel equations with three configuration variables for the floating base,
it still only rotates around the toe when it is in stance and is inside the
friction cone). This will have implications for our algorithms.

Finally, there are many recipes for simulating these systems. At one extreme, we have integrators that do explicit event detection -- we register the guards of our system as "witness functions" for the event and the integrator ensures that it obtains a high-accuracy integration step at the event. For the complicated simulations that you see in computer graphics, coarser approximations are often used, such as the time-stepping approximations which solve a small optimization problem at every integration step to compute time-averaged contact forces to summarize the impact events without explicit event detection. I'll leave the details to the appendix, but will use both approaches in this chapter for control and analysis.

Transverse stabilization (build off limit cycle chapter)

Reinforcement learning (RL) is a collection of algorithms for solving the
same optimal control problem that we've focused on through the text, but the
real gems from the RL literature are the algorithms for (almost) *black-box
optimization* of stochastic optimal control problems. The idea of an
algorithm that only has a "black-box" interface to the optimization problem
means that it can obtain (potentially noisy) samples of the optimal cost via
trial and error, but does not have access to the underlying model, and does
not have direct access to complete gradient information.

This is a hard problem! In general we cannot expect RL algorithms to
optimize as quickly as more structured optimization, and we can typically only
guarantee convergence to a local optima at best. But the framework is
extremely general, so it can be applied to problems that are inaccessible to
any of the other algorithms we've examined so far. My favorite examples for
reinforcement learning are control in complex fluid dynamics (e.g.

In this chapter we will examine one particular style of reinforcement learning that attempts to find good controllers by explicitly parameterizing a family of policies (e.g. via a parameter vector $\alpha$), then searching directly for the parameters that optimize the long-term cost. For a stochastic optimal control problem with our favorite additive form of the objective, this might look like: \begin{equation} \min_\alpha E \left[ \sum_{n=0}^N g(\bx[n],\bu[n]) \right]\end{equation} where the random variables are drawn from probability densities \begin{gather*}\bx[0] \sim p_0(\bx),\\ \bx[n] \sim p(\bx[n] | \bx[n-1], \bu[n-1]), \\ \bu[n] \sim p_\alpha(\bu[n] | \bx[n]).\end{gather*} The last equation is a probabilistic representation of the control policy -- on each time-step actions $\bu$ are drawn from a distribution that is conditioned on the current state, $\bx$.

Of course, the controls community has investigated ideas like this, too,
for instance under the umbrellas of *extremum-seeking control* and
*iterative learning control*. I'll try to make connections whenever
possible.

One of the standard approaches to policy search in RL is to estimate the
gradient of the expected long-term cost with respect to the policy
parameters by evaluating some number of sample trajectories, then performing
(stochastic) gradient descent. Many of these so-called "policy gradient"
algorithms leverage a derivation that was first described in the context of
the REINFORCE algorithm

Let's start with a simpler optimization over a stochastic function: $$\min_\alpha E\left[ g(\bx) \right] \text{ with } \bx \sim p_\alpha(\bx) $$ I hope the notation is clear? $\bx$ is a random vector, drawn from the distribution $p_\alpha(\bx)$, with the subscript indicating that the distribution depends on the parameter vector $\alpha$. What is the gradient of this function? The REINFORCE derivation is \begin{align*} \frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] =& \frac{\partial}{\partial \alpha} \int d\bx ~ g(\bx) p_\alpha(\bx) \\ =& \int d\bx ~ g(\bx) \frac{\partial}{\partial \alpha} p_\alpha(\bx) \\ =& \int d\bx ~ g(\bx) p_\alpha(\bx) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx) \\ =& E\left[ g(\bx) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx) \right]. \end{align*} To achieve this, we used the derivative of the logarithm: \begin{gather*} y = \log u \\ \frac{\partial y}{\partial u} = \frac{1}{u} \frac{\partial u}{\partial x} \end{gather*} to write $$\frac{\partial}{\partial \alpha} p_\alpha(\bx) = p_\alpha(\bx) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx).$$ This suggests a simple Monte Carlo algorithm for estimating the policy gradient: draw $N$ random samples $\bx_i$ then estimate the gradient as $$ \frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] \approx \frac{1}{N} \sum_i g(\bx_i) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx_i).$$

This trick is even more potent in the optimal control case. For a finite-horizon problem we have \begin{align*} \frac{\partial}{\partial \alpha} E \left[ \sum_{n=0}^N g(\bx[n], \bu[n]) \right] =& \sum_{n=0}^N \int d\bx[n] d\bu[n] ~ g(\bx[n],\bu[n]) \frac{\partial}{\partial \alpha} p_\alpha(\bx[n],\bu[n]) \\ =& E\left[ \sum_{n=0}^N g(\bx[n],\bu[n]) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx[n],\bu[n]) \right], \end{align*} where $$ p_\alpha(\bx[n],\bu[n]) = p_0(\bx[0]) \left( \prod_{k=1}^n p(\bx[k] | \bx[k-1],\bu[k-1]) \right) \left(\prod_{k=0}^n p_\alpha(\bu[k] | \bx[k]) \right).$$ Taking the $\log$ we have $$\log p_\alpha(\bx[n],\bu[n]) = \log p_0(\bx[0]) + \sum_{k=1}^n \log p(\bx[k] | \bx[k-1],\bu[k-1]) + \sum_{k=0}^n \log p_\alpha(\bu[k] | \bx[k]).$$ Only the last terms depend on $\alpha$, which yields $$\frac{\partial}{\partial \alpha} E \left[ \sum_{n=0}^N g(\bx[n], \bu[n]) \right] = E\left[ \sum_{n=0}^N g(\bx[n],\bu[n]) \sum_{k=0}^n \frac{\partial}{\partial \alpha} \log p_\alpha(\bu[k]|\bx[k]) \right]. $$

This update should surprise you. It says that I can find the gradient of the long-term cost by taking only the gradient of the policy... but not the gradient of the plant, etc! The intuition is that one can obtain the gradient by evaluating the policy along a number of (random) trajectory roll-outs of the closed-loop system, evaluating the cost on each, then increasing the probability in the policy of taking the actions correlated with lower long-term costs.

This derivation is often presented as **the** policy gradient
derivation. The identity is certainly correct, but I'd prefer that you
think of this as just one way to obtain the policy gradient. It's a
particularly clever one in that it uses exactly the information that we
have available in reinforcement learning -- we have access to the
instantaneous costs, $g(\bx[n], \bu[n])$, and to the policy -- so are
allowed to take gradients of the policy -- but does not require any
understanding whatsoever of the plant model. Although it's clever, it is
not particularly efficient -- the Monte Carlo approximations of the
expected value have a high variance, so many samples are required for an
accurate estimate. Other derivations are possible, some even simpler, and
others that **do** make use of the plant gradients if you have access
to them, and these will perform differently in the finite-sample
approximations.

For the remainder of this section, I'd like to dig in and try to understand the nature of the stochastic update a little better, to help you understand what I mean.

Let's take a step back and think more generally about how one can use gradient descent in black-box (unconstrained) optimization. Imagine you have a simple optimization problem: $$\min_\alpha g(\alpha),$$ and you can directly evaluate $g()$, but not $\frac{\partial g}{\partial \alpha}$. How can you perform gradient descent?

A standard technique for estimating the gradients, in lieu of
analytical gradient information, is the method of finite
differences

What if each function evaluation is expensive? Perhaps it means
picking up a physical robot and running it for 10 seconds for each
evaluation of $g()$. Suddenly there is a premium on trying to optimize
the cost function with the fewest number of evaluations on $g()$. This is
the name of the game in reinforcement learning -- it's often called RL's
*sample complexity*. Can we perform gradient descent using less
evaluations?

This leads us to the question of doing approximate gradient descent, or
"stochastic" gradient descent. Thinking of the cost landscape as a
Lyapunov function†

Instead of sampling each dimension independently, consider making a single small random change, $\bbeta$, to the parameter vector, $\balpha$. Now consider the update of the form: $$\Delta \balpha= -\eta [ g(\balpha + \bbeta) - g(\balpha)]\bbeta.$$ The intuition here is that if $g(\balpha+\bbeta) < g(\balpha)$, then $\bbeta$ was a good change and we'll move in the direction of $\bbeta$. If the cost increased, then we'll move in the opposite direction. Assuming the function is smooth and $\bbeta$ is small, then we will always move downhill on the Lyapunov function.

Even more interesting, on average, the update is actually in the direction of the true gradient. To see this, again we can assume the function is locally smooth and $\bbeta$ is small to write the Taylor expansion: $$g(\balpha + \bbeta) \approx g(\balpha) + \pd{g}{\balpha}\bbeta.$$ Now we have \begin{align*}\Delta \balpha \approx& -\eta \left[ \pd{g}{\alpha} \bbeta \right] \bbeta = -\eta \bbeta \bbeta^T \pd{g}{\balpha}^T \\ \avg{\Delta \balpha} \approx& -\eta \avg{\bbeta \bbeta^T} \pd{g}{\balpha}^T \end{align*} If we select each element of $\bbeta$ independently from a distribution with zero mean and variance $\sigma_\beta^2$, or $\avg{ \beta_i } = 0, \avg{ \beta_i \beta_j } = \sigma_\beta^2 \delta_{ij}$, then we have $$\avg{\Delta \balpha} \approx -\eta \sigma_\beta^2 \pd{g}{\balpha}^T.$$ Note that the distribution $p_{\alpha}(\bx)$ need not be Gaussian, but it is the variance of the distribution which determines the scaling on the gradient.

The weight perturbation update above requires us to evaluate the function $g$ twice for every update of the parameters. This is low compared to the method of finite differences, but let us see if we can do better. What if, instead of evaluating the function twice per update [$g(\balpha+\bbeta)$ and $g(\balpha)$], we replace the evaluation of $g(\balpha)$ with an estimator, $b = \hat{g}(\balpha)$, obtained from previous trials? In other words, consider an update of the form: \begin{equation} \Delta \balpha= - \frac{\eta}{\sigma_\beta^2} [ g(\balpha + \bbeta) - b]\bbeta \label{eq:weight_perturbation_baseline} \end{equation} The estimator can take any of a number of forms, but perhaps the simplest is based on the update (after the $n$th trial): $$b[n+1] = \gamma g[n] + (1 - \gamma)b[n],\quad b[0] = 0, 0\le\gamma\le1,$$ where $\gamma$ parameterizes the moving average. Let's compute the expected value of the update, using the same Taylor expansion that we used above: \begin{align*} E[\Delta \balpha] =& -\frac{\eta}{\sigma_\beta^2} \avg{ [g(\balpha) + \pd{g}{\balpha}\bbeta - b]\bbeta} \\ =& -\frac{\eta}{\sigma_\beta^2} \avg{[g(\balpha)-b]\bbeta} -\frac{\eta}{\sigma_\beta^2} \avg{\bbeta \bbeta^T}\pd{g}{\balpha}^T \\ =& - \eta \pd{g}{\balpha}^T \end{align*} In other words, the baseline does not effect our basic result - that the expected update is in the direction of the gradient. Note that this computation works for any baseline estimator which is uncorrelated with $\bbeta$, as it should be if the estimator is a function of the performance in previous trials.

Although using an estimated baseline doesn't effect the average update, it can have a dramatic effect on the performance of the algorithm. As we will see in a later section, if the evaluation of $g$ is stochastic, then the update with a baseline estimator can actually outperform an update using straight function evaluations.

Let's take the extreme case of $b=0$. This seems like a very bad idea... on each step we will move in the direction of every random perturbation, but we will move more or less in that direction based on the evaluated cost. In average, we will still move in the direction of the true gradient, but only because we will eventually move more downhill than we move uphill. It feels very naive.

Now let's consider the simple form of the REINFORCE update: \begin{align*} \frac{\partial}{\partial \alpha} E \left[ g(\bx) \right] = E\left[ g(\bx) \frac{\partial}{\partial \alpha} \log p_\alpha(\bx) \right]. \end{align*} It turns out that weight perturbation is a REINFORCE algorithm. To see this, take $\bx = \balpha + \bbeta$, $\bbeta \in N(0,\sigma^2)$, to have \begin{gather*} p_{\balpha}(\bx) = \frac{1}{(2\pi\sigma^2)^N} e^{\frac{-(\bx-\balpha)^T(\bx - \balpha)}{2\sigma^2}} \\ \log p_{\balpha}(\bx) = \frac{-(\bx-\balpha)^T(\bx - \balpha)}{2\sigma^2} + \text{terms that don't depend on }\balpha \\ \pd{}{\balpha} \ln p_{\balpha}(\bx) = \frac{1}{\sigma^2} (\balpha - \bx)^T = \frac{1}{\sigma^2} \bbeta^T. \\ \end{gather*} If we use only a single trial for each Monte-Carlo evaluation, then the REINFORCE update is $$\Delta \balpha = -\frac{\eta}{\sigma^2} g(\balpha + \bbeta) \bbeta,$$ which is precisely the weight perturbation update (the crazy version with $b=0$ discussed above). Although it will move in the direction of the gradient on average, it can be highly inefficient.

The policy gradient "trick" from REINFORCE using log probabilities provides one means of estimating the true policy gradient. It is not the only way to obtain the policy gradient... in fact the trivial weight-perturbation update obtains the same gradient for the case of a policy with a mean linear in $\alpha$ and a fixed diagonal covariance. It's cleverness lies in the fact that it makes use of the information we do have (instantaneous cost values and the gradients of the policy), but it's inefficiency comes from having a potentially very high variance. Variance reduction for policy gradient, often through baseline estimation, continues to be an active area of research.

The simplicity of the REINFORCE / weight perturbation updates makes it
tempting to apply them to problems of arbitrary complexity. But a major
concern for the algorithm is its performance - although we have shown that
the update is in the direction of the true gradient *on average*, it
may still require a prohibitive number of computations to obtain a local
minima.

In this section, we will investigate the performance of the weight
perturbation algorithm by investigating its *signal-to-noise* ratio
(SNR). This idea was explored in

For the weight perturbation update we have: $$\avg{\Delta \balpha}^T\avg{\Delta \balpha} = \eta^2 \pd{g}{\balpha} \pd{g}{\balpha}^T = \eta^2 \sum_{i=1}^{N} \left( \pd{g}{\alpha_i} \right)^2,$$ \begin{align*} \avg{(\Delta \balpha)^T(\Delta \balpha)} =& \frac{\eta^2}{\sigma_\beta^4} \avg{\left[g(\balpha+\bbeta) - g(\balpha)\right]^2 \bbeta^T\bbeta} \\ \approx& \frac{\eta^2}{\sigma_\beta^4} \avg{ \left[ \pd{g}{\balpha} \bbeta \right]^2 \bbeta^T\bbeta } \\ =& \frac{\eta^2}{\sigma_\beta^4} \avg{ \left(\sum_i \pd{g}{\alpha_i}\beta_i\right) \left( \sum_j \pd{g}{\alpha_j} \beta_j\right) \left(\sum_k \beta_k^2 \right) } \\ =& \frac{\eta^2}{\sigma_\beta^4} \avg{ \sum_i \pd{g}{\alpha_i}\beta_i \sum_j \pd{g}{\alpha_j} \beta_j \sum_k \beta_k^2 } \\ =& \frac{\eta^2}{\sigma_\beta^4} \sum_{i,j,k} \pd{g}{\alpha_i} \pd{g}{\alpha_j} \avg{ \beta_i \beta_j \beta_k^2 }\\ \avg{\beta_i \beta_j \beta_k^2} = \begin{cases} 0 & i \neq j \\ \sigma_\beta^4 & i=j\neq k \\ \mu_4(\beta) & i = j =k \end{cases}& \\ =& \eta^2 (N-1) \sum_i \left( \pd{g}{\alpha_i} \right)^2 + \frac{\eta^2 \mu_4(z)}{\sigma_\beta^4} \sum_i \left(\pd{g}{\alpha_i}\right)^2 \end{align*} where $\mu_n(z)$ is the $n$th central moment of $z$: $$\mu_n(z) = \avg{(z - \avg{z})^n}.$$ Putting it all together, we have: $$\text{SNR} = \frac{1}{N - 2 + \frac{\mu_4(\beta_i)}{\sigma_\beta^4}}.$$

For $\beta_i$ drawn from a Gaussian distribution, we have $\mu_1 = 0, \mu_2 = \sigma^2_\beta, \mu_3 = 0, \mu_4 = 3 \sigma^4_\beta,$ simplifying the above expression to: $$\text{SNR} = \frac{1}{N+1}.$$

For $\beta_i$ drawn from a uniform distribution over [-a,a], we have $\mu_1 = 0, \mu_2 = \frac{a^2}{3} = \sigma_\beta^2, \mu_3 = 0, \mu_4 = \frac{a^4}{5} = \frac{9}{5}\sigma_\beta^4,$ simplifying the above to: $$\text{SNR} = \frac{1}{N-\frac{1}{5}}.$$

Performance calculations using the SNR can be used to design the parameters of the algorithm in practice. For example, based on these results it is apparent that noise added through a uniform distribution yields better gradient estimates than the Gaussian noise case for very small $N$, but that these differences are neglibile for large $N$.

Similar calculations can yield insight into picking the size of the additive noise (scaled by $\sigma_\beta$). The calculations in this section seem to imply that larger $\sigma_\beta$ can only reduce the variance, overcoming errors or noise in the baseline estimator, $\tilde{b}$; this is a short-coming of our first-order Taylor expansion. If the cost function is not linear in the parameters, then an examination of higher-order terms reveals that large $\sigma_\beta$ can increase the SNR. The derivation with a second-order Taylor expansion is left for an exercise.

The

**NOTE: If you are taking the MIT class, we have provided a Docker
workflow for you to make sure you are kept up to date week by week. Do NOT
use the instructions below. Follow
these instructions instead.**

First, pick your platform (click on your OS):

**Ubuntu Linux (Bionic)** |
Ubuntu Linux
(Xenial) |
Mac Homebrew
Ubuntu Linux
(Bionic) |
**Ubuntu Linux (Xenial)** |
Mac Homebrew
Ubuntu Linux
(Bionic) |
Ubuntu Linux (Xenial) | **Mac Homebrew**

`git clone https://github.com/RussTedrake/underactuated.git`

and install the prerequisites using the platform-specific installation script provided:

```
sudo underactuated/scripts/setup/ubuntu/18.04/install_prereqs
export PYTHONPATH=`pwd`/underactuated/src:${PYTHONPATH}
```

```
sudo underactuated/scripts/setup/ubuntu/16.04/install_prereqs
export PYTHONPATH=`pwd`/underactuated/src:${PYTHONPATH}
```

```
underactuated/scripts/setup/mac/install_prereqs
export PYTHONPATH=`pwd`/underactuated/src:${PYTHONPATH}
```

Although I hope that power users will build

`curl -o drake.tar.gz https://drake-packages.csail.mit.edu/drake/nightly/drake-latest-bionic.tar.gz`

`curl -o drake.tar.gz https://drake-packages.csail.mit.edu/drake/nightly/drake-latest-xenial.tar.gz`

`curl -o drake.tar.gz https://drake-packages.csail.mit.edu/drake/nightly/drake-latest-mac.tar.gz`

```
sudo tar -xvzf drake.tar.gz -C /opt
sudo /opt/drake/share/drake/setup/install_prereqs
export PYTHONPATH=/opt/drake/lib/python2.7/site-packages:${PYTHONPATH}
```

```
sudo tar -xvzf drake.tar.gz -C /opt
/opt/drake/share/drake/setup/install_prereqs
export PYTHONPATH=/opt/drake/lib/python2.7/site-packages:${PYTHONPATH}
export PATH=/usr/local/opt/python/libexec/bin:${PATH}
```

```
python -c 'import pydrake; print(pydrake.__file__)'
```

Further tips and instructions are available at the python documentation page on the drake website.

Note: All of the examples in the textbook supplement currently assume
that working directory of your python interpreter is the
`underactuated/src`

directory.

The examples you'll find throughout the text are provided so that
they can be run in any python interpreter. However, many people
*love* the Jupyter Notebook
workflow, and the code should work for you there, as well.

Here is an example that
shows you how to tell `matplotlib`

to perform well in the
notebook (including with animations).

The next chapter in the Appendix provides a tutorial for working with
dynamical systems in

The best overview we have for the remaining capabilities in

This chapter provides a short tutorial for modeling input-output dynamical
systems in `drake::systems`

that provide an API for building complex dynamical
systems by connecting simple dynamical systems into a block diagram. The
modeling approach is very much inspired by MATLAB's Simulink (though without
the graphical user interface). Like Simulink, `drake::systems`

provides a library of existing systems and also makes it easy for you to
define your own new dynamical systems. The python wrappers in
`pydrake`

used in these notes provide a convenient set of
bindings on top of this systems framework, allowing you to write new
systems in either Python or C++, and to use them interchangeably.

A major difference between modeling dynamical systems in

We'll return to the more advanced modeling tools in a minute, but let us start with some very simple examples.

A quick note on documentation. The C++ classes are beautifully documented
in doxygen. I currently
recommend using this doxygen as the primary source of documentation even
for your Python development... most of the `pydrake`

API maps
directly to that C++ API.

In this section, we will describe how you can write your own dynamics
class in pydrake. Every user system should derive from the
`pydrake.systems.framework.LeafSystem`

class.

However, many dynamical systems we are interested in are represented by
a simple vector field in state-space form, where we often use $x$ to denote the
state vector, $u$ for the input vector, and $y$ for the output vector. To
make it even easier to implement systems in this form, we provide another
subclass `pydrake.systems.framework.VectorSystem`

that provides
an excellent starting point.

**Continuous-Time Systems.** Consider a basic continuous-time
nonlinear input-output dynamical system described by the following
state-space equations: \begin{gather*} \dot{x} = f(t,x,u), \\ y = g(t,x,u).
\end{gather*} In `pydrake`

, you can instantiate a system of this
form where $f()$ and $g()$ are anything that you can write into a Python
function, as illustrated by the following example:

Consider the system \begin{gather*}\dot{x} = -x + x^3,\\ y =
x.\end{gather*} This system has zero inputs, one (continuous) state
variable, and one output. It can be implemented in

**Discrete-Time Systems**. Implementing a basic discrete-time system
in

Consider the system \begin{gather*}x[n+1] = x^3[n],\\ y[n] =
x[n].\end{gather*} This system has zero inputs, one (discrete) state
variable, and one output. It can be implemented in

Although these simple VectorSystems are a nice way to get started, in fact Drake supports authoring a wide variety of systems with multiple inputs and outputs, mixed discrete- and continuous- dynamics, hybrid dynamics with guards and resets, systems with constraints, and even stochastic systems. To expose more of the underlying system framework, you should derive your system from LeafSystem directly, instead of using the simplified VectorSystem interface.

Once you have acquired a System object describing the dynamics of
interest, the most basic thing that you can do is to simulate it. This is
accomplished with the `pydrake.framework.analysis.Simulator`

class. This class provides access to a rich suite of numerical integration
routines, with support for variable-step integration, stiff solvers, and
event detection.

In order to view the data from a simulation after the simulation has run,
you should add a `pydrake.framework.primitives.SignalLogger`

system to your diagram.

Use the following code to instantiate the
`SimpleContinuousTimeSystem `

, simulate it, and plot the
results:

This should generate the plot below

The code for running the discrete time system is basically identical:

For many systems, the simulation will run much faster than real time.
If you would like to tell the simulator to slow down (if possible) to some
multiple of the real time clock, then consider using the
`set_target_realtime_rate`

method of the `Simulator`

.
This is useful, for example, if you are trying to animate your robot as it
simulates, and would like the benefit of physical intuition, or if you are
trying to use the simulation as a part of a multi-process real-time control
system.

If you were looking carefully, you might have noticed a few instances of
the word "context" in the code snippets above. The Context
is a core concept in the

Systems know how to create an instance of a Context (see CreateDefaultContext). In the simulation example above, the Simulator created a Context for us. We retreived the Context from the Simulator in order to set the initial conditions (state) of the system before running the simulation.

Note that a Context is not completely defined unless all of the input ports are connected (simulation and other method calls will fail if they are not). For input ports that are not directly tied to the output of anther system, consider using the FixInputPort method of the Context.

The real modeling power of `DiagramBuilder`

class to ```
AddSystem
```

s and to `Connect`

input ports to output ports or to
expose them as inputs/output of the diagram. Then we call ```
Build
```

to generate the new `Diagram`

instance, which is just
another `System`

in the framework, and can be simulated or
analyzed using the entire suite of tools.

In the example below, we connect two subsystems (the plant and the
visualizer), and expose the input of the plant as an input to the
`Diagram`

being constructed:

Just like any other `System`

, a `Diagram`

has
a `Context`

. You can always extract the context of a
subsystem using `Diagram.GetSubSystemContext`

or
`Diagram.GetMutableSubsystemContext`

. (If it is not clear
yet, "mutable" means you have write access, otherwise the data is
`const`

; we go to some length in the systems framework to
protect algorithms / users from violating the fundamental systems
abstractions).

The equations of motion for a standard robot can be derived using the
method of Lagrange. Using $T$ as the total kinetic energy of the system,
and $U$ as the total potential energy of the system, $L = T-U$, and $Q_i$ as
the generalized force corresponding to $q_i$, the Lagrangian dynamic
equations are: \begin{equation} \frac{d}{dt}\pd{L}{\dot{q}_i} - \pd{L}{q_i}
= Q_i.\end{equation} If you are not comfortable with these equations, then
any good book chapter on rigid body mechanics can bring you up to speed --
try

Consider the simple double pendulum with torque actuation at both
joints and all of the mass concentrated in two points (for simplicity).
Using $\bq = [\theta_1,\theta_2]^T$, and ${\bf p}_1,{\bf p}_2$ to denote
the locations of $m_1,m_2$, respectively, the kinematics of this system
are:
\begin{eqnarray*}
{\bf p}_1 =& l_1\begin{bmatrix} s_1 \\ - c_1 \end{bmatrix}, &{\bf p}_2 =
{\bf p}_1 + l_2\begin{bmatrix} s_{1+2} \\ - c_{1+2} \end{bmatrix} \\
\dot{{\bf p}}_1 =& l_1 \dot{q}_1\begin{bmatrix} c_1 \\ s_1 \end{bmatrix},
&\dot{{\bf p}}_2 = \dot{{\bf p}}_1 + l_2 (\dot{q}_1+\dot{q}_2) \begin{bmatrix} c_{1+2} \\ s_{1+2} \end{bmatrix}
\end{eqnarray*}
Note that $s_1$ is shorthand for $\sin(q_1)$, $c_{1+2}$ is shorthand for
$\cos(q_1+q_2)$, etc. From this we can write the kinetic and potential
energy:
\begin{align*}
T =& \frac{1}{2} m_1 \dot{\bf p}_1^T \dot{\bf p}_1 + \frac{1}{2} m_2
\dot{\bf p}_2^T \dot{\bf p}_2 \\
=& \frac{1}{2}(m_1 + m_2) l_1^2 \dot{q}_1^2 + \frac{1}{2} m_2 l_2^2 (\dot{q}_1 + \dot{q}_2)^2 + m_2 l_1 l_2 \dot{q}_1 (\dot{q}_1 + \dot{q}_2) c_2 \\
U =& m_1 g y_1 + m_2 g y_2 = -(m_1+m_2) g l_1 c_1 - m_2 g l_2 c_{1+2}
\end{align*}
Taking the partial derivatives $\pd{T}{q_i}$, $\pd{T}{\dot{q}_i}$, and
$\pd{U}{q_i}$ ($\pd{U}{\dot{q}_i}$ terms are always zero), then
$\frac{d}{dt}\pd{T}{\dot{q}_i}$, and plugging them into the Lagrangian,
reveals the equations of motion:
\begin{align*}
(m_1 + m_2) l_1^2 \ddot{q}_1& + m_2 l_2^2 (\ddot{q}_1 + \ddot{q}_2) + m_2 l_1 l_2 (2 \ddot{q}_1 + \ddot{q}_2) c_2 \\
&- m_2 l_1 l_2 (2 \dot{q}_1 + \dot{q}_2) \dot{q}_2 s_2 + (m_1 + m_2) l_1 g s_1 + m_2 g l_2 s_{1+2} = \tau_1 \\
m_2 l_2^2 (\ddot{q}_1 + \ddot{q}_2)& + m_2 l_1 l_2 \ddot{q}_1 c_2 + m_2 l_1 l_2
\dot{q}_1^2 s_2 + m_2 g l_2 s_{1+2} = \tau_2
\end{align*}
As we saw in chapter 1, numerically integrating (and animating) these
equations in

If you crank through the Lagrangian dynamics for a few simple robotic
manipulators, you will begin to see a pattern emerge - the resulting
equations of motion all have a characteristic form. For example, the
kinetic energy of your robot can always be written in the form:
\begin{equation} T = \frac{1}{2} \dot{\bq}^T \bM(\bq)
\dot{\bq},\end{equation} where $\bM$ is the state-dependent inertia matrix
(aka mass matrix). This observation affords some insight into general
manipulator dynamics - for example we know that ${\bf M}$ is always positive
definite, and symmetric

Continuing our abstractions, we find that the equations of motion of a
general robotic manipulator (without kinematic loops) take the form
\begin{equation}{\bf M}(\bq)\ddot{\bq} + \bC(\bq,\dot{\bq})\dot{\bq} =
\btau_g(\bq) + {\bf B}\bu, \label{eq:manip} \end{equation} where $\bq$ is
the state vector, ${\bf M}$ is the inertia matrix, $\bC$ captures Coriolis
forces, and $\btau_g$ is the gravity vector. The matrix $\bB$ maps control
inputs $\bu$ into generalized forces. Note that we pair $\bM\ddot\bq +
\bC\dot\bq$ on the left side because "... the equations of motion depend on
the choice of coordinates $\bq$. For this reason neither $\bM\ddot\bq$ nor
$\bC\dot\bq$ individually should be thought of as a generalized force; only
their sum is a force"

The manipulator equations are very general, but they do define some important characteristics. For example, $\ddot{\bq}$ is (state-dependent) linearly related to the control input, $\bu$. This observation justifies the control-affine form of the dynamics assumed throughout the notes.

Note that we have chosen to use the notation of second-order systems
(with $\dot{\bq}$ and $\ddot{\bq}$ appearing in the equations) throughout
this book. Although I believe it provides more clarity, there is an
important limitation to this notation: it is impossible to describe 3D
rotations in "minimal coordinates" using this notation without introducing
kinematic singularities (like the famous "gimbal lock"). For instance, a
common singularity-free choice for representing a 3D rotation is the unit
quaternion, described by 4 real values (plus a norm constraint). However we
can still represent the rotational velocity without singularities using just
3 real values. This means that the length of our velocity vector is no
longer the same as the length of our position vector. For this reason, you
will see that most of the software in

If our robot has closed-kinematic chains, for instance those that arise from a four-bar linkage, then we need a little more. The Lagrangian machinery above assumes "minimal coordinates"; if our state vector $\bq$ contains all of the links in the kinematic chain, then we do not have a minimal parameterization -- each kinematic loop adds (at least) one constraint so should remove (at least) one degree of freedom. Although some constraints can be solved away, the more general solution is to use the Lagrangian to derive the dynamics of the unconstrained system (a kinematic tree without the closed-loop constraints), then add additional generalized forces that ensure that the constraint is always satisfied.

Consider the constraint equation \begin{equation}\bh(\bq) = 0.\end{equation} For the case of the kinematic closed-chain, this can be the kinematic constraint that the location of one end of the chain is equal to the location of the other end of the chain. The equations of motion can be written \begin{equation}\bM({\bq})\ddot{\bq} + \bC(\bq,\dot{\bq})\dot\bq = \btau_g(\bq) + \bB\bu + \bH^T(\bq) \blambda,\end{equation} where $\bH(\bq) = \pd\bh{\bq}$ and ${\blambda}$ is the constraint force. Let's use the shorthand \begin{equation} \bM({\bq})\ddot{\bq} = \btau(q,\dot{q},u) + \bH^T(\bq) \blambda. \label{eq:manip_short} \end{equation}

Using \begin{gather}\dot\bh = \bH\dot\bq,\\ \ddot\bh = \bH \ddot\bq + \dot\bH \dot\bq, \label{eq:ddoth} \end{gather} we can solve for $\blambda$, by observing that when the constraint is imposed, $\bh=0$ and therefore $\dot\bh=0$ and $\ddot\bh=0$. Inserting the dynamics (\ref{eq:manip_short}) into (\ref{eq:ddoth}) yields \begin{equation}\blambda =- (\bH \bM^{-1} \bH^T)^+ (\bH \bM^{-1} \btau + \dot\bH\dot\bq).\end{equation} The $^+$ notation refers to a Moore-Penrose pseudo-inverse. In many cases, this matrix will be full rank (for instance, in the case of multiple independent four-bar linkages) and the traditional inverse could have been used. When the matrix drops rank (multiple solutions), then the pseudo-inverse will select the solution with the smallest constraint forces in the least-squares sense.

To combat numerical "constraint drift", one might like to add a restoring force in the event that the constraint is not satisfied to numerical precision. To accomplish this, rather than solving for $\ddot\bh = 0$ as above, we can instead solve for \begin{equation}\ddot\bh = \bH\ddot\bq + \dot\bH\dot\bq = -2\alpha \dot\bh - \alpha^2 \bh,\end{equation} where $\alpha>0$ is a stiffness parameter. This is known as Baumgarte's stabilization technique, implemented here with a single parameter to provide a critically-damped response. Carrying this through yields \begin{equation} \blambda =- (\bH \bM^{-1} \bH^T)^+ (\bH \bM^{-1} \btau + (\dot{\bH} + 2\alpha \bH)\dot{\bq} + \alpha^2 \bh). \end{equation}

Consider the constraint equation \begin{equation}\bh_v(\bq,\dot{\bq}) = 0,\end{equation} where $\pd{\bh_v}{\dot{\bq}} \ne 0.$ These are less common, but arise when, for instance, a joint is driven through a prescribed motion. Here, the manipulator equations are given by \begin{equation}\bM\ddot{\bq} = \btau + \pd{\bh_v}{\dot{\bq}}^T \blambda.\end{equation} Using \begin{equation} \dot\bh_v = \pd{\bh_v}{\bq} \dot{\bq} + \pd{\bh_v}{\dot{\bq}}\ddot{\bq},\end{equation} setting $\dot\bh_v = 0$ yields \begin{equation}\blambda = - \left( \pd{\bh_v}{\dot{\bq}} \bM^{-1} \pd{\bh_v}{\dot{\bq}} \right)^+ \left[ \pd{\bh_v}{\dot{\bq}} \bM^{-1} \btau + \pd{\bh_v}{\bq} \dot{\bq} \right].\end{equation}

Again, to combat constraint drift, we can ask instead for $\dot\bh_v = -\alpha \bh_v$, which yields \begin{equation}\blambda = - \left( \pd{\bh_v}{\dot{\bq}} \bM^{-1} \pd{\bh_v}{\dot{\bq}} \right)^+ \left[ \pd{\bh_v}{\dot{\bq}} \bM^{-1} \btau + \pd{\bh_v}{\bq} \dot{\bq} + \alpha \bh_v \right].\end{equation}

Now let $\phi(\bq)$ indicate the relative (signed) distance between two rigid bodies. For rigid contact, we would like to enforce the unilateral constraint: \begin{equation} \phi(\bq) \geq 0. \end{equation}

There are three main approaches used to modeling contact with "rigid" body systems: 1) rigid contact approximated with stiff compliant contact, 2) hybrid models with collision event detection, impulsive reset maps, and continuous (constrained) dynamics between collision events, and 3) rigid contact approximated with time-averaged forces (time-stepping). Each modeling approach has advantages/disadvantages for different applications...

Most compliant contact models are conceptually straight-forward: we will implement contact forces using a stiff spring (and damper) that produces forces to resist penetration (and grossly model the dissipation of collision and/or friction). However, the devil is in the details.

The first detail comes in defining the signed distance function, $\phi(\bq)$. It is natural to define the distance between two objects that are not in collision as the smallest Euclidean distance between any two points on the bodies, and the distance for two bodies in collision as the (negative) maximum penetration depth. Unfortunately, computing the penetration depth for arbitrary bodies is a very complex numerical geometry problem, and is generally considered too expensive to be computed exactly inside, for instance, a simulation loop.

The second detail is about where we should put the spring/dampers. One candidate would be to add a spring along the line of maximal penetration, but even if one could compute it accurately this line might not be unique, and can move discontinuously with small changes in $\bq$. (This is one of the reasons that simulated robots can "explode" in a number of modern simulators; as we will see the time-stepping methods can have the same pitfall). Also a single point force cannot capture effects like torsional friction, and performs badly in some very simple cases (imaging a box sitting on a table). As a result, many of the most effective algorithms for contact restrict themselves to very limited/simplified geometry. For instance, one can place "point contacts" (zero-radius spheres) in the four corners of a robot's foot; in this case adding forces at exactly those four points as they penetrate some other body/mesh can give more consistent contact force locations/directions. A surprising number of simulators, especially for legged robots, do this.

In practice, most collision detection algorithms will return a list of "collision point pairs" given the list of bodies (with one point pair per pair of bodies in collision, or more if they are using the aforementioned "multi-contact" heuristics), and our contact force model simply attaches springs between these pairs. Given a point-pair, $p_A$ and $p_B$, both in world coordinates, ...

For this simple case, we can define the normal force direction as... the frictional force as... and write a spring-law...

In order to tightly approximate the (nearly) rigid contact that most
robots make with the world, the stiffness of these springs must be quite
high. For instance, I might want my 180kg humanoid robot model to
penetrate into the ground no more than 1mm during steady-state standing.
The challenge with adding stiff springs to our model is that this results
in stiff
differential equations (it is not a coincidence that the word
*stiff* is the common term for this in both springs and ODEs). As
a result, the best implementations of compliant contact for simulation
must use stiff ODE solvers, and these models are often difficult or
impossible to use in e.g. trajectory/feedback optimization.

Some of the advantages of this approach include (1) the fact that it is easy to implement, at least for simple geometries, (2) by virtue of being a continuous-time model it can be simulated with error-controlled integrators, and (3) the tightness of the approximation of "rigid" contact can be controlled through relatively intuitive parameters.

The collision event is described by the zero-crossings (from positive to negative) of $\phi$. Let us start by assuming frictionless collisions, allowing us to write \begin{equation}\bM\ddot{\bq} = \btau + \Phi^T \lambda,\end{equation} where $\Phi = \pd{\phi}{\bq}$ and $\lambda \ge 0$ is now a (scalar) impulsive force that is well-defined when integrated over the time of the collision (denoted $t_c^-$ to $t_c^+$). Integrate both sides of the equation over that (instantaneous) interval: \begin{align*}\int_{t_c^-}^{t_c^+} dt\left[ \bM \ddot{\bq} \right] = \int_{t_c^-}^{t_c^+} dt \left[ \btau + \Phi^T \lambda \right] \end{align*} Since $\bM$, $\btau$, and $\bPhi$ are constants over this interval, we are left with $$\bM\dot{\bq}^+ - \bM\dot{\bq}^- = \Phi^T \int_{t_c^-}^{t_c^+} \lambda dt,$$ where $\dot{\bq}^+$ is short-hand for $\dot{\bq}(t_c^+)$. Multiplying both sides by $\Phi \bM^{-1}$, we have $$ \Phi \dot{\bq}^+ - \Phi\dot{\bq}^- = \Phi \bM^{-1} \Phi^T \int_{t_c^-}^{t_c^+} \lambda dt.$$ After the collision, we have $\dot\phi^+ = -e \dot\phi^-$, with $0 \le e \le 1$ denoting the coefficient of restitution, yielding: $$ \Phi \dot{\bq}^+ - \Phi\dot{\bq}^- = -(1+e)\Phi\dot{\bq}^-,$$ and therefore $$\int_{t_c^-}^{t_c^+} \lambda dt = - (1+e)\left[ \Phi \bM^{-1} \Phi^T \right]^+ \Phi \dot{\bq}^-.$$ Substituting this back in above results in \begin{equation}\dot{\bq}^+ = \left[ \bI - (1+e)\bM^{-1} \Phi^T \left[ \Phi \bM^{-1} \Phi^T \right]^+ \Phi \right] \dot{\bq}^-.\end{equation}

We can add friction at the contact. Rigid bodies will almost always
experience contact at only a point, so we typically ignore torsional
friction

We can put the bilateral constraint equations and the impulsive collision equations together to implement a hybrid model for unilateral constraints of the form. Let us generalize \begin{equation}\bphi(\bq) \ge 0,\end{equation} to be the vector of all pairwise (signed) distance functions between rigid bodies; this vector describes all possible contacts. At every moment, some subset of the contacts are active, and can be treated as a bilateral constraint ($\bphi_i=0$). The guards of the hybrid model are when an inactive constraint becomes active ($\bphi_i>0 \rightarrow \bphi_i=0$), or when an active constraint becomes inactive ($\blambda_i>0 \rightarrow \blambda_i=0$ and $\dot\phi_i > 0$). Note that collision events will (almost always) only result in new active constraints when $e=0$, e.g. we have pure inelastic collisions, because elastic collisions will rarely permit sustained contact.

If the contact involves Coulomb friction, then the transitions
between sticking and sliding can be modeled as additional hybrid
guards, or can be solved efficiently as a complementarity problem

The equations of motions for our machines get complicated quickly.
Fortunately, for robots with a tree-link kinematic structure, there are very
efficient and natural recursive algorithms for generating these equations of
motion. For a detailed reference on these methods, see

We have a short introduction to mathematical programming in

Coming soon... a collection of tips and formulations that can make seemingly non-smooth constraints smooth, seemingly non-convex constraints convex, etc.