Underactuated Robotics

Algorithms for Walking, Running, Swimming, Flying, and Manipulation

Russ Tedrake

© Russ Tedrake, 2020
Last modified .
How to cite these notes   |   Send me your feedback

Note: These are working notes used for a course being taught at MIT. They will be updated throughout the Spring 2020 semester. Lecture videos are available on YouTube.

Previous Chapter Table of contents Next Chapter

Policy Search

So far, most of our recommendations for control design have been relatively "local" -- leveraging trajectory planning/optimization as a tool and our ability to locally stabilize trajectories for even very complex systems using linear optimal control. This is in stark contrast to the dynamic programming / value iteration methods that we started with, which attempt to solve for a control policy for every possible state; unfortunately, the dynamic programming methods as presented are restricted to relatively low dimensional state spaces. What is missing so far is algorithms for synthesizing feedback controllers that scale to large state spaces and produce controllers that are, hopefully, less "local" than trajectory stabilization.

In this chapter, we will explore another very natural idea: let us parameterize a controller with some decision variables, and then search over those decision variables directly in order to achieve a task and/or optimize a performance objective. We'll refer to this broad class of methods as "policy search" or, when optimization methods are used, "policy optimization".

Problem formulation

Consider a static full-state feedback policy, $$\bu = \bpi_\balpha(\bx),$$ where $\bpi$ is potentially a nonlinear function, and $\balpha$ is the vector of parameters that describe the controller. The control might take time as an input, or might even have it's own internal state, but let's start with this simple form.

How should we write an objective function for optimizing $\balpha$? The approach that we used for trajectory optimization is quite reasonable -- the objective was typically to minimize an integral cost over some time horizon (be it finite or infinite). But in trajectory optimization, the cost is only ever defined based on forward simulation from a single initial condition. We used the same additive cost structures in dynamic programming, where the Hamilton-Bellman-Jacobi equation provided optimality conditions for optimizing an additive cost from every initial condition; at least in the idealized equations, we were able to get away with saying $\forall \bx, \minimize_\bu ...$.

But now we are playing a different game. If we are searching over the some finitely parameterized policy, $\bpi_{\balpha}$, we can almost never expect to be optimal for every state -- and we need to somehow define the relevant importance of different states. For finite-time, a distribution over initial conditions. For infinite horizon, what really matters is the stationary distribution (which depends on the policy). Let's start with the distribution over initial conditions.

Controller parameterizations

more coming soon...

It might be tempting to optimize an LQR problem by searching directly over the feedback parameters, ${\bf K}$. It was shown only quite recently that this can achieve the optimal ${\bf K}$ Fazel18. But other, and likely more efficient parameterizations are also knownRoberts11.

The case of output feedback is not as nice. Searching directly over static output feedback controllers, $\bu = -{\bf K}\by$, is known to be hard -- even the set of stabilizing ${\bf K}$ matrices can be disconnected. We can see that with a simple example (given to me once during a conversation with Alex Megretski).

Parameterizations of Static Output Feedback

Consider the single-input, single-output LTI system $$\dot{\bx} = {\bf A}\bx + {\bf B} u, \quad y = {\bf C}\bx,$$ with $${\bf A} = \begin{bmatrix} 0 & 0 & 2 \\ 1 & 0 & 0 \\ 0 & 1 & 0\end{bmatrix}, \quad {\bf B} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad {\bf C} = \begin{bmatrix} 1 & 1 & 3 \end{bmatrix}.$$ Here the linear static-output-feedback policy can be written as $u = -ky$, with a single scalar parameter $k$.

Go ahead and make a plot of the maximum eigenvalue of the closed-loop system, as a function of $k$. The system is only stable when this maximum eigenvalue is less than zero. You'll find the set of stabilizing $k$'s is a disconnected set.

Trajectory-based policy search

Lyapunov-based approaches to policy search.

Approximate Dynamic Programming


[1] Maryam Fazel, Rong Ge, Sham M. Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv:1801.05039 [cs.LG], 2018.
[2] John Roberts, Ian Manchester, and Russ Tedrake. Feedback controller parameterizations for reinforcement learning. In Proceedings of the 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), 2011. [ www: ]

Previous Chapter Table of contents Next Chapter