Behavior cloning

Behavior cloning Pomerleau88+Bain95 attempts to learn a policy directly from the policy's input-output data using supervised learning. In the case of full-state feedback, where we know that the optimal policy can be represented as a simple function of the state, this is typically cast as a simple regression problem, e.g. $\min_\theta \sum_{i} |\bu_i - \pi_\theta(\bx_i)|^2,$ where $\theta$ are the parameters of e.g. the neural network. In the richer case of output feedback, then this becomes a sequence learning problem (we are learning a dynamical system which transforms a sequence of observations into a sequence of actions).

Famously, Large Language Models (LLMs) are trained with behavior cloning (and then fine-tuned to make them more aligned with human preferences Ziegler19+Zhang24). OpenAI's GPT models are autoregressive models that predict the next token given a "context" of recent tokens Radford18+Radford19+Brown20+Bubeck23. More recently, we've seen the extension to multi-modal models, such as visually-conditioned language models (VLMs) like Flamingo, GPT-4V, and LLaVa Liu23 with an increasing wealth of open-source reproductions Karamcheti24+Laurençon24.

Is predicting actions fundamentally different from next-token prediction in language? There are a few reasons why it might be. Actions are continuous and high-dimensional, whereas language tokens are discrete. Our control systems get put into the feedback loop with physics, and have to deal with stochasticity from the environment that LLMs don't experience. Google DeepMind released the RT series of "vision-language-action" (VLA) models (RT-1 Brohan22, RT-2 Brohan23, and RT-X O’Neill24) which started to show that predicting actions may not be so fundamentally different than language, at least for simple pick-and-place tasks. Then in 2023, there were two main lines of work which convincingly demonstrated this capability could even extend to surprisngly dexterous manipulation. Those were the Diffusion Policy Chi24 and the Action Chunking Transformer (ACT) from the ALOHA project Zhao23. Of course, these were built on a long line of progress on "visuomotor policies" which we'll discuss below.

Diffusion Policy started as an intern project(!) by Cheng Chi at TRI, and blossomed into a great collaboration with Shuran Song.

Diffusion Policy and ALOHA seem to have been the watershed results for dexterous manipulation in robotics. Since that time, the internet is now rich with videos of highly dexterous manipulation from all sorts of robots ranging from very low cost manipulators up to and including humanoid robots with dexterous hands.

In one sense, it might seem a little disappointing that, after we've spent so much time in these notes exploring the rich mathematical foundations of dynamics and control, that behavior cloning from human teleop demonstrations can outperform out best methods for some class of problems (which are arguably more about understanding the world than about dynamics and control). But think of it this way: using supervised learning is an awfully clever way to explore the space of policy parameterizations, and has accelerated us into bigger questions about using cameras in the feedback loop, learning multitask/foundation models, and leveraging structure (such as 3D geometry / objectness) in our representations or not. The success of LLMs (and now multimodal models) is undeniable, and it would be a mistake to ignore the great new possibilities that have opened up. And I do think that there is something fundamentally about imitation learning for manipulation: in many cases the "rules of the game" (e.g. task specification) is not fully described by physics -- I've come to appreciate that humans bring an amazing amount of background knowledge / common sense to bear when they are performing even relatively simple manipulation tasks. I'm very confident that our knowledge of dynamics and control will help us penetrate the new vistas enabled by these high-capacity models.

Let me make one more important point. Sometimes people say that BC is fundamentally limited because it can never outperform the human demonstrator. But one of the early results from behavior cloning was that this limit that is not strictly true. cite the classic result; i forget which classic paper it came from. Chowdery23 figure 6 is relevant, but they seem to hedge a bit because perhaps the metric isn't great. Certainly Chat-GPT can produce text output that by some metrics surpasses the capabilities of any one human. Here is a recent case study of this phenomenon for chess.

There is also a precident for combining BC with other methods to improve beyond the original demonstrations. For DeepMind's AlphaGoSilver16, behavior cloning was only the first step -- it was enough to get us to a strong Go player, but not enough to be championship level. Adding "self-play" and improvement via Monte-Carlo tree search was fundamental to "mastering the game of Go".

Visuomotor policies (aka control from pixels)

In 2016, a few years after the start of the deep learning revolution, Levine16 introduced the concept of a "visuomotor policy", and produced fairly stunning videos of robots performing complicated tasks. The name "visuomotor" is chosen to emphasize that the policy is trained to predict actions directly from RGB camera images, using architectures similar to the one pictured below; in a sense this was a return to the early ideas of Pomerleau88, but now powered with deep learning architectures and pretraining.

The (now) classical visuomotor policy architecture, adapted from Levine16+Florence20. The emphasis here is that $z$ is a *learned state representation* which encodes the state of the environment.

My own journey with imitation learning started with a project led by Pete Florence and Lucas Manuelli on using a particular form of self-supervised learning for 3D geometries to train the policies Florence20. For me, this was the first time I got to really experience directly how the model was trained, and I got to experience the incredibly impressive robustness that was possible in the roll-outs, even with a relatively modest amount of robot training data.

Writing feedback controllers which operate directly on the RGB camera images was something entirely new for me. We had been using RGB-D cameras to do some amount of visual state estimation / pose estimation before that, but were leaning pretty heavily on the depth channel and very explicit 3D reasoning/matching. But I remember around the time of our first imitation learning project, I asked the students "if you could only choose one, RGB or depth, which would you choose". They chose RGB. There are many situations where a task is unclear or even ambiguous when looking only at a depth image. The ambiguities can often be resolved with the addition of RGB, and there are many depth cues in RGB that allow us and our visuomotor policies to be successfull for 3D tasks even without an explicit depth sensor. Diffusion Policy and ACT, for instance, achieve their amazing performance by consuming RGB (no depth).

In my experience, control theorists have very satisfying answers to almost any dynamics and control problem. But (with a few rare exceptions) they didn't do computer vision. This was something entirely new. The sensor model -- the mapping from, e.g. the state and parameters of a MultibodyPlant to the output image $y$ -- is potentially a full game-engine-quality renderer. Even though there are lots of projects now on making differentiable renderers, these can only do so much because the pixelation process is inherently very local/non-smooth. Going from the image back into a manageable intermediate (latent) state representation, $z$, started becoming viable with the rise of deep networks for perception. Data/learning does feel fundamental here -- mapping from RGB into a meaningful representation for control is more about the statistics of natural scenes than about the model-based physics of propagating light.

This, for me, was the first main lesson I took from our imitation learning work: closing feedback loops by directly consuming RGB at control rates is now possible, and is incredibly powerful for robust performance in visually complex settings. Imitation learning (and reinforcement learning) have so far enabled this in a way that model-based control pipelines which require explicit state (or belief) estimation do not. (Of course, techniques like teacher-student distillation Chen20+Hwangbo19 and Chou23 might help us bridge that gap.)

Behavior cloning as sequence modeling

Supervised learning in a feedback loop: dealing with distribution shift

One of the famous challenges in imitation learning is the problem of distribution shift. Imagine that you are training a policy for driving a car...

DAGGER Ross11

Teacher/Student.

Dealing with suboptimal and multimodal demonstrations

Discrete maze example w/ symmetries. Push-T example. is multimodality at odds with (local) stability? cite max's neurips paper.

References

Brenna D Argall and Sonia Chernova and Manuela Veloso and Brett Browning, "A survey of robot learning from demonstration", Robotics and autonomous systems, vol. 57, no. 5, pp. 469--483, 2009.

J.A. Bagnell, "An Invitation to Imitation", Tech. Report, CMU-RI-TR-15-08, March, 2015.

Takayuki Osa and Joni Pajarinen and Gerhard Neumann and J Andrew Bagnell and Pieter Abbeel and Jan Peters and others, "An algorithmic perspective on imitation learning", Foundations and Trends in Robotics, vol. 7, no. 1-2, pp. 1--179, 2018.

Dean A Pomerleau, "Alvinn: An autonomous land vehicle in a neural network", Advances in neural information processing systems, vol. 1, 1988.

Michael Bain and Claude Sammut, "A Framework for Behavioural Cloning.", Machine Intelligence 15 , pp. 103--129, 1995.

Daniel M Ziegler and Nisan Stiennon and Jeffrey Wu and Tom B Brown and Alec Radford and Dario Amodei and Paul Christiano and Geoffrey Irving, "Fine-tuning language models from human preferences", arXiv preprint arXiv:1909.08593, 2019.

Shengyu Zhang and Linfeng Dong and Xiaoya Li and Sen Zhang and Xiaofei Sun and Shuhe Wang and Jiwei Li and Runyi Hu and Tianwei Zhang and Fei Wu and Guoyin Wang, "Instruction Tuning for Large Language Models: A Survey", , 2024.

Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskever and others, "Improving language understanding by generative pre-training", , 2018.

Alec Radford and Jeffrey Wu and Rewon Child and David Luan and Dario Amodei and Ilya Sutskever and others, "Language models are unsupervised multitask learners", OpenAI blog, vol. 1, no. 8, pp. 9, 2019.

Tom Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared D Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and others, "Language models are few-shot learners", Advances in neural information processing systems, vol. 33, pp. 1877--1901, 2020.

Sébastien Bubeck and Varun Chandrasekaran and Ronen Eldan and Johannes Gehrke and Eric Horvitz and Ece Kamar and Peter Lee and Yin Tat Lee and Yuanzhi Li and Scott Lundberg and others, "Sparks of artificial general intelligence: Early experiments with gpt-4", arXiv preprint arXiv:2303.12712, 2023.

Haotian Liu and Chunyuan Li and Qingyang Wu and Yong Jae Lee, "Visual Instruction Tuning", , 2023.

Siddharth Karamcheti and Suraj Nair and Ashwin Balakrishna and Percy Liang and Thomas Kollar and Dorsa Sadigh, "Prismatic vlms: Investigating the design space of visually-conditioned language models", arXiv preprint arXiv:2402.07865, 2024.

Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh, "What matters when building vision-language models?", , 2024.

Anthony Brohan and Noah Brown and Justice Carbajal and Yevgen Chebotar and Joseph Dabis and Chelsea Finn and Keerthana Gopalakrishnan and Karol Hausman and Alex Herzog and Jasmine Hsu and Julian Ibarz and Brian Ichter and Alex Irpan and Tomas Jackson and Sally Jesmonth and Nikhil Joshi and Ryan Julian and Dmitry Kalashnikov and Yuheng Kuang and Isabel Leal and Kuang-Huei Lee and Sergey Levine and Yao Lu and Utsav Malla and Deeksha Manjunath and Igor Mordatch and Ofir Nachum and Carolina Parada and Jodilyn Peralta and Emily Perez and Karl Pertsch and Jornell Quiambao and Kanishka Rao and Michael Ryoo and Grecia Salazar and Pannag Sanketi and Kevin Sayed and Jaspiar Singh and Sumedh Sontakke and Austin Stone and Clayton Tan and Huong Tran and Vincent Vanhoucke and Steve Vega and Quan Vuong and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Tianhe Yu and Brianna Zitkovich, "RT-1: Robotics Transformer for Real-World Control at Scale", arXiv preprint arXiv:2212.06817 , 2022.

Anthony Brohan and Noah Brown and Justice Carbajal and Yevgen Chebotar and Xi Chen and Krzysztof Choromanski and Tianli Ding and Danny Driess and Avinava Dubey and Chelsea Finn and Pete Florence and Chuyuan Fu and Montse Gonzalez Arenas and Keerthana Gopalakrishnan and Kehang Han and Karol Hausman and Alex Herzog and Jasmine Hsu and Brian Ichter and Alex Irpan and Nikhil Joshi and Ryan Julian and Dmitry Kalashnikov and Yuheng Kuang and Isabel Leal and Lisa Lee and Tsang-Wei Edward Lee and Sergey Levine and Yao Lu and Henryk Michalewski and Igor Mordatch and Karl Pertsch and Kanishka Rao and Krista Reymann and Michael Ryoo and Grecia Salazar and Pannag Sanketi and Pierre Sermanet and Jaspiar Singh and Anikait Singh and Radu Soricut and Huong Tran and Vincent Vanhoucke and Quan Vuong and Ayzaan Wahid and Stefan Welker and Paul Wohlhart and Jialin Wu and Fei Xia and Ted Xiao and Peng Xu and Sichun Xu and Tianhe Yu and Brianna Zitkovich, "RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control", arXiv preprint arXiv:2307.15818 , 2023.

Abby O’Neill and Abdul Rehman and Abhiram Maddukuri and Abhishek Gupta and Abhishek Padalkar and Abraham Lee and Acorn Pooley and Agrim Gupta and Ajay Mandlekar and Ajinkya Jain and others, "Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0", 2024 IEEE International Conference on Robotics and Automation (ICRA) , pp. 6892--6903, 2024.

Cheng Chi and Zhenjia Xu and Siyuan Feng and Eric Cousineau and Yilun Du and Benjamin Burchfiel and Russ Tedrake and Shuran Song, "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion", , 2024.

Tony Z Zhao and Vikash Kumar and Sergey Levine and Chelsea Finn, "Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware", arXiv preprint arXiv:2304.13705, 2023.

David Silver and Aja Huang and Chris J Maddison and Arthur Guez and Laurent Sifre and George Van Den Driessche and Julian Schrittwieser and Ioannis Antonoglou and Veda Panneershelvam and Marc Lanctot and others, "Mastering the game of Go with deep neural networks and tree search", nature, vol. 529, no. 7587, pp. 484--489, 2016.

Sergey Levine and Chelsea Finn and Trevor Darrell and Pieter Abbeel, "End-to-end training of deep visuomotor policies", The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334--1373, 2016.

Peter Florence and Lucas Manuelli and Russ Tedrake, "Self-Supervised Correspondence in Visuomotor Policy Learning", IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 492-499, April, 2020. [ link ]

Dian Chen and Brady Zhou and Vladlen Koltun and Philipp Krahenbuhl, "Learning by cheating", Conference on Robot Learning , pp. 66--75, 2020.

Jemin Hwangbo and Joonho Lee and Alexey Dosovitskiy and Dario Bellicoso and Vassilios Tsounis and Vladlen Koltun and Marco Hutter, "Learning agile and dynamic motor skills for legged robots", Science Robotics, vol. 4, no. 26, pp. eaau5872, 2019.

Glen Chou and Russ Tedrake, "Synthesizing Stable Reduced-Order Visuomotor Policies for Nonlinear Systems via Sums-of-Squares Optimization", Proceedings of the IEEE 62nd Annual Conference on Decision and Control (CDC) , 2023. [ link ]

Stéphane Ross and Geoffrey Gordon and Drew Bagnell, "A reduction of imitation learning and structured prediction to no-regret online learning", Proceedings of the fourteenth international conference on artificial intelligence and statistics , pp. 627--635, 2011.

Lili Chen and Kevin Lu and Aravind Rajeswaran and Kimin Lee and Aditya Grover and Misha Laskin and Pieter Abbeel and Aravind Srinivas and Igor Mordatch, "Decision transformer: Reinforcement learning via sequence modeling", Advances in neural information processing systems, vol. 34, pp. 15084--15097, 2021.

Scott Reed and Konrad Zolna and Emilio Parisotto and Sergio Gomez Colmenarejo and Alexander Novikov and Gabriel Barth-Maron and Mai Gimenez and Yury Sulsky and Jackie Kay and Jost Tobias Springenberg and others, "A generalist agent", arXiv preprint arXiv:2205.06175, 2022.

Moo Jin Kim and Karl Pertsch and Siddharth Karamcheti and Ted Xiao and Ashwin Balakrishna and Suraj Nair and Rafael Rafailov and Ethan Foster and Grace Lam and Pannag Sanketi and Quan Vuong and Thomas Kollar and Benjamin Burchfiel and Russ Tedrake and Dorsa Sadigh and Sergey Levine and Percy Liang and Chelsea Finn, "OpenVLA: An Open-Source Vision-Language-Action Model", arXiv preprint arXiv:2406.09246, 2024. [ link ]

Nur Muhammad Shafiullah and Zichen Cui and Ariuntuya (Arty) Altanzaya and Lerrel Pinto, "Behavior Transformers: Cloning k modes with one stone", Advances in Neural Information Processing Systems , vol. 35, pp. 22955--22968, 2022.

Seungjae Lee and Yibin Wang and Haritheja Etukuru and H Jin Kim and Nur Muhammad Mahi Shafiullah and Lerrel Pinto, "Behavior generation with latent actions", arXiv preprint arXiv:2403.03181, 2024.

Pete Florence and Corey Lynch and Andy Zeng and Oscar A Ramirez and Ayzaan Wahid and Laura Downs and Adrian Wong and Johnny Lee and Igor Mordatch and Jonathan Tompson, "Implicit behavioral cloning", Conference on Robot Learning , pp. 158--168, 2022.

Christopher M Bishop, "Mixture density networks", , 1994.

Cheng Chi and Zhenjia Xu and Chuer Pan and Eric Cousineau and Benjamin Burchfiel and Siyuan Feng and Russ Tedrake and Shuran Song, "Universal Manipulation Interface: In-The-Wild Robot Teaching Without In-The-Wild Robots", , 2024.

Preetum Nakkiran and Arwen Bradley and Hattie Zhou and Madhu Advani, "Step-by-Step Diffusion: An Elementary Tutorial", , 2024.

Jiaming Song and Chenlin Meng and Stefano Ermon, "Denoising Diffusion Implicit Models", International Conference on Learning Representations , 2020.

Frank Permenter and Chenyang Yuan, "Interpreting and Improving Diffusion Models Using the Euclidean Distance Function", arXiv preprint arXiv:2306.04848, 2023.

Yilun Xu and Mingyang Deng and Xiang Cheng and Yonglong Tian and Ziming Liu and Tommi Jaakkola, "Restart Sampling for Improving Generative Processes", Advances in Neural Information Processing Systems , vol. 36, pp. 76806--76838, 2023.

Jascha Sohl-Dickstein and Eric Weiss and Niru Maheswaranathan and Surya Ganguli, "Deep unsupervised learning using nonequilibrium thermodynamics", International conference on machine learning , pp. 2256--2265, 2015.

Aapo Hyvarinen, "Estimation of Non-Normalized Statistical Models by Score Matching", Journal of Machine Learning Research, vol. 6, pp. 695–708, 2005.

Pascal Vincent, "A connection between score matching and denoising autoencoders", Neural computation, vol. 23, no. 7, pp. 1661--1674, 2011.

Yang Song and Stefano Ermon, "Generative Modeling by Estimating Gradients of the Data Distribution", Advances in Neural Information Processing Systems , vol. 32, 2019.

Jonathan Ho and Ajay Jain and Pieter Abbeel, "Denoising diffusion probabilistic models", Advances in neural information processing systems, vol. 33, pp. 6840--6851, 2020.

Michael Janner and Yilun Du and Joshua Tenenbaum and Sergey Levine, "Planning with Diffusion for Flexible Behavior Synthesis", International Conference on Machine Learning , pp. 9902--9915, 2022.

Adam Block and Ali Jadbabaie and Daniel Pfrommer and Max Simchowitz and Russ Tedrake, "Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior", Thirty-seventh Conference on Neural Information Processing Systems , 2023.

Yu Zhang and Qiang Yang, "A survey on multi-task learning", IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586--5609, 2021.

Rishi Bommasani and Drew A Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and others, "On the opportunities and risks of foundation models", arXiv preprint arXiv:2108.07258, 2021.

Dhruv Shah and Ajay Sridhar and Nitish Dashora and Kyle Stachowicz and Kevin Black and Noriaki Hirose and Sergey Levine, "ViNT: A Foundation Model for Visual Navigation", Conference on Robot Learning , pp. 711--733, 2023.

Abhishek Padalkar and Acorn Pooley and Ajinkya Jain and Alex Bewley and Alex Herzog and Alex Irpan and Alexander Khazatsky and Anant Rai and Anikait Singh and Anthony Brohan and others, "Open x-embodiment: Robotic learning datasets and rt-x models", arXiv preprint arXiv:2310.08864, 2023.

Underactuated Robotics

Imitation Learning

Behavior cloning

Visuomotor policies (aka control from pixels)

Behavior cloning as sequence modeling

Supervised learning in a feedback loop: dealing with distribution shift

Dealing with suboptimal and multimodal demonstrations

Architectures for visuomotor policies

Desiderata

Output/action decoders

(Multi-modal) input encoders

Diffusion Policy

Denoising Diffusion models

Diffusion Policy

Diffusion Policy for Linear Policies

State-feedback

Output-feedback

Action sequence prediction

Inverse reinforcement learning

Vistas

Multitask / foundation models for control

Distributed decentralized learning (aka "fleet learning")

Be rigorous

References