Feed aggregator

Soft robots exhibit complex nonlinear dynamics with large degrees of freedom, making their modelling and control challenging. Typically, reduced-order models in time or space are used in addressing these challenges, but the resulting simplification limits soft robot control accuracy and restricts their range of motion. In this work, we introduce an end-to-end learning-based approach for fully dynamic modelling of any general robotic system that does not rely on predefined structures, learning dynamic models of the robot directly in the visual space. The generated models possess identical dimensionality to the observation space, resulting in models whose complexity is determined by the sensory system without explicitly decomposing the problem. To validate the effectiveness of our proposed method, we apply it to a fully soft robotic manipulator, and we demonstrate its applicability in controller development through an open-loop optimization-based controller. We achieve a wide range of dynamic control tasks including shape control, trajectory tracking and obstacle avoidance using a model derived from just 90 min of real-world data. Our work thus far provides the most comprehensive strategy for controlling a general soft robotic system, without constraints on the shape, properties, or dimensionality of the system.

Introduction: Robots present an opportunity to enhance healthcare delivery. Rather than targeting complete automation and nurse replacement, collaborative robots, or “cobots”, might be designed to allow nurses to focus on high-value caregiving. While many institutions are now investing in these platforms, there is little publicly available data on how cobots are being developed, implemented, and evaluated to determine if and how they support nursing practice in the real world.

Methods: This systematic review investigates the current state of cobotic technologies designed to assist nurses in hospital settings, their intended applications, and impacts on nurses and patient care. A comprehensive database search identified 28 relevant peer-reviewed articles published since 2018 which involve real studies with robotic platforms in simulated or actual clinical contexts.

Results: Few cobots were explicitly designed to reduce nursing workload through administrative or logistical assistance. Most included studies were designed as patient-centered rather than nurse-centered, but included assistance for tasks like medication delivery, vital monitoring, and social interaction. Most applications emerged from India, with limited evidence from the United States despite commercial availability of nurse-assistive cobots. Robots ranged from proof-of-concept to commercially deployed systems.

Discussion: This review highlights the need for further published studies on cobotic development and evaluation. A larger body of evidence is needed to recognize current limitations and pragmatic opportunities to assist nurses and patients using state-of-the-art robotics. Human-centered design can assist in discovering the right opportunities for cobotic assistance. Committed research-practice partnerships and human-centered design are needed to guide the technical development of nurse-centered cobotic solutions.

The Expanded Endoscopic Endonasal Approach, one of the best examples of endoscopic neurosurgery, allows access to the skull base through the natural orifice of the nostril. Current standard instruments lack articulation limiting operative access and surgeon dexterity, and thus, could benefit from robotic articulation. In this study, a handheld robotic system with a series of detachable end-effectors for this approach is presented. This system is comprised of interchangeable articulated 2/3 degrees-of-freedom 3 mm instruments that expand the operative workspace and enhance the surgeon’s dexterity, an ergonomically designed handheld controller with a rotating joystick-body that can be placed at the position most comfortable for the user, and the accompanying control box. The robotic instruments were experimentally evaluated for their workspace, structural integrity, and force-delivery capabilities. The entire system was then tested in a pre-clinical context during a phantom feasibility test, followed up by a cadaveric pilot study by a cohort of surgeons of varied clinical experience. Results from this series of experiments suggested enhanced dexterity and adequate robustness that could be associated with feasibility in a clinical context, as well as improvement over current neurosurgical instruments.

Social technology can improve the quality of social lives of older adults (OAs) and mitigate negative mental and physical health outcomes. When people engage with technology, they can do so to stimulate social interaction (stimulation hypothesis) or disengage from their real world (disengagement hypothesis), according to Nowland et al.‘s model of the relationship between social Internet use and loneliness. External events, such as large periods of social isolation like during the COVID-19 pandemic, can also affect whether people use technology in line with the stimulation or disengagement hypothesis. We examined how the COVID-19 pandemic affected the social challenges OAs faced and their expectations for robot technology to solve their challenges. We conducted two participatory design (PD) workshops with OAs during and after the COVID-19 pandemic. During the pandemic, OAs’ primary concern was distanced communication with family members, with a prevalent desire to assist them through technology. They also wanted to share experiences socially, as such OA’s attitude toward technology could be explained mostly by the stimulation hypothesis. However, after COVID-19 the pandemic, their focus shifted towards their own wellbeing. Social isolation and loneliness were already significant issues for OAs, and these were exacerbated by the COVID-19 pandemic. Therefore, such OAs’ attitudes toward technology after the pandemic could be explained mostly by the disengagement hypothesis. This clearly reflect the OA’s current situation that they have been getting further digitally excluded due to rapid technological development during the pandemic. Both during and after the pandemic, OAs found it important to have technologies that were easy to use, which would reduce their digital exclusion. After the pandemic, we found this especially in relation to newly developed technologies meant to help people keep at a distance. To effectively integrate these technologies and avoid excluding large parts of the population, society must address the social challenges faced by OAs.



Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

RoboCup 2024: 17–22 July 2024, EINDHOVEN, NETHERLANDSICSR 2024: 23–26 October 2024, ODENSE, DENMARKCybathlon 2024: 25–27 October 2024, ZURICH

Enjoy today’s videos!

Do you have trouble multitasking? Cyborgize yourself through muscle stimulation to automate repetitive physical tasks while you focus on something else.

[ SplitBody ]

By combining a 5,000 frame-per-second (FPS) event camera with a 20-FPS RGB camera, roboticists from the University of Zurich have developed a much more effective vision system that keeps autonomous cars from crashing into stuff, as described in the current issue of Nature.

[ Nature ]

Mitsubishi Electric has been awarded the GUINNESS WORLD RECORDS title for the fastest robot to solve a puzzle cube. The robot’s time of 0.305 second beat the previous record of 0.38 second, for which it received a GUINNESS WORLD RECORDS certificate on 21 May 2024.

[ Mitsubishi ]

Sony’s AIBO is celebrating its 25th anniversary, which seems like a long time, and it is. But back then, the original AIBO could check your email for you. Email! In 1999!

I miss Hotmail.

[ AIBO ]

SchniPoSa: schnitzel with french fries and a salad.

[ Dino Robotics ]

Cloth-folding is still a really hard problem for robots, but progress was made at ICRA!

[ ICRA Cloth Competition ]

Thanks, Francis!

MIT CSAIL researchers enhance robotic precision with sophisticated tactile sensors in the palm and agile fingers, setting the stage for improvements in human-robot interaction and prosthetic technology.

[ MIT ]

We present a novel adversarial attack method designed to identify failure cases in any type of locomotion controller, including state-of-the-art reinforcement-learning-based controllers. Our approach reveals the vulnerabilities of black-box neural network controllers, providing valuable insights that can be leveraged to enhance robustness through retraining.

[ Fan Shi ]

In this work, we investigate a novel integrated flexible OLED display technology used as a robotic skin-interface to improve robot-to-human communication in a real industrial setting at Volkswagen or a collaborative human-robot interaction task in motor assembly. The interface was implemented in a workcell and validated qualitatively with a small group of operators (n=9) and quantitatively with a large group (n=42). The validation results showed that using flexible OLED technology could improve the operators’ attitude toward the robot; increase their intention to use the robot; enhance their perceived enjoyment, social influence, and trust; and reduce their anxiety.

[ Paper ]

Thanks, Bram!

We introduce InflatableBots, shape-changing inflatable robots for large-scale encountered-type haptics in VR. Unlike traditional inflatable shape displays, which are immobile and limited in interaction areas, our approach combines mobile robots with fan-based inflatable structures. This enables safe, scalable, and deployable haptic interactions on a large scale.

[ InflatableBots ]

We present a bioinspired passive dynamic foot in which the claws are actuated solely by the impact energy. Our gripper simultaneously resolves the issue of smooth absorption of the impact energy and fast closure of the claws by linking the motion of an ankle linkage and the claws through soft tendons.

[ Paper ]

In this video, a 3-UPU exoskeleton robot for a wrist joint is designed and controlled to perform wrist extension, flexion, radial-deviation, and ulnar-deviation motions in stroke-affected patients. This is the first time a 3-UPU robot has been used effectively for any kind of task.

“UPU” stands for “universal-prismatic-universal” and refers to the actuators—the prismatic joints between two universal joints.

[ BAS ]

Thanks, Tony!

BRUCE Got Spot-ted at ICRA2024.

[ Westwood Robotics ]

Parachutes: maybe not as good of an idea for drones as you might think.

[ Wing ]

In this paper, we propose a system for the artist-directed authoring of stylized bipedal walking gaits, tailored for execution on robotic characters. To demonstrate the utility of our approach, we animate gaits for a custom, free-walking robotic character, and show, with two additional in-simulation examples, how our procedural animation technique generalizes to bipeds with different degrees of freedom, proportions, and mass distributions.

[ Disney Research ]

The European drone project Labyrinth aims to keep new and conventional air traffic separate, especially in busy airspaces such as those expected in urban areas. The project provides a new drone-traffic service and illustrates its potential to improve the safety and efficiency of civil land, air, and sea transport, as well as emergency and rescue operations.

[ DLR ]

This Carnegie Mellon University Robotics Institute seminar, by Kim Baraka at Vrije Universiteit Amsterdam, is on the topic “Why We Should Build Robot Apprentices and Why We Shouldn’t Do It Alone.”

For robots to be able to truly integrate human-populated, dynamic, and unpredictable environments, they will have to have strong adaptive capabilities. In this talk, I argue that these adaptive capabilities should leverage interaction with end users, who know how (they want) a robot to act in that environment. I will present an overview of my past and ongoing work on the topic of human-interactive robot learning, a growing interdisciplinary subfield that embraces rich, bidirectional interaction to shape robot learning. I will discuss contributions on the algorithmic, interface, and interaction design fronts, showcasing several collaborations with animal behaviorists/trainers, dancers, puppeteers, and medical practitioners.

[ CMU RI ]



Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

RoboCup 2024: 17–22 July 2024, EINDHOVEN, NETHERLANDSICSR 2024: 23–26 October 2024, ODENSE, DENMARKCybathlon 2024: 25–27 October 2024, ZURICH

Enjoy today’s videos!

Do you have trouble multitasking? Cyborgize yourself through muscle stimulation to automate repetitive physical tasks while you focus on something else.

[ SplitBody ]

By combining a 5,000 frame-per-second (FPS) event camera with a 20-FPS RGB camera, roboticists from the University of Zurich have developed a much more effective vision system that keeps autonomous cars from crashing into stuff, as described in the current issue of Nature.

[ Nature ]

Mitsubishi Electric has been awarded the GUINNESS WORLD RECORDS title for the fastest robot to solve a puzzle cube. The robot’s time of 0.305 second beat the previous record of 0.38 second, for which it received a GUINNESS WORLD RECORDS certificate on 21 May 2024.

[ Mitsubishi ]

Sony’s AIBO is celebrating its 25th anniversary, which seems like a long time, and it is. But back then, the original AIBO could check your email for you. Email! In 1999!

I miss Hotmail.

[ AIBO ]

SchniPoSa: schnitzel with french fries and a salad.

[ Dino Robotics ]

Cloth-folding is still a really hard problem for robots, but progress was made at ICRA!

[ ICRA Cloth Competition ]

Thanks, Francis!

MIT CSAIL researchers enhance robotic precision with sophisticated tactile sensors in the palm and agile fingers, setting the stage for improvements in human-robot interaction and prosthetic technology.

[ MIT ]

We present a novel adversarial attack method designed to identify failure cases in any type of locomotion controller, including state-of-the-art reinforcement-learning-based controllers. Our approach reveals the vulnerabilities of black-box neural network controllers, providing valuable insights that can be leveraged to enhance robustness through retraining.

[ Fan Shi ]

In this work, we investigate a novel integrated flexible OLED display technology used as a robotic skin-interface to improve robot-to-human communication in a real industrial setting at Volkswagen or a collaborative human-robot interaction task in motor assembly. The interface was implemented in a workcell and validated qualitatively with a small group of operators (n=9) and quantitatively with a large group (n=42). The validation results showed that using flexible OLED technology could improve the operators’ attitude toward the robot; increase their intention to use the robot; enhance their perceived enjoyment, social influence, and trust; and reduce their anxiety.

[ Paper ]

Thanks, Bram!

We introduce InflatableBots, shape-changing inflatable robots for large-scale encountered-type haptics in VR. Unlike traditional inflatable shape displays, which are immobile and limited in interaction areas, our approach combines mobile robots with fan-based inflatable structures. This enables safe, scalable, and deployable haptic interactions on a large scale.

[ InflatableBots ]

We present a bioinspired passive dynamic foot in which the claws are actuated solely by the impact energy. Our gripper simultaneously resolves the issue of smooth absorption of the impact energy and fast closure of the claws by linking the motion of an ankle linkage and the claws through soft tendons.

[ Paper ]

In this video, a 3-UPU exoskeleton robot for a wrist joint is designed and controlled to perform wrist extension, flexion, radial-deviation, and ulnar-deviation motions in stroke-affected patients. This is the first time a 3-UPU robot has been used effectively for any kind of task.

“UPU” stands for “universal-prismatic-universal” and refers to the actuators—the prismatic joints between two universal joints.

[ BAS ]

Thanks, Tony!

BRUCE Got Spot-ted at ICRA2024.

[ Westwood Robotics ]

Parachutes: maybe not as good of an idea for drones as you might think.

[ Wing ]

In this paper, we propose a system for the artist-directed authoring of stylized bipedal walking gaits, tailored for execution on robotic characters. To demonstrate the utility of our approach, we animate gaits for a custom, free-walking robotic character, and show, with two additional in-simulation examples, how our procedural animation technique generalizes to bipeds with different degrees of freedom, proportions, and mass distributions.

[ Disney Research ]

The European drone project Labyrinth aims to keep new and conventional air traffic separate, especially in busy airspaces such as those expected in urban areas. The project provides a new drone-traffic service and illustrates its potential to improve the safety and efficiency of civil land, air, and sea transport, as well as emergency and rescue operations.

[ DLR ]

This Carnegie Mellon University Robotics Institute seminar, by Kim Baraka at Vrije Universiteit Amsterdam, is on the topic “Why We Should Build Robot Apprentices and Why We Shouldn’t Do It Alone.”

For robots to be able to truly integrate human-populated, dynamic, and unpredictable environments, they will have to have strong adaptive capabilities. In this talk, I argue that these adaptive capabilities should leverage interaction with end users, who know how (they want) a robot to act in that environment. I will present an overview of my past and ongoing work on the topic of human-interactive robot learning, a growing interdisciplinary subfield that embraces rich, bidirectional interaction to shape robot learning. I will discuss contributions on the algorithmic, interface, and interaction design fronts, showcasing several collaborations with animal behaviorists/trainers, dancers, puppeteers, and medical practitioners.

[ CMU RI ]

This study proposes a modular platform to improve the adoption of gamification in conventional physical rehabilitation programs. The effectiveness of rehabilitation is correlated to a patient’s adherence to the program. This adherence can be diminished due to factors such as motivation, feedback, and isolation. Gamification is a means of adding game-like elements to a traditionally non-game activity. This has been shown to be effective in providing a more engaging experience and improving adherence. The platform is made of three main parts; a central hardware hub, various wired and wireless sensors, and a software program with a stream-lined user interface. The software interface and hardware peripherals were all designed to be simple to use by either a medical specialist or an end-user patient without the need for technical training. A usability study was performed using a group of university students and a group of medical specialists. Using the System Usability Scale, the system received an average score of 69.25 ± 20.14 and 72.5 ± 17.16 by the students and medical specialists, respectively. We also present a framework that attempts to assist in selecting commercial games that are viable for physical rehabilitation.

Introduction: Flow state, the optimal experience resulting from the equilibrium between perceived challenge and skill level, has been extensively studied in various domains. However, its occurrence in industrial settings has remained relatively unexplored. Notably, the literature predominantly focuses on Flow within mentally demanding tasks, which differ significantly from industrial tasks. Consequently, our understanding of emotional and physiological responses to varying challenge levels, specifically in the context of industry-like tasks, remains limited.

Methods: To bridge this gap, we investigate how facial emotion estimation (valence, arousal) and Heart Rate Variability (HRV) features vary with the perceived challenge levels during industrial assembly tasks. Our study involves an assembly scenario that simulates an industrial human-robot collaboration task with three distinct challenge levels. As part of our study, we collected video, electrocardiogram (ECG), and NASA-TLX questionnaire data from 37 participants.

Results: Our results demonstrate a significant difference in mean arousal and heart rate between the low-challenge (Boredom) condition and the other conditions. We also found a noticeable trend-level difference in mean heart rate between the adaptive (Flow) and high-challenge (Anxiety) conditions. Similar differences were also observed in a few other temporal HRV features like Mean NN and Triangular index. Considering the characteristics of typical industrial assembly tasks, we aim to facilitate Flow by detecting and balancing the perceived challenge levels. Leveraging our analysis results, we developed an HRV-based machine learning model for discerning perceived challenge levels, distinguishing between low and higher-challenge conditions.

Discussion: This work deepens our understanding of emotional and physiological responses to perceived challenge levels in industrial contexts and provides valuable insights for the design of adaptive work environments.

Force is crucial for learning psychomotor skills in laparoscopic tissue manipulation. Fundamental laparoscopic surgery (FLS), on the other hand, only measures time and position accuracy. FLS is a commonly used training program for basic laparoscopic training through part tasks. The FLS is employed in most of the laparoscopic training systems, including box trainers and virtual reality (VR) simulators. However, many laparoscopic VR simulators lack force feedback and measure tissue damage solely through visual feedback based on virtual collisions. Few VR simulators that provide force feedback have subjective force metrics. To provide an objective force assessment for haptic skills training in the VR simulators, we extend the FLS part tasks to haptic-based FLS (HFLS), focusing on controlled force exertion. We interface the simulated HFLS part tasks with a customized bi-manual haptic simulator that offers five degrees of freedom (DOF) for force feedback. The proposed tasks are evaluated through face and content validity among laparoscopic surgeons of varying experience levels. The results show that trainees perform better in HFLS tasks. The average Likert score observed for face and content validity is greater than 4.6 ± 0.3 and 4 ± 0.5 for all the part tasks, which indicates the acceptance of the simulator among subjects for its appearance and functionality. Face and content validations show the need to improve haptic realism, which is also observed in existing simulators. To enhance the accuracy of force rendering, we incorporated a laparoscopic tool force model into the simulation. We study the effectiveness of the model through a psychophysical study that measures just noticeable difference (JND) for the laparoscopic gripping task. The study reveals an insignificant decrease in gripping-force JND. A simple linear model could be sufficient for gripper force feedback, and a non-linear LapTool force model does not affect the force perception for the force range of 0.5–2.5 N. Further study is required to understand the usability of the force model in laparoscopic training at a higher force range. Additionally, the construct validity of HFLS will confirm the applicability of the developed simulator to train surgeons with different levels of experience.



This post was originally published on the author’s personal blog.

Last year’s Conference on Robot Learning (CoRL) was the biggest CoRL yet, with over 900 attendees, 11 workshops, and almost 200 accepted papers. While there were a lot of cool new ideas (see this great set of notes for an overview of technical content), one particular debate seemed to be front and center: Is training a large neural network on a very large dataset a feasible way to solve robotics?1

Of course, some version of this question has been on researchers’ minds for a few years now. However, in the aftermath of the unprecedented success of ChatGPT and other large-scale “foundation models” on tasks that were thought to be unsolvable just a few years ago, the question was especially topical at this year’s CoRL. Developing a general-purpose robot, one that can competently and robustly execute a wide variety of tasks of interest in any home or office environment that humans can, has been perhaps the holy grail of robotics since the inception of the field. And given the recent progress of foundation models, it seems possible that scaling existing network architectures by training them on very large datasets might actually be the key to that grail.

Given how timely and significant this debate seems to be, I thought it might be useful to write a post centered around it. My main goal here is to try to present the different sides of the argument as I heard them, without bias towards any side. Almost all the content is taken directly from talks I attended or conversations I had with fellow attendees. My hope is that this serves to deepen people’s understanding around the debate, and maybe even inspire future research ideas and directions.

I want to start by presenting the main arguments I heard in favor of scaling as a solution to robotics.

Why Scaling Might Work
  • It worked for Computer Vision (CV) and Natural Language Processing (NLP), so why not robotics? This was perhaps the most common argument I heard, and the one that seemed to excite most people given recent models like GPT4-V and SAM. The point here is that training a large model on an extremely large corpus of data has recently led to astounding progress on problems thought to be intractable just 3-4 years ago. Moreover, doing this has led to a number of emergent capabilities, where trained models are able to perform well at a number of tasks they weren’t explicitly trained for. Importantly, the fundamental method here of training a large model on a very large amount of data is general and not somehow unique to CV or NLP. Thus, there seems to be no reason why we shouldn’t observe the same incredible performance on robotics tasks.
    • We’re already starting to see some evidence that this might work well: Chelsea Finn, Vincent Vanhoucke, and several others pointed to the recent RT-X and RT-2 papers from Google DeepMind as evidence that training a single model on large amounts of robotics data yields promising generalization capabilities. Russ Tedrake of Toyota Research Institute (TRI) and MIT pointed to the recent Diffusion Policies paper as showing a similar surprising capability. Sergey Levine of UC Berkeley highlighted recent efforts and successes from his group in building and deploying a robot-agnostic foundation model for navigation. All of these works are somewhat preliminary in that they train a relatively small model with a paltry amount of data compared to something like GPT4-V, but they certainly do seem to point to the fact that scaling up these models and datasets could yield impressive results in robotics.
  • Progress in data, compute, and foundation models are waves that we should ride: This argument is closely related to the above one, but distinct enough that I think it deserves to be discussed separately. The main idea here comes from Rich Sutton’s influential essay: The history of AI research has shown that relatively simple algorithms that scale well with data always outperform more complex/clever algorithms that do not. A nice analogy from Karol Hausman’s early career keynote is that improvements to data and compute are like a wave that is bound to happen given the progress and adoption of technology. Whether we like it or not, there will be more data and better compute. As AI researchers, we can either choose to ride this wave, or we can ignore it. Riding this wave means recognizing all the progress that’s happened because of large data and large models, and then developing algorithms, tools, datasets, etc. to take advantage of this progress. It also means leveraging large pre-trained models from vision and language that currently exist or will exist for robotics tasks.
  • Robotics tasks of interest lie on a relatively simple manifold, and training a large model will help us find it: This was something rather interesting that Russ Tedrake pointed out during a debate in the workshop on robustly deploying learning-based solutions. The manifold hypothesis as applied to robotics roughly states that, while the space of possible tasks we could conceive of having a robot do is impossibly large and complex, the tasks that actually occur practically in our world lie on some much lower-dimensional and simpler manifold of this space. By training a single model on large amounts of data, we might be able to discover this manifold. If we believe that such a manifold exists for robotics — which certainly seems intuitive — then this line of thinking would suggest that robotics is not somehow different from CV or NLP in any fundamental way. The same recipe that worked for CV and NLP should be able to discover the manifold for robotics and yield a shockingly competent generalist robot. Even if this doesn’t exactly happen, Tedrake points out that attempting to train a large model for general robotics tasks could teach us important things about the manifold of robotics tasks, and perhaps we can leverage this understanding to solve robotics.
  • Large models are the best approach we have to get at “common sense” capabilities, which pervade all of robotics: Another thing Russ Tedrake pointed out is that “common sense” pervades almost every robotics task of interest. Consider the task of having a mobile manipulation robot place a mug onto a table. Even if we ignore the challenging problems of finding and localizing the mug, there are a surprising number of subtleties to this problem. What if the table is cluttered and the robot has to move other objects out of the way? What if the mug accidentally falls on the floor and the robot has to pick it up again, re-orient it, and place it on the table? And what if the mug has something in it, so it’s important it’s never overturned? These “edge cases” are actually much more common that it might seem, and often are the difference between success and failure for a task. Moreover, these seem to require some sort of ‘common sense’ reasoning to deal with. Several people argued that large models trained on a large amount of data are the best way we know of to yield some aspects of this ‘common sense’ capability. Thus, they might be the best way we know of to solve general robotics tasks.

As you might imagine, there were a number of arguments against scaling as a practical solution to robotics. Interestingly, almost no one directly disputes that this approach could work in theory. Instead, most arguments fall into one of two buckets: (1) arguing that this approach is simply impractical, and (2) arguing that even if it does kind of work, it won’t really “solve” robotics.

Why Scaling Might Not WorkIt’s impractical
  • We currently just don’t have much robotics data, and there’s no clear way we’ll get it: This is the elephant in pretty much every large-scale robot learning room. The Internet is chock-full of data for CV and NLP, but not at all for robotics. Recent efforts to collect very large datasets have required tremendous amounts of time, money, and cooperation, yet have yielded a very small fraction of the amount of vision and text data on the Internet. CV and NLP got so much data because they had an incredible “data flywheel”: tens of millions of people connecting to and using the Internet. Unfortunately for robotics, there seems to be no reason why people would upload a bunch of sensory input and corresponding action pairs. Collecting a very large robotics dataset seems quite hard, and given that we know that a lot of important “emergent” properties only showed up in vision and language models at scale, the inability to get a large dataset could render this scaling approach hopeless.
  • Robots have different embodiments: Another challenge with collecting a very large robotics dataset is that robots come in a large variety of different shapes, sizes, and form factors. The output control actions that are sent to a Boston Dynamics Spot robot are very different to those sent to a KUKA iiwa arm. Even if we ignore the problem of finding some kind of common output space for a large trained model, the variety in robot embodiments means we’ll probably have to collect data from each robot type, and that makes the above data-collection problem even harder.
  • There is extremely large variance in the environments we want robots to operate in: For a robot to really be “general purpose,” it must be able to operate in any practical environment a human might want to put it in. This means operating in any possible home, factory, or office building it might find itself in. Collecting a dataset that has even just one example of every possible building seems impractical. Of course, the hope is that we would only need to collect data in a small fraction of these, and the rest will be handled by generalization. However, we don’t know how much data will be required for this generalization capability to kick in, and it very well could also be impractically large.
  • Training a model on such a large robotics dataset might be too expensive/energy-intensive: It’s no secret that training large foundation models is expensive, both in terms of money and in energy consumption. GPT-4V — OpenAI’s biggest foundation model at the time of this writing — reportedly cost over US $100 million and 50 million KWh of electricity to train. This is well beyond the budget and resources that any academic lab can currently spare, so a larger robotics foundation model would need to be trained by a company or a government of some kind. Additionally, depending on how large both the dataset and model itself for such an endeavor are, the costs may balloon by another order-of-magnitude or more, which might make it completely infeasible.
Even if it works as well as in CV/NLP, it won’t solve robotics
  • The 99.X problem and long tails: Vincent Vanhoucke of Google Robotics started a talk with a provocative assertion: Most — if not all — robot learning approaches cannot be deployed for any practical task. The reason? Real-world industrial and home applications typically require 99.X percent or higher accuracy and reliability. What exactly that means varies by application, but it’s safe to say that robot learning algorithms aren’t there yet. Most results presented in academic papers top out at 80 percent success rate. While that might seem quite close to the 99.X percent threshold, people trying to actually deploy these algorithms have found that it isn’t so: getting higher success rates requires asymptotically more effort as we get closer to 100 percent. That means going from 85 to 90 percent might require just as much — if not more — effort than going from 40 to 80 percent. Vincent asserted in his talk that getting up to 99.X percent is a fundamentally different beast than getting even up to 80 percent, one that might require a whole host of new techniques beyond just scaling.
    • Existing big models don’t get to 99.X percent even in CV and NLP: As impressive and capable as current large models like GPT-4V and DETIC are, even they don’t achieve 99.X percent or higher success rate on previously-unseen tasks. Current robotics models are very far from this level of performance, and I think it’s safe to say that the entire robot learning community would be thrilled to have a general model that does as well on robotics tasks as GPT-4V does on NLP tasks. However, even if we had something like this, it wouldn’t be at 99.X percent, and it’s not clear that it’s possible to get there by scaling either.
  • Self-driving car companies have tried this approach, and it doesn’t fully work (yet): This is closely related to the above point, but important and subtle enough that I think it deserves to stand on its own. A number of self-driving car companies — most notably Tesla and Wayve — have tried training such an end-to-end big model on large amounts of data to achieve Level 5 autonomy. Not only do these companies have the engineering resources and money to train such models, but they also have the data. Tesla in particular has a fleet of over 100,000 cars deployed in the real world that it is constantly collecting and then annotating data from. These cars are being teleoperated by experts, making the data ideal for large-scale supervised learning. And despite all this, Tesla has so far been unable to produce a Level 5 autonomous driving system. That’s not to say their approach doesn’t work at all. It competently handles a large number of situations — especially highway driving — and serves as a useful Level 2 (i.e., driver assist) system. However, it’s far from 99.X percent performance. Moreover, data seems to suggest that Tesla’s approach is faring far worse than Waymo or Cruise, which both use much more modular systems. While it isn’t inconceivable that Tesla’s approach could end up catching up and surpassing its competitors performance in a year or so, the fact that it hasn’t worked yet should serve as evidence perhaps that the 99.X percent problem is hard to overcome for a large-scale ML approach. Moreover, given that self-driving is a special case of general robotics, Tesla’s case should give us reason to doubt the large-scale model approach as a full solution to robotics, especially in the medium term.
  • Many robotics tasks of interest are quite long-horizon: Accomplishing any task requires taking a number of correct actions in sequence. Consider the relatively simple problem of making a cup of tea given an electric kettle, water, a box of tea bags, and a mug. Success requires pouring the water into the kettle, turning it on, then pouring the hot water into the mug, and placing a tea-bag inside it. If we want to solve this with a model trained to output motor torque commands given pixels as input, we’ll need to send torque commands to all 7 motors at around 40 Hz. Let’s suppose that this tea-making task requires 5 minutes. That requires 7 * 40 * 60 * 5 = 84,000 correct torque commands. This is all just for a stationary robot arm; things get much more complicated if the robot is mobile, or has more than one arm. It is well-known that error tends to compound with longer-horizons for most tasks. This is one reason why — despite their ability to produce long sequences of text — even LLMs cannot yet produce completely coherent novels or long stories: small deviations from a true prediction over time tend to add up and yield extremely large deviations over long-horizons. Given that most, if not all robotics tasks of interest require sending at least thousands, if not hundreds of thousands, of torques in just the right order, even a fairly well-performing model might really struggle to fully solve these robotics tasks.

Okay, now that we’ve sketched out all the main points on both sides of the debate, I want to spend some time diving into a few related points. Many of these are responses to the above points on the ‘against’ side, and some of them are proposals for directions to explore to help overcome the issues raised.

Miscellaneous Related ArgumentsWe can probably deploy learning-based approaches robustly

One point that gets brought up a lot against learning-based approaches is the lack of theoretical guarantees. At the time of this writing, we know very little about neural network theory: we don’t really know why they learn well, and more importantly, we don’t have any guarantees on what values they will output in different situations. On the other hand, most classical control and planning approaches that are widely used in robotics have various theoretical guarantees built-in. These are generally quite useful when certifying that systems are safe.

However, there seemed to be general consensus amongst a number of CoRL speakers that this point is perhaps given more significance than it should. Sergey Levine pointed out that most of the guarantees from controls aren’t really that useful for a number of real-world tasks we’re interested in. As he put it: “self-driving car companies aren’t worried about controlling the car to drive in a straight line, but rather about a situation in which someone paints a sky onto the back of a truck and drives in front of the car,” thereby confusing the perception system. Moreover, Scott Kuindersma of Boston Dynamics talked about how they’re deploying RL-based controllers on their robots in production, and are able to get the confidence and guarantees they need via rigorous simulation and real-world testing. Overall, I got the sense that while people feel that guarantees are important, and encouraged researchers to keep trying to study them, they don’t think that the lack of guarantees for learning-based systems means that they cannot be deployed robustly.

What if we strive to deploy Human-in-the-Loop systems?

In one of the organized debates, Emo Todorov pointed out that existing successful ML systems, like Codex and ChatGPT, work well only because a human interacts with and sanitizes their output. Consider the case of coding with Codex: it isn’t intended to directly produce runnable, bug-free code, but rather to act as an intelligent autocomplete for programmers, thereby making the overall human-machine team more productive than either alone. In this way, these models don’t have to achieve the 99.X percent performance threshold, because a human can help correct any issues during deployment. As Emo put it: “humans are forgiving, physics is not.”

Chelsea Finn responded to this by largely agreeing with Emo. She strongly agreed that all successfully-deployed and useful ML systems have humans in the loop, and so this is likely the setting that deployed robot learning systems will need to operate in as well. Of course, having a human operate in the loop with a robot isn’t as straightforward as in other domains, since having a human and robot inhabit the same space introduces potential safety hazards. However, it’s a useful setting to think about, especially if it can help address issues brought on by the 99.X percent problem.

Maybe we don’t need to collect that much real world data for scaling

A number of people at the conference were thinking about creative ways to overcome the real-world data bottleneck without actually collecting more real world data. Quite a few of these people argued that fast, realistic simulators could be vital here, and there were a number of works that explored creative ways to train robot policies in simulation and then transfer them to the real world. Another set of people argued that we can leverage existing vision, language, and video data and then just ‘sprinkle in’ some robotics data. Google’s recent RT-2 model showed how taking a large model trained on internet scale vision and language data, and then just fine-tuning it on a much smaller set robotics data can produce impressive performance on robotics tasks. Perhaps through a combination of simulation and pretraining on general vision and language data, we won’t actually have to collect too much real-world robotics data to get scaling to work well for robotics tasks.

Maybe combining classical and learning-based approaches can give us the best of both worlds

As with any debate, there were quite a few people advocating the middle path. Scott Kuindersma of Boston Dynamics titled one of his talks “Let’s all just be friends: model-based control helps learning (and vice versa)”. Throughout his talk, and the subsequent debates, his strong belief that in the short to medium term, the best path towards reliable real-world systems involves combining learning with classical approaches. In her keynote speech for the conference, Andrea Thomaz talked about how such a hybrid system — using learning for perception and a few skills, and classical SLAM and path-planning for the rest — is what powers a real-world robot that’s deployed in tens of hospital systems in Texas (and growing!). Several papers explored how classical controls and planning, together with learning-based approaches can enable much more capability than any system on its own. Overall, most people seemed to argue that this ‘middle path’ is extremely promising, especially in the short to medium term, but perhaps in the long-term either pure learning or an entirely different set of approaches might be best.

What Can/Should We Take Away From All This?

If you’ve read this far, chances are that you’re interested in some set of takeaways/conclusions. Perhaps you’re thinking “this is all very interesting, but what does all this mean for what we as a community should do? What research problems should I try to tackle?” Fortunately for you, there seemed to be a number of interesting suggestions that had some consensus on this.

We should pursue the direction of trying to just scale up learning with very large datasets

Despite the various arguments against scaling solving robotics outright, most people seem to agree that scaling in robot learning is a promising direction to be investigated. Even if it doesn’t fully solve robotics, it could lead to a significant amount of progress on a number of hard problems we’ve been stuck on for a while. Additionally, as Russ Tedrake pointed out, pursuing this direction carefully could yield useful insights about the general robotics problem, as well as current learning algorithms and why they work so well.

We should also pursue other existing directions

Even the most vocal proponents of the scaling approach were clear that they don’t think everyone should be working on this. It’s likely a bad idea for the entire robot learning community to put its eggs in the same basket, especially given all the reasons to believe scaling won’t fully solve robotics. Classical robotics techniques have gotten us quite far, and led to many successful and reliable deployments: pushing forward on them or integrating them with learning techniques might be the right way forward, especially in the short to medium terms.

We should focus more on real-world mobile manipulation and easy-to-use systems

Vincent Vanhoucke made an observation that most papers at CoRL this year were limited to tabletop manipulation settings. While there are plenty of hard tabletop problems, things generally get a lot more complicated when the robot — and consequently its camera view — moves. Vincent speculated that it’s easy for the community to fall into a local minimum where we make a lot of progress that’s specific to the tabletop setting and therefore not generalizable. A similar thing could happen if we work predominantly in simulation. Avoiding these local minima by working on real-world mobile manipulation seems like a good idea.

Separately, Sergey Levine observed that a big reason why LLM’s have seen so much excitement and adoption is because they’re extremely easy to use: especially by non-experts. One doesn’t have to know about the details of training an LLM, or perform any tough setup, to prompt and use these models for their own tasks. Most robot learning approaches are currently far from this. They often require significant knowledge of their inner workings to use, and involve very significant amounts of setup. Perhaps thinking more about how to make robot learning systems easier to use and widely applicable could help improve adoption and potentially scalability of these approaches.

We should be more forthright about things that don’t work

There seemed to be a broadly-held complaint that many robot learning approaches don’t adequately report negative results, and this leads to a lot of unnecessary repeated effort. Additionally, perhaps patterns might emerge from consistent failures of things that we expect to work but don’t actually work well, and this could yield novel insight into learning algorithms. There is currently no good incentive for researchers to report such negative results in papers, but most people seemed to be in favor of designing one.

We should try to do something totally new

There were a few people who pointed out that all current approaches — be they learning-based or classical — are unsatisfying in a number of ways. There seem to be a number of drawbacks with each of them, and it’s very conceivable that there is a completely different set of approaches that ultimately solves robotics. Given this, it seems useful to try think outside the box. After all, every one of the current approaches that’s part of the debate was only made possible because the few researchers that introduced them dared to think against the popular grain of their times.

Acknowledgements: Huge thanks to Tom Silver and Leslie Kaelbling for providing helpful comments, suggestions, and encouragement on a previous draft of this post.


1 In fact, this was the topic of a popular debate hosted at a workshop on the first day; many of the points in this post were inspired by the conversation during that debate.



This post was originally published on the author’s personal blog.

Last year’s Conference on Robot Learning (CoRL) was the biggest CoRL yet, with over 900 attendees, 11 workshops, and almost 200 accepted papers. While there were a lot of cool new ideas (see this great set of notes for an overview of technical content), one particular debate seemed to be front and center: Is training a large neural network on a very large dataset a feasible way to solve robotics?1

Of course, some version of this question has been on researchers’ minds for a few years now. However, in the aftermath of the unprecedented success of ChatGPT and other large-scale “foundation models” on tasks that were thought to be unsolvable just a few years ago, the question was especially topical at this year’s CoRL. Developing a general-purpose robot, one that can competently and robustly execute a wide variety of tasks of interest in any home or office environment that humans can, has been perhaps the holy grail of robotics since the inception of the field. And given the recent progress of foundation models, it seems possible that scaling existing network architectures by training them on very large datasets might actually be the key to that grail.

Given how timely and significant this debate seems to be, I thought it might be useful to write a post centered around it. My main goal here is to try to present the different sides of the argument as I heard them, without bias towards any side. Almost all the content is taken directly from talks I attended or conversations I had with fellow attendees. My hope is that this serves to deepen people’s understanding around the debate, and maybe even inspire future research ideas and directions.

I want to start by presenting the main arguments I heard in favor of scaling as a solution to robotics.

Why Scaling Might Work
  • It worked for Computer Vision (CV) and Natural Language Processing (NLP), so why not robotics? This was perhaps the most common argument I heard, and the one that seemed to excite most people given recent models like GPT4-V and SAM. The point here is that training a large model on an extremely large corpus of data has recently led to astounding progress on problems thought to be intractable just 3-4 years ago. Moreover, doing this has led to a number of emergent capabilities, where trained models are able to perform well at a number of tasks they weren’t explicitly trained for. Importantly, the fundamental method here of training a large model on a very large amount of data is general and not somehow unique to CV or NLP. Thus, there seems to be no reason why we shouldn’t observe the same incredible performance on robotics tasks.
    • We’re already starting to see some evidence that this might work well: Chelsea Finn, Vincent Vanhoucke, and several others pointed to the recent RT-X and RT-2 papers from Google DeepMind as evidence that training a single model on large amounts of robotics data yields promising generalization capabilities. Russ Tedrake of Toyota Research Institute (TRI) and MIT pointed to the recent Diffusion Policies paper as showing a similar surprising capability. Sergey Levine of UC Berkeley highlighted recent efforts and successes from his group in building and deploying a robot-agnostic foundation model for navigation. All of these works are somewhat preliminary in that they train a relatively small model with a paltry amount of data compared to something like GPT4-V, but they certainly do seem to point to the fact that scaling up these models and datasets could yield impressive results in robotics.
  • Progress in data, compute, and foundation models are waves that we should ride: This argument is closely related to the above one, but distinct enough that I think it deserves to be discussed separately. The main idea here comes from Rich Sutton’s influential essay: The history of AI research has shown that relatively simple algorithms that scale well with data always outperform more complex/clever algorithms that do not. A nice analogy from Karol Hausman’s early career keynote is that improvements to data and compute are like a wave that is bound to happen given the progress and adoption of technology. Whether we like it or not, there will be more data and better compute. As AI researchers, we can either choose to ride this wave, or we can ignore it. Riding this wave means recognizing all the progress that’s happened because of large data and large models, and then developing algorithms, tools, datasets, etc. to take advantage of this progress. It also means leveraging large pre-trained models from vision and language that currently exist or will exist for robotics tasks.
  • Robotics tasks of interest lie on a relatively simple manifold, and training a large model will help us find it: This was something rather interesting that Russ Tedrake pointed out during a debate in the workshop on robustly deploying learning-based solutions. The manifold hypothesis as applied to robotics roughly states that, while the space of possible tasks we could conceive of having a robot do is impossibly large and complex, the tasks that actually occur practically in our world lie on some much lower-dimensional and simpler manifold of this space. By training a single model on large amounts of data, we might be able to discover this manifold. If we believe that such a manifold exists for robotics — which certainly seems intuitive — then this line of thinking would suggest that robotics is not somehow different from CV or NLP in any fundamental way. The same recipe that worked for CV and NLP should be able to discover the manifold for robotics and yield a shockingly competent generalist robot. Even if this doesn’t exactly happen, Tedrake points out that attempting to train a large model for general robotics tasks could teach us important things about the manifold of robotics tasks, and perhaps we can leverage this understanding to solve robotics.
  • Large models are the best approach we have to get at “common sense” capabilities, which pervade all of robotics: Another thing Russ Tedrake pointed out is that “common sense” pervades almost every robotics task of interest. Consider the task of having a mobile manipulation robot place a mug onto a table. Even if we ignore the challenging problems of finding and localizing the mug, there are a surprising number of subtleties to this problem. What if the table is cluttered and the robot has to move other objects out of the way? What if the mug accidentally falls on the floor and the robot has to pick it up again, re-orient it, and place it on the table? And what if the mug has something in it, so it’s important it’s never overturned? These “edge cases” are actually much more common that it might seem, and often are the difference between success and failure for a task. Moreover, these seem to require some sort of ‘common sense’ reasoning to deal with. Several people argued that large models trained on a large amount of data are the best way we know of to yield some aspects of this ‘common sense’ capability. Thus, they might be the best way we know of to solve general robotics tasks.

As you might imagine, there were a number of arguments against scaling as a practical solution to robotics. Interestingly, almost no one directly disputes that this approach could work in theory. Instead, most arguments fall into one of two buckets: (1) arguing that this approach is simply impractical, and (2) arguing that even if it does kind of work, it won’t really “solve” robotics.

Why Scaling Might Not WorkIt’s impractical
  • We currently just don’t have much robotics data, and there’s no clear way we’ll get it: This is the elephant in pretty much every large-scale robot learning room. The Internet is chock-full of data for CV and NLP, but not at all for robotics. Recent efforts to collect very large datasets have required tremendous amounts of time, money, and cooperation, yet have yielded a very small fraction of the amount of vision and text data on the Internet. CV and NLP got so much data because they had an incredible “data flywheel”: tens of millions of people connecting to and using the Internet. Unfortunately for robotics, there seems to be no reason why people would upload a bunch of sensory input and corresponding action pairs. Collecting a very large robotics dataset seems quite hard, and given that we know that a lot of important “emergent” properties only showed up in vision and language models at scale, the inability to get a large dataset could render this scaling approach hopeless.
  • Robots have different embodiments: Another challenge with collecting a very large robotics dataset is that robots come in a large variety of different shapes, sizes, and form factors. The output control actions that are sent to a Boston Dynamics Spot robot are very different to those sent to a KUKA iiwa arm. Even if we ignore the problem of finding some kind of common output space for a large trained model, the variety in robot embodiments means we’ll probably have to collect data from each robot type, and that makes the above data-collection problem even harder.
  • There is extremely large variance in the environments we want robots to operate in: For a robot to really be “general purpose,” it must be able to operate in any practical environment a human might want to put it in. This means operating in any possible home, factory, or office building it might find itself in. Collecting a dataset that has even just one example of every possible building seems impractical. Of course, the hope is that we would only need to collect data in a small fraction of these, and the rest will be handled by generalization. However, we don’t know how much data will be required for this generalization capability to kick in, and it very well could also be impractically large.
  • Training a model on such a large robotics dataset might be too expensive/energy-intensive: It’s no secret that training large foundation models is expensive, both in terms of money and in energy consumption. GPT-4V — OpenAI’s biggest foundation model at the time of this writing — reportedly cost over US $100 million and 50 million KWh of electricity to train. This is well beyond the budget and resources that any academic lab can currently spare, so a larger robotics foundation model would need to be trained by a company or a government of some kind. Additionally, depending on how large both the dataset and model itself for such an endeavor are, the costs may balloon by another order-of-magnitude or more, which might make it completely infeasible.
Even if it works as well as in CV/NLP, it won’t solve robotics
  • The 99.X problem and long tails: Vincent Vanhoucke of Google Robotics started a talk with a provocative assertion: Most — if not all — robot learning approaches cannot be deployed for any practical task. The reason? Real-world industrial and home applications typically require 99.X percent or higher accuracy and reliability. What exactly that means varies by application, but it’s safe to say that robot learning algorithms aren’t there yet. Most results presented in academic papers top out at 80 percent success rate. While that might seem quite close to the 99.X percent threshold, people trying to actually deploy these algorithms have found that it isn’t so: getting higher success rates requires asymptotically more effort as we get closer to 100 percent. That means going from 85 to 90 percent might require just as much — if not more — effort than going from 40 to 80 percent. Vincent asserted in his talk that getting up to 99.X percent is a fundamentally different beast than getting even up to 80 percent, one that might require a whole host of new techniques beyond just scaling.
    • Existing big models don’t get to 99.X percent even in CV and NLP: As impressive and capable as current large models like GPT-4V and DETIC are, even they don’t achieve 99.X percent or higher success rate on previously-unseen tasks. Current robotics models are very far from this level of performance, and I think it’s safe to say that the entire robot learning community would be thrilled to have a general model that does as well on robotics tasks as GPT-4V does on NLP tasks. However, even if we had something like this, it wouldn’t be at 99.X percent, and it’s not clear that it’s possible to get there by scaling either.
  • Self-driving car companies have tried this approach, and it doesn’t fully work (yet): This is closely related to the above point, but important and subtle enough that I think it deserves to stand on its own. A number of self-driving car companies — most notably Tesla and Wayve — have tried training such an end-to-end big model on large amounts of data to achieve Level 5 autonomy. Not only do these companies have the engineering resources and money to train such models, but they also have the data. Tesla in particular has a fleet of over 100,000 cars deployed in the real world that it is constantly collecting and then annotating data from. These cars are being teleoperated by experts, making the data ideal for large-scale supervised learning. And despite all this, Tesla has so far been unable to produce a Level 5 autonomous driving system. That’s not to say their approach doesn’t work at all. It competently handles a large number of situations — especially highway driving — and serves as a useful Level 2 (i.e., driver assist) system. However, it’s far from 99.X percent performance. Moreover, data seems to suggest that Tesla’s approach is faring far worse than Waymo or Cruise, which both use much more modular systems. While it isn’t inconceivable that Tesla’s approach could end up catching up and surpassing its competitors performance in a year or so, the fact that it hasn’t worked yet should serve as evidence perhaps that the 99.X percent problem is hard to overcome for a large-scale ML approach. Moreover, given that self-driving is a special case of general robotics, Tesla’s case should give us reason to doubt the large-scale model approach as a full solution to robotics, especially in the medium term.
  • Many robotics tasks of interest are quite long-horizon: Accomplishing any task requires taking a number of correct actions in sequence. Consider the relatively simple problem of making a cup of tea given an electric kettle, water, a box of tea bags, and a mug. Success requires pouring the water into the kettle, turning it on, then pouring the hot water into the mug, and placing a tea-bag inside it. If we want to solve this with a model trained to output motor torque commands given pixels as input, we’ll need to send torque commands to all 7 motors at around 40 Hz. Let’s suppose that this tea-making task requires 5 minutes. That requires 7 * 40 * 60 * 5 = 84,000 correct torque commands. This is all just for a stationary robot arm; things get much more complicated if the robot is mobile, or has more than one arm. It is well-known that error tends to compound with longer-horizons for most tasks. This is one reason why — despite their ability to produce long sequences of text — even LLMs cannot yet produce completely coherent novels or long stories: small deviations from a true prediction over time tend to add up and yield extremely large deviations over long-horizons. Given that most, if not all robotics tasks of interest require sending at least thousands, if not hundreds of thousands, of torques in just the right order, even a fairly well-performing model might really struggle to fully solve these robotics tasks.

Okay, now that we’ve sketched out all the main points on both sides of the debate, I want to spend some time diving into a few related points. Many of these are responses to the above points on the ‘against’ side, and some of them are proposals for directions to explore to help overcome the issues raised.

Miscellaneous Related ArgumentsWe can probably deploy learning-based approaches robustly

One point that gets brought up a lot against learning-based approaches is the lack of theoretical guarantees. At the time of this writing, we know very little about neural network theory: we don’t really know why they learn well, and more importantly, we don’t have any guarantees on what values they will output in different situations. On the other hand, most classical control and planning approaches that are widely used in robotics have various theoretical guarantees built-in. These are generally quite useful when certifying that systems are safe.

However, there seemed to be general consensus amongst a number of CoRL speakers that this point is perhaps given more significance than it should. Sergey Levine pointed out that most of the guarantees from controls aren’t really that useful for a number of real-world tasks we’re interested in. As he put it: “self-driving car companies aren’t worried about controlling the car to drive in a straight line, but rather about a situation in which someone paints a sky onto the back of a truck and drives in front of the car,” thereby confusing the perception system. Moreover, Scott Kuindersma of Boston Dynamics talked about how they’re deploying RL-based controllers on their robots in production, and are able to get the confidence and guarantees they need via rigorous simulation and real-world testing. Overall, I got the sense that while people feel that guarantees are important, and encouraged researchers to keep trying to study them, they don’t think that the lack of guarantees for learning-based systems means that they cannot be deployed robustly.

What if we strive to deploy Human-in-the-Loop systems?

In one of the organized debates, Emo Todorov pointed out that existing successful ML systems, like Codex and ChatGPT, work well only because a human interacts with and sanitizes their output. Consider the case of coding with Codex: it isn’t intended to directly produce runnable, bug-free code, but rather to act as an intelligent autocomplete for programmers, thereby making the overall human-machine team more productive than either alone. In this way, these models don’t have to achieve the 99.X percent performance threshold, because a human can help correct any issues during deployment. As Emo put it: “humans are forgiving, physics is not.”

Chelsea Finn responded to this by largely agreeing with Emo. She strongly agreed that all successfully-deployed and useful ML systems have humans in the loop, and so this is likely the setting that deployed robot learning systems will need to operate in as well. Of course, having a human operate in the loop with a robot isn’t as straightforward as in other domains, since having a human and robot inhabit the same space introduces potential safety hazards. However, it’s a useful setting to think about, especially if it can help address issues brought on by the 99.X percent problem.

Maybe we don’t need to collect that much real world data for scaling

A number of people at the conference were thinking about creative ways to overcome the real-world data bottleneck without actually collecting more real world data. Quite a few of these people argued that fast, realistic simulators could be vital here, and there were a number of works that explored creative ways to train robot policies in simulation and then transfer them to the real world. Another set of people argued that we can leverage existing vision, language, and video data and then just ‘sprinkle in’ some robotics data. Google’s recent RT-2 model showed how taking a large model trained on internet scale vision and language data, and then just fine-tuning it on a much smaller set robotics data can produce impressive performance on robotics tasks. Perhaps through a combination of simulation and pretraining on general vision and language data, we won’t actually have to collect too much real-world robotics data to get scaling to work well for robotics tasks.

Maybe combining classical and learning-based approaches can give us the best of both worlds

As with any debate, there were quite a few people advocating the middle path. Scott Kuindersma of Boston Dynamics titled one of his talks “Let’s all just be friends: model-based control helps learning (and vice versa)”. Throughout his talk, and the subsequent debates, his strong belief that in the short to medium term, the best path towards reliable real-world systems involves combining learning with classical approaches. In her keynote speech for the conference, Andrea Thomaz talked about how such a hybrid system — using learning for perception and a few skills, and classical SLAM and path-planning for the rest — is what powers a real-world robot that’s deployed in tens of hospital systems in Texas (and growing!). Several papers explored how classical controls and planning, together with learning-based approaches can enable much more capability than any system on its own. Overall, most people seemed to argue that this ‘middle path’ is extremely promising, especially in the short to medium term, but perhaps in the long-term either pure learning or an entirely different set of approaches might be best.

What Can/Should We Take Away From All This?

If you’ve read this far, chances are that you’re interested in some set of takeaways/conclusions. Perhaps you’re thinking “this is all very interesting, but what does all this mean for what we as a community should do? What research problems should I try to tackle?” Fortunately for you, there seemed to be a number of interesting suggestions that had some consensus on this.

We should pursue the direction of trying to just scale up learning with very large datasets

Despite the various arguments against scaling solving robotics outright, most people seem to agree that scaling in robot learning is a promising direction to be investigated. Even if it doesn’t fully solve robotics, it could lead to a significant amount of progress on a number of hard problems we’ve been stuck on for a while. Additionally, as Russ Tedrake pointed out, pursuing this direction carefully could yield useful insights about the general robotics problem, as well as current learning algorithms and why they work so well.

We should also pursue other existing directions

Even the most vocal proponents of the scaling approach were clear that they don’t think everyone should be working on this. It’s likely a bad idea for the entire robot learning community to put its eggs in the same basket, especially given all the reasons to believe scaling won’t fully solve robotics. Classical robotics techniques have gotten us quite far, and led to many successful and reliable deployments: pushing forward on them or integrating them with learning techniques might be the right way forward, especially in the short to medium terms.

We should focus more on real-world mobile manipulation and easy-to-use systems

Vincent Vanhoucke made an observation that most papers at CoRL this year were limited to tabletop manipulation settings. While there are plenty of hard tabletop problems, things generally get a lot more complicated when the robot — and consequently its camera view — moves. Vincent speculated that it’s easy for the community to fall into a local minimum where we make a lot of progress that’s specific to the tabletop setting and therefore not generalizable. A similar thing could happen if we work predominantly in simulation. Avoiding these local minima by working on real-world mobile manipulation seems like a good idea.

Separately, Sergey Levine observed that a big reason why LLM’s have seen so much excitement and adoption is because they’re extremely easy to use: especially by non-experts. One doesn’t have to know about the details of training an LLM, or perform any tough setup, to prompt and use these models for their own tasks. Most robot learning approaches are currently far from this. They often require significant knowledge of their inner workings to use, and involve very significant amounts of setup. Perhaps thinking more about how to make robot learning systems easier to use and widely applicable could help improve adoption and potentially scalability of these approaches.

We should be more forthright about things that don’t work

There seemed to be a broadly-held complaint that many robot learning approaches don’t adequately report negative results, and this leads to a lot of unnecessary repeated effort. Additionally, perhaps patterns might emerge from consistent failures of things that we expect to work but don’t actually work well, and this could yield novel insight into learning algorithms. There is currently no good incentive for researchers to report such negative results in papers, but most people seemed to be in favor of designing one.

We should try to do something totally new

There were a few people who pointed out that all current approaches — be they learning-based or classical — are unsatisfying in a number of ways. There seem to be a number of drawbacks with each of them, and it’s very conceivable that there is a completely different set of approaches that ultimately solves robotics. Given this, it seems useful to try think outside the box. After all, every one of the current approaches that’s part of the debate was only made possible because the few researchers that introduced them dared to think against the popular grain of their times.

Acknowledgements: Huge thanks to Tom Silver and Leslie Kaelbling for providing helpful comments, suggestions, and encouragement on a previous draft of this post.


1 In fact, this was the topic of a popular debate hosted at a workshop on the first day; many of the points in this post were inspired by the conversation during that debate.

Despite robots being applied in various situations of modern society, some people avoid them or do not feel comfortable interacting with them. Designs that allow robots to interact appropriately with people will make a positive impression on them resulting in a better evaluation of robots, which will solve this problem. To establish such a design, this study conducted two scenario-based experiments focusing on the politeness of the robot’s conversation and behavior, and examined the impressions caused when the robot succeeds or slightly fails at a task. These two experiments revealed that regardless of whether the partner is a robot or a human, politeness not only affected the impression of interaction but also the expectations for better task results on the next occasion. Although the effect of politeness on preference toward robot agents was smaller than those toward human agents when agents failed a task, people were more likely to interact with polite robots and human agents again because they thought that they would not fail the next time. This study revealed that politeness motivates people to interact with robots repeatedly even if they make minor mistakes, suggesting that the politeness design is important for encouraging human-robot interaction.

Soft grippers are garnering increasing attention for their adeptness in conforming to diverse objects, particularly delicate items, without warranting precise force control. This attribute proves especially beneficial in unstructured environments and dynamic tasks such as food handling. Human hands, owing to their elevated dexterity and precise motor control, exhibit the ability to delicately manipulate complex food items, such as small or fragile objects, by dynamically adjusting their grasping configurations. Furthermore, with their rich sensory receptors and hand-eye coordination that provide valuable information involving the texture and form factor, real-time adjustments to avoid damage or spill during food handling appear seamless. Despite numerous endeavors to replicate these capabilities through robotic solutions involving soft grippers, matching human performance remains a formidable engineering challenge. Robotic competitions serve as an invaluable platform for pushing the boundaries of manipulation capabilities, simultaneously offering insights into the adoption of these solutions across diverse domains, including food handling. Serving as a proxy for the future transition of robotic solutions from the laboratory to the market, these competitions simulate real-world challenges. Since 2021, our research group has actively participated in RoboSoft competitions, securing victories in the Manipulation track in 2022 and 2023. Our success was propelled by the utilization of a modified iteration of our Retractable Nails Soft Gripper (RNSG), tailored to meet the specific requirements of each task. The integration of sensors and collaborative manipulators further enhanced the gripper’s performance, facilitating the seamless execution of complex grasping tasks associated with food handling. This article encapsulates the experiential insights gained during the application of our highly versatile soft gripper in these competition environments.

Companion robots are aimed to mitigate loneliness and social isolation among older adults by providing social and emotional support in their everyday lives. However, older adults’ expectations of conversational companionship might substantially differ from what current technologies can achieve, as well as from other age groups like young adults. Thus, it is crucial to involve older adults in the development of conversational companion robots to ensure that these devices align with their unique expectations and experiences. The recent advancement in foundation models, such as large language models, has taken a significant stride toward fulfilling those expectations, in contrast to the prior literature that relied on humans controlling robots (i.e., Wizard of Oz) or limited rule-based architectures that are not feasible to apply in the daily lives of older adults. Consequently, we conducted a participatory design (co-design) study with 28 older adults, demonstrating a companion robot using a large language model (LLM), and design scenarios that represent situations from everyday life. The thematic analysis of the discussions around these scenarios shows that older adults expect a conversational companion robot to engage in conversation actively in isolation and passively in social settings, remember previous conversations and personalize, protect privacy and provide control over learned data, give information and daily reminders, foster social skills and connections, and express empathy and emotions. Based on these findings, this article provides actionable recommendations for designing conversational companion robots for older adults with foundation models, such as LLMs and vision-language models, which can also be applied to conversational robots in other domains.



Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

RoboCup 2024: 17–22 July 2024, EINDHOVEN, NETHERLANDSICSR 2024: 23–26 October 2024, ODENSE, DENMARKCybathlon 2024: 25–27 October 2024, ZURICH

Enjoy today’s videos!

NAVER 1784 is the world’s largest robotics testbed. The Starbucks on the second floor of 1784 is the world’s most unique Starbucks, with more than 100 service robots called “Rookie” delivering Starbucks drinks to meeting rooms and private seats, and various experiments with a dual-arm robot.

[ Naver ]

If you’re gonna take a robot dog with you on a hike, the least it could do is carry your backpack for you.

[ Deep Robotics ]

Obligatory reminder that phrases like “no teleoperation” without any additional context can mean many different things.

[ Astribot ]

This video is presented at the ICRA 2024 conference and summarizes recent results of our Learning AI for Dextrous Manipulation Lab. It demonstrates how our learning AI methods allowed for breakthroughs in dextrous manipulation with the mobile humanoid robot DLR Agile Justin. Although the core of the mechatronic hardware is almost 20 years old, only the advent of learning AI methods enabled a level of dexterity, flexibility and autonomy coming close to human capabilities.

[ TUM ]

Thanks Berthold!

Hands of blue? Not a good look.

[ Synaptic ]

With all the humanoid stuff going on, there really should be more emphasis on intentional contact—humans lean and balance on things all the time, and robots should too!

[ Inria ]

LimX Dynamics W1 is now more than a wheeled quadruped. By evolving into a biped robot, W1 maneuvers slickly on two legs in different ways: non-stop 360° rotation, upright free gliding, slick maneuvering, random collision and self-recovery, and step walking.

[ LimX Dynamics ]

Animal brains use less data and energy compared to current deep neural networks running on Graphics Processing Units (GPUs). This makes it hard to develop tiny autonomous drones, which are too small and light for heavy hardware and big batteries. Recently, the emergence of neuromorphic processors that mimic how brains function has made it possible for researchers from Delft University of Technology to develop a drone that uses neuromorphic vision and control for autonomous flight.

[ Science ]

In the beginning of the universe, all was darkness — until the first organisms developed sight, which ushered in an explosion of life, learning and progress. AI pioneer Fei-Fei Li says a similar moment is about to happen for computers and robots. She shows how machines are gaining “spatial intelligence” — the ability to process visual data, make predictions and act upon those predictions — and shares how this could enable AI to interact with humans in the real world.

[ TED ]



Video Friday is your weekly selection of awesome robotics videos, collected by your friends at IEEE Spectrum robotics. We also post a weekly calendar of upcoming robotics events for the next few months. Please send us your events for inclusion.

RoboCup 2024: 17–22 July 2024, EINDHOVEN, NETHERLANDSICSR 2024: 23–26 October 2024, ODENSE, DENMARKCybathlon 2024: 25–27 October 2024, ZURICH

Enjoy today’s videos!

NAVER 1784 is the world’s largest robotics testbed. The Starbucks on the second floor of 1784 is the world’s most unique Starbucks, with more than 100 service robots called “Rookie” delivering Starbucks drinks to meeting rooms and private seats, and various experiments with a dual-arm robot.

[ Naver ]

If you’re gonna take a robot dog with you on a hike, the least it could do is carry your backpack for you.

[ Deep Robotics ]

Obligatory reminder that phrases like “no teleoperation” without any additional context can mean many different things.

[ Astribot ]

This video is presented at the ICRA 2024 conference and summarizes recent results of our Learning AI for Dextrous Manipulation Lab. It demonstrates how our learning AI methods allowed for breakthroughs in dextrous manipulation with the mobile humanoid robot DLR Agile Justin. Although the core of the mechatronic hardware is almost 20 years old, only the advent of learning AI methods enabled a level of dexterity, flexibility and autonomy coming close to human capabilities.

[ TUM ]

Thanks Berthold!

Hands of blue? Not a good look.

[ Synaptic ]

With all the humanoid stuff going on, there really should be more emphasis on intentional contact—humans lean and balance on things all the time, and robots should too!

[ Inria ]

LimX Dynamics W1 is now more than a wheeled quadruped. By evolving into a biped robot, W1 maneuvers slickly on two legs in different ways: non-stop 360° rotation, upright free gliding, slick maneuvering, random collision and self-recovery, and step walking.

[ LimX Dynamics ]

Animal brains use less data and energy compared to current deep neural networks running on Graphics Processing Units (GPUs). This makes it hard to develop tiny autonomous drones, which are too small and light for heavy hardware and big batteries. Recently, the emergence of neuromorphic processors that mimic how brains function has made it possible for researchers from Delft University of Technology to develop a drone that uses neuromorphic vision and control for autonomous flight.

[ Science ]

In the beginning of the universe, all was darkness — until the first organisms developed sight, which ushered in an explosion of life, learning and progress. AI pioneer Fei-Fei Li says a similar moment is about to happen for computers and robots. She shows how machines are gaining “spatial intelligence” — the ability to process visual data, make predictions and act upon those predictions — and shares how this could enable AI to interact with humans in the real world.

[ TED ]

Navigation of mobile agents in unknown, unmapped environments is a critical task for achieving general autonomy. Recent advancements in combining Reinforcement Learning with Deep Neural Networks have shown promising results in addressing this challenge. However, the inherent complexity of these approaches, characterized by multi-layer networks and intricate reward objectives, limits their autonomy, increases memory footprint, and complicates adaptation to energy-efficient edge hardware. To overcome these challenges, we propose a brain-inspired method that employs a shallow architecture trained by a local learning rule for self-supervised navigation in uncharted environments. Our approach achieves performance comparable to a state-of-the-art Deep Q Network (DQN) method with respect to goal-reaching accuracy and path length, with a similar (slightly lower) number of parameters, operations, and training iterations. Notably, our self-supervised approach combines novelty-based and random walks to alleviate the need for objective reward definition and enhance agent autonomy. At the same time, the shallow architecture and local learning rule do not call for error backpropagation, decreasing the memory overhead and enabling implementation on edge neuromorphic processors. These results contribute to the potential of embodied neuromorphic agents utilizing minimal resources while effectively handling variability.

We recently developed a biomimetic robotic eye with six independent tendons, each controlled by their own rotatory motor, and with insertions on the eye ball that faithfully mimic the biomechanics of the human eye. We constructed an accurate physical computational model of this system, and learned to control its nonlinear dynamics by optimising a cost that penalised saccade inaccuracy, movement duration, and total energy expenditure of the motors. To speed up the calculations, the physical simulator was approximated by a recurrent neural network (NARX). We showed that the system can produce realistic eye movements that closely resemble human saccades in all directions: their nonlinear main-sequence dynamics (amplitude-peak eye velocity and duration relationships), cross-coupling of the horizontal and vertical movement components leading to approximately straight saccade trajectories, and the 3D kinematics that restrict 3D eye orientations to a plane (Listing’s law). Interestingly, the control algorithm had organised the motors into appropriate agonist-antagonist muscle pairs, and the motor signals for the eye resembled the well-known pulse-step characteristics that have been reported for monkey motoneuronal activity. We here fully analyse the eye-movement properties produced by the computational model across the entire oculomotor range and the underlying control signals. We argue that our system may shed new light on the neural control signals and their couplings within the final neural pathways of the primate oculomotor system, and that an optimal control principle may account for a wide variety of oculomotor behaviours. The generated data are publicly available at https://data.ru.nl/collections/di/dcn/DSC_626870_0003_600.

Teleoperation allows workers to safely control powerful construction machines; however, its primary reliance on visual feedback limits the operator’s efficiency in situations with stiff contact or poor visibility, hindering its use for assembly of pre-fabricated building components. Reliable, economical, and easy-to-implement haptic feedback could fill this perception gap and facilitate the broader use of robots in construction and other application areas. Thus, we adapted widely available commercial audio equipment to create AiroTouch, a naturalistic haptic feedback system that measures the vibration experienced by each robot tool and enables the operator to feel a scaled version of this vibration in real time. Accurate haptic transmission was achieved by optimizing the positions of the system’s off-the-shelf accelerometers and voice-coil actuators. A study was conducted to evaluate how adding this naturalistic type of vibrotactile feedback affects the operator during telerobotic assembly. Thirty participants used a bimanual dexterous teleoperation system (Intuitive da Vinci Si) to build a small rigid structure under three randomly ordered haptic feedback conditions: no vibrations, one-axis vibrations, and summed three-axis vibrations. The results show that users took advantage of both tested versions of the naturalistic haptic feedback after gaining some experience with the task, causing significantly lower vibrations and forces in the second trial. Subjective responses indicate that haptic feedback increased the realism of the interaction and reduced the perceived task duration, task difficulty, and fatigue. As hypothesized, higher haptic feedback gains were chosen by users with larger hands and for the smaller sensed vibrations in the one-axis condition. These results elucidate important details for effective implementation of naturalistic vibrotactile feedback and demonstrate that our accessible audio-based approach could enhance user performance and experience during telerobotic assembly in construction and other application domains.

Pages