The core of our project lies in the integration of embodied intelligence, enabling robots to perceive, reason, and act within real-world household environments. Unlike traditional AI systems that focus solely on static tasks, embodied intelligence demands dynamic interaction with environments, integrating visual perception, physical action, and natural language understanding. By leveraging a Vision-Language-Action (VLA) framework, we aim to enhance robots' abilities to execute complex manipulation tasks in indoor settings, relying heavily on data from continuous visual and action streams.
Our data collection pipeline captures continuous video and image sequences from household environments, where robots navigate and manipulate objects. This visual data includes not only static elements like room layouts and furniture but also dynamic factors such as object placements, occlusions, and lighting changes. The goal is to create rich, context-aware models that enable robots to navigate these environments effectively, even as they change. By collecting high-quality data from real-world environments, we aim to train models that understand the spatial relationships between objects, enabling robots to reason about how to move within these spaces and adapt to the ongoing changes within them.
In parallel, we gather robot manipulation data through wrist-mounted cameras on robot arms, which capture the robot's interaction with objects. The data collected includes visual streams, end-effector dynamics, and action sequences—such as grasping, moving, and placing objects—across a wide range of tasks. This combination of high-precision tracking and video data allows us to train models capable of executing precise manipulation actions in real-world environments. Through this, the system learns to perform tasks such as object sorting, moving items between locations, or interacting with novel objects, all while adjusting for environmental variables like clutter and lighting.
Training the models involves combining these visual and action data with natural language instructions. The integration of pretrained large language models (LLMs) and vision-language models (VLMs) enables robots to interpret open-ended, human-given commands and translate them into actionable tasks. This process makes use of multimodal fusion, where the robot learns to combine information from visual inputs and semantic instructions, mapping them to physical actions. The system understands natural language commands like “move the red vase from the kitchen to the shelf in the living room, ” enabling it to navigate, interact, and manipulate objects across various household contexts.
Our end-to-end training pipeline is designed to process and merge multimodal data, allowing robots to not only navigate unfamiliar environments but also to perform precise manipulation tasks, such as placing or adjusting objects in response to spoken commands. By grounding the language in visual perception and action prediction, robots learn how to handle open-vocabulary tasks, meaning they can adapt to previously unseen objects without requiring specific training for each new scenario. This capacity for zero-shot generalization ensures that the system can perform tasks across diverse environments and respond flexibly to user needs.
Available Tools: None.
The underlying model relies on expert demonstrations for imitation learning, where robots observe and replicate human actions. This method enables the system to learn generalizable policies that can later be applied to novel tasks or environments. Over time, the system evolves through continual data collection, improving its ability to understand complex household environments and complete tasks with increasing proficiency.
In summary, this approach combines natural language processing, visual perception, and robotic manipulation in a unified framework, allowing for intelligent interaction in dynamic, real-world environments. The system is capable of handling tasks with varying complexity, from simple object movements to more intricate organizational tasks, all while being adaptable to changes in the environment and open to new, unseen scenarios.
The Embodied AI data we generated has completed usability validation. This is an demo arm trained by our data. It can automatically sense debris on the ground and store it in the designated location.
ROBOTIN's data network is a kind of decentralized data network, which designed to eliminate information monopolies, data misuse, and privacy issues and improve data acquisition efficiency and scenario effectiveness.
Thanks for subscribing!
This email has been registered!
| Product | SKU | Description | Collection | Availability | Product type | Other details |
|---|