Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction.
Instruction: Pick a [The-Required-Color] toy from one of the compartments of the cabinet and place it on the table.
Instruction: Pick the toy of the same color as the plate in [The-Specified-Compartment] and place it on the plate.
Instruction: Pass the [The-Required-Color] cup and hang it onto the rack.
Instruction: Pass the [The-Required-Color] cup and place it on the coaster.
Instruction: Push the box to the lined area.
Instruction: Plug the USB light into the charger.
Instruction: Place the bread dough on the tray and brush it with oil.
Instruction: Place the nail on the scrap of silver paper and knock the nail in.
Instruction: Insert the cable of the same color as the port.
Instruction: Plug the USB charger into the [Left/Middle/Right] port of the power strip.
Preliminary experiments have been conducted on 4 tasks in EFM-10 to verify the idea of leveraging the non-operating arm to provide eye-in-hand active vision. Three different settings of visual context captured by the active view are compared here.
According to the results, it is desirable to capture both the manipulated area and the operating end effector in the active view to provide direct clues about how the operating end effector should adjust its pose.
We benchmark the performance of representative manipulation policies on EFM-10 in order to reveal the pros and cons of these policies. The evaluation results are shown in the following table.
We can observe that:
We further carry out experiments on enhancing policies with force sensing. Two EFM tasks that involve fine-grained operation are considered here. A variant of the GR-MG policy is devised, as illustrated in the following image. Concretely, we encode current Force/Torque value with a linear layer, append the encoded embedding as well as a query token to the original input sequence, and train the model to predict the chunk of future Fore/Torque based on the final representation of the query token.
With force sensing, the success rates increase substantially on the two tasks. More importantly, the averages of maximum vertical force decrease by 29% and 22% (relative to the average ranges), indicating that a sort of force compliance control is achieved with our neural network. To qualitatively analyze this phenomenon, we visualize a rollout by the GR-MG policy with force sensing.
It can be observed that: When the hand-held USB light began to contact with one side of the USB port, the model forecasted that the vertical force would increase and was able to control the end effector in a way that prevents abrupt boost of vertical force. When the USB light got plugged into the port, the model also correctly forecasted that the vertical force would decrease. These observations strongly support the significance of force sensing.
To provide more insights about how to further improve manipulation policies in order to address EFM tasks, we further look into the typical failure modes observed during our experiments.
[Task]: Toy-Find
[Instruction]: Pick a blue toy from one of the compartment of the cabinet and place it on the table.
[Task]: Toy-Match
[Instruction]: Pick the toy of the same color as the plate in the right compartment and place it on the plate.
[Task]: Cable-Match
[Instruction]: Insert the cable of the same color as the port.
[Task]: Cup-Place
[Instruction]: Pass the red cup to the right arm and place it on the coaster.
[Task]: Cup-Hang
[Instruction]: Pass the white cup to the right arm and hang it onto the rack.
[Task]: Box-Push
[Instruction]: Push the box to the lined area.
[Task]: Light-Plug
[Instruction]: Plug the USB light into the charger.
[Task]: Bread-Brush
[Instruction]: Place the bread dough on the tray and brush it with oil.
[Task]: Nail-Knock
[Instruction]: Place the nail on the scrap of silver paper and knock the nail in.
[Task]: Charger-Plug
[Instruction]: Plug the USB charger into the middle port of the power strip.
Our BAPData and source code will be released upon acceptance with a Creative Commons Attribution-ShareAlike 4.0 International License.