Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy

Abstract

Recently, active vision has reemerged as an important concept for manipulation, since visual occlusion occurs more frequently when main cameras are mounted on the robot heads. We reflect on the visual occlusion issue and identify its essence as the absence of information useful for task completion. Inspired by this, we come up with the more fundamental problem of Exploratory and Focused Manipulation (EFM). The proposed problem is about actively collecting information to complete challenging manipulation tasks that require exploration or focus. As an initial attempt to address this problem, we establish the EFM-10 benchmark that consists of 4 categories of tasks that align with our definition (10 tasks in total). We further come up with a Bimanual Active Perception (BAP) strategy, which leverages one arm to provide active vision and another arm to provide force sensing while manipulating. Based on this idea, we collect a dataset named BAPData for the tasks in EFM-10. With the dataset, we successfully verify the effectiveness of the BAP strategy in an imitation learning manner. We hope that the EFM-10 benchmark along with the BAP strategy can become a cornerstone that facilitates future research towards this direction.

Examples of the 10 Tasks in EFM-10 (from our BAPData)

Semantically Exploratory Tasks

Toy-Find

Instruction: Pick a [The-Required-Color] toy from one of the compartments of the cabinet and place it on the table.

Toy-Match

Instruction: Pick the toy of the same color as the plate in [The-Specified-Compartment] and place it on the plate.

Exploratory Tasks Involving Visual Occlusion

Cup-Hang

Instruction: Pass the [The-Required-Color] cup and hang it onto the rack.

Cup-Place

Instruction: Pass the [The-Required-Color] cup and place it on the coaster.

Box-Push

Instruction: Push the box to the lined area.

Delicate Tasks Requiring Focus

Light-Plug

Instruction: Plug the USB light into the charger.

Bread-Brush

Instruction: Place the bread dough on the tray and brush it with oil.

Nail-Knock

Instruction: Place the nail on the scrap of silver paper and knock the nail in.

Complex Tasks Requiring Both Exploration and Focus

Cable-Match

Instruction: Insert the cable of the same color as the port.

Charger-Plug

Instruction: Plug the USB charger into the [Left/Middle/Right] port of the power strip.

Experiments

Does the Eye-in-Hand Active Vision provided by the non-operating arm help?

Preliminary experiments have been conducted on 4 tasks in EFM-10 to verify the idea of leveraging the non-operating arm to provide eye-in-hand active vision. Three different settings of visual context captured by the active view are compared here.

According to the results, it is desirable to capture both the manipulated area and the operating end effector in the active view to provide direct clues about how the operating end effector should adjust its pose.

How do representative policy models work in our EFM-10 Benchmark?

We benchmark the performance of representative manipulation policies on EFM-10 in order to reveal the pros and cons of these policies. The evaluation results are shown in the following table.

“*” means that we train the policy with the BAPData collected based on our BAP strategy.
benchmarking

We can observe that:

  • Single-task policies ACT and DP cannot fulfill language-driven semantically exploratory tasks, such as Toy-Find.
  • DP is superior to ACT on handling tasks with multimodal action distributions like Box-Push, but inferior to ACT on handling fine-grained tasks like Nail-Knock and Charger Plug.
  • Pi-0 show stronger instruction-following ability than GR-MG and can more easily master EFM tasks that do not involve fine-grained operation, such as Toy-Find, Cup-Hang and Cup-Place.
  • All these policies do not perform well EFM tasks that involve extremely fine-grained operation, such as Light-Plug and Charger-Plug.

Can Force/Torque sensing help?

We further carry out experiments on enhancing policies with force sensing. Two EFM tasks that involve fine-grained operation are considered here. A variant of the GR-MG policy is devised, as illustrated in the following image. Concretely, we encode current Force/Torque value with a linear layer, append the encoded embedding as well as a query token to the original input sequence, and train the model to predict the chunk of future Fore/Torque based on the final representation of the query token.

An variant of the GR-MG policy with force sensing, developped by us.
model
“Avg. Fz Max” means the average of maximum vertical force of the operating end-effector.
force

With force sensing, the success rates increase substantially on the two tasks. More importantly, the averages of maximum vertical force decrease by 29% and 22% (relative to the average ranges), indicating that a sort of force compliance control is achieved with our neural network. To qualitatively analyze this phenomenon, we visualize a rollout by the GR-MG policy with force sensing.

It can be observed that: When the hand-held USB light began to contact with one side of the USB port, the model forecasted that the vertical force would increase and was able to control the end effector in a way that prevents abrupt boost of vertical force. When the USB light got plugged into the port, the model also correctly forecasted that the vertical force would decrease. These observations strongly support the significance of force sensing.

What are the typical Failure Modes?

To provide more insights about how to further improve manipulation policies in order to address EFM tasks, we further look into the typical failure modes observed during our experiments.

Fail to accurately condition the action on the semantic context

[Task]: Toy-Find
[Instruction]: Pick a blue toy from one of the compartment of the cabinet and place it on the table.

[Task]: Toy-Match
[Instruction]: Pick the toy of the same color as the plate in the right compartment and place it on the plate.

[Task]: Cable-Match
[Instruction]: Insert the cable of the same color as the port.

Fail to find the optimal active viewpoint

[Task]: Cup-Place
[Instruction]: Pass the red cup to the right arm and place it on the coaster.

Fail to adapt to corner cases

[Task]: Cup-Hang
[Instruction]: Pass the white cup to the right arm and hang it onto the rack.

[Task]: Box-Push
[Instruction]: Push the box to the lined area.

Inadequate spatial perception/reasoning that leads to subtle wrong positioning

[Task]: Light-Plug
[Instruction]: Plug the USB light into the charger.

[Task]: Bread-Brush
[Instruction]: Place the bread dough on the tray and brush it with oil.

[Task]: Nail-Knock
[Instruction]: Place the nail on the scrap of silver paper and knock the nail in.

[Task]: Charger-Plug
[Instruction]: Plug the USB charger into the middle port of the power strip.

License

Our BAPData and source code will be released upon acceptance with a Creative Commons Attribution-ShareAlike 4.0 International License.