Sparse2Act

Learning Action-Aligned Sparse 3D Representations for Cross-Domain Robot Manipulation

Anonymous

Sparse2Act uses task-space actions as geometric supervision for a masked point-cloud encoder, then reuses only the encoder initialization for downstream policies.

Abstract

Sparse 3D manipulation policies represent object geometry, spatial relations, and end-effector motion in a shared metric workspace. This structure suggests a simple pretraining signal: the demonstrated action should be inferable from the current 3D geometry. We introduce Sparse2Act, an action-aligned pretraining framework for sparse point-cloud encoders. Sparse2Act groups each point cloud into local sparse 3D tokens, masks a subset of tokens, and trains the encoder to regress the demonstrated action from the remaining visible tokens. After pretraining, the lightweight regression head is discarded, and only the pretrained encoder is used to initialize downstream policies, allowing downstream control to operate in its preferred action space. In simulated manipulation benchmarks, our pretraining enables downstream diffusion policies to achieve 86.9% average success rates on LIBERO-10 after 500 fine-tuning steps, and the same pretrained encoder transfers to Meta-World-5 with 73.4% average success rates. We further demonstrate sim-to-real deployment by pretraining the encoder on simulated workspace actions and fine-tuning with a small set of real-robot joint-space demonstrations. These results show that our method provides a simple and effective pretraining signal for learning control-relevant sparse 3D representations for robot manipulation.

Project Overview Video

Method Overview

Sparse2Act framework overview showing sparse 3D token construction, masked action-aligned pretraining, and downstream policy initialization.
Framework overview. Sparse2Act converts raw point clouds into local sparse 3D tokens, encodes the visible tokens with 3D positional structure, and pretrains the encoder to predict task-space alignment actions from masked observations. After pretraining, only the encoder initialization is reused by downstream policies, allowing each policy to keep its own action parameterization.

Simulation Results

In-Domain Adaptation

In-domain adaptation on LIBERO-10 Success rates: DP3 29.1, DP 50.5, SpatialVLA 55.5, pi zero 85.2, and Sparse2Act 86.9 percent. 0 25 50 75 100 Success Rate (%) 29.1 50.5 55.5 85.2 86.9 DP3 DP SpatialVLA π₀ Ours

LIBERO-10. Sparse2Act reaches 86.9% average success after 500 fine-tuning steps.

Cross-Domain Transfer

Cross-domain transfer to Meta-World-5 Success rates: FVP 28.4, DP3 42.4, AFRO 48.8, LIBERO pretraining 73.4, and Meta-World pretraining 85.6 percent. 0 25 50 75 100 Success Rate (%) 28.4 42.4 48.8 73.4 85.6 FVP DP3 AFRO LIBERO pretrain MW pretrain

LIBERO to Meta-World-5. The transferred encoder reaches 73.4% average success.

Action-aligned pretraining provides a strong in-domain initialization while preserving substantial performance under cross-domain transfer.

Sim-to-Real Transfer

Comparison with DP3

Real-Robot Tasks

Simulation and Real Observations

Simulation point cloud
Loading...
Real point cloud
Loading...

Additional Videos

All videos are shown at 4× speed.