T-Rex: Tactile-ReactiveDexterous Manipulation

Abstract

The ability to react dynamically to tactile signals has long been considered crucial to agile, human-level dexterity. Yet contemporary learning-based Vision-Language-Action (VLA) models for robotic manipulation generally either overlook the tactile modality altogether or are limited to encoders that capture only static cues. This gap stems from three obstacles: the scarcity of diverse training data and standardized evaluation, architectural constraints in current VLA models, and the limitations of static tactile encoders. T-Rex pushes the frontier of tactile-reactive manipulation by addressing all three. We open-source a large-scale, 100-hour tactile-rich dataset collected via a novel, data-efficient recipe that prioritizes elementary motor primitives. To exploit naturally high-frequency touch signals without sacrificing the capabilities of existing VLAs, we introduce a variable-rate Mixture-of-Transformer (MoT) architecture equipped with a novel temporal tactile VQ-VAE encoder. We demonstrate the effectiveness of tactile-reactive policies on 12 manipulation tasks requiring delicate force control and deformable-object manipulation, achieving over 30% higher average success rate than the strongest baseline.

Contributions

T-Rex redesigns the full pipeline across data, architecture, and real-world tasks:

  • The T-Rex Dataset — an open-source, 100-hour, tactile-synchronized teleoperation dataset organized around diverse object × motor-primitive combinations, filling the tactile gap in vision-only training data.
  • A three-stage training paradigm — human egocentric pre-training, tactile-rich mid-training, and lightweight fine-tuning — the first complete recipe for tactile dexterous manipulation.
  • The T-Rex Model — a variable-rate Mixture-of-Transformer that splits control into a low-rate action expert and a high-rate tactile expert providing reactive residual refinements.
  • A temporal tactile VQ-VAE encoder that compresses high-frequency touch into compact, drift-robust tokens of temporal force and contact patterns.
  • A 12-task real-world benchmark on a 58-DoF bimanual dexterous robot, spanning force control, deformation, and bimanual coordination — and state-of-the-art results, beating the strongest baseline by over 30% average success rate.

The T-Rex Dataset

Most robot-manipulation datasets are built around parallel grippers or grasp-centric hands, offering limited coverage of tactile-rich interactions. The T-Rex Dataset is a large-scale, tactile-synchronized robot “play” corpus for mid-training. Rather than chasing a few long tasks, it follows a deliberately data-efficient recipe that prioritizes broad coverage of elementary motor primitives — short, reusable contact-rich behaviors that compose into complex skills. Every episode is structured around an object × motor-primitive combination: pairing 207 household objects with 22 motor primitives and pruning the physically infeasible pairs yields 502 unique, meaningful combinations, each with ~17 demonstrations.

Data is collected on a bimanual Dexmate Vega-1 with two 7-DoF arms and two 22-DoF Sharpa Wave dexterous hands (five fingertip tactile sensors per hand). Perception combines a head ZED X Mini stereo camera with two wide-view wrist cameras; teleoperation uses Manus gloves and VIVE trackers routed through the same control pipeline used at deployment. Every episode is a time-aligned bundle at 30 Hz: three RGB streams, bimanual proprioception, SE(3) wrist poses, per-fingertip tactile (a deformation depth map plus a 6-axis wrench for all ten fingertips), and a language instruction — beneath a 300 Hz low-level control thread.

50h
Open-Sourced Teleop Play
200+
Daily Objects
22
Motion Primitives
3
Tactile Modalities
7.9%7.5%8.6%5.9%14.0%9.7%13.3%6.7%14.1%12.3%
  • Personal Care
  • Hardware & Tools
  • Toys
  • Kitchen
  • Wrapping & Tape
  • Electronics
  • Fabric & Cloth
  • Clothing
  • Containers
  • Paper & Writing
0246Hours (h)wraplift and placegrasp and liftfoldcutreachinsertpresswipepeelassembleextracttwistshakedispensedisassemblesqueezepouropenclosescrewunscrew
Personal CareToysWrapping & TapeElectronicsKitchenHardware & ToolsContainersFabric & ClothPaper & WritingClothing100101102# Episodes (log scale)bandagestress ballsyringegrip strengthenertoothpastesoaplipsticksoap boxmedicine bottlehand sanitizermirrorlint roller sheetmakeup brushsanitizersurgical tapebandaidcompact mirrormedicine boxfoam earplugsearplugpill capsulecubetoy bananatoy tower of hanoilegostoy pegtoy piano keytoy pocket knifetoy train tracktoy train tracksbellbasketballhandbellmaracaspiggy banktambourinemagnetic blockstrain tracksbubble wrapaluminum foilplastic wrapadhesive wraptapecling filmgluecellophaneconstruction paperwrapping paperduck tapeparchment paperfoilmasking tapeorigami papercrepe paperglue bottlewindow clingairpodsbuttonflip phonecameraethernet cableswitch on offkeyboardcalculatorphonefile adapterroutertimerusb cablejoysticklaptoptv remotecablemousespeakercalculator keyscreen protectorside buttonstopwatchpower adaptercharging cableplugbatteryflashlighton off switchmeasuring cupalmondcondimentsugarpipettecoffee filterknifespoonforksandwich pressbeakerglass cuptweezersdishkettleoil cruetpitcherteatest tubewatering canair blowerkeyscrewscrewdriverscissorstoolboxmatchesbox of matchestablecombination lockzip tiefoldable boardpipekeycardpvc pipesclamshell containerdispenser bottlebowlbottle capdrawerbottlecupjar lidjewelry boxglasses casecarafepackagepen bagpencil casebucketpaper plateplateplastic platepaper bagwater bottlelunch bagegg cartonhinged tinspray bottlemustard capcarafe lidspongenapkinscarftissueribbonclothstringmuslinvelcro strapbandanapaper towelcheeseclothmemory foamtowelarm wrapnettingsashgel padthreadfelt sheethandkerchiefyarnstrawtissue paperbookwhiteboardcoinpaperbumper stickerstaplermapsticky notecardmarkerpenreceipt paperstickerphoto albumdollar billshipping labelfilepaper clampenvelope sealpostage stampfolderpaper clippen and captextbookcanvas paperenvelopepenn capreceiptbackpackshortsglasseshoodieshirtjacketlong pantscoatbeltshoehair clipapron
T-Rex Dataset statistics — top-left: share of demonstration time across task categories; top-right: hours of data per motor primitive; bottom: the long-tail distribution of demonstrations across 200+ household objects.

Explore the dataset

Browse a 500-trajectory random subset, filter by object and motion primitive, and resample on demand — all in your browser. Click the preview to open the full interactive visualizer.

Open visualizer →Open the dataset visualizer →

Model & Architecture

T-Rex is built around one idea: contact-rich dexterity needs two clocks running at once. Vision and language tell the robot what to do and roughly how to move, but they are too slow and too coarse to manage the millisecond-scale forces that decide whether an egg cracks, a card slips, or a bulb threads cleanly. T-Rex therefore splits the policy into a low-frequency visuomotor planning stream and a high-frequency tactile refinement stream, fused inside one shared transformer via a Mixture-of-Transformer-Experts backbone with three experts: a latent expert that predicts future visual representations, an action expert that plans a coarse action chunk, and a lightweight tactile expert that adds high-frequency residual corrections. The figure below shows how these experts fit together.

T-Rex model architecture
Figure 2: The T-Rex model — a variable-rate Mixture-of-Transformer with a latent expert, a slow action expert, and a fast, lightweight tactile expert that performs high-frequency residual refinement via an asynchronous cascaded flow-matching scheme.
The architecture in motion — the asynchronous cascaded flow-matching rollout.

The interaction is an asynchronous cascaded flow-matching scheme. The flow trajectory is split at τ = 0.4: the action expert integrates the upper segment (τ : 1 → 0.4) over 6 steps to produce a coarse plan, whose vision-language context is cached as a frozen snapshot. The tactile expert clones that cache and finishes the lower segment (τ : 0.4 → 0) over just 4 cheap steps, conditioned on live tactile tokens — and re-fires at intra-chunk offsets {0, 4, 8, 12}, four fast tactile ticks per slow visuomotor tick. Because the expensive vision/planning compute is amortized across many fast ticks, per-step cost is dominated by four lightweight tactile steps — fast enough for real closed-loop reaction.

Temporal tactile VQ-VAE encoder

Tactile feedback carries two complementary signals: temporal force dynamics (how contact forces evolve) and spatial contact geometry (edges, slip, shear). T-Rex encodes each separately. A per-finger VQ-VAEcompresses a 16-frame window of raw 6-D force/torque through a 1D temporal convolutional encoder into a 256-D embedding, then vector-quantizes it to a learned codebook (K = 64) — one discrete, drift-robust token per finger. EMA codebook updates with periodic re-seeding prevent collapse, and a magnitude-weighted reconstruction loss keeps the codebook from collapsing onto the dominant no-contact state. The current unquantized force vector is also projected directly to preserve low-latency present-moment information, and a frozen ResNet-derived encoder turns each fingertip’s deformation map into geometry-aware features. Concatenated, these form the tactile tokens the fast expert consumes — it never re-runs the vision tower.

Training recipe

T-Rex is trained in three stages that progressively transfer large-scale human priors into tactile-reactive control. (1) Human egocentric pre-training on 22,889 hours of first-person video gives the latent and action experts broad visuomotor and language priors (no tactile yet). (2) Tactile-grounded mid-training on the 100-hour T-Rex dataset adapts the action expert to robot observations and trains the tactile expert from scratch as a high-frequency refiner, with a delay augmentation that matches the visual/tactile staleness seen at deployment. (3) Skill-specific post-training fine-tunes on ~100 demonstrations per task. An auxiliary future-visual-prediction objective keeps the rapid tactile reflexes grounded in task context.

Demonstrations

Real-world autonomous policy rollouts on the bimanual dexterous platform. Pick a task to watch the policy execute it; click the video to expand it.

Browse all 12 tasks in detail →

Results

T-Rex is evaluated against six representative dexterous-manipulation and VLA baselines on 12 tactile-reactive tasks spanning three difficulty families: force-sensitive contact, deformation-aware manipulation, and the hardest bimanual force-deformation tasks. Each task is evaluated over 16 randomized trials, with progress rubrics for multi-stage tasks. T-Rex is the top method on every one of the 12 tasks.

T-Rex (Ours)
65%
EgoScale
35%
π0.5
17%
Tactile-VLA
15%
RDP
6%
π0.5 + tactile
6%
ViTacFormer
3%

T-Rex reaches a 65% macro-average+30 absolute points over the strongest baseline (EgoScale, 35%), and over 20× the weakest, across all 12 tactile-reactive tasks.

Figure 3: Average success rate (%) across the 12 tactile-reactive manipulation tasks, averaged over 16 rollouts per task. T-Rex is best on every task.

Two findings stand out. First, large-scale pre-training is essential: policies trained from scratch on ~100 demos (ViTacFormer, RDP) collapse, while EgoScale’s egocentric pre-training makes it the strongest baseline. Second, tactile integration must be done right: naively bolting tactile signals onto a pretrained VLA actually hurts (π0.5 + tactile scores below plain π0.5). T-Rex combines large-scale pre-training, tactile-grounded mid-training, and tactile-reactive control to win across the board.

Per-task results

MethodFlip PageTransfer EggWipe PlateApply PasteSplit CupSort MahjongOpen LockRefill TabletAcid-BaseExtract CardDeal PokerScrew BulbAvg
ViTacFormer9041470002213
RDP128182692001276
Tactile-VLA381424021278094111815
π0.53617281318325124891117
π0.5 + tactile892724142073006
EgoScale68443438333619124341281835
T-Rex (Ours)96756966786547417670573565
Force-sensitive contact Deformation-aware Bimanual force–deformation
Figure 4: Per-task success rate (%). The bottom row is T-Rex; the right-most column is the macro-average across all 12 tasks.

Ablations

Every component earns its place. Removing touch entirely is the most damaging; the temporal VQ-VAE force encoder, the asynchronous cascade, and both training stages each contribute measurably.

Tactile modality & encoding

Full model (Ours)
65
MLP force + VQ-VAE force
59
MLP force + deform
58
Deformation only
54
w/o Tactile
42

Removing touch entirely is the most damaging (−23 pts); the temporal VQ-VAE force encoder is what makes touch pay off.

Asynchronous cascaded flow matching

Full model (Ours)
65
w/o Async (synchronous)
60

Decoupling low-frequency planning from high-frequency tactile refinement gives a consistent +5 pts.

Training recipe (pre-train × mid-train)

Pre-train ✓ + Mid-train ✓
67
Pre-train ✓ only
45
Mid-train ✓ only
34
From scratch
18

From an 18-pt scratch baseline, tactile-grounded mid-training and human pre-training each help; together they reach the full recipe.

Figure 5: Ablations (average success over 6 representative tasks). Tactile feedback, the temporal VQ-VAE encoder, the asynchronous cascade, and both training stages each contribute.

We also sweep the cascaded denoising split step Kslow. An intermediate split is best across all three contact-rich tasks: too small a split leaves the action expert with too few visuomotor priors for the tactile expert to refine, while too large a split starves the tactile expert of capacity to fold in feedback.

0%

Apply Toothpaste

606264666812468Success Rate (%)

Split Cup

707274767812468Success Rate (%)

Extract Card

646668707212468Success Rate (%)
Denoise Split Step
Figure 6: Ablation on the cascaded denoising split step Kslow — success rate (%) vs. split step for three representative tasks. An intermediate split consistently wins. Hover any point for its value; drag the bar to scrub.

Finally, T-Rex is markedly more data-efficient. With tactile-grounded mid-training (blue), success climbs far faster as post-training demonstrations grow than training from scratch without mid-training (green) — the gap is largest in the low-data regime.

0%

Apply Toothpaste

020406080102050100200Success Rate (%)

Split Cup

1030507090102050100200Success Rate (%)

Extract Card

020406080102050100200Success Rate (%)
Number of post-training trajectories
Figure 7: Data efficiency — success rate (%) vs. number of post-training trajectories, with (blue) and without (green) T-Rex tactile mid-training. Hover a point for its value; click a legend entry to isolate a curve.

Failure cases

Six recurring failure modes point to where tactile-reactive dexterity still has headroom:

Execution Progress
Screw Lightbulb failure rollout — Object collision
Object collisionScrew Lightbulb. The red box marks the contact issue behind the failure. Hover a card above to switch modes; click the strip to enlarge.

Limitations & future directions

  • Hard, tightly-toleranced long-horizon tasks where good teleoperation is difficult — future work could add reinforcement learning or online refinement beyond behavior cloning.
  • Tactile hardware bottlenecks — sensor distortion, calibration drift across devices, and the absence of dense palm sensing for true whole-hand manipulation.
  • Toward unified, richer tactile sensing — representations that generalize across heterogeneous sensors, and whole-hand hardware with dense coverage beyond the fingertips.

Conclusion

T-Rex brings large-scale pre-training and high-frequency touch together for contact-rich bimanual manipulation. A Mixture-of-Transformer model combines asynchronous tactile refinement with a temporal tactile VQ-VAE, processing touch at high frequency without slowing the main policy. Trained with human-video pre-training and an open-source 100-hour tactile-synchronized dataset, T-Rex outperforms existing dexterous and tactile-aware VLA baselines by an average of 30% across 12 real-world tactile-reactive tasks while substantially improving data efficiency — a practical recipe for tactile-reactive dexterous control.

Citation

The T-Rex paper is available on arXiv. If you find this work useful, please cite:

@misc{niu2026trextactilereactivedexterousmanipulation,
      title={T-Rex: Tactile-Reactive Dexterous Manipulation},
      author={Dantong Niu and Zhuoyang Liu and Zekai Wang and Boning Shao and Zhao-Heng Yin and Anirudh Pai and Yuvan Sharma and Stefano Saravalle and Ruijie Zheng and Jing Wang and Ryan Punamiya and Mengda Xu and Yuqi Xie and Yunfan Jiang and Letian Fu and Konstantinos Kallidromitis and Matteo Gioia and Junyi Zhang and Jiaxin Ge and Haiwen Feng and Fabio Galasso and Wei Zhan and David M. Chan and Yutong Bai and Roei Herzig and Jiahui Lei and Fei-Fei Li and Ken Goldberg and Jitendra Malik and Pieter Abbeel and Yuke Zhu and Danfei Xu and Jim and Fan and Trevor Darrell},
      year={2026},
      eprint={2606.17055},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.17055},
}

This project page is built on the open-source ENPIRE project-page template — many thanks to its authors.