Controller (SSM core) routes each token to 2 active experts (glowing). Inactive experts use zero compute.
Transformer vs State-Space Model
Transformer
O(n²) attention · Full KV history · Every token sees all past · Memory explodes with context
SSM / Linear
O(n) state · Fixed memory footprint · Compressed learned state · Long context for free
Retrieval + Small Core Model
Query 3–7B Core
→
Vector Index
→
Top-K Chunks
→
Focused Reason
Knowledge lives in a deduplicated, embedded corpus — not in weights. Model learns to query & compose rather than memorize.
1
Skim & Index
Tiny model or static analysis scans the codebase. Builds a semantic map: modules, deps, hotspots, anomalies. Output: compressed table of contents.
2
Focused Retrieval
Pull only the 0.1–1% of code that matters for the current question. Embeddings + static analysis + heuristics narrow the scope radically.
3
Deep Reasoning
Small-but-sharp 3–7B model sees a curated, tiny context. Elite behavior from high-quality input — not raw model size.
4
Iterative Refinement
Propose → Check → Refine → Verify. Multiple cheap passes with small context windows. The system is the intelligence.
Stage 1 / 4
1
Pick a strong small model
Start with Llama 3.1 8B or Mistral 7B. Quantize to INT4 with llama.cpp. Target ≤4 GB VRAM on device.
2
Build a ruthless retrieval layer
Embed your codebase with nomic-embed-text. Store in Chroma or FAISS. Deduplicate aggressively — target <100 MB index for a typical repo.
3
Wire the 4-stage pipeline
Skim with tree-sitter AST → Retrieve top-20 chunks → Reason with <4k token context → Verify with linter + tests.
4
Experiment with KV compression
Try SnapKV or H₂O for eviction-based KV cache reduction. Test Mamba or RWKV as SSM controller models.
5
Add MoE domain experts
Fine-tune tiny LoRA adapters for Code, Math, Hardware domains. Load only the 2 active experts per task. Swap inactive experts from disk in <100 ms.
6
Browser runtime via WebGPU
Use WebLLM or transformers.js for in-browser inference. Target 3B INT4 for mobile NPU. Heavy compute runs off main thread via WebWorker.
NVIDIA GEAR's GR00T-WholeBodyControl is the open platform behind the Decoupled WBC controllers in Isaac-GR00T (N1.5–N1.7) and the GEAR-SONIC behavior foundation model. It turns large-scale human motion into a single policy that drives a real humanoid — the missing motor system for embodied agents.
SONIC — Behavior Foundation Model
One unified policy learns whole-body motor skills from human motion data. Uses motion tracking as a scalable training task instead of hand-built controllers per behavior.
Decoupled Whole-Body Control
Splits high-level intent from low-level balance/locomotion — natural walking, crawling and dynamic movement with a ready-to-deploy C++ inference stack.
VLA Workflow — Collect → Fine-tune → Deploy
Collect teleop data, fine-tune Isaac-GR00T N1.7 on SONIC latent actions, deploy for vision-language-action control on the Unitree G1.
Real-time VR Teleop + BONES-SEED
PICO VR whole-body teleoperation for data capture, trained on BONES-SEED — 142K+ human motions (~288 hrs) with G1 MuJoCo trajectories.
A future bridge: AgentR/T/B stop at the screen today. GR00T-WBC is the body they could drive — agent intent compiled down to balanced, real-world humanoid motion.
Agent Intent
R/T/B plan
→
SONIC Latent
action tokens
→
Decoupled WBC
balance + locomotion
→
G1 Humanoid
real motion
🔴 AgentR
Architect → maps onto the kinematic planner: sets goals, milestones and motion targets the WBC stack must satisfy.
🟢 AgentT
Connector → maps onto the ZMQ streaming / teleop interface: routes intent into the live control loop and relays robot state back.
🔵 AgentB
Creator → maps onto the VLA policy: executes vision-language-action skills as concrete whole-body behaviors.
Same problem, opposite directions. Claude Fable 5 is a single model with reasoning, safety and self-verification built in. The AgentR runtime is a workflow engine that wraps any model with Fable-5-style behaviors.
Native intelligence
Claude Fable 5 — a brain
Long-horizon reasoning built in
Latent planning & safety routing
Native repo ingestion
Self-verification & large context
Workflow engine
AgentR Runtime — a nervous system
Model-agnostic (Claude, Copilot, …)
Planning externalized to task files
Safety in editable scorecards
Repo ingestion via scripts
The brain's intelligence is native, so it's faster and more coherent. The nervous system is modular, transparent and reproducible — every stage is inspectable and swappable.