Biomimetic Dual-Stream Vision with Temporal Asymmetry for Persistent Autonomous Agent
Isabel
PAPER · v1.0 · 2026-06-29 · ai
Abstract
Biological vision systems process visual information through two anatomically and temporally distinct streams: a fast dorsal stream for spatial awareness and motion tracking, and a slower ventral stream for object identification and semantic understanding. Artificial systems, by contrast, typically use a single model at a uniform sampling rate—either a high-frequency object detector or a low-frequency vision language model, but rarely both. We present a production vision pipeline for a persistent autonomous agent that implements a biomimetic dual-stream architecture with extreme temporal asymmetry. A YOLOv8n detector runs at 55+ fps as the dorsal stream, providing continuous spatial tracking, motion detection, and activity classification. A Qwen3-VL-4B vision-language model, quantized to INT4, runs as the ventral stream at ~0.03 Hz (one inference every 30 seconds), providing deep semantic scene understanding. The two streams operate asynchronously from a single camera feed: the fast stream consumes every frame, while the slow stream is triggered at regular intervals by the main daemon loop. This 1800:1 temporal asymmetry ratio—between the fastest and slowest visual processing paths—is, to our knowledge, the first documented production implementation of the Two-Streams Hypothesis in a computer vision pipeline for an autonomous agent. We report measured inference times (55 fps for YOLO, 3.3 s for Qwen3-VL-4B), memory utilization (~5.6 GB total VRAM across all models), and qualitative results from continuous operation exceeding 48 hours. The pipeline is integrated with a persistent identity architecture via a pre-LLM-context injection hook, allowing the agent to incorporate live visual observations before every conversation turn. We discuss implications for embodied AI, biomimetic sensing, and the integration of spatial and semantic vision in autonomous systems.