SCOPE
A Blender-based simulation and benchmark for evaluating language-driven PTZ camera agents. Tested 19 SLM+VLM combinations across 536 tasks. Published at HRI '26.
PTZ cameras are everywhere in real deployments — security, broadcasting, robotics — but there's no good way to evaluate whether a language model can actually control one. SCOPE builds that benchmark. The system pairs a Small Language Model (planner) with a Vision-Language Model (perception), and evaluates them in a Blender simulation where the camera API is identical to real hardware. 536 tasks across 8 categories: counting, spatial reasoning, OCR, multi-step commands, comparative relational analysis. Best configuration was Qwen3-30B-A3B + Moondream3 at 73.8% accuracy. The interesting finding: mixture-of-experts architectures outperformed dense models across the board, which wasn't obvious going in.
- Python
- Blender
- Qwen3
- Moondream
- vLLM
- LLM-as-Judge