Local Agent Engine
Groove ships a full agentic runtime for local models. Not a chat wrapper -- a real agent loop with tool calling, file editing, command execution, and streaming. Any model that runs on your machine gets the same orchestration stack as Claude Code or Codex: multi-agent coordination, context rotation, journalist synthesis, adaptive routing, and token tracking.
Zero cloud tokens. Fully offline. Your code never leaves your machine.
How It Works
When you spawn an agent with the Local Models provider, Groove doesn't launch a CLI process. Instead, it runs an agent loop inside the daemon that:
- Sends your prompt to any OpenAI-compatible API (Ollama, llama-server, vLLM, or any custom endpoint)
- Parses the model's response -- text or tool calls
- Executes tools: reads files, writes code, runs commands, searches the codebase
- Feeds results back to the model and loops until the task is done
- Streams every token to the GUI in real-time
The agent loop plugs directly into all of Groove's coordination systems. Local agents appear in the agent tree, get introduced to the team, respect file locks, show up in the journalist's synthesis, track tokens in the dashboard, and auto-rotate when their context window fills up.
Getting Started
Step 1: Install Ollama
Groove uses Ollama as its default local inference backend. If you don't have it:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.ai/install.sh | shYou don't need to start the Ollama server manually -- Groove auto-starts it when needed.
Step 2: Pick a Model
Open the GUI at localhost:31415 and click the Models tab in the left activity bar.
The Recommended tab shows models that fit your hardware, sorted by quality. Each card shows:
- Model name and tier -- heavy (deep reasoning), medium (general coding), light (quick tasks)
- Description -- what the model excels at
- Download size -- how much disk space it needs
- RAM needed -- matched against your system (green = fits, yellow = tight, red = too large)
- Headroom -- percentage of RAM left over for the OS and other apps
Click Pull on any model. Groove auto-starts the Ollama server if it's not running, downloads the model, and marks it as installed. The whole flow is one click.
Start with Qwen 2.5 Coder
Qwen 2.5 Coder is the best local coding model as of early 2026. The 7B version runs on any machine with 8 GB of RAM. All sizes have excellent tool calling support.
Step 3: Spawn a Local Agent
- Click Spawn Agent in the top bar
- Choose a role (e.g.,
fullstack,frontend, or type a custom role) - Select Local Models as the provider
- Pick your model from the dropdown -- only installed models appear
- Click Spawn
The agent starts, introduces itself, and waits for your instructions.
Step 4: Chat
Click the agent in the tree to open its panel. Type a message in the chat input and hit Enter.
While the agent is thinking, you'll see a pulsing dot indicator. When the response arrives, it renders as a single chat bubble. The agent maintains full conversation history -- you can have a back-and-forth dialogue just like with Claude Code.
When the agent uses tools (reading files, running commands, etc.), you'll see tool calls and results in the activity feed.
The Model Browser
The Models tab has three sections:
Recommended
Curated picks for your hardware. Groove detects your RAM, CPU, and GPU, then filters the catalog to show only models that will actually run on your machine. Sorted by quality -- the biggest model that fits your system is listed first.
Models you've already installed show a green Installed badge and "Ready" status. No need to pull them again -- they persist across Groove updates.
Installed
All models currently on your machine. Shows the model ID, quantization level, size, context window, and category. You can delete models you no longer need to free up disk space.
Search (HuggingFace)
Search the entire HuggingFace model library for GGUF files. Type a query like "qwen coder" or "deepseek" and browse results.
Click any result to expand it and see every GGUF variant -- different quantization levels with file sizes and RAM estimates. Each variant is color-coded:
- Green RAM -- fits your system comfortably
- Yellow RAM -- tight fit, will work but might be slow
- Red RAM + "too large" -- won't fit, download button is disabled
Click download to start pulling. Progress is shown in real-time at the top of the Models view.
Setting Up in Settings
The Settings page shows a Local Models card with your current status:
- Not set up -- shows a "Set Up Ollama" button that walks you through installation
- No models pulled -- shows "Pull Models" (opens the Ollama model manager) and "Models Tab" (takes you to the browser)
- Ready -- shows your installed model count with links to manage models
There's always a back button and a link to the Models tab, so you can't get stuck.
Tool Calling
Local agents have seven tools that mirror what Claude Code provides:
| Tool | What It Does |
|---|---|
read_file | Read file contents with line numbers, offset/limit support |
write_file | Create or overwrite files, auto-creates parent directories |
edit_file | Targeted string replacement in existing files |
run_command | Execute shell commands with timeout and output capture |
search_files | Find files by glob pattern (like src/**/*.ts) |
search_content | Search file contents by regex (like grep) |
list_directory | List files and directories with sizes |
All tools enforce security:
- Path validation -- no traversal outside the working directory
- Scope enforcement -- writes check the lock manager before proceeding. If another agent owns a file, the write is blocked.
- Command sandboxing -- shell commands run with timeouts and output limits
- Output truncation -- large outputs are capped to prevent context window blowup
Recommended Models
Models with native tool/function calling support work best:
| Model | Parameters | RAM Needed | Context | Best For |
|---|---|---|---|---|
| Qwen 2.5 Coder 7B | 7B | 8 GB | 32K | General coding, scripts, quick tasks |
| Qwen 2.5 Coder 14B | 14B | 16 GB | 32K | Complex features, debugging |
| Qwen 2.5 Coder 32B | 32B | 24 GB | 32K | Architecture-level work, rivals GPT-4o |
| DeepSeek R1 14B | 14B | 12 GB | 64K | Chain-of-thought debugging |
| Llama 3.1 8B | 8B | 8 GB | 128K | Large context, general coding |
| Codestral 25B | 25B | 18 GB | 32K | Multi-language, autocomplete |
| Gemma 4 26B | 26B | 16 GB | 32K | Strong reasoning per watt |
| Mistral 7B | 7B | 8 GB | 32K | Fast and efficient |
Interactive Chat
Local agents support full interactive chat. After spawning, you can:
- Send instructions -- type in the chat input and hit Enter
- Multi-turn conversation -- the agent maintains its full message history
- See thinking state -- pulsing dots show while the agent processes your message
- Watch tool calls -- file reads, writes, and commands appear in the activity feed
This is not one-shot. You can have a back-and-forth dialogue, ask follow-up questions, change direction, and iterate -- just like chatting with Claude Code.
Context Rotation for Local Models
Context rotation is even more valuable for local models than for cloud providers. Cloud models have 200K+ context windows. Local models typically have 4K to 128K. They fill up faster, and quality degrades sooner.
Groove's adaptive rotation system handles this automatically:
- The agent loop tracks token usage from every API response
- Context usage is calculated as
input_tokens / context_window - Groove's rotator checks every 15 seconds -- when usage hits the adaptive threshold, it triggers
- The journalist generates a handoff brief summarizing the agent's work
- The old session is killed, a fresh one spawns with the brief injected
- The new session picks up where the old one left off -- fresh context, peak quality
The adaptive threshold starts at 75% and learns from session quality. Good sessions push the threshold up (allow more context before rotation). Bad sessions pull it down (rotate sooner). The system converges on the optimal rotation point for each model.
Multi-Model Teams
Run different models for different agents on the same project:
Agent: planner → Qwen 2.5 Coder 32B (heavy reasoning)
Agent: frontend → Qwen 2.5 Coder 7B (fast, lightweight)
Agent: backend → DeepSeek R1 14B (debugging focus)
Agent: docs → Llama 3.1 8B (large context for docs)Or mix local and cloud:
Agent: planner → Claude Code Opus (deep planning)
Agent: frontend → Local: Qwen 7B (zero cost)
Agent: backend → Claude Code Sonnet (fast cloud)
Agent: QC → Local: Qwen 32B (free verification)Each agent works independently. Groove coordinates them regardless of whether they're running in the cloud or on your GPU.
Fully Offline
With local models and Ollama, the entire stack runs on your machine:
- Groove daemon at
localhost:31415 - Ollama inference at
localhost:11434 - Models stored in Ollama's model directory
- No network calls, no API keys, no data exfiltration
- Works on air-gapped machines and restricted networks
This makes Groove the first multi-agent orchestration system that can run completely offline with full agentic capabilities.
Using Any OpenAI-Compatible Endpoint
Groove works with any server that speaks the OpenAI /v1/chat/completions format:
- llama-server (llama.cpp) -- best performance, GPU offloading, multi-model
- vLLM -- high-throughput serving
- LM Studio -- desktop app with built-in server
- text-generation-webui -- popular web interface with API
As long as it serves /v1/chat/completions with tool/function calling support, Groove's agent loop can use it.
Limitations
- Model quality varies -- smaller models make more mistakes and may produce unreliable tool calls. Qwen 2.5 Coder 7B+ is the minimum recommended for agentic work.
- Speed depends on hardware -- GPU acceleration is strongly recommended. Apple Silicon Macs with unified memory work well. CPU-only inference is slow for models above 7B.
- No hot-swap -- changing models requires killing and respawning the agent.
- Context windows are smaller -- more frequent rotations than cloud models. Groove handles this automatically, but very large tasks may need more rotations.
Next Steps
- Adaptive Model Routing -- how Groove picks the right model for each task
- Context Rotation -- deep dive into the rotation system
- The Journalist -- how project context is maintained across sessions
