Local Agent Engine

Groove ships a full agentic runtime for local models. Not a chat wrapper -- a real agent loop with tool calling, file editing, command execution, and streaming. Any model that runs on your machine gets the same orchestration stack as Claude Code or Codex: multi-agent coordination, context rotation, journalist synthesis, adaptive routing, and token tracking.

Zero cloud tokens. Fully offline. Your code never leaves your machine.

How It Works

When you spawn an agent with the Local Models provider, Groove doesn't launch a CLI process. Instead, it runs an agent loop inside the daemon that:

Sends your prompt to any OpenAI-compatible API (Ollama, llama-server, vLLM, or any custom endpoint)
Parses the model's response -- text or tool calls
Executes tools: reads files, writes code, runs commands, searches the codebase
Feeds results back to the model and loops until the task is done
Streams every token to the GUI in real-time

The agent loop plugs directly into all of Groove's coordination systems. Local agents appear in the agent tree, get introduced to the team, respect file locks, show up in the journalist's synthesis, track tokens in the dashboard, and auto-rotate when their context window fills up.

Getting Started

Step 1: Install Ollama

Groove uses Ollama as its default local inference backend. If you don't have it:

bash

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

You don't need to start the Ollama server manually -- Groove auto-starts it when needed.

Step 2: Pick a Model

Open the GUI at localhost:31415 and click the Models tab in the left activity bar.

The Recommended tab shows models that fit your hardware, sorted by quality. Each card shows:

Model name and tier -- heavy (deep reasoning), medium (general coding), light (quick tasks)
Description -- what the model excels at
Download size -- how much disk space it needs
RAM needed -- matched against your system (green = fits, yellow = tight, red = too large)
Headroom -- percentage of RAM left over for the OS and other apps

Click Pull on any model. Groove auto-starts the Ollama server if it's not running, downloads the model, and marks it as installed. The whole flow is one click.

Start with Qwen 2.5 Coder

Qwen 2.5 Coder is the best local coding model as of early 2026. The 7B version runs on any machine with 8 GB of RAM. All sizes have excellent tool calling support.

Step 3: Spawn a Local Agent

Click Spawn Agent in the top bar
Choose a role (e.g., fullstack, frontend, or type a custom role)
Select Local Models as the provider
Pick your model from the dropdown -- only installed models appear
Click Spawn

The agent starts, introduces itself, and waits for your instructions.

Step 4: Chat

Click the agent in the tree to open its panel. Type a message in the chat input and hit Enter.

While the agent is thinking, you'll see a pulsing dot indicator. When the response arrives, it renders as a single chat bubble. The agent maintains full conversation history -- you can have a back-and-forth dialogue just like with Claude Code.

When the agent uses tools (reading files, running commands, etc.), you'll see tool calls and results in the activity feed.

The Model Browser

The Models tab has three sections:

Installed

All models currently on your machine. Shows the model ID, quantization level, size, context window, and category. You can delete models you no longer need to free up disk space.

Search (HuggingFace)

Search the entire HuggingFace model library for GGUF files. Type a query like "qwen coder" or "deepseek" and browse results.

Click any result to expand it and see every GGUF variant -- different quantization levels with file sizes and RAM estimates. Each variant is color-coded:

Green RAM -- fits your system comfortably
Yellow RAM -- tight fit, will work but might be slow
Red RAM + "too large" -- won't fit, download button is disabled

Click download to start pulling. Progress is shown in real-time at the top of the Models view.

Setting Up in Settings

The Settings page shows a Local Models card with your current status:

Not set up -- shows a "Set Up Ollama" button that walks you through installation
No models pulled -- shows "Pull Models" (opens the Ollama model manager) and "Models Tab" (takes you to the browser)
Ready -- shows your installed model count with links to manage models

There's always a back button and a link to the Models tab, so you can't get stuck.

Tool Calling

Local agents have seven tools that mirror what Claude Code provides:

Tool	What It Does
`read_file`	Read file contents with line numbers, offset/limit support
`write_file`	Create or overwrite files, auto-creates parent directories
`edit_file`	Targeted string replacement in existing files
`run_command`	Execute shell commands with timeout and output capture
`search_files`	Find files by glob pattern (like `src/*/.ts`)
`search_content`	Search file contents by regex (like grep)
`list_directory`	List files and directories with sizes

All tools enforce security:

Path validation -- no traversal outside the working directory
Scope enforcement -- writes check the lock manager before proceeding. If another agent owns a file, the write is blocked.
Command sandboxing -- shell commands run with timeouts and output limits
Output truncation -- large outputs are capped to prevent context window blowup

Recommended Models

Models with native tool/function calling support work best:

Model	Parameters	RAM Needed	Context	Best For
Qwen 2.5 Coder 7B	7B	8 GB	32K	General coding, scripts, quick tasks
Qwen 2.5 Coder 14B	14B	16 GB	32K	Complex features, debugging
Qwen 2.5 Coder 32B	32B	24 GB	32K	Architecture-level work, rivals GPT-4o
DeepSeek R1 14B	14B	12 GB	64K	Chain-of-thought debugging
Llama 3.1 8B	8B	8 GB	128K	Large context, general coding
Codestral 25B	25B	18 GB	32K	Multi-language, autocomplete
Gemma 4 26B	26B	16 GB	32K	Strong reasoning per watt
Mistral 7B	7B	8 GB	32K	Fast and efficient

Interactive Chat

Local agents support full interactive chat. After spawning, you can:

Send instructions -- type in the chat input and hit Enter
Multi-turn conversation -- the agent maintains its full message history
See thinking state -- pulsing dots show while the agent processes your message
Watch tool calls -- file reads, writes, and commands appear in the activity feed

This is not one-shot. You can have a back-and-forth dialogue, ask follow-up questions, change direction, and iterate -- just like chatting with Claude Code.

Context Rotation for Local Models

Context rotation is even more valuable for local models than for cloud providers. Cloud models have 200K+ context windows. Local models typically have 4K to 128K. They fill up faster, and quality degrades sooner.

Groove's adaptive rotation system handles this automatically:

The agent loop tracks token usage from every API response
Context usage is calculated as input_tokens / context_window
Groove's rotator checks every 15 seconds -- when usage hits the adaptive threshold, it triggers
The journalist generates a handoff brief summarizing the agent's work
The old session is killed, a fresh one spawns with the brief injected
The new session picks up where the old one left off -- fresh context, peak quality

The adaptive threshold starts at 75% and learns from session quality. Good sessions push the threshold up (allow more context before rotation). Bad sessions pull it down (rotate sooner). The system converges on the optimal rotation point for each model.

Multi-Model Teams

Run different models for different agents on the same project:

Agent: planner     → Qwen 2.5 Coder 32B (heavy reasoning)
Agent: frontend    → Qwen 2.5 Coder 7B  (fast, lightweight)
Agent: backend     → DeepSeek R1 14B     (debugging focus)
Agent: docs        → Llama 3.1 8B        (large context for docs)

Or mix local and cloud:

Agent: planner     → Claude Code Opus    (deep planning)
Agent: frontend    → Local: Qwen 7B      (zero cost)
Agent: backend     → Claude Code Sonnet  (fast cloud)
Agent: QC          → Local: Qwen 32B     (free verification)

Each agent works independently. Groove coordinates them regardless of whether they're running in the cloud or on your GPU.

Fully Offline

With local models and Ollama, the entire stack runs on your machine:

Groove daemon at localhost:31415
Ollama inference at localhost:11434
Models stored in Ollama's model directory
No network calls, no API keys, no data exfiltration
Works on air-gapped machines and restricted networks

This makes Groove the first multi-agent orchestration system that can run completely offline with full agentic capabilities.

Using Any OpenAI-Compatible Endpoint

Groove works with any server that speaks the OpenAI /v1/chat/completions format:

llama-server (llama.cpp) -- best performance, GPU offloading, multi-model
vLLM -- high-throughput serving
LM Studio -- desktop app with built-in server
text-generation-webui -- popular web interface with API

As long as it serves /v1/chat/completions with tool/function calling support, Groove's agent loop can use it.

Limitations

Model quality varies -- smaller models make more mistakes and may produce unreliable tool calls. Qwen 2.5 Coder 7B+ is the minimum recommended for agentic work.
Speed depends on hardware -- GPU acceleration is strongly recommended. Apple Silicon Macs with unified memory work well. CPU-only inference is slow for models above 7B.
No hot-swap -- changing models requires killing and respawning the agent.
Context windows are smaller -- more frequent rotations than cloud models. Groove handles this automatically, but very large tasks may need more rotations.

Next Steps

Adaptive Model Routing -- how Groove picks the right model for each task
Context Rotation -- deep dive into the rotation system
The Journalist -- how project context is maintained across sessions

Local Agent Engine ​

How It Works ​

Getting Started ​

Step 1: Install Ollama ​

Step 2: Pick a Model ​

Step 3: Spawn a Local Agent ​

Step 4: Chat ​

The Model Browser ​

Recommended ​

Installed ​

Search (HuggingFace) ​

Setting Up in Settings ​

Tool Calling ​

Recommended Models ​

Interactive Chat ​

Context Rotation for Local Models ​

Multi-Model Teams ​

Fully Offline ​

Using Any OpenAI-Compatible Endpoint ​

Limitations ​

Next Steps ​