Setting Up a Local AI Coding Stack with Continue.dev + Ollama (2026)

continue-devollamalocal-aicodingtutorialvs-codeqwen2.5-coderprivacy

GitHub Copilot Pro costs $10/month. Copilot Pro+ costs $39. Every keystroke you generate gets sent to Microsoft’s servers, and your proprietary business logic goes with it. For solo developers that might feel acceptable. For anyone working under an NDA, at a company with a “no cloud AI” policy, or just building something they don’t want pre-trained into someone else’s next model — it isn’t.

The good news: the open-source alternative has caught up. Continue.dev (2.5 million VS Code installs, 32,000+ GitHub stars) paired with Ollama gives you inline autocomplete, a full chat panel, and an agent mode that can read and edit across files — all running locally, all private after the initial model download.

The setup is not complicated, but most tutorials get one thing wrong: they tell you to use a single model for everything. That’s how you end up with either laggy autocomplete or underpowered chat. The working setup uses two models simultaneously.


The dual-model principle

Autocomplete and chat have different requirements. Autocomplete needs to respond in under 500ms or you’ll dismiss it before it finishes typing. That rules out anything bigger than about 3B parameters on most consumer hardware. Chat and code generation can take 3–5 seconds for a complete response and you’ll accept that tradeoff for quality.

The setup:

  • Autocomplete model: qwen2.5-coder:1.5b — small enough to generate completions faster than you can read them on any modern GPU, and still usable on CPU
  • Chat/edit model: qwen2.5-coder:7b (8GB VRAM) or qwen2.5-coder:32b (24GB VRAM) — real reasoning power for generating functions, explaining code, writing tests

Continue.dev assigns these via roles in config.yaml. The small model never interrupts your thinking; the big model handles everything you explicitly ask for.


Hardware requirements

SetupGPU / RAMAutocomplete modelChat modelNotes
CPU-only laptop16GB RAMqwen2.5-coder:1.5bqwen2.5-coder:7bChat is slow; autocomplete still responsive
8GB VRAMRTX 4060 / 3060qwen2.5-coder:1.5bqwen2.5-coder:7bBoth on GPU; comfortable for daily use
12–16GB VRAMRTX 4060 Ti 16GB / 3080qwen2.5-coder:1.5bqwen2.5-coder:7bComfortable; room for 14B models
24GB VRAMRTX 3090 / 4090qwen2.5-coder:1.5bqwen2.5-coder:32bFull power; 35 tok/s on RTX 4090 Q4
Apple SiliconM2 Pro 18GB+ / M3 Pro 36GB+qwen2.5-coder:1.5bqwen2.5-coder:7b or 32bUnified memory; M3 Pro 36GB handles 32B

If you’re on a laptop without a dedicated GPU and want to test the 32B model before committing to a GPU purchase, RunPod community instances with RTX 4090s rent for around $0.34/hr — enough to evaluate whether 32B quality is worth the hardware investment for your workflow.


Model selection: what the benchmarks actually say

qwen2.5-coder:7b — On HumanEval (the standard Python code-generation benchmark), Qwen2.5-Coder 7B scores 88.4% pass@1, beating GPT-4’s 87.1%. For a model that fits on an 8GB gaming GPU, that’s not a typo. This is the default recommendation for most setups: capable enough for real coding tasks, VRAM footprint small enough to leave the OS usable.

qwen2.5-coder:32b — On LiveCodeBench, which tests code generation against problems that post-date the model’s training cutoff (so memorization can’t inflate scores), the 32B model scores 37.2% — compared to GPT-4o at 29.2% and Claude 3.5 Sonnet at 32.1%. At 35 tokens/sec on an RTX 4090 with Q4 quantization, it’s responsive enough for conversational coding. The catch: you need 24GB VRAM.

deepseek-r1:32b — Not a coding-specialist model, but useful for the “explain what this function does and what edge cases it misses” tasks that pure coding models handle less well. Same VRAM requirement as qwen2.5-coder:32b. Worth pulling as a second chat model if you have the storage.

qwen2.5-coder:1.5b — The dedicated autocomplete model. Skip trying to use this for chat — the 1.5B parameter count is too constrained for meaningful dialogue. Its only job is to finish your current line or suggest the next 2–3 lines, and it does that job faster than the cursor blink rate.


Installation

Step 1: Install Ollama

Download from ollama.com. On Windows, it installs as a system service. On macOS:

brew install ollama

On Linux:

curl -fsSL https://ollama.ai/install.sh | sh

Verify the service is running:

ollama --version
curl http://localhost:11434

Ollama v0.24.0 (released May 14, 2026) is the current stable release as of this writing.

Step 2: Pull your models

# Autocomplete model — fast, small
ollama pull qwen2.5-coder:1.5b

# Chat model — pick one based on your VRAM
ollama pull qwen2.5-coder:7b      # 8GB VRAM
# or
ollama pull qwen2.5-coder:32b     # 24GB VRAM

# Optional reasoning model for complex explanations
ollama pull deepseek-r1:32b

Verify what’s installed:

ollama list

Step 3: Install Continue.dev in VS Code

Open VS Code → Extensions panel → search “Continue” → Install. The extension’s identifier is continue.continue.

For JetBrains IDEs: the JetBrains plugin exists (plugin ID: 22707) but note that as of 2026 the JetBrains plugin is community-maintained. The Continue team recommends the Continue CLI as an alternative for non-VS Code editors.


Configuring config.yaml

After installing Continue.dev, open the command palette (Ctrl+Shift+P) and run Continue: Open Config. This opens ~/.continue/config.yaml.

Here’s a complete working configuration for the dual-model setup:

name: Local AI Coding Stack
version: 0.0.1
schema: v1

models:
  # --- Chat and code generation ---
  - name: Qwen2.5-Coder 7B (Chat)
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply
    defaultCompletionOptions:
      contextLength: 8192
      temperature: 0.1

  # --- Inline autocomplete ---
  - name: Qwen2.5-Coder 1.5B (Autocomplete)
    provider: ollama
    model: qwen2.5-coder:1.5b
    apiBase: http://localhost:11434
    roles:
      - autocomplete
    autocompleteOptions:
      disable: false
      maxPromptTokens: 1024
      debounceDelay: 250
      modelTimeout: 150
      maxSuffixPercentage: 0.2
      prefixPercentage: 0.3
      onlyMyCode: true

If you have 24GB VRAM, swap qwen2.5-coder:7b for qwen2.5-coder:32b in the chat model block. Everything else stays the same.

Save the file. Continue.dev picks up config changes immediately without restarting VS Code.

Using the extension

  • Chat: Press Ctrl+L (or Cmd+L on Mac) to open the chat panel. Highlight code first to include it as context.
  • Edit: Highlight code, press Ctrl+I, type an instruction (“add docstring”, “add error handling for empty list”).
  • Autocomplete: Just type. Suggestions appear automatically after the debounce delay. Tab to accept, Esc to dismiss.

Tuning autocomplete performance

The two numbers that matter most in autocompleteOptions:

debounceDelay (default 250ms): How long Continue.dev waits after you stop typing before sending a completion request. On fast hardware you can drop this to 150ms. On slower machines or if you find constant network round-trips distracting, raise it to 400ms.

modelTimeout (default 150ms): Continue.dev cancels a completion request and tries again if Ollama hasn’t responded within this window. If you’re seeing frequent missed completions, raise to 300ms. If you’re on a fast GPU and completions are arriving late, lower to 100ms.

maxPromptTokens (default 1024): The context window sent to the autocomplete model. More tokens = better awareness of surrounding code = more accurate completions. But larger prompts slow the 1.5B model down. 1024 is the right balance for most codebases.

If autocomplete feels too aggressive — triggering on every character — set onlyMyCode: true (already in the config above). This suppresses completions inside comments, strings, and import statements where completions are rarely useful.


Agent mode

Continue.dev’s agent mode (triggered via the chat panel with @codebase or file references) lets the model read your project structure, open files, and propose multi-file edits. With qwen2.5-coder:7b or larger, it handles tasks like “add a unit test for every public function in this file” or “find all places where we call this deprecated API and update them.”

To enable tool use in agent mode, add capabilities: [tool_use] to your chat model block:

  - name: Qwen2.5-Coder 7B (Chat)
    provider: ollama
    model: qwen2.5-coder:7b
    apiBase: http://localhost:11434
    roles:
      - chat
      - edit
      - apply
    capabilities:
      - tool_use

Agent mode works well for single-file tasks and small refactors. For large cross-repository changes, expect it to lose track of context — this is a limitation of the 8k context window in the 7B model, not Continue.dev itself. The 32B model with a 32k context handles more ambitious tasks.


Honest take

For day-to-day coding — inline completions, asking a model to write a function you’ve described, reviewing a diff — this stack is genuinely competitive with Copilot Pro. Qwen2.5-Coder 7B’s 88.4% HumanEval score is not a rounding error. It handles real Python, TypeScript, Go, and Rust tasks well.

Where it falls short: complex multi-file refactors, long chains of dependent edits, and anything requiring knowledge newer than the model’s training cutoff. The 32B model closes most of those gaps but requires hardware most home setups don’t have yet.

The autocomplete latency is the real differentiator compared to a year ago. With a 1.5B model on a GPU, completions appear faster than Copilot’s cloud round-trip. On CPU-only the 1.5B is slower but still usable; the 7B chat model on CPU is where you’ll notice the wait.

If your use case is primarily backend services or data engineering in Python or TypeScript, and you have at minimum an 8GB VRAM GPU, this stack replaces Copilot without meaningful productivity loss. If you’re working across unfamiliar codebases where Copilot’s training breadth matters, the 32B model is where parity lives.

For comparisons to other local inference options, see our vLLM vs Ollama deep dive and the best local AI models by VRAM tier roundup. If you’re still deciding what GPU to buy for this setup, the GPU buying guide covers the current options at each price point. For a full feature review of Continue.dev including team plan options and JetBrains support, see the Continue.dev review on AICoderScope.


1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 16, 2026. Model benchmarks and pricing change frequently; verify current figures before making hardware decisions.


The hardware mentioned in this guide, with current prices on Amazon (affiliate links — at no extra cost to you, purchases help support this site):

Was this article helpful?