Local RAG in 2026: Build a Private Document AI That Never Leaves Your Machine

raglocal-aiollamaopen-webuiprivacyembeddingstutorial

RAG (Retrieval-Augmented Generation) is the difference between asking your LLM questions about its training data and asking it questions about your documents. With a local setup, every PDF, contract, manual, or research paper you upload stays on your hardware — no API call to OpenAI, no document upload to a third-party server, no data retention policy to read.

Three paths covered here: no-code with Open WebUI, no-code with AnythingLLM, and a Python pipeline for developers who want full control. All three run entirely offline once set up.

How RAG actually works

When a cloud service answers questions about a document, one of two things happens: either the whole document is stuffed into the context window (expensive, limited by window size), or it’s chunked, embedded into a vector database, and the most relevant chunks are retrieved on each query. The second approach is RAG.

Local RAG runs every component on your machine:

  1. Ingest: Your document gets split into fixed-size chunks (typically 512 tokens)
  2. Embed: Each chunk is converted to a vector by an embedding model
  3. Store: Vectors are saved to a local vector database
  4. Retrieve: Your question is embedded, matched against stored vectors, top-k most similar chunks are selected
  5. Answer: The LLM answers using only the retrieved context injected into its prompt

The privacy implication: your documents never leave your machine. Embedding happens locally, retrieval happens locally, and the LLM runs locally. Compare this against what common tools actually phone home — the gap is significant.

Pick your embedding model first

The embedding model determines retrieval quality and is the first decision to make. These three run via Ollama, and the top option runs fine on CPU with no GPU required:

ModelSizeParamsContextMTEB ScoreBest for
nomic-embed-text v1.5274 MB137M8,192 tokens62.39General use, CPU-only machines
mxbai-embed-large670 MB334M512 tokens64.68Higher accuracy, short document chunks
snowflake-arctic-embed2~600 MB303M8,192 tokensCompetitive MTEB-RMultilingual documents

For context on those MTEB scores: nomic-embed-text at 62.39 matches OpenAI’s text-embedding-3-small (62.3). mxbai-embed-large at 64.68 matches OpenAI’s text-embedding-3-large (64.6). Both run locally at zero marginal cost.

The mxbai-embed-large caveat: its 512-token context window means any chunk longer than roughly 380 words gets truncated. If your documents have dense, long paragraphs, nomic-embed-text’s 8,192-token context handles them cleanly. mxbai-embed-large wins on accuracy for short, well-segmented content.

Pull whichever you’re starting with:

ollama pull nomic-embed-text
# or for higher accuracy:
ollama pull mxbai-embed-large

Embedding models are separate from chat models in Ollama — you need both pulled before any RAG pipeline works.

Path 1: Open WebUI — zero config, browser-based

If you already have Open WebUI running with Ollama, RAG is a few settings changes away. If you haven’t set it up yet, the full setup walkthrough is at /blog/open-webui-multi-user-auth-family-setup-2026/.

Step 1 — Configure the embedding model:

Admin Panel → Settings → Documents:

  • Embedding Model Engine: Ollama
  • Embedding Model: nomic-embed-text
  • Chunk Size: 512
  • Chunk Overlap: 64
  • Hybrid Search: toggle on (this blends vector similarity with keyword matching, improving recall for specific terms like product names or version numbers)
  • Save

Step 2 — Fix Ollama’s default context length (critical):

Ollama defaults to a 2,048-token context window, which silently drops retrieved chunks that fall outside it. For RAG to work well, you need at least 8,192.

Admin Panel → Models → select your chat model → Advanced Parameters → set num_ctx to 8192. For long documents with many retrieved chunks, push this to 16384.

Step 3 — Create a knowledge base:

Workspace → Knowledge → + New Knowledge → give it a name (e.g., “Product Manuals”) → upload files. Open WebUI processes documents asynchronously; wait for the spinner to clear before querying.

Supported formats as of 2026: PDF, DOCX, TXT, Markdown, CSV. Complex DOCX formatting (tracked changes, nested tables) can lose fidelity — plain text and PDF are the most reliable.

Step 4 — Use it in chat:

In a new chat session, type # and the knowledge collection name appears as an autocomplete option. Select it to attach to the session. Every query now retrieves from your indexed documents before the LLM responds.

One limitation to know: if you change your chunk size or embedding model after documents are already indexed, existing documents in knowledge bases retain their original chunking. New uploads use the updated settings. You’d need to delete and re-upload existing files to re-index them with new settings.

Path 2: AnythingLLM — desktop app, no terminal

AnythingLLM is a desktop application built specifically for document chat. It bundles its own vector database (LanceDB), chunking logic, and a GUI for every step — no Docker, no terminal, drag-and-drop documents. As of May 2026 it has 53,000+ GitHub stars and is actively maintained.

The app itself needs roughly 2 GB RAM. Running a local LLM alongside it requires whatever your chosen Ollama model needs separately.

Install and connect to Ollama:

Download from useanything.com — the installer is around 500 MB. On first launch:

  1. Settings → LLM Preference → Ollama. The app auto-detects localhost:11434
  2. Select your chat model (Qwen2.5 7B for a balance of speed and quality; Llama 3.2 3B if you’re on a low-VRAM machine)
  3. Settings → Embedding Preference → Ollama → select nomic-embed-text
  4. Save and close settings

Create a workspace and upload documents:

Workspaces are the unit of organization — a project folder, chat history, and document collection in one. Click + New Workspace, name it, then drag and drop PDFs into the document panel. AnythingLLM chunks and embeds automatically. When the spinner clears, the documents are queryable.

The workspace isolation model is better than Open WebUI for multi-project use: documents in Workspace A are invisible to Workspace B. If you’re running separate projects — client work, personal research, a codebase’s documentation — this prevents cross-contamination in retrieval.

The trade-off: AnythingLLM’s default chunk size is on the larger side. For documents where you’re looking up specific numbers or dates, reducing the chunk size in Settings → Embedder → Chunk Configuration improves precision at the cost of needing more retrieved chunks to cover the same context.

Path 3: Python with LangChain + Ollama

For developers building applications or needing full control over the pipeline — custom preprocessing, re-ranking, hybrid retrieval, or integration into existing code.

Install dependencies:

pip install langchain langchain-ollama langchain-community faiss-cpu pypdf

Build the pipeline:

from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_community.vectorstores import FAISS
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA

# Load all PDFs from a directory
loader = PyPDFDirectoryLoader("./docs/")
documents = loader.load()

# Chunk: 512 tokens, 64-token overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)
chunks = splitter.split_documents(documents)

# Embed locally with Ollama — 768-dimensional vectors
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Build and save vector store to disk
db = FAISS.from_documents(chunks, embeddings)
db.save_local("faiss_index")

# Query pipeline
llm = ChatOllama(model="llama3.2", num_ctx=8192)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 5})
)

response = qa_chain.invoke({"query": "What are the key findings in section 3?"})
print(response["result"])

To reload the index in a later session (LangChain 0.3+ requires this safety flag):

db = FAISS.load_local(
    "faiss_index",
    embeddings,
    allow_dangerous_deserialization=True
)

The k parameter: k=5 retrieves the 5 most similar chunks. For specific factoid questions, k=3 is usually enough. For multi-part analytical questions or long documents, k=8–10 prevents important context from being missed. Higher k increases prompt length, latency, and the risk of noisy context — there’s a diminishing returns curve past k=8 for most use cases.

For developers who want more than FAISS, the same pipeline works with Chroma (langchain_community.vectorstores.Chroma) as a drop-in replacement with a persistent database rather than a flat file. Chroma also supports metadata filtering, which matters when you need to restrict retrieval to documents from a specific date range or source.

Chunking strategy: what the numbers say

Chunk size is the most impactful tuning parameter in a local RAG setup and also the most frequently misconfigured. Chroma’s published research on recursive splitting vs semantic chunking found:

  • Recursive splitting at 400 tokens → 85–90% retrieval recall
  • Semantic chunking (splits at meaning boundaries rather than token count) → 91–92% recall

The improvement from semantic chunking is real but not dramatic. Start with recursive splitting at 512 tokens. Only move to semantic chunking if you observe consistent missed retrievals — it has higher indexing cost and more moving parts.

On overlap: the standard recommendation is 10–20% (64 tokens for a 512-token chunk). A January 2026 analysis found overlap had no measurable benefit for SPLADE retrieval specifically, but for dense vector search, 10–15% overlap is low-cost insurance against a boundary split cutting a critical sentence in half.

Matching chunk size to content type matters more than squeezing the last 2% out of a retrieval benchmark:

Content typeRecommended chunk size
Dense technical documentation, research papers512–1024 tokens
Legal documents, contracts256–512 tokens
FAQ documents, short-answer lookup128–256 tokens
Mixed document types512 tokens as a safe default

What runs on CPU vs GPU

The embedding models in this guide run on CPU without meaningful throughput penalty. nomic-embed-text at 274 MB processes documents in the background — indexing is a one-time cost, and retrieval at query time is milliseconds regardless of hardware.

The LLM doing the answering is where GPU matters. With 8 GB VRAM, a 7B–8B model handles synthesis well while the embedding model stays on CPU. With 16+ GB VRAM, a 14B–32B model produces noticeably better reasoning over multi-document retrieved context.

No local GPU? A hybrid approach still works: run the embedding model and vector store locally on CPU (nomic-embed-text is explicitly designed to run well on CPU), and route LLM inference to a cloud GPU. RunPod Serverless charges per inference call rather than by the hour — practical for occasional document Q&A rather than continuous workloads. Your documents and embeddings never leave your machine in this setup; only the final text query + retrieved context goes to the inference endpoint.

For hardware sizing context, see the system RAM guide and the NVMe SSD guide — fast NVMe cuts initial model loading time, and adequate system RAM prevents the LLM from having to page.

Honest take: which path for which situation

Open WebUI is the right call if you already run it for chat and want to add document Q&A without a separate app. The hybrid search option is genuinely useful. The limitation: Open WebUI wasn’t designed specifically for document work, so workspace isolation is weaker than AnythingLLM and managing many separate document collections gets messy.

AnythingLLM wins when document chat is the primary use case. The workspace model is cleaner, the desktop app removes Docker friction, and drag-and-drop document management is faster to iterate on. The trade-off is less flexibility on the inference backend.

LangChain Python is for developers building something, not for personal document chat. Full control over every parameter, but you’re writing and owning the pipeline code. The FAISS index file is portable — useful if you want to build the index once and ship it with an application.

What all three get right: documents stay on your machine. The embedding model runs locally. The vector store is local. The privacy audit covers what each tool actually phones home — local RAG stacks hold up cleanly.

The one thing that will trip you up regardless of which path you take: forgetting to increase Ollama’s context length from its 2,048-token default. Do that first, before anything else, and the rest of the setup is straightforward.

Prices and availability current as of May 2026. Verify hardware and API pricing before purchasing.

1V1 PLAYBOOK · LOCAL LLM

Cut your local AI bill from $400/month cloud GPU to $47/month at home.

4-path hardware decision table, Ollama cold-start fix, Cursor/Claude Code routing configs, full 24-month TCO calculator.

Get it for $19 (early bird) →

Sources

Last updated May 23, 2026. Prices and specs change; verify current rates before purchasing.

Was this article helpful?