This guide is based on Olivier Duvelleroy's NEXUS project—a personal RAG system he built in a single weekend to solve the file-limit problem with AI tools. If you haven't read that story, start there for context on why this matters.
Here, we'll walk through exactly how to build your own version.
What You'll Need
- A computer with 16GB+ RAM (the model runs locally)
- ~10GB free disk space (for the model and index)
- Basic comfort with the terminal (you'll copy/paste commands)
- Documents to index (PDFs, Word docs, text files)
- 2-4 hours (mostly waiting for downloads)
The Architecture
Before we start, let's understand what we're building:
┌─────────────────────────────────────────────────────────────┐
│ YOUR DOCUMENTS │
│ (PDFs, Word docs, text files, etc.) │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CHUNKING & EMBEDDING │
│ Split into paragraphs → Convert to vectors │
│ (sentence-transformers library) │
└─────────────────────────┬───────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ FAISS VECTOR INDEX │
│ Fast similarity search across all your chunks │
└─────────────────────────┬───────────────────────────────────┘
│
┌───────────┴───────────┐
│ │
▼ ▼
┌─────────────────────┐ ┌───────────────────────────────────┐
│ YOUR QUESTION │ │ RETRIEVED CHUNKS │
│ │──▶│ (Top 5-10 most relevant) │
└─────────────────────┘ └───────────────────┬───────────────┘
│
▼
┌───────────────────────────────────┐
│ LOCAL LLM (Ollama) │
│ Question + Context → Answer │
│ (Qwen3:8b) │
└───────────────────────────────────┘
NEXUS Architecture: Index once, query instantly
The key insight: retrieval is fast (a few seconds), while generation is slower (1-2 minutes on local hardware). This is fine because you're trading speed for privacy and unlimited context.
Step 1: Install Ollama ~10 min
Ollama is a local AI runtime that makes it dead simple to run open-source models. Think of it as "Docker for LLMs."
On Mac:
curl -fsSL https://ollama.com/install.sh | shOn Windows: Download from ollama.com/download
On Linux:
curl -fsSL https://ollama.com/install.sh | shVerify it's working:
ollama --versionStep 2: Pull a Model ~20 min
We'll use Qwen3:8b—a capable model that runs well on consumer hardware. It's 4-bit quantized, meaning it's compressed to use less memory while maintaining quality.
ollama pull qwen3:8bThis downloads about 5GB. Go grab coffee.
Test it:
ollama run qwen3:8b "What is RAG in AI?"You should see a response about Retrieval-Augmented Generation. If your fan starts spinning, that's normal—your CPU is doing the work.
Performance Note
Local inference is slower than cloud APIs. Expect 1-2 minutes per response on a typical laptop. This is the tradeoff for privacy and no API costs. If you need speed, you can swap in a cloud API later—the architecture supports both.
Step 3: Set Up Python Environment ~15 min
Create a clean Python environment for the project:
# Create project directory
mkdir nexus && cd nexus
# Create virtual environment
python -m venv venv
# Activate it
source venv/bin/activate # Mac/Linux
# or: venv\Scripts\activate # Windows
# Install dependencies
pip install langchain langchain-community sentence-transformers faiss-cpu pypdf python-docxWhat we're installing:
- langchain: Orchestration framework for LLM pipelines
- sentence-transformers: Creates embeddings from text
- faiss-cpu: Facebook's vector similarity search
- pypdf, python-docx: Document parsers
Step 4: Create the Indexing Script ~30 min
This script reads your documents, splits them into chunks, creates embeddings, and stores them in a FAISS index.
Create index_documents.py:
import os
from pathlib import Path
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader, Docx2txtLoader, TextLoader
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# Configuration
DOCS_PATH = "./documents" # Put your docs here
INDEX_PATH = "./faiss_index"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
def load_documents(docs_path):
"""Load all documents from the specified directory."""
documents = []
for file_path in Path(docs_path).rglob("*"):
if file_path.suffix.lower() == ".pdf":
loader = PyPDFLoader(str(file_path))
elif file_path.suffix.lower() in [".docx", ".doc"]:
loader = Docx2txtLoader(str(file_path))
elif file_path.suffix.lower() in [".txt", ".md"]:
loader = TextLoader(str(file_path))
else:
continue
try:
documents.extend(loader.load())
print(f"Loaded: {file_path.name}")
except Exception as e:
print(f"Error loading {file_path.name}: {e}")
return documents
def main():
print("Loading documents...")
documents = load_documents(DOCS_PATH)
print(f"Loaded {len(documents)} document pages")
print("Splitting into chunks...")
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
print("Creating embeddings (this takes a while)...")
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
print("Building FAISS index...")
vectorstore = FAISS.from_documents(chunks, embeddings)
print(f"Saving index to {INDEX_PATH}...")
vectorstore.save_local(INDEX_PATH)
print("Done! Index ready for queries.")
if __name__ == "__main__":
main()Step 5: Create the Query Script ~30 min
This script loads the index, retrieves relevant chunks for your question, and sends them to the local LLM for an answer.
Create query.py:
import sys
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
INDEX_PATH = "./faiss_index"
MODEL_NAME = "qwen3:8b"
TOP_K = 5 # Number of chunks to retrieve
# Custom prompt for grounded answers
PROMPT_TEMPLATE = """Use the following context to answer the question.
If you cannot answer based on the context, say so.
Always cite which documents support your answer.
Context:
{context}
Question: {question}
Answer:"""
def main():
if len(sys.argv) < 2:
print("Usage: python query.py 'Your question here'")
sys.exit(1)
question = " ".join(sys.argv[1:])
print("Loading index...")
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = FAISS.load_local(
INDEX_PATH, embeddings, allow_dangerous_deserialization=True
)
print("Connecting to Ollama...")
llm = Ollama(model=MODEL_NAME, temperature=0.1)
print("Creating retrieval chain...")
prompt = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": TOP_K}),
chain_type_kwargs={"prompt": prompt},
return_source_documents=True
)
print(f"\nQuestion: {question}\n")
print("Thinking... (this takes 1-2 minutes)\n")
result = qa_chain({"query": question})
print("=" * 60)
print("ANSWER:")
print("=" * 60)
print(result["result"])
print("\n" + "=" * 60)
print("SOURCES:")
print("=" * 60)
for doc in result["source_documents"]:
source = doc.metadata.get("source", "Unknown")
print(f"- {source}")
if __name__ == "__main__":
main()Step 6: Index Your Documents Varies
Create a documents folder and add your files:
mkdir documents
# Copy your PDFs, Word docs, and text files hereRun the indexing script:
python index_documents.pyThis will take a while depending on how many documents you have. For 100 documents, expect 10-20 minutes.
Checkpoint
You should now have a faiss_index folder containing your vectorized knowledge base. This only needs to be rebuilt when you add new documents.
Step 7: Query Your Knowledge Base
Now the fun part. Ask questions:
python query.py "What did customers say about pricing in our research?"
python query.py "Summarize the main themes from the Gartner reports"
python query.py "Find quotes about AI adoption challenges"The system will:
- Convert your question to an embedding (instant)
- Find the 5 most relevant chunks (instant)
- Send question + context to the local LLM (1-2 min)
- Return a grounded answer with sources
Real Example: What NEXUS Actually Outputs
Here's a real example from Olivier's system. He asked NEXUS to find relevant quotes for each chapter of a book he's writing about the "context problem" in enterprise data:
Part 6 — Context problem
Data exists, but meaning is lost across systems
| Chapter | Expanded quote | Reference |
|---|---|---|
| Setup | "If a user is spending their time in Outlook or in CRM, how do you take that insight and make sure it gets to them in that application? Otherwise, it just stays disconnected from action." | Qual Transcript 2 |
| Exploration | "All these tools have siloed, contextual metadata. A lot of the understanding sits in one person's head. When they leave or switch roles, that context disappears." | Qual Transcript 1 |
| Practical | "We are very clear that Confluence is the source for documentation. If you want to understand definitions or logic, there is one place to go. That consistency matters when people start self-serving." | Qual Transcript 5 |
| Synthesis | "AI models don't understand business rules or why decisions were made. Without decision memory and context, agents will act quickly but incorrectly." | Qual Transcript 5 |
This is the magic: In 30 seconds, NEXUS retrieved the most relevant quotes from 40+ interview transcripts, matched them to book chapters, and cited the exact source. No manual searching. No missed documents.
What's Next
This is a working foundation. Here's how to extend it:
| Improvement | Difficulty | Impact |
|---|---|---|
| Add a simple web UI (Gradio/Streamlit) | Easy | Much better UX |
| Swap in GPT-4 for faster inference | Easy | 10x faster responses |
| Add incremental indexing | Medium | Faster updates |
| Support images/diagrams (vision model) | Hard | Much richer context |
| Add memory across sessions | Medium | Conversational queries |
The Hybrid Future
Olivier's insight: this doesn't have to be all-local forever. The architecture supports a hybrid approach:
- Local retrieval for governance (your documents never leave your machine)
- Optional cloud inference for speed/quality when content is non-sensitive
You control the tradeoff. That's the point.
"At no time should we be afraid of diving into the details these days. AI tools can guide you through implementation; you are mostly limited by your curiosity, creativity, and intent."
— Olivier Duvelleroy
You now have your own context engine. What will you ask it?