Building RAG Agents as Personal Knowledge Assistants: A Practical Guide

Retrieval-Augmented Generation (RAG) has emerged as the dominant pattern for building AI agents that can reason over private or domain-specific knowledge. While the concept seems straightforward—retrieve relevant documents, then generate answers—the implementation details separate a frustrating chatbot from a genuinely useful personal knowledge assistant.

This guide covers the practical engineering decisions, trade-offs, and implementation patterns I've found effective after building multiple RAG systems for personal and professional use.

The Architecture of a Personal RAG Agent

A personal knowledge assistant differs from enterprise RAG systems in several key ways:

Single-user optimization: You can tune aggressively for one person's use case
Diverse document types: Personal knowledge spans PDFs, emails, notes, bookmarks, code snippets, and more
Evolving context: The knowledge base grows and changes frequently
High tolerance for latency: Personal use cases can accept 2-3 second response times

The core architecture consists of four components:

┌─────────────────────────────────────────────────────────┐
│                    User Query                            │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│              Query Processing & Routing                 │
│  (Intent detection, query expansion, routing)           │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│              Retrieval Pipeline                          │
│  (Vector search, keyword search, reranking)             │
└──────────────────────┬──────────────────────────────────┘
                       │
                       ▼
┌─────────────────────────────────────────────────────────┐
│              Generation with Context                    │
│  (LLM synthesis, citation, source attribution)         │
└─────────────────────────────────────────────────────────┘

Document Ingestion: The Foundation Most People Get Wrong

The quality of your RAG system is bounded by your ingestion pipeline. Most tutorials focus on PDF parsing, but a personal knowledge assistant needs to handle multiple formats gracefully.

Setting Up a Multi-Format Ingestion Pipeline

import os
from pathlib import Path
from typing import List, Dict
from dataclasses import dataclass
from datetime import datetime

@dataclass
class Document:
    content: str
    metadata: Dict
    source_type: str
    created_at: datetime = None

class DocumentIngestionPipeline:
    """Handles multiple document formats for personal knowledge bases."""
    
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.processors = {
            '.pdf': self._process_pdf,
            '.md': self._process_markdown,
            '.txt': self._process_text,
            '.html': self._process_html,
            '.json': self._process_json,
            '.py': self._process_code,
            '.js': self._process_code,
        }
    
    def ingest_directory(self, directory: str) -> List[Document]:
        """Recursively ingest all supported files from a directory."""
        documents = []
        path = Path(directory)
        
        for file_path in path.rglob('*'):
            if file_path.suffix in self.processors:
                try:
                    doc = self.processors[file_path.suffix](file_path)
                    if doc:
                        documents.extend(self._chunk_document(doc))
                except Exception as e:
                    print(f"Error processing {file_path}: {e}")
        
        return documents
    
    def _process_pdf(self, file_path: Path) -> Document:
        """Extract text from PDF with layout awareness."""
        import pymupdf4llm  # Better than standard PyMuPDF for RAG
        
        md_content = pymupdf4llm.to_markdown(str(file_path))
        
        return Document(
            content=md_content,
            metadata={
                'source': str(file_path),
                'filename': file_path.name,
                'format': 'pdf',
                'pages': self._count_pdf_pages(file_path),
            },
            source_type='pdf',
            created_at=datetime.fromtimestamp(file_path.stat().st_mtime)
        )
    
    def _process_markdown(self, file_path: Path) -> Document:
        """Process markdown with header extraction."""
        content = file_path.read_text(encoding='utf-8')
        
        # Extract headers for metadata
        headers = []
        for line in content.split('\n'):
            if line.startswith('#'):
                headers.append(line.strip('# ').strip())
        
        return Document(
            content=content,
            metadata={
                'source': str(file_path),
                'filename': file_path.name,
                'format': 'markdown',
                'headers': headers[:5],  # Keep first 5 headers as context
            },
            source_type='markdown',
            created_at=datetime.fromtimestamp(file_path.stat().st_mtime)
        )
    
    def _process_code(self, file_path: Path) -> Document:
        """Process code files with syntax awareness."""
        content = file_path.read_text(encoding='utf-8')
        
        # Add file path as context
        enhanced_content = f"File: {file_path}\n\n{content}"
        
        return Document(
            content=enhanced_content,
            metadata={
                'source': str(file_path),
                'filename': file_path.name,
                'format': 'code',
                'language': file_path.suffix[1:],
                'lines': len(content.split('\n')),
            },
            source_type='code',
            created_at=datetime.fromtimestamp(file_path.stat().st_mtime)
        )
    
    def _chunk_document(self, doc: Document) -> List[Document]:
        """Split document into chunks with overlap."""
        from langchain.text_splitter import RecursiveCharacterTextSplitter
        
        splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size,
            chunk_overlap=self.chunk_overlap,
            separators=["\n\n", "\n", " ", ""]
        )
        
        chunks = splitter.split_text(doc.content)
        
        chunked_docs = []
        for i, chunk in enumerate(chunks):
            chunk_metadata = doc.metadata.copy()
            chunk_metadata.update({
                'chunk_index': i,
                'total_chunks': len(chunks),
                'chunk_id': f"{doc.metadata['source']}_{i}",
            })
            
            chunked_docs.append(Document(
                content=chunk,
                metadata=chunk_metadata,
                source_type=doc.source_type,
                created_at=doc.created_at
            ))
        
        return chunked_docs

Key Ingestion Decisions

Chunk size matters more than you think. For a personal knowledge assistant:

500-800 tokens: Best for factual Q&A, code snippets, and definitions
1000-1500 tokens: Good for conceptual explanations and how-to guides
2000+ tokens: Use only for documents that lose meaning when split (legal contracts, narratives)

Metadata is your secret weapon. Always capture:

Source file path and name
Creation/modification dates
Document type and format
Any structural elements (headers, sections)

Embedding Strategies: Choosing the Right Vector Representation

The embedding model determines what your system can "understand." Here's what actually works in practice:

Comparing Embedding Models for Personal Use

Model	Dimensions	Strengths	Weaknesses	Best For
`text-embedding-3-small` (OpenAI)	1536	Good balance of quality/cost	Requires API calls	General purpose
`text-embedding-3-large` (OpenAI)	3072	Excellent quality	Expensive for large KBs	High-stakes retrieval
`all-MiniLM-L6-v2` (Sentence Transformers)	384	Fast, runs locally	Lower quality on complex queries	Quick prototypes
`bge-base-en-v1.5` (BAAI)	768	Strong retrieval performance	Slightly slower than MiniLM	Production local systems
`mxbai-embed-large-v1` (MixedBread)	1024	State-of-the-art open source	Requires more compute	Best open source option

Implementing a Flexible Embedding System

import numpy as np
from typing import List, Optional
import hashlib

class EmbeddingManager:
    """Manages multiple embedding models with caching."""
    
    def __init__(self, model_name: str = "text-embedding-3-small"):
        self.model_name = model_name
        self.cache = {}  # Simple in-memory cache
        self._init_model(model_name)
    
    def _init_model(self, model_name: str):
        """Initialize the appropriate embedding model."""
        if "text-embedding" in model_name:
            # OpenAI models
            import openai
            self.client = openai.OpenAI()
            self.embed_func = self._openai_embed
        else:
            # Local models via Sentence Transformers
            from sentence_transformers import SentenceTransformer
            self.model = SentenceTransformer(model_name)
            self.embed_func = self._local_embed
    
    def _openai_embed(self, texts: List[str]) -> np.ndarray:
        """Embed using OpenAI API with batching."""
        response = self.client.embeddings.create(
            input=texts,
            model=self.model_name
        )
        return np.array([item.embedding for item in response.data])
    
    def _local_embed(self, texts: List[str]) -> np.ndarray:
        """Embed using local Sentence Transformers model."""
        return self.model.encode(texts, convert_to_numpy=True)
    
    def embed_documents(self, documents: List[str], batch_size: int = 100) -> np.ndarray:
        """Embed documents with caching and batching."""
        all_embeddings = []
        
        # Check cache first
        uncached_indices = []
        uncached_texts = []
        
        for i, text in enumerate(documents):
            cache_key = hashlib.md5(text.encode()).hexdigest()
            if cache_key in self.cache:
                all_embeddings.append(self.cache[cache_key])
            else:
                uncached_indices.append(i)
                uncached_texts.append(text)
        
        # Process uncached documents in batches
        if uncached_texts:
            for batch_start in range(0, len(uncached_texts), batch_size):
                batch = uncached_texts[batch_start:batch_start + batch_size]
                batch_embeddings = self.embed_func(batch)
                
                # Cache results
                for text, embedding in zip(batch, batch_embeddings):
                    cache_key = hashlib.md5(text.encode()).hexdigest()
                    self.cache[cache_key] = embedding
                
                all_embeddings.extend(batch_embeddings)
        
        return np.array(all_embeddings)
    
    def embed_query(self, query: str) -> np.ndarray:
        """Embed a single query."""
        return self.embed_func([query])[0]

Hybrid Embedding Strategy

For a personal knowledge assistant, I recommend a hybrid approach:

class HybridEmbeddingManager:
    """Uses multiple embedding models for different content types."""
    
    def __init__(self):
        self.models = {
            'general': EmbeddingManager("text-embedding-3-small"),
            'code': EmbeddingManager("mxbai-embed-large-v1"),  # Better for code
            'technical': EmbeddingManager("bge-base-en-v1.5"),  # Good for technical docs
        }
    
    def get_model_for_content(self, content_type: str) -> EmbeddingManager:
        """Select the best model based on content type."""
        if content_type in ['code', 'programming', 'software']:
            return self.models['code']
        elif content_type in ['technical', 'scientific', 'research']:
            return self.models['technical']
        else:
            return self.models['general']

Retrieval Optimization: Beyond Basic Similarity Search

Vector similarity search is just the starting point. Here's how to build a retrieval pipeline that actually works:

Implementing a Multi-Stage Retrieval Pipeline

import numpy as np
from typing import List, Dict, Tuple
from dataclasses import dataclass

@dataclass
class RetrievedDocument:
    content: str
    metadata: Dict
    score: float
    retrieval_method: str

class AdvancedRetrievalPipeline:
    """Multi-stage retrieval with reranking and hybrid search."""
    
    def __init__(self, vector_store, embedding_manager):
        self.vector_store = vector_store
        self.embedding_manager = embedding_manager
        
        # Initialize reranker
        from sentence_transformers import CrossEncoder
        self.reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
        
        # BM25 for keyword search
        from rank_bm25 import BM25Okapi
        self.bm25 = None
        self.corpus = []
    
    def index_documents(self, documents: List[Dict]):
        """Index documents for both vector and keyword search."""
        # Store for vector search
        self.vector_store.add_documents(documents)
        
        # Prepare for BM25
        self.corpus = [doc['content'] for doc in documents]
        tokenized_corpus = [doc.split() for doc in self.corpus]
        self.bm25 = BM25Okapi(tokenized_corpus)
    
    def retrieve(self, query: str, top_k: int = 10) -> List[RetrievedDocument]:
        """Multi-stage retrieval pipeline."""
        
        # Stage 1: Initial retrieval (vector + keyword)
        vector_results = self._vector_search(query, top_k * 2)
        keyword_results = self._keyword_search(query, top_k * 2)
        
        # Stage 2: Reciprocal Rank Fusion
        fused_results = self._reciprocal_rank_fusion(
            [vector_results, keyword_results],
            weights=[0.7, 0.3]  # Weight vector search higher
        )
        
        # Stage 3: Reranking with cross-encoder
        reranked_results = self._rerank_results(query, fused_results[:top_k * 2])
        
        # Stage 4: Diversity filtering
        diverse_results = self._mmr_diversity_filter(
            reranked_results, 
            lambda_param=0.7,  # Balance relevance vs diversity
            top_k=top_k
        )
        
        return diverse_results[:top_k]
    
    def _vector_search(self, query: str, top_k: int) -> List[Tuple[int, float]]:
        """Semantic search using embeddings."""
        query_embedding = self.embedding_manager.embed_query(query)
        
        # Assuming vector_store has a search method
        results = self.vector_store.search(
            query_embedding, 
            top_k=top_k,
            include_distances=True
        )
        
        return [(idx, 1 - dist) for idx, dist in results]  # Convert distance to similarity
    
    def _keyword_search(self, query: str, top_k: int) -> List[Tuple[int, float]]:
        """BM25 keyword search."""
        if not self.bm25:
            return []
        
        tokenized_query = query.split()
        scores = self.bm25.get_scores(tokenized_query)
        
        # Get top-k indices
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]
    
    def _reciprocal_rank_fusion(
        self, 
        result_lists: List[List[Tuple[int, float]]],
        weights: List[float],
        k: int = 60
    ) -> List[Tuple[int, float]]:
        """Combine multiple result lists using RRF."""
        fused_scores = {}
        
        for weight, results in zip(weights, result_lists):
            for rank, (doc_id, score) in enumerate(results):
                if doc_id not in fused_scores:
                    fused_scores[doc_id] = 0
                fused_scores[doc_id] += weight * (1 / (k + rank + 1))
        
        # Sort by fused score
        sorted_docs = sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_docs
    
    def _rerank_results(
        self, 
        query: str, 
        results: List[Tuple[int, float]]
    ) -> List[Tuple[int, float]]:
        """Rerank using cross-encoder for precision."""
        if not results:
            return []
        
        # Prepare pairs for reranking
        pairs = [(query, self.corpus[idx]) for idx, _ in results]
        
        # Get reranker scores
        rerank_scores = self.reranker.predict(pairs)
        
        # Combine with original scores (weighted)
        reranked = []
        for (idx, orig_score), rerank_score in zip(results, rerank_scores):
            combined_score = 0.3 * orig_score + 0.7 * rerank_score
            reranked.append((idx, combined_score))
        
        # Sort by combined score
        return sorted(reranked, key=lambda x: x[1], reverse=True)
    
    def _mmr_diversity_filter(
        self,
        results: List[Tuple[int, float]],
        lambda_param: float = 0.5,
        top_k: int = 10
    ) -> List[Tuple[int, float]]:
        """Maximal Marginal Relevance for diversity."""
        if not results:
            return []
        
        selected = []
        candidates = results.copy()
        
        # Select first document (highest score)
        selected.append(candidates.pop(0))
        
        while len(selected) < top_k and candidates:
            best_score = -1
            best_idx = -1
            
            for i, (doc_id, score) in enumerate(candidates):
                # Calculate relevance to query
                relevance = score
                
                # Calculate similarity to already selected documents
                max_similarity = 0
                for selected_id, _ in selected:
                    # Simple Jaccard similarity for demonstration
                    # In practice, use embedding similarity
                    set1 = set(self.corpus[doc_id].split())
                    set2 = set(self.corpus[selected_id].split())
                    similarity = len(set1 & set2) / len(set1 | set2)
                    max_similarity = max(max_similarity, similarity)
                
                # MMR score
                mmr_score = lambda_param * relevance - (1 - lambda_param) * max_similarity
                
                if mmr_score > best_score:
                    best_score = mmr_score
                    best_idx = i
            
            if best_idx >= 0:
                selected.append(candidates.pop(best_idx))
        
        return selected

Query Processing: The Hidden Multiplier

Raw user queries are often poorly formed for retrieval. Implement query processing:

class QueryProcessor:
    """Enhance queries before retrieval."""
    
    def __init__(self, llm_client):
        self.llm = llm_client
    
    def process_query(self, query: str) -> Dict[str, str]:
        """Analyze and enhance the user query."""
        
        analysis_prompt = f"""Analyze this user query for a knowledge assistant:
        
        Query: {query}
        
        Provide JSON with:
        1. "intent": one of [factual, conceptual, procedural, comparative, exploratory]
        2. "entities": list of key entities mentioned
        3. "expanded_query": reformulated query that's better for retrieval
        4. "sub_queries": if complex, break into 2-3 simpler queries
        5. "time_sensitivity": one of [current, historical, timeless]
        
        Return ONLY valid JSON."""
        
        response = self.llm.generate(analysis_prompt)
        
        try:
            import json
            analysis = json.loads(response)
            
            # Generate expanded queries
            expanded_queries = [analysis['expanded_query']]
            if 'sub_queries' in analysis:
                expanded_queries.extend(analysis['sub_queries'])
            
            return {
                'original_query': query,
                'intent': analysis['intent'],
                'entities': analysis['entities'],
                'expanded_queries': expanded_queries,
                'time_sensitivity': analysis['time_sensitivity']
            }
        except:
            # Fallback to simple expansion
            return {
                'original_query': query,
                'intent': 'unknown',
                'entities': [],
                'expanded_queries': [query],
                'time_sensitivity': 'timeless'
            }

Building the Complete RAG Agent

Now let's assemble everything into a working personal knowledge assistant:

import os
from typing import List, Dict, Optional
from datetime import datetime

class PersonalKnowledgeAssistant:
    """A RAG-based personal knowledge assistant."""
    
    def __init__(self, knowledge_base_path: str):
        self.kb_path = knowledge_base_path
        
        # Initialize components
        self.ingestion = DocumentIngestionPipeline(
            chunk_size=800,
            chunk_overlap=150
        )
        self.embedding_manager = HybridEmbeddingManager()
        
        # Use ChromaDB for vector storage (simple, local)
        import chromadb
        self.chroma_client = chromadb.PersistentClient(path="./chroma_db")
        self.collection = self.chroma_client.get_or_create_collection(
            name="personal_kb",
            metadata={"hnsw:space": "cosine"}
        )
        
        # Initialize retrieval pipeline
        self.retrieval = AdvancedRetrievalPipeline(
            vector_store=self.collection,
            embedding_manager=self.embedding_manager
        )
        
        # LLM for generation
        import openai
        self.llm = openai.OpenAI()
        
        # Query processor
        self.query_processor = QueryProcessor(self.llm)
        
        # Load or build knowledge base
        self._initialize_knowledge_base()
    
    def _initialize_knowledge_base(self):
        """Load existing KB or build from documents."""
        if self.collection.count() == 0:
            print("Building knowledge base from documents...")
            self.ingest_knowledge_base()
        else:
            print(f"Loaded existing knowledge base with {self.collection.count()} documents")
    
    def ingest_knowledge_base(self):
        """Ingest all documents from the knowledge base path."""
        documents = self.ingestion.ingest_directory(self.kb_path)
        
        # Prepare for ChromaDB
        ids = [f"doc_{i}" for i in range(len(documents))]
        texts = [doc.content for doc in documents]
        metadatas = [doc.metadata for doc in documents]
        
        # Get embeddings
        embeddings = self.embedding_manager.embed_documents(texts)
        
        # Add to ChromaDB
        self.collection.add(
            ids=ids,
            embeddings=embeddings.tolist(),
            documents=texts,
            metadatas=metadatas
        )
        
        print(f"Ingested {len(documents)} document chunks")
        
        # Also prepare BM25 index
        self.retrieval.index_documents([
            {'content': doc.content, 'metadata': doc.metadata}
            for doc in documents
        ])
    
    def ask(self, question: str, include_sources: bool = True) -> Dict:
        """Ask a question and get an answer with sources."""
        
        # Process the query
        processed_query = self.query_processor.process_query(question)
        
        # Retrieve relevant documents
        all_results = []
        for expanded_query in processed_query['expanded_queries']:
            results = self.retrieval.retrieve(expanded_query, top_k=5)
            all_results.extend(results)
        
        # Deduplicate results
        seen_ids = set()
        unique_results = []
        for result in all_results:
            if result.metadata['chunk_id'] not in seen_ids:
                seen_ids.add(result.metadata['chunk_id'])
                unique_results.append(result)
        
        # Sort by score
        unique_results.sort(key=lambda x: x.score, reverse=True)
        top_results = unique_results[:5]
        
        # Generate answer
        context = "\n\n---\n\n".join([
            f"Source: {r.metadata.get('filename', 'Unknown')}\n{r.content}"
            for r in top_results
        ])
        
        answer_prompt = f"""Based on the following context, answer the user's question.
        
        Context:
        {context}
        
        Question: {question}
        
        Instructions:
        1. Answer based ONLY on the provided context
        2. If the context doesn't contain the answer, say so
        3. Include specific references to sources when possible
        4. Be concise but thorough
        
        Answer:"""
        
        response = self.llm.chat.completions.create(
            model="gpt-4-turbo-preview",
            messages=[
                {"role": "system", "content": "You are a helpful personal knowledge assistant."},
                {"role": "user", "content": answer_prompt}
            ],
            temperature=0.3
        )
        
        answer = response.choices[0].message.content
        
        # Prepare response
        result = {
            'question': question,
            'answer': answer,
            'sources': [],
            'metadata': {
                'intent': processed_query['intent'],
                'retrieval_count': len(unique_results),
                'top_score': top_results[0].score if top_results else 0
            }
        }
        
        if include_sources:
            result['sources'] = [
                {
                    'content': r.content[:200] + "..." if len(r.content) > 200 else r.content,
                    'source': r.metadata.get('source', 'Unknown'),
                    'score': r.score,
                    'chunk_id': r.metadata.get('chunk_id')
                }
                for r in top_results[:3]
            ]
        
        return result
    
    def add_document(self, file_path: str):
        """Add a single document to the knowledge base."""
        documents = self.ingestion.ingest_directory(os.path.dirname(file_path))
        
        # Filter to just the new document
        new_docs = [d for d in documents if d.metadata['source'] == file_path]
        
        if new_docs:
            # Add to vector store
            ids = [f"doc_{self.collection.count() + i}" for i in range(len(new_docs))]
            texts = [doc.content for doc in new_docs]
            metadatas = [doc.metadata for doc in new_docs]
            embeddings = self.embedding_manager.embed_documents(texts)
            
            self.collection.add(
                ids=ids,
                embeddings=embeddings.tolist(),
                documents=texts,
                metadatas=metadatas
            )
            
            # Update BM25 index
            self.retrieval.index_documents([
                {'content': doc.content, 'metadata': doc.metadata}
                for doc in new_docs
            ])
            
            return len(new_docs)
        
        return 0

Practical Applications and Use Cases

Here's how to use this assistant for real personal knowledge tasks:

1. Research Assistant

# Initialize with your research papers
assistant = PersonalKnowledgeAssistant("./research_papers")

# Ask complex research questions
response = assistant.ask(
    "What are the main limitations of transformer models for long documents, "
    "and what alternatives have been proposed in the last two years?"
)

print(response['answer'])
print("\nSources:")
for source in response['sources']:
    print(f"- {source['source']} (score: {source['score']:.2f})")

2. Code Knowledge Base

# Index your code repositories
assistant = PersonalKnowledgeAssistant("./code_repos")

# Ask coding questions
response = assistant.ask(
    "How do I implement rate limiting in our FastAPI backend? "
    "Show me examples from our existing codebase."
)

# The assistant will find relevant code snippets and explain patterns

3. Personal Wiki

# Index your notes, bookmarks, and documents
assistant = PersonalKnowledgeAssistant("./my_notes")

# Connect disparate knowledge
response = assistant.ask(
    "What connections exist between my notes on machine learning "
    "and my project management documents?"
)

Performance Optimization Tips

1. Caching Strategy

class CachedRAGAssistant:
    """Add caching to reduce latency and cost."""
    
    def __init__(self, base_assistant):
        self.assistant = base_assistant
        self.cache = {}  # In production, use Redis or similar
        
    def ask(self, question: str, **kwargs) -> Dict:
        # Create cache key from question
        cache_key = hashlib.md5(question.lower().strip().encode()).hexdigest()
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        # Get fresh answer
        result = self.assistant.ask(question, **kwargs)
        
        # Cache for 1 hour
        result['cached_at'] = datetime.now().isoformat()
        self.cache[cache_key] = result
        
        return result

2. Incremental Updates

Don't rebuild your entire knowledge base when adding documents:

def incremental_update(self, new_documents_path: str):
    """Add new documents without rebuilding entire index."""
    new_docs = self.ingestion.ingest_directory(new_documents_path)
    
    # Check which documents are truly new
    existing_sources = set()
    for metadata in self.collection.get()['metadatas']:
        existing_sources.add(metadata['source'])
    
    truly_new = [d for d in new_docs if d.metadata['source'] not in existing_sources]
    
    if truly_new:
        # Add only new documents
        self.add_documents(truly_new)
        return len(truly_new)
    
    return 0

Limitations and Honest Assessment

After building several RAG systems, here's what I've learned works and what doesn't:

What works well:

Factual Q&A over structured documents
Code search and explanation
Connecting information across multiple documents
Maintaining context across conversations

What's challenging:

Complex reasoning requiring multiple inference steps
Temporal reasoning (what changed between versions)
Understanding implicit relationships not stated in text
Handling contradictory information in the knowledge base

When RAG is the wrong tool:

When you need real-time information (use web search instead)
For creative writing or open-ended generation
When documents are highly confidential and can't be processed by LLMs
For numerical calculations or precise data lookups (use databases)

Conclusion

Building a personal knowledge assistant with RAG is an iterative process. Start simple with basic vector search, then add complexity as you identify specific pain points. The most important investments are:

Quality ingestion: Spend time getting document processing right
Smart chunking: Experiment with different chunk sizes for your content
Hybrid retrieval: Combine vector and keyword search for robustness
Query understanding: Process user questions before retrieval
Source attribution: Always show where answers come from

The complete code from this guide is available on GitHub. Start with a small knowledge base of 50-100 documents, iterate on the retrieval pipeline, and you'll build a genuinely useful personal knowledge assistant that grows with your needs.

Remember: the goal isn't to build a perfect system, but to build one that's better than searching through files manually. Even a 70% accurate retrieval system can save hours of manual searching when it surfaces the right information at the right time.

RAG Agents for Knowledge Workers: Building Your Second Brain with AI