Building a Customer Support Agent That Resolves 70% of Tickets
Emma Liu
Tech journalist covering the AI agent ecosystem and startups.
**Most customer support agents fail not because the LLM is bad, but because the architecture around it is naive.** A chatbot that can only answer questions from a static FAQ page will hallucinate, fru...
Building a High-Performing Customer Support Agent: A Production-Grade Tutorial
Most customer support agents fail not because the LLM is bad, but because the architecture around it is naive. A chatbot that can only answer questions from a static FAQ page will hallucinate, frustrate users, and generate tickets that are harder to resolve than if the customer had never contacted support at all.
This tutorial walks through building a support agent that actually works in production — one that knows when it doesn't know, integrates with your existing systems, escalates intelligently, and gets better over time. We'll use real tools, real code, and real tradeoffs.
The Architecture That Actually Works
Before touching any code, let's establish the system design. A production support agent isn't a single prompt — it's a pipeline.
┌─────────────────────────────────────────────────────────────────┐
│ Customer Message │
└──────────────────────────┬──────────────────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Intent Classification │
│ (deterministic routing, not LLM guessing) │
└──────────┬───────────────┬──────────────────┬───────────────────┘
▼ ▼ ▼
┌──────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Knowledge │ │ Account / │ │ Escalation │
│ Retrieval │ │ Order Tool │ │ to Human │
│ (RAG) │ │ Calls │ │ │
└──────┬───────┘ └──────┬──────┘ └──────────────────┘
▼ ▼
┌─────────────────────────────┐
│ Response Generation │
│ (constrained, grounded) │
└──────────────┬──────────────┘
▼
┌─────────────────────────────┐
│ Quality Gate / Guardrails │
└──────────────┬──────────────┘
▼
Response to User
Each layer is independently testable and replaceable. That's the key insight — don't build a monolith.
Part 1: Knowledge Base Setup
Choosing Your RAG Stack
The knowledge base is where most of your accuracy comes from. Here's what I've seen work and what doesn't.
What works:
- Chunking by semantic section (not arbitrary token counts)
- Metadata filtering (product version, customer tier, region)
- Hybrid search (vector + keyword) for technical content
What doesn't work:
- Dumping your entire docs site into one vector store
- Using a single embedding model for all content types
- Ignoring document freshness — outdated answers are worse than no answer
Let's build this properly with a concrete example. We'll use LangChain for orchestration, Qdrant for vector storage (it supports hybrid search natively), and Cohere's embed-v3 for embeddings (it handles multilingual content well).
Document Processing Pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, models
import hashlib
def build_knowledge_base(docs_path: str, collection_name: str):
"""Build a knowledge base with proper chunking and metadata."""
# Load documents - use appropriate loaders per file type
loader = DirectoryLoader(
docs_path,
glob="**/*.md",
show_progress=True,
use_multithreading=True,
)
documents = loader.load()
# Semantic-aware chunking
# Critical: chunk by headers for docs, not by arbitrary token count
splitter = RecursiveCharacterTextSplitter.from_language(
language="markdown",
chunk_size=800,
chunk_overlap=100,
separators=["\n## ", "\n### ", "\n\n", "\n", " "],
)
chunks = splitter.split_documents(documents)
# Enrich metadata - this is what enables filtering later
for chunk in chunks:
# Generate a stable ID for deduplication
chunk.metadata["content_hash"] = hashlib.md5(
chunk.page_content.encode()
).hexdigest()
# Extract product area from the file path
# e.g., docs/billing/refunds.md → "billing"
path_parts = chunk.metadata.get("source", "").split("/")
chunk.metadata["product_area"] = (
path_parts[-2] if len(path_parts) > 1 else "general"
)
# Tag document type
chunk.metadata["doc_type"] = classify_doc_type(chunk.page_content)
# Set up Qdrant with hybrid search
client = QdrantClient(host="localhost", port=6333)
# Create collection with both dense and sparse vectors
client.create_collection(
collection_name=collection_name,
vectors_config={
"dense": models.VectorParams(
size=1024, # Cohere embed-v3 dimension
distance=models.Distance.COSINE,
)
},
sparse_vectors_config={
"sparse": models.SparseVectorParams(
index=models.SparseIndexParams(on_disk=False)
)
},
)
return QdrantVectorStore.from_documents(
documents=chunks,
embedding=CohereEmbeddings(model="embed-english-v3.0"),
collection_name=collection_name,
retrieval_mode="hybrid", # This is the key
vector_name="dense",
sparse_vector_name="sparse",
)
Why Hybrid Search Matters
Pure vector search struggles with exact terminology. When a customer asks about "error code 4217," vector similarity will return semantically similar content — maybe error handling docs in general. But the customer needs the specific error code page.
Hybrid search combines dense vectors (semantic understanding) with sparse vectors (BM25-style keyword matching). In benchmarks on support content, I've seen hybrid search improve retrieval accuracy by 15-25% over pure vector search, especially for:
- Error codes and specific identifiers
- Product names and SKUs
- Policy references (e.g., "30-day return policy")
# Retrieval with metadata filtering and reranking
from langchain_cohere import CohereRerank
def retrieve_context(
query: str,
vector_store: QdrantVectorStore,
product_area: str = None,
top_k: int = 10,
rerank_top_n: int = 3,
) -> list[str]:
"""Retrieve and rerank knowledge base context."""
# Build filter if product area is known
search_kwargs = {"k": top_k}
if product_area:
search_kwargs["filter"] = models.Filter(
must=[
models.FieldCondition(
key="metadata.product_area",
match=models.MatchValue(value=product_area),
)
]
)
# Retrieve candidates
retriever = vector_store.as_retriever(search_kwargs=search_kwargs)
candidates = retriever.invoke(query)
# Rerank with Cohere — this is where the magic happens
# Reranking is cheap and dramatically improves precision
reranker = CohereRerank(model="rerank-english-v3.0", top_n=rerank_top_n)
reranked = reranker.rerank(
query=query,
documents=[c.page_content for c in candidates],
top_n=rerank_top_n,
)
return [candidates[r["index"]].page_content for r in reranked]
The reranking step is non-negotiable. Retrieval is recall-oriented (find everything that might be relevant). Reranking is precision-oriented (rank the most relevant first). Skipping it means your LLM gets noisy context and produces worse answers.
Part 2: Tool Integration
A support agent that can only answer questions from documentation is useful but limited. The real value comes from integrating with your actual systems — checking order status, issuing refunds, updating accounts.
Tool Design Principles
- Tools should be atomic and predictable. One tool = one action. Don't build
check_order_and_update_shipping— buildget_orderandupdate_shipping_addressseparately. - Tools should return structured data, not prose. The LLM will generate the prose.
- Tools should fail gracefully. Every tool needs error handling that the agent can reason about.
Building the Tool Layer
from langchain_core.tools import tool
from pydantic import BaseModel, Field
from typing import Optional
import httpx
# Define strict input schemas - this prevents the LLM from
# hallucinating parameter names
class OrderLookupInput(BaseModel):
order_id: str = Field(description="The order ID, e.g., ORD-12345")
email: Optional[str] = Field(
default=None,
description="Customer email for verification"
)
@tool("lookup_order", args_schema=OrderLookupInput)
async def lookup_order(order_id: str, email: str = None) -> dict:
"""Look up order details including status, tracking, and items.
Use this when a customer asks about their order, shipping status,
or wants to make changes to an existing order."""
async with httpx.AsyncClient() as client:
try:
response = await client.get(
f"https://api.internal.example.com/orders/{order_id}",
headers={"Authorization": f"Bearer {get_service_token()}"},
timeout=5.0,
)
if response.status_code == 404:
return {
"error": "order_not_found",
"message": f"No order found with ID {order_id}. "
"Please verify the order ID.",
}
response.raise_for_status()
order = response.json()
# Verify email if provided (prevents social engineering)
if email and order["customer_email"].lower() != email.lower():
return {
"error": "verification_failed",
"message": "The email provided doesn't match this order.",
}
# Return structured data, not a formatted string
return {
"order_id": order["id"],
"status": order["status"],
"items": order["items"],
"shipping": {
"carrier": order["shipping"]["carrier"],
"tracking_number": order["shipping"]["tracking"],
"estimated_delivery": order["shipping"]["eta"],
},
"total": order["total"],
"can_cancel": order["status"] in ["pending", "processing"],
"can_return": order["status"] == "delivered",
}
except httpx.TimeoutException:
return {
"error": "service_unavailable",
"message": "Order system is temporarily unavailable.",
}
class RefundInput(BaseModel):
order_id: str = Field(description="The order ID to refund")
reason: str = Field(description="Reason for the refund")
amount: Optional[float] = Field(
default=None,
description="Partial refund amount. None means full refund."
)
line_items: Optional[list[str]] = Field(
default=None,
description="Specific line item IDs to refund. None means all."
)
@tool("process_refund", args_schema=RefundInput)
async def process_refund(
order_id: str,
reason: str,
amount: float = None,
line_items: list[str] = None,
) -> dict:
"""Process a refund for an order. Can handle full or partial refunds.
IMPORTANT: Only use this tool AFTER confirming the refund details
with the customer. Never process a refund without explicit confirmation."""
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.internal.example.com/refunds",
json={
"order_id": order_id,
"reason": reason,
"amount": amount,
"line_items": line_items,
},
headers={"Authorization": f"Bearer {get_service_token()}"},
timeout=10.0,
)
if response.status_code == 422:
return {
"error": "refund_not_allowed",
"message": response.json()["detail"],
}
response.raise_for_status()
refund = response.json()
return {
"refund_id": refund["id"],
"amount": refund["amount"],
"status": refund["status"],
"estimated_processing_days": refund["processing_days"],
}
The System Prompt That Ties It Together
Tools alone aren't enough. The system prompt needs to establish clear behavioral boundaries:
SUPPORT_AGENT_SYSTEM_PROMPT = """You are a customer support agent for ExampleCo.
## Core Rules
1. NEVER fabricate information. If you don't know, say so.
2. NEVER process refunds or account changes without explicit customer confirmation.
3. ALWAYS verify customer identity before accessing account details.
4. If a customer is frustrated, acknowledge their frustration before solving the problem.
5. Keep responses concise. Customers want answers, not essays.
## Knowledge
You have access to our knowledge base covering products, policies, and procedures.
Only answer based on retrieved context. If the context doesn't contain the answer,
say: "I don't have that information in my system. Let me connect you with someone
who can help."
## Tools
You have access to tools for looking up orders, processing refunds, and managing
accounts. Always explain what you're doing before using a tool that modifies data.
## Escalation
Escalate to a human agent when:
- The customer explicitly requests a human
- The issue involves legal matters, fraud, or account security
- You've failed to resolve the issue after 2 attempts
- The customer has been waiting more than 5 minutes in this conversation
- The issue requires access to systems you don't have tools for
When escalating, summarize the issue and what you've already tried.
"""
Part 3: Escalation Logic
This is where most support agents fail. Bad escalation means either (a) the bot keeps the customer trapped in a loop when it can't help, or (b) it gives up and escalates too early, defeating the purpose of automation.
Multi-Signal Escalation Detection
Don't rely on a single signal. Combine multiple indicators:
from dataclasses import dataclass, field
from enum import Enum
from langchain_openai import ChatOpenAI
class EscalationSignal(Enum):
EXPLICIT_REQUEST = "explicit_request" # "let me talk to a human"
NEGATIVE_SENTIMENT = "negative_sentiment" # frustration detected
REPETITION_LOOP = "repetition_loop" # same issue, multiple attempts
OUT_OF_SCOPE = "out_of_scope" # no tools or knowledge available
HIGH_STAKES = "high_stakes" # fraud, legal, safety
TIMEOUT = "timeout" # conversation too long
@dataclass
class EscalationState:
attempts: int = 0
sentiment_scores: list[float] = field(default_factory=list)
tools_used: list[str] = field(default_factory=list)
knowledge_retrieved: bool = False
explicit_request: bool = False
high_stakes_detected: bool = False
def should_escalate(self) -> tuple[bool, str | None]:
"""Determine if we should escalate and why."""
# Hard rules — always escalate
if self.explicit_request:
return True, EscalationSignal.EXPLICIT_REQUEST.value
if self.high_stakes_detected:
return True, EscalationSignal.HIGH_STAKES.value
# Soft rules — escalate based on patterns
if self.attempts >= 3:
return True, EscalationSignal.REPETITION_LOOP.value
if (
len(self.sentiment_scores) >= 2
and all(s < 0.3 for s in self.sentiment_scores[-2:])
):
return True, EscalationSignal.NEGATIVE_SENTIMENT.value
if not self.knowledge_retrieved and not self.tools_used:
return True, EscalationSignal.OUT_OF_SCOPE.value
return False, None
async def detect_escalation_signals(
message: str,
conversation_history: list[dict],
state: EscalationState,
) -> EscalationState:
"""Update escalation state based on the latest message."""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# Check for explicit escalation request
# Use a small, fast model for classification tasks
classification = await llm.ainvoke([
{"role": "system", "content": """Classify this customer message.
Respond with ONLY one of:
- ESCALATE: customer explicitly wants a human agent
- CONTINUE: customer is asking a question or describing an issue
- HIGH_STAKES: involves fraud, legal action, safety concern, or data breach
Examples of ESCALATE: "speak to a manager", "I want a real person",
"transfer me to an agent"
"""},
{"role": "user", "content": message},
])
result = classification.content.strip().upper()
if result == "ESCALATE":
state.explicit_request = True
elif result == "HIGH_STAKES":
state.high_stakes_detected = True
# Sentiment analysis — lightweight, runs on every message
sentiment = await llm.ainvoke([
{"role": "system", "content": """Rate the sentiment of this message
from 0.0 (extremely negative/frustrated) to 1.0 (very positive/happy).
Respond with ONLY a number."""},
{"role": "user", "content": message},
])
try:
state.sentiment_scores.append(float(sentiment.content.strip()))
except ValueError:
pass # Don't crash on malformed LLM output
return state
The Escalation Handoff
When you escalate, don't just dump the customer into a queue. Provide context:
async def escalate_to_human(
conversation_id: str,
state: EscalationState,
reason: str,
conversation_history: list[dict],
) -> dict:
"""Escalate to a human agent with full context."""
# Generate a summary for the human agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)
summary = await llm.ainvoke([
{"role": "system", "content": """Summarize this support conversation
for a human agent taking over. Include:
1. The customer's issue in one sentence
2. What the bot has already tried
3. Key details (order numbers, account info, etc.)
4. The customer's emotional state
Keep it under 150 words."""},
{"role": "user", "content": str(conversation_history)},
])
# Create handoff ticket in your helpdesk
async with httpx.AsyncClient() as client:
ticket = await client.post(
"https://api.internal.example.com/tickets",
json={
"conversation_id": conversation_id,
"priority": "high" if state.high_stakes_detected else "normal",
"escalation_reason": reason,
"bot_summary": summary.content,
"full_history": conversation_history,
"customer_sentiment_trend": state.sentiment_scores,
"attempted_resolutions": state.attempts,
},
headers={"Authorization": f"Bearer {get_service_token()}"},
)
return {
"message": "I'm connecting you with a support specialist who can help "
"with this. They'll have the full context of our conversation, "
"so you won't need to repeat yourself.",
"ticket_id": ticket.json()["id"],
"estimated_wait": ticket.json()["estimated_wait_minutes"],
}
Part 4: Quality Metrics and Monitoring
You can't improve what you don't measure. Here are the metrics that actually matter for support agents.
The Metrics That Matter
| Metric | What It Measures | Target | How to Measure |
|---|---|---|---|
| Resolution Rate | % of conversations resolved without human escalation | 60-75% | Track escalation outcomes |
| Accuracy Rate | % of answers that are factually correct | >95% | LLM-as-judge + human sampling |
| Customer Satisfaction (CSAT) | Post-interaction survey score | >4.0/5.0 | Automated survey after resolution |
| Escalation Quality | Did escalation include proper context? | >90% | Audit escalation handoffs |
| Hallucination Rate | % of responses containing fabricated info | <2% | RAG faithfulness checks |
| Average Handle Time | Time from first message to resolution | <3 min | Timestamp tracking |
| Tool Success Rate | % of tool calls that return valid data | >98% | Error logging |
Implementing a Quality Gate
Every response should pass through a quality check before reaching the customer:
from pydantic import BaseModel
class QualityCheckResult(BaseModel):
passed: bool
faithfulness_score: float # 0-1: how grounded in retrieved context
relevance_score: float # 0-1: how relevant to the question
safety_score: float # 0-1: no harmful content
issues: list[str] # specific problems found
async def quality_gate(
question: str,
response: str,
retrieved_context: list[str],
llm: ChatOpenAI,
) -> QualityCheckResult:
"""Check response quality before sending to customer."""
context_text = "\n---\n".join(retrieved_context)
evaluation = await llm.ainvoke([
{"role": "system", "content": """You are a quality assurance evaluator
for customer support responses. Evaluate the response against these criteria:
1. FAITHFULNESS (0-1): Is every claim in the response supported by the
provided context? Score 0 if any claim is fabricated.
2. RELEVANCE (0-1): Does the response directly address the customer's
question? Score 0 if it's off-topic.
3. SAFETY (0-1): Is the response safe? Score 0 if it contains harmful
advice, exposes internal systems, or shares other customers' data.
Respond in this exact JSON format:
{
"faithfulness": 0.0-1.0,
"relevance": 0.0-1.0,
"safety": 0.0-1.0,
"issues": ["list of specific problems"]
}"""},
{"role": "user", "content": f"""Customer question: {question}
Retrieved context:
{context_text}
Agent response:
{response}"""},
])
result = json.loads(evaluation.content)
return QualityCheckResult(
passed=(
result["faithfulness"] >= 0.8
and result["relevance"] >= 0.7
and result["safety"] >= 0.95
),
faithfulness_score=result["faithfulness"],
relevance_score=result["relevance"],
safety_score=result["safety"],
issues=result["issues"],
)
The Feedback Loop
Metrics are useless without a feedback loop. Here's a practical system:
async def handle_with_quality_loop(
message: str,
conversation_id: str,
state: ConversationState,
) -> str:
"""Full support pipeline with quality checking."""
# 1. Check escalation signals
state.escalation = await detect_escalation_signals(
message, state.history, state.escalation
)
should_esc, reason = state.escalation.should_escalate()
if should_esc:
return await escalate_to_human(
conversation_id, state.escalation, reason, state.history
)
# 2. Retrieve context
context = retrieve_context(
message, state.vector_store, product_area=state.detected_product_area
)
if context:
state.escalation.knowledge_retrieved = True
# 3. Generate response
response = await generate_response(message, context, state)
# 4. Quality gate
quality = await quality_gate(message, response, context, state.llm)
if not quality.passed:
# Log the failure for analysis
await log_quality_failure(
conversation_id=conversation_id,
question=message,
response=response,
quality_result=quality,
)
# If faithfulness failed, we might be hallucinating
if quality.faithfulness_score < 0.5:
return ("I want to make sure I give you accurate information. "
"Let me connect you with a specialist who can help.")
# If relevance failed, try rephrasing the query
if quality.relevance_score < 0.5:
state.escalation.attempts += 1
return ("I may have misunderstood your question. "
"Could you tell me more about what you're looking for?")
# 5. Log success for monitoring
await log_interaction(
conversation_id=conversation_id,
message=message,
response=response,
context_used=context,
quality=quality,
)
return response
Dashboard Metrics to Track
Set up a real-time dashboard (we use Grafana, but anything works):
┌─────────────────────────────────────────────────────────┐
│ Support Agent Dashboard │
├──────────────┬──────────────┬───────────────────────────┤
│ Resolution │ CSAT Score │ Active Conversations │
│ 68.3% │ 4.2/5.0 │ 142 │
│ ▲ +2.1% │ ▲ +0.1 │ │
├──────────────┴──────────────┴───────────────────────────┤
│ Escalation Reasons (last 24h) │
│ ████████████████░░░░ Explicit request 42% │
│ ██████████░░░░░░░░░░ Repetition loop 28% │
│ █████░░░░░░░░░░░░░░░ Out of scope 15% │
│ ████░░░░░░░░░░░░░░░░ Negative sentiment 10% │
│ ██░░░░░░░░░░░░░░░░░░ High stakes 5% │
├─────────────────────────────────────────────────────────┤
│ Hallucination Rate: 1.2% │ Tool Success Rate: 98.7% │
│ Avg Handle Time: 2m 34s │ Knowledge Coverage: 84% │
└─────────────────────────────────────────────────────────┘
Putting It All Together
Here's the main orchestration loop:
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
def create_support_agent(vector_store, tools):
"""Create the full support agent."""
llm = ChatOpenAI(
model="gpt-4o",
temperature=0.1, # Low but not zero — some variation feels natural
streaming=True, # Important for UX — show responses as they generate
)
prompt = ChatPromptTemplate.from_messages([
("system", SUPPORT_AGENT_SYSTEM_PROMPT),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{input}"),
MessagesPlaceholder(variable_name="agent_scratchpad"),
])
agent = create_openai_tools_agent(llm, tools, prompt)
return AgentExecutor(
agent=agent,
tools=tools,
max_iterations=5, # Prevent infinite tool loops
handle_parsing_errors=True,
return_intermediate_steps=True, # For debugging and monitoring
)
What to Expect
With this architecture, here's realistic performance after tuning:
- Week 1-2: 40-50% resolution rate. Lots of edge cases. Focus on knowledge base gaps.
- Week 3-4: 55-65% resolution rate. Tool integrations stabilize. Escalation logic improves.
- Month 2+: 65-75% resolution rate. Feedback loop catches remaining issues.
The 75% ceiling is real. Beyond that, you're dealing with complex, multi-step issues that genuinely need human judgment. Don't chase 90% automation — it'll make the experience worse for everyone.
Honest Limitations
I'd be doing you a disservice if I didn't mention what's hard:
Latency. RAG retrieval + quality gate + tool calls can add 3-8 seconds per response. Streaming helps perception, but the pipeline is inherently slower than a simple chatbot. Budget for this.
Cost. A quality-checked, tool-using agent costs roughly $0.02-0.08 per conversation with GPT-4o. At scale (100k conversations/month), that's $2,000-8,000. Use GPT-4o-mini for classification and sentiment, reserve GPT-4o for response generation.
Knowledge base maintenance. Your docs will go stale. Build automated pipelines to flag outdated content. The best RAG system in the world can't help if your return policy page still says "30 days" when you changed it to "14 days" last month.
Edge cases. Customers will say things like "my order is broken" (product issue? shipping damage? metaphorical complaint about your company?). Plan for ambiguity — don't assume the LLM will always classify correctly.
Evaluation is expensive. The LLM-as-judge quality gate adds cost and latency. For high-volume deployments, run it on a sample (10-20%) rather than every response.
Summary
A high-performing support agent is a system, not a prompt. The key components:
- Knowledge base with hybrid search, reranking, and metadata filtering
- Tool layer with atomic operations, strict schemas, and graceful error handling
- Escalation logic using multiple signals, not just keyword matching
- Quality gates that catch hallucinations and irrelevant responses before they reach customers
- Feedback loops that turn every conversation into training data
Build each layer independently. Test each layer independently. Ship incrementally — a bot that handles "where's my order?" perfectly is more valuable than one that handles everything poorly.