Generative AI without grounding is a confident liar. RAG (Retrieval Augmented Generation) is how you make Azure OpenAI answer based on your data, your policies, your knowledge — not the open internet circa 2023. Here's the production architecture that actually works.
The basic pattern
- User asks a question.
- We embed the question into a vector.
- We retrieve the top-K most similar chunks from our indexed knowledge base.
- We pass the question + retrieved chunks to an LLM with instructions to answer only from the provided context.
- We return the answer with citations to the original sources.
Simple to explain. Genuinely hard to do well at production grade.
Choose your retrieval engine
For Azure-aligned shops:
- Azure AI Search (formerly Cognitive Search) — first-party, supports hybrid (keyword + vector + semantic ranking), scales, integrates with everything.
- Azure Cosmos DB for MongoDB / PostgreSQL with vector support — when your data already lives there.
- Pinecone / Weaviate / Qdrant — third-party, when you have a specific need.
For 90% of enterprise RAG, Azure AI Search with hybrid retrieval is the right default.
Chunking — the unglamorous but critical step
How you split documents determines retrieval quality more than any other choice:
- Fixed-size chunks (e.g., 800 tokens with 100 overlap) — fast, simple, often "good enough."
- Semantic chunking — split on heading / paragraph / topic boundaries. Better quality, more complex.
- Hierarchical chunking — store both small and large chunks, retrieve small for relevance, return large for context.
- Per-document type tuning — chunking that works for legal contracts won't work for product specs.
Test 3 chunking strategies against a labeled question set. Pick the winner. Iterate.
Embeddings — pick consistently
- text-embedding-3-large for highest quality (English and multilingual).
- text-embedding-3-small for cost-sensitive workloads where quality is acceptable.
- Embed and store the model version. Re-embedding when you upgrade is non-trivial.
Don't mix embedding models in the same index. Ever.
Hybrid retrieval beats pure vector
In benchmarks, hybrid (keyword BM25 + vector + semantic re-ranking) consistently beats pure vector for enterprise corpus:
- Keyword catches exact-term matches (product codes, error IDs, names).
- Vector catches conceptual similarity.
- Semantic re-ranking applies a learned model on top to reorder.
Azure AI Search supports all three in one query. Use them.
Filtering and metadata
Every chunk should have rich metadata:
- Source document, section, page, last-updated.
- Document type, owner, language.
- Security identifiers (which roles / users can see this chunk).
- Effective date and expiry.
Filter retrieval by user permissions and recency. Without this, your RAG will quote outdated policy at a confused customer.
Prompt engineering for grounding
The system prompt for a RAG agent should:
- State the role and persona.
- Demand answers come only from the provided context.
- Specify what to do if the answer isn't in context ("I don't know — please contact support").
- Demand citations.
- Set tone (concise, formal, conversational — match your brand).
Then the user message includes the user's question + the top-K retrieved chunks formatted clearly.
Evaluation — the part most teams skip
You need a labeled test set:
- 50–200 representative questions with ground-truth answers and expected sources.
- Run every model / prompt change against it.
- Score on retrieval recall (was the right source retrieved?), answer correctness, groundedness (did the answer use the source?), citation accuracy.
Tools: Azure AI Foundry's evaluation features, Promptflow, or your own Python harness.
Without evaluation, you don't know if a "fix" is actually a fix.
Guardrails
- Azure AI Content Safety for prompt injection, jailbreak, hate, sexual, violent.
- PII detection and redaction before logging conversations.
- Output filters for sensitive topics specific to your domain.
- Rate limiting and per-user quotas to control cost.
- Audit logging of every prompt and response with user identity.
Cost — the realistic numbers
Per question, with hybrid retrieval and GPT-4 class generation:
- ~$0.01–$0.05 per question for typical enterprise scenarios.
- Embedding cost is one-off per document; re-embedding on update is the recurring cost.
- The biggest cost lever is caching repeat questions. Memcache or Cosmos DB cache hits often eliminate 30–50% of LLM calls.
Production rollout
- Pilot with one knowledge domain and a small user group.
- Capture feedback (thumbs up/down + free text) on every answer.
- Weekly review: top low-scoring answers, top sources missing from index.
- Scale knowledge sources after the first domain feels right, not before.
FAQs
Can we just use Microsoft 365 Copilot instead? For Microsoft 365 content (SharePoint, OneDrive, Exchange), yes. For mixed sources, line-of-business apps, Dataverse and external systems, you need a custom RAG.
How does this fit with Copilot Studio? Copilot Studio's "Generative answers" feature is essentially managed RAG. Use it when you want low-code; build custom RAG when you need control over retrieval, evaluation and guardrails.
What about agentic AI? RAG is one tool an agent uses. A good agent combines RAG retrieval with tool calls, planning and memory. Build the RAG layer well and the agent layers benefit.
Will fine-tuning help? Rarely for RAG. Fine-tuning helps with style, format and domain language — not factual recall. Almost always, better retrieval beats fine-tuning.
Building enterprise GenAI? We deliver RAG architectures from prototype to production, with proper evaluation harnesses and guardrails. Schedule a workshop.