AI & Data8 min read27 March 2026

RAG in Production: Architecture Decisions That Actually Matter

A practitioner's guide to retrieval-augmented generation for enterprise Azure deployments

Parveen KR

Microsoft Certified Trainer · Enterprise AI & Data Platform

Summary

Retrieval-Augmented Generation (RAG) has become the dominant pattern for grounding enterprise LLM applications in proprietary data. Yet most organisations underestimate the architecture decisions required to move from a working demo to a production system that is accurate, cost-controlled, and auditable.

The demo-to-production gap

A RAG demo is easy to build. You chunk a PDF, embed the chunks, store them in a vector index, and wire a GPT-4 completion call to the top-k retrieved passages. The demo works. Stakeholders are impressed.

Production RAG is a different system. It must handle multiple document formats with inconsistent structure, queries that mix explicit retrieval intent with implicit context from prior conversation turns, and latency requirements incompatible with naive top-k retrieval across a large corpus. It must also be debuggable — when the system returns an incorrect answer, an engineer needs to identify which step in the pipeline failed.

Chunking is not trivial

The most underestimated decision in RAG pipeline design is chunking strategy. Fixed-size character chunks (the default in most tutorials) break semantic units at arbitrary boundaries. A policy document chunked at 512 characters will split a numbered list mid-item, creating fragments that embed poorly and retrieve inconsistently.

In enterprise deployments, document-aware chunking matters: paragraph-level splits for prose, row-level splits for tables, section-level splits for structured reports. For complex documents — insurance policies, technical specifications, legal contracts — hierarchical chunking (retaining parent-chunk context alongside the retrieved child chunk) significantly improves answer grounding without increasing retrieval cost proportionally.

Azure AI Search supports this via its integrated vectorisation and hierarchical indexing capabilities. The configuration is non-trivial but the retrieval quality improvement is measurable.

Hybrid retrieval over pure vector search

Pure semantic (vector) retrieval performs poorly on queries that contain exact keywords, product codes, regulatory references, or named entities. A query for "clause 4.2.1 of the Master Services Agreement" should match by keyword, not semantic similarity.

Hybrid retrieval combines BM25 keyword search with vector search, using a Reciprocal Rank Fusion (RRF) step to merge the ranked result lists. Azure AI Search implements this natively. In our experience, hybrid retrieval outperforms pure vector retrieval on approximately 30–40% of real enterprise queries — the long tail of precise lookups that semantic similarity handles poorly.

Evaluation before optimisation

Most teams optimise their RAG pipeline (adjusting chunk size, top-k, reranker thresholds) before they have a systematic evaluation framework. This is backwards. Without a labelled evaluation set — a collection of representative queries with known correct answers — you cannot tell whether a parameter change improved retrieval quality or simply shifted the error distribution.

Building an evaluation set is not glamorous work. It requires domain experts to review outputs and label them. In our Azure AI Foundry workshops, we dedicate a full module to building RAG evaluation pipelines using Azure AI Evaluation SDK and the Groundedness, Relevance, and Coherence metrics. Teams that invest in evaluation infrastructure reduce their optimisation cycle time significantly.

Responsible AI guardrails

Enterprise RAG systems must implement content safety and grounding checks. Azure OpenAI content filters handle obvious harmful content, but enterprise-specific guardrails — preventing the system from speculating beyond retrieved context, flagging low-confidence answers, and enforcing citation requirements — must be built into the orchestration layer.

Prompt Shield (available in Azure AI Content Safety) detects jailbreak attempts and indirect prompt injection from retrieved documents. For systems that process untrusted external documents (e.g., customer-submitted forms, scraped web content), prompt injection from retrieved content is a real attack vector, not a theoretical one.

RAGAzure OpenAIAzure AI SearchLLMEnterprise AIResponsible AI

Related Insights

Platform DeliveryMicrosoft Fabric in the Enterprise: Beyond the PilotRead article StrategyAI Readiness: The Work That Happens Before You Touch a ModelRead article

Work with us

Want to discuss these topics with your team?

We deliver hands-on programs covering AI platform adoption, data engineering, and enterprise architecture. Let us know what your team is working on.

Schedule a consultation Take the AI Readiness Assessment