The demo-to-production gap
A RAG demo is easy to build. You chunk a PDF, embed the chunks, store them in a vector index, and wire a GPT-4 completion call to the top-k retrieved passages. The demo works. Stakeholders are impressed.
Production RAG is a different system. It must handle multiple document formats with inconsistent structure, queries that mix explicit retrieval intent with implicit context from prior conversation turns, and latency requirements incompatible with naive top-k retrieval across a large corpus. It must also be debuggable — when the system returns an incorrect answer, an engineer needs to identify which step in the pipeline failed.
