What Leaders Must Understand Before funding an LLM/RAG Project

11.26.2025 05:37 PM - By Rohit Vaidyula

In my previous article, I talked about the LLM-pocalypse — the point where hype around AI collided with reality and confidence in large language models started to slip. This follow-up is about something more concrete: how these systems fail in practice.


From the outside, AI systems often look mysterious. Things work beautifully in a demo, then start breaking in production, and leadership hears phrases like “model drift,” “RAG issues,” “pipeline failures,” and “LLMOps gaps.” It can feel like chaos. But these failures are not mysterious at all. They are mechanical, predictable, and recurring. The same patterns appear across companies, industries, and use cases.


1. Model-Level Failures: When the “Brain” Is Misunderstood

These failures relate to the model itself — how it was trained, how it behaves over time, and how updates ripple through your systems.


1.1 Model Drift

LLMs don’t learn continuously. Once a model is trained, its internal representation of the world is effectively frozen. It “remembers” the state of the world as it existed at training time, not as it exists today. Now there are some interesting candidates for “self-learning” LLMs that update their weights periodically.

Meanwhile, your environment moves on. Product catalogs get updated, support policies change, regulations are revised, and internal processes evolve. If you rely only on the model’s internal knowledge, it will keep giving answers that reflect an older reality. Even in RAG systems, if the underlying knowledge base or vector index is not refreshed, the model will keep pulling context from an outdated view of your organization.

Practically, this shows up as:

  • slightly wrong pricing information
  • outdated policy language
  • references to deprecated processes and tool

It rarely fails spectacularly at first. It just becomes “off” in subtle but important ways.

The key idea is this: model drift isn’t the model going bad — it’s the world moving on. Unless you have deliberate processes for updating knowledge (via RAG, fine-tuning, or re-indexing), drift is guaranteed.


1.2 Instruction Decay

Every LLM relies heavily on prompts (instructions) to guide its behavior. In early pilots, teams often spend weeks or months hand-tuning prompts until the model outputs exactly what they want.

But prompts are not stable contracts. Over time:

  • the model provider may update internal weights, safety filters, or sampling defaults
  • the model’s preferences around verbosity or style may shift
  • your own context windows or system messages may change

The net effect is what you can think of as instruction decay: prompts that were once reliable begin to behave differently. Instead of returning a clean JSON object, the model starts adding commentary. Instead of summarizing in three bullet points, it produces a page of text. Instead of following a field order, it changes the structure.

From the outside this looks random. From the inside it’s just the result of a system whose behavior is sensitive to small internal and contextual changes.


1.3 Version Instability

Model upgrades are tempting. New versions promise better reasoning, better safety, better multimodal capabilities. BUT, when you swap one model version for another:

  • the way it tokenizes text may differ
  • its internal representation of meaning may shift
  • it may follow instructions more strictly
  • it may change its default tone or verbosity

You’re effectively changing the “personality” and response patterns of a critical system component.

That means a prompt that was carefully tuned for Model A may behave very differently under Model B. JSON formats may change, summaries may lengthen or shorten, RAG answers may take a different angle, and agents may behave differently in planning steps.

If you don’t treat model versions like major infrastructure changes — with regression tests, evaluation sets, and staged rollouts — version instability can quietly break workflows that appeared “finished.”


2. Retrieval & RAG Failures: When the Knowledge Layer Starts Lying

If the model is the “brain,” RAG is the “memory.” Most business-grade AI systems combine LLMs with retrieval from internal data sources. This layer is powerful, but extremely fragile.


2.1 Context Rot

A vector database (or any search index used in RAG) is not a living organism. It doesn’t automatically stay in sync with your source systems. It captures whatever was true now documents were embedded and indexed.

When:

  • product docs are updated
  • new policy versions are released
  • old knowledge is retired
  • new entities or fields are added

…your context base doesn’t magically update itself. Unless you explicitly re-ingest and re-embed the new content — and remove or supersede the old — your AI system will keep returning outdated material. The outward symptom is an assistant that “sounds right” but references policies or procedures you stopped using months ago.


2.2 Chunking Errors

To store documents in a vector database, you rarely embed the whole document at once — it’s too long, and retrieval becomes imprecise. Instead, you split them into “chunks” and embed those. If that splitting is naive (by character count or arbitrary sentence boundaries), you often end up severing concepts that must stay together. Conditions are separated from rules, footnotes are separated from definitions, tables are broken into incoherent fragments.


Examples:

  • A refund policy split so that “30 days” is in one chunk and “only with receipt” is in another.
  • A table where quantities and units land in different chunks.
  • A compliance requirement split mid-sentence.


When the user asks a question, the system retrieves only one of those chunks — often the “catchier” one — and the model answers based on incomplete context. The model isn’t misbehaving; it’s being fed a broken slice of reality. Chunking is not a minor implementation detail; its core to whether your RAG system is trustworthy.


2.3 Retrieval Irrelevance

Semantic search uses embeddings to find text that is “close in meaning” to the user’s query. This is powerful — but it’s not the same as finding the correct answer. Embeddings capture fuzzy conceptual similarity, not business logic. If someone searches “refund exceptions for damaged items,” the system might retrieve:

  • “refund processes for defective items”
  • “shipping exception policies”
  • “product damage documentation rules”

These all sound related, but they may not contain the specific rule needed to answer the question accurately.


This is how you get AI responses that are on-topic but wrong. The retrieval system did its job in a linguistic sense; it failed in a business relevance sense. That distinction matters a lot in regulated, process-heavy environments.


2.4 Grounding Drift

In a RAG system, you usually:

  1. retrieve relevant chunks
  2. feed them into the model
  3. generate an answer
  4. optionally attach citations to those chunks

Users then see an answer with links or references and assume: “this is supported by our documentation.”


Grounding drift happens when that assumption stops being true. Over time, because of retrieval noise, chunking issues, model behavior changes, or mixing of internal training knowledge, the model begins to generate answers that:

  • contain details do not present in the retrieved text
  • selectively use parts of the context while ignoring others
  • combine multiple chunks into a conclusion that isn’t documented anywhere


The citations still look correct — they point to legitimate internal docs — but those docs don’t truly back up the full answer. From a leadership standpoint, this is where compliance risk explodes: the system appears grounded but is improvising.


2.5 Vector Store Contamination

A vector store is like a semantic landfill: whatever you put into it stays until you deliberately clean it up.

Contamination usually enters when:

  • PDF extraction goes wrong and garbled text is embedded
  • older versions of documents coexist with newer ones
  • irrelevant content (like email text, markup, or UI junk) is included
  • partially scraped or truncated docs get indexed
  • mislabeled or unreviewed content is ingested at scale


Because embeddings are opaque, you cannot easily see which chunks are “dirty.” The only symptom is that retrieval begins returning confusing, contradictory, or obviously incorrect content. Cleaning this up later often requires full re-ingestion and re-indexing, a very lengthy process.

 

3. Pipeline & Orchestration Failures: When Glue Code Becomes a Liability

Beyond the model and retrieval, most real systems have pipelines: ingestion, transformation, tool-calling, agents, validation, logging. This is where a good number of “mysterious” failures genuinely live.


3.1 Broken PDF Extraction

Most serious businesses run on PDFs: contracts, policies, proposals, manuals, reports. Unfortunately, PDFs aren’t structured text. They’re graphical layout containers. Text is placed at coordinates, often in multiple columns, with tables, footers, headers, and sidebars all rendered visually.

When you “extract” text from a PDF using naive tools, you often get:

  • sentences out of order
  • two columns merged into one linear stream
  • tables flattened in ways that separate labels from values
  • footers or page numbers interwoven with content
  • broken words and hyphenation artifacts


If you embed that extracted output as-is, you are teaching your AI a corrupted version of your knowledge. From the AI’s perspective, it’s faithfully working with what it was given. But the upstream extraction step is silently degrading data quality, making correct answers nearly impossible.


3.2 Malformed JSON & Structured Output Failures

Modern AI systems frequently ask LLMs to output structured data: JSON, key-value pairs, SQL snippets, etc. The problem is that LLMs generate text, not abstract syntax trees. Even when the instructions are clear, you will see:

  • missing or extra commas in JSON
  • unclosed brackets
  • incorrect or unexpected keys
  • explanatory prose mixed into structured output
  • incorrect data types (string instead of number, etc.)


To a human, this looks like a minor formatting issue. To a workflow expecting valid JSON, it’s catastrophic. The whole pipeline can fail on a single character.

Serious systems address this with:

  • schema enforcement
  • JSON mode or structured output capabilities
  • validation and retries


3.3 Prompt Brittleness

Prompts are not “code” in the traditional sense. They don’t have strict guarantees. They are more like suggestions to a very talented but unpredictable assistant.

Brittleness emerges when:

  • a user’s phrasing is drastically different from training examples
  • the prompt plus user input plus retrieved context approaches the model’s context limit
  • minor changes in word order or emphasis shift the model’s interpretation
  • new requirements are bolted onto an already long prompt


In testing, prompts are usually evaluated on a small set of curated examples. In production, they face the full diversity of user behavior. That’s when brittleness shows itself through:

  • inconsistent tone
  • inconsistent structure
  • edge-case failures
  • unexpected gaps in reasoning


3.4 Schema Drift

Many AI systems depend on the model producing output in a specific format—a schema. That might be:

  • an object with specific fields
  • a list of items in a defined structure
  • a JSON object matching an API


Even with explicit instructions, models sometimes:

  • rename fields
  • reorder them
  • add or remove fields
  • alter nesting (e.g., wrapping a single item in a list)
  • mix types (e.g., returning "5" instead of 5)


From a machine’s perspective, the meaning may still be understandable. From the point of view of an API or database expecting exact structure, the contract is broken. Schema drift is one of the primary reasons automated AI systems that depend on structured output are fragile — unless they include strict validation, correction, or external tooling to enforce the schema.


3.5 Agentic Loops

Agent-based systems give LLMs the ability to:

  • plan steps
  • call tools (APIs, search, DB queries)
  • analyze results
  • decide what to do next


This is powerful, but without guardrails, agents can become cost and latency disasters. Loops occur when:

  • the model misinterprets a tool result as incomplete
  • the agent does not recognize that its goal is already satisfied
  • errors are hallucinated (the model thinks something went wrong that didn’t)
  • the prompt doesn’t clearly define stopping criteria


The agent then continues to search, call tools, and re-plan — even though the answer is already available or the task is impossible.

From a director’s vantage point, this manifests as:

  • unpredictable latency
  • skyrocketing token usage
  • logs full of repeated steps


3.6 Silent Failures

The most dangerous category of failure is the quiet one. The system still responds. The UI still looks good. Logs aren’t throwing obvious errors. But the underlying quality is degrading. Silent failures come from combinations of:

  • index rot
  • bad chunking
  • retrieval irrelevance
  • updated model behavior
  • changed prompts
  • evolving data that doesn’t match assumptions

Without:

  • evaluation sets
  • periodic testing
  • retrieval quality metrics
  • hallucination detection
  • schema validation logs

…no one notices the slow decline until a major incident forces a review.


4. Economic Failures: When Technical Design Turns Into Financial Waste

Technical decisions around models, tokens, latency, and routing have real cost implications. These are failure modes CFOs and directors care about deeply.


4.1 Token Explosion

Cost in LLM systems is largely a function of tokens processed — both inputs and outputs. As systems evolve, token counts tend to grow:

  • prompts get longer as new requirements are added
  • more context is included “just to be safe”
  • more documents are retrieved per query
  • agents chain multiple model calls together

Individually, each change seems harmless. Collectively, they multiply cost. A pipeline that started with 500 tokens per request can quietly grow to 5,000 or 10,000 tokens per request.

In financial terms, this is the difference between an experiment you can run at scale and a system that becomes too expensive to justify, even if it works correctly.


4.2 The Hallucination Tax

Hallucinations aren’t just a quality problem; they’re an operational one. Every incorrect answer requires:

  • time for someone to verify it
  • time to correct it
  • time to re-run workflows
  • time to manage customer frustration
  • effort from legal or compliance in sensitive domains

Even a small hallucination rate — say 2–5% — can create a significant hallucination tax in terms of labor, delays, and rework. That overhead often wipes out the productivity gains that AI was meant to create in the first place.


For leaders, the focus should not be “can we reduce hallucinations?” but “what is the total cost when hallucinations occur, and how often can we tolerate them?”


4.3 Latency Stacking

LLMs themselves may respond in under a second. But real systems rarely make a single call. They typically:

  • retrieve context
  • summarize or rank it
  • interpret the question
  • generate an answer
  • sometimes call tools or APIs
  • validate outputs
  • perhaps perform a follow-up reasoning step

Each stage adds latency. If they happen sequentially, total response time can easily climb into the 8–20 second range.


4.4 Oversized Models

It’s tempting to use the biggest, most capable model for everything. But most enterprise tasks are not complex reasoning. They are:

  • classification
  • routing
  • entity extraction
  • keyword or tag assignment
  • simple summarization
  • FAQ answering

These tasks can often be handled by much smaller models — cheaper, faster, and easier to run on commodity hardware or optimized infrastructure.

If your architecture doesn’t include:

  • model routing (choosing the right model for the right task)
  • smaller specialized models for simple jobs

…then you are almost certainly overspending and suffering unnecessary latency.


5. Organizational Failures: When Structure and Governance Don’t Match the Tech

Now, some of the most serious “technical” problems aren’t purely technical. They’re organizational. Here’s how that plays out.


5.1 No Evaluation Sets

Many AI projects launch without a clearly defined evaluation set — a fixed collection of real-world questions, tasks, or scenarios used to measure performance over time.

Without this:

  • you can’t tell if a model change improved or worsened performance
  • you can’t compare vendors meaningfully
  • you can’t measure drift
  • you can’t justify investment with evidence


Teams then rely on anecdotes, internal demos, or individual impressions, which is a poor basis for decision-making.


5.2 Missing Observability & LLMOps

Traditional systems have logging, monitoring, and alerting. Many AI systems don’t.

Without proper LLMOps and observability, you don’t know:

  • how often hallucinations occur
  • how retrieval quality is trending
  • how token usage is changing over time
  • which prompts or workflows are failing more often
  • how grounding quality varies by use case


You’re effectively flying an aircraft with no altimeter, no fuel gauge, and no radar. Leaders should be asking for:

  • dashboards for token usage and cost
  • retrieval quality metrics
  • structured error logs (e.g., schema validation errors, JSON parse failures)
  • model performance trends on the evaluation set


5.3 Data Hygiene Debt

LLMs amplify the quality of your data — for better or worse. If your internal documents are:

  • inconsistent
  • outdated
  • contradictory
  • poorly written
  • scattered across silos


then your AI will reflect that. RAG does not magically fix messy knowledge bases. It exposes them.

This is data hygiene debt. You can’t build a trustworthy AI assistant on top of a knowledge base you don’t trust manually.


5.4 Unclear Ownership

Is AI owned by IT? Data? Engineering? Product? Operations? Compliance? If ownership is fuzzy:

  • no one is accountable for ongoing quality
  • evaluation, monitoring, and governance get deprioritized
  • decisions about model versions and architectures drift
  • pilots proliferate without consolidation or standards


Over time, you end up with a patchwork of half-maintained AI experiments rather than a cohesive capability.


The Point of All This: AI Doesn’t Fail Randomly

The story across all these failure modes is straightforward:

  • LLMs are probabilistic text engines, not magic.
  • RAG is powerful but brittle.
  • Pipelines and agents are easy to overcomplicate.
  • Costs grow faster than people expect.
  • Organizational maturity must catch up to technical possibility.


Once you understand how these systems fail, they stop being mysterious. You can ask sharper questions, fund the right infrastructure, insist on evaluation and observability, and push your teams to design for reliability — not just for demos.


Because the companies that succeed in the next wave of AI won’t be the ones with the flashiest prototypes. They’ll be the ones who understand the failure modes so well that they almost never let them happen

Rohit Vaidyula

Rohit Vaidyula

Founder VyomTech
https://www.vyomtech.ai/

Rohit is an AI engineer and a technical founder with expertise in transforming vague, real-world challenges into actionable AI solutions.