Needles in a Haystack — Kenny Trinh

The "Needle in a Haystack" evaluation is a deceptively simple test: hide a specific piece of information somewhere in a long document, then ask the model to retrieve it. Despite the information being explicitly present, models fail more often than you'd expect.

This post is a summary of what I found when exploring this problem.

Why retrieval breaks down

Four factors consistently degrade performance:

Query complexity. Simple lookups ("what is the API key?") succeed far more reliably than reasoning tasks that require connecting multiple pieces of information across the context.

Context length. Accuracy drops as documents grow. This isn't just a compute issue — models appear to lose track of earlier content as later tokens dominate attention.

Information position. Where you place the needle matters enormously. Information buried in the middle of a long context is retrieved far less reliably than information at the start or end. This is sometimes called the "lost in the middle" problem.

Model architecture. Different models make different tradeoffs. Some are optimized for retrieval; others prioritize reasoning. A model that handles RAG well may not handle multi-hop reasoning well, and vice versa.

What actually helps

Pre-process to simplify. Don't ask the model to do retrieval and reasoning at the same time if you can avoid it. Extract the relevant chunk first, then reason over it.

Compress aggressively. Fewer tokens means less noise. Summarize, filter, or restructure your context before injecting it into the prompt.

Position critical information deliberately. If you have information the model must not miss, put it at the beginning or end of your prompt — not in the middle.

Use structured formats. JSON and XML give the model clearer anchors than unstructured prose. Structured context is easier to parse and less likely to get "lost."

The broader point

Most production LLM failures aren't capability failures — the model could answer correctly if given the right input. They're retrieval failures. The information is there; the model just didn't find it, or didn't weight it properly.

Understanding where and why retrieval breaks down is the first step toward building systems that are actually reliable.