Does AI search retrieve whole pages or passages?

Passages. Production retrieval and RAG systems split a document into bounded chunks before anything reads it as a whole, then embed and rank those chunks as the unit. The peer-reviewed dense passage retrieval work operates entirely at passage level, and Google's own RAG Engine documents a default chunk size of 1,024 tokens. 'Getting the page indexed' is the wrong mental model; you get passages embedded.

What is the Passage Pipeline?

It is a five-gate model of how a page becomes retrievable text: Split (the engine cuts your page into chunks), Contextualize (cutting strips the context each chunk needs to stand alone), Rank (units are scored at token and heading granularity), Select (only a top-k subset enters a bounded window), and Position (where a unit sits decides whether the model uses it). Each gate has one operator lever.

How should I structure content so AI search can retrieve it?

Write self-contained sections that read as standalone answers, name the subject explicitly in each block so it survives being lifted off the page, lead each section with the atomic claim directly under a question-shaped heading, keep an answer inside one bounded section rather than letting it cross a section break, and front-load the decisive answer high on the page rather than burying it after a long preamble.

Why does chunk structure matter more than authority for retrieval?

Anthropic's first-party Contextual Retrieval experiments cut the top-20 retrieval failure rate from 5.7% to 1.9% — a roughly 67% reduction — purely by giving each chunk enough context to stand alone, with no change in authority. Segmentation quality is a load-bearing variable separate from links or domain strength: a brilliant page that chunks into half-thoughts is effectively invisible.

Do larger context windows make content structure irrelevant?

No. The RULER benchmark found that of 17 models claiming 32K-plus token windows, only half maintain satisfactory performance at 32K, and 'Lost in the Middle' shows performance degrades for information in the middle of a long context even for long-context models. The selection and position gates still bite, so tight, front-loaded units still win.

Can I tag a passage as 'the answer' for AI search?

No. Google's documentation states you cannot mark a page as a featured snippet; its systems decide whether a page makes a good answer and elevate it automatically. Passage-level extraction is algorithmic and out of publisher control. Your only handle is structure — making a unit the easiest, best-bounded candidate — not markup that forces selection.

How is this different from how AI search chooses what to cite?

Citation selection is about which already-retrieved sentence gets attributed; the Passage Pipeline is upstream of that — it is about whether your text becomes a candidate unit at all. A sentence is never retrieved alone; it rides inside a chunk, and whether that chunk is well-bounded, self-contained, and well-positioned decides whether the sentence is ever embedded, ranked, and placed into the window.

All research

Retrieval research9 min read2026-06-07

The unit of retrieval is the passage, not the page — and five gates decide which survives

AI search never retrieves your page. It splits documents into chunks, embeds and ranks those units, then answers from the few passages that survive. Drawing on dense-retrieval research and Google passage-ranking docs and patents, this maps the five gates a passage must clear.

2026-06-079 min read

AI search never retrieves your page. Before anything answers a question, the engine cuts your document into chunks, embeds and ranks those chunks as the unit, selects a small top-k subset into a bounded context window, and answers from whichever passages land where the model still pays attention. The page is a delivery truck; the passage is the cargo. And the truck only rolls once the engine decides to retrieve — the prior Grounding Gate.

This is the gap most content strategy never closes. Teams optimize a page as one artifact — its title, its word count, its links — while the machine quietly disassembles it into fragments determined by a chunker the author never sees. A brilliant page that splits into half-thoughts is invisible. A plain page made of clean, self-contained units gets read. This piece traces the journey one passage takes through the retrieval machinery, using the primary sources that describe each step, and closes with a single model — the Passage Pipeline — plus an honest account of where the evidence stops.

FIG 01The Passage PipelineFive gates stand between your page and a cited passage

01Split

Your page is cut into bounded chunks before it is read as a whole.

Self-contained sections

02Contextualize

Cutting strips the context each chunk relied on to make sense.

Name your own subject

03Rank

Units are scored at token + heading granularity, not page-level.

Answer under the heading

04Select

Only a top-k subset enters a bounded context window.

Several strong units

05Position

Where a unit sits decides whether the model actually uses it.

Front-load the answer

A passage has to survive all five gates to be the text an engine answers from. Most teams optimize the page as one artifact and never ask where the cut lines fall — so the work moves from “rank my page” to “make these passages individually survivable.”

Framework: Martech LLC · synthesis of dense-retrieval research + Google passage-ranking docs & patents

Why does your page disappear the moment it's retrieved?

The mental model of "get the page indexed" quietly broke years ago. Modern retrieval embeds and ranks passages, not documents. The canonical proof is peer-reviewed: the Dense Passage Retrieval paper showed that a dense retriever using a simple dual-encoder over passages "outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy" across open-domain QA datasets.[3] The retrievable, embedded, ranked unit throughout is the passage. The document is just where passages happen to live.

And the cut is mechanical. Google's own Vertex AI RAG Engine documentation states that when documents are ingested they are split into chunks, with a default chunk size of 1,024 tokens and a 256-token overlap.[4] That is one vendor's default, not a universal constant — but it makes the point concrete: a number you never chose decides where your content is severed. A 2025 study on passage segmentation for extractive QA emphasizes "the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline"[17] — segmentation quality is its own load-bearing stage, separate from what you say or who links to you.

FIG 02Stage 1 · SplitA chunker you don't control cuts your page before anything reads it wholeIllustrative of the documented mechanism — chunk sizes are vendor-default, not measured on your page

Cuts land on section boundariesself-contained

chunk · definition — complete

chunk · how it works — complete

Each chunk is a standalone answer

Cuts fall mid-argumentsevered

chunk · …ends mid-sentence

chunk · starts with a dangling clause

Half-thoughts the retriever can't use

Production retrieval segments a document into fixed, bounded chunks first— Google’s Vertex AI RAG Engine defaults to 1,024-token chunks with 256-token overlap. Wherever those cuts land is the unit you’re judged on. Section design is retrieval design.

Source: Google Vertex AI RAG Engine docs (1,024-token chunks / 256 overlap) · Dense X Retrieval (arXiv 2312.06648)

The lever here is structural, not editorial: write sections that each read as a standalone answer, so that wherever the chunker cuts, the resulting unit is still a coherent, complete thought rather than a sentence severed mid-argument. Section design is retrieval design.

What happens to a chunk when you lift it off the page?

Splitting solves one problem and creates another. The moment a chunk is cut out, it loses the surrounding context it leaned on. A paragraph that opened with "it raised this by 67%" is meaningless once the antecedent is three chunks away. This is not a stylistic nitpick — it is measured, and it is large.

Anthropic's first-party Contextual Retrieval experiments reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — purely by prepending self-contained context to each chunk before embedding.[6] No new authority, no new links: just making each chunk explain itself. Jina AI's Late Chunking reaches the same conclusion from the model side, embedding all tokens of a long text first and chunking afterward so that "chunk embeddings capture the full contextual information, leading to superior results."[7] Both are techniques the retrieval system applies — but they reveal the publisher's analogous lever exactly.

FIG 03Stage 2 · ContextualizeLift a chunk off the page — does it still know what it's about?

chunk lifted out · orphaned

“It cut this by 67% after the team shipped the change.”

what is “it”? what is “this”? — unrecoverable

chunk lifted out · self-identifying

“Contextual Retrieval cut top-20 retrieval failures by 67% after prepending context to each chunk.”

names its own subject — stands alone

Top-20 retrieval failure rate

Plain chunk (baseline)5.7%

+ contextual embeddings3.7%

+ contextual BM252.9%

+ reranking1.9%

Anthropic’s first-party experiments cut the top-20 retrieval failure rate from 5.7% to 1.9% — a 67% reduction — purely by giving each chunk enough context to stand alone. The publisher’s version of that lever is plain writing discipline: kill orphan pronouns and restate the subject so no block depends on the paragraphs above it.

Source: Anthropic — Introducing Contextual Retrieval (2024-09-19) · Late Chunking (arXiv 2409.04701)

There is a corollary about granularity worth internalizing. The Dense X Retrieval study found that the choice of retrieval unit significantly affects performance, and that "indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks."[5] Atomic, self-contained factoids are more retrievable than dense paragraphs that bury three claims in one sentence. The translation for a writer: name your subject in every block, kill orphan pronouns, and make one clear claim per unit so it stands alone the instant it is lifted out.

How is a passage actually scored?

Once chunked, candidates are ranked at sub-page granularity — tokens and structure, not "the page." ColBERT introduced a "late interaction" architecture that independently encodes the query and document with BERT and scores relevance through a cheap token-level step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers.[8] Its successor ColBERTv2 "produce[s] multi-vector representations at the granularity of each token and decompose[s] relevance modeling into scalable token-level computations."[9] Relevance is computed on your text's constituents.

Classic search operationalized the same idea in public. Google's ranking-systems documentation lists a "Passage ranking system" and defines it as an AI system used to "identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search."[2] And two granted Google patents make a page's structure the literal input to the score. The "Context scoring adjustments for answer passages" patent (US9959315B1) describes adjusting a candidate passage's score by a context score derived from a "heading vector" — the path in the heading hierarchy from the root heading down to the passage's heading.[10] The "Scoring candidate answer passages" patent (US9940367B1) adds that "a candidate answer passage will be penalized if it includes text that passes formatting boundaries, such as paragraphs and section breaks."[11]

FIG 04Stage 3 · RankStructure is the score — heading depth lifts a passage, boundaries penalize itScores illustrative of the patented mechanisms — relative, not Google's live values

H1 › intro paragraph34

shallow heading path, generic

H2 › H3 › answer sentence88

deep path + query-relevant heading

H2 › answer spanning a section break−penalty

crosses a formatting boundary

Two granted Google patents make a page’s structure the input to the score: one raises a passage by its position in the heading hierarchy and how well that heading matches the query; the other penalizes a candidate that crosses paragraph or section breaks. The lever is mechanical — lead each section with the atomic claim it proves, directly under a question-shaped heading.

Source: Google patents US9959315B1 (heading-vector context score) & US9940367B1 (boundary penalty) · ColBERT (arXiv 2004.12832)

Read those two patents together and the lever is unambiguous: lead each section with the single atomic claim it proves, placed directly under a question-shaped heading, and keep the answer inside one bounded section. Headings are not cosmetic typography. They are scored retrieval structure.

Most teams write for the reader who scrolls and the crawler that indexes. The machine that actually answers reads neither — it reads a chunk, scored on its own headings and boundaries.

Which passages even make it into the answer?

Retrieval is a bottleneck by design. The engine answers from a retrieved subset, never the whole corpus or page. The original RAG paper conditions generation on retrieved passages and compares two formulations — "one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token."[12] Only the top-k slots are filled, and the rest of your page is never read.

The slots are also scarcer than the spec sheet implies. The RULER benchmark found that although 17 evaluated models "all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K,"[13] with almost all degrading as length grows. A giant nominal window does not rescue a weak or buried unit. A systematic study of RAG versus long-context LLMs confirms engines deliberately reason over a chosen subset — long-context can win when sufficiently resourced, but RAG keeps "a distinct" cost advantage, motivating a hybrid that routes queries between the two.[18]

FIG 05Stage 4 · SelectOnly a top-k subset reaches the window — and the window is smaller than advertised

Candidate passages → top-k selected

definition passage

how-it-works passage

history passage

buried mid-page aside

weak duplicate

off-topic tangent

A single buried “best” paragraph often misses the cut; a page with several strong, independent units lands at least one in the window.

Nominal vs usable context (32K claim)

Advertised window32K tokens

Models holding up at 32K~half

Of 17 models claiming 32K+ windows, RULER found only half maintain satisfactory performance at 32K — big windows don’t rescue weak or buried units.

Source: RAG (arXiv 2005.11401) · RULER long-context benchmark (arXiv 2404.06654) · BEIR (arXiv 2104.08663)

So the lever at the selection gate is breadth, not perfection: earn several strong, independently-citable units on the topic across the page — and across pages — so at least one survives the top-k truncation. A single buried "best" paragraph loses to a page that offers the reranker three good options. It pays off across engines, too: the BEIR benchmark found that across 18 datasets, "re-ranking and late-interaction-based models on average achieve the best zero-shot performances"[16] — the architectures that generalize best are the passage-level ones, so structuring for passage retrievability is an engine-portable bet, not a single-platform hack.

Why does a selected passage still get ignored?

Even after a unit is selected into the window, placement governs whether the model uses it. This is the most counterintuitive gate, and the best-measured.

Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.

That U-shaped position bias is the central finding of "Lost in the Middle", and it holds "even for explicitly long-context models."[14] A correct answer sitting in the middle of the assembled context is quietly discounted. Google named the same failure for classic search years earlier.

Very specific searches can be the hardest to get right, since sometimes the single sentence that answers your question might be buried deep in a web page.

That is from Google's Search On 2020 announcement, which introduced passage ranking and said the technology would "improve 7 percent of search queries across all languages" as it rolled out.[1] The principle predates LLMs: burial has always been the enemy.

FIG 06Stage 5 · PositionEven a selected passage gets ignored if it lands in the middleIllustrative of the U-shaped finding — the engine orders its own window; your lever is page-position front-loading

Top of context

Early

Middle

Late

End of context

“Lost in the Middle” found a U-shaped position bias: models use information best at the beginning and end of a context and significantly worse in the middle — even long-context models. Google named the same failure for classic search in 2020 and shipped passage ranking to fix it. The operator lever is page-position, not window-position: don’t bury the decisive answer behind a long preamble.

Source: Lost in the Middle (arXiv 2307.03172, TACL 2024) · Google Search On 2020 ('buried deep in a web page')

One honest boundary: this gate is about your page's internal position, not the engine's window. You cannot place your passage at the favorable edge of a model's context — the engine's reranker orders that. What you control is page-position: front-load the decisive answer near the top of the page and the top of each section, so the answer-bearing unit is the one most likely to be selected and surfaced rather than lost behind a long preamble.

The Passage Pipeline, run on a real page

The five gates are a sequence, not a menu. A passage must survive Split (be a clean unit), Contextualize (stand alone), Rank (win on structure), Select (make the top-k), and Position (sit where it's used). The worked example below runs one real, public page — Wikipedia's Retrieval-augmented generation article — against one real query through all five gates. Each verdict is a structural observation, never a promise of citation.

Worked example · run the pipelineOne real page, one real query, walked through all five gates

The page + the query

Wikipedia’s Retrieval-augmented generationarticle against the query “what is retrieval-augmented generation?”

One passage through the pipeline

Stage 1 · Split· you cover
The lead section chunks into a self-contained definition; ~1,024-token cuts fall on section boundaries.
tests · do the cut lines leave whole units?
What wins this gate
Bounded sections, each a complete thought — no answer severed across a cut.
Stage 2 · Contextualize· you cover
The opening sentence names its subject: 'Retrieval-augmented generation (RAG) is…' — no orphan pronoun.
tests · does the chunk self-identify when lifted out?
What wins this gate
Restate the entity in the block; never open with 'it' or 'this technique'.
Stage 3 · Rank· you cover
A question-shaped heading nests cleanly (H2 › H3) with the definition front-loaded directly beneath it.
tests · is the heading path deep + relevant, answer un-split?
What wins this gate
Lead each section with the atomic claim, under the literal question.
Stage 4 · Select· you cover
The page offers several independently-citable units — definition, how-it-works, history — not one.
tests · are there multiple strong units for top-k?
What wins this gate
Build redundancy: several strong passages, so one survives truncation.
Stage 5 · Position· you cover
The decisive answer sits at the very top of the page, in the high-salience zone, not mid-document.
tests · is the payoff front-loaded, not buried?
What wins this gate
Put the answer first — page-top and section-top.

5/5steps covered

Each verdict here is a structural observation — this passage is self-contained, front-loaded, well-bounded — not a claim that it therefore gets cited. The pipeline tells you what makes a passage survivable; the engine still decides. Your control ends at structure, which is exactly why structure is where the work is.

Illustrative walkthrough of the cited mechanisms on a public page — structural observations only, not a citation outcome

There is one tactic the pipeline rules out. Google's featured-snippets documentation states that its systems "determine whether a page would make a good featured snippet… and if so, elevates it," and that you cannot mark your own page as a featured snippet.[15] Passage-level extraction is system-decided. There is no markup that forces selection — your only handle is to make a unit the easiest, best-bounded, best-positioned candidate. Structure is the whole game because structure is the only part you control.

What this means for the work

The brief changes shape. "Rank this page" becomes "make these N passages individually survivable." Concretely, on any page that matters: audit each H2/H3 section to confirm it reads as a standalone answer at roughly 200–1,000 tokens; restate the subject in every block so no unit depends on its neighbors; lead each section with the atomic claim under a literal-question heading; keep each answer inside one bounded section; build several strong units per topic rather than one; and move the decisive answer to the top. None of these are tricks. They are the physical properties of a retrievable unit.

Where the evidence runs out

This model is a synthesis, and intellectual honesty is part of the work. Several limits matter.

The academic benchmarks — DPR, ColBERT, BEIR, RULER, Dense X — run on QA corpora and research pipelines. They establish the mechanism class: that retrieval operates on embedded, ranked passages and that structure changes outcomes. They do not prove that any named 2026 engine — ChatGPT, Perplexity, Gemini, or Google's AI surfaces — chunks, embeds, or ranks live web pages in a specific way. The Vertex 1,024/256 default is one vendor's configuration, not a public-search-engine standard.

The position-bias gate is the subtlest. "Lost in the Middle" and RULER measure the engine's assembled window, ordered by its reranker — not your page layout. Front-loading your answer is a real, operator-controllable move, but it is a reasoned translation of the finding, not the same intervention the papers tested. Likewise, Contextual Retrieval and Late Chunking are techniques the retrieval system applies; the publisher's analogue — self-contained writing — is an inference, a strong one, but an inference.

And no source here measures "front-loading the answer leads to more AI citations" on a live public engine. The link from passage structure to citation outcomes is reasoned across primary sources, and it should be read that way: a well-grounded hypothesis about durable retrieval mechanics, not a guarantee about any one product. What the evidence does support is the spine of the argument — the unit of retrieval is the passage, not the page, and structure is the lever you actually hold.

Run it on your own page · free

Don’t take our word for it — measure it.

Frequently asked questions

Does AI search retrieve whole pages or passages?: Passages. Production retrieval and RAG systems split a document into bounded chunks before anything reads it as a whole, then embed and rank those chunks as the unit. The peer-reviewed dense passage retrieval work operates entirely at passage level, and Google's own RAG Engine documents a default chunk size of 1,024 tokens. 'Getting the page indexed' is the wrong mental model; you get passages embedded.
What is the Passage Pipeline?: It is a five-gate model of how a page becomes retrievable text: Split (the engine cuts your page into chunks), Contextualize (cutting strips the context each chunk needs to stand alone), Rank (units are scored at token and heading granularity), Select (only a top-k subset enters a bounded window), and Position (where a unit sits decides whether the model uses it). Each gate has one operator lever.
How should I structure content so AI search can retrieve it?: Write self-contained sections that read as standalone answers, name the subject explicitly in each block so it survives being lifted off the page, lead each section with the atomic claim directly under a question-shaped heading, keep an answer inside one bounded section rather than letting it cross a section break, and front-load the decisive answer high on the page rather than burying it after a long preamble.
Why does chunk structure matter more than authority for retrieval?: Anthropic's first-party Contextual Retrieval experiments cut the top-20 retrieval failure rate from 5.7% to 1.9% — a roughly 67% reduction — purely by giving each chunk enough context to stand alone, with no change in authority. Segmentation quality is a load-bearing variable separate from links or domain strength: a brilliant page that chunks into half-thoughts is effectively invisible.
Do larger context windows make content structure irrelevant?: No. The RULER benchmark found that of 17 models claiming 32K-plus token windows, only half maintain satisfactory performance at 32K, and 'Lost in the Middle' shows performance degrades for information in the middle of a long context even for long-context models. The selection and position gates still bite, so tight, front-loaded units still win.
Can I tag a passage as 'the answer' for AI search?: No. Google's documentation states you cannot mark a page as a featured snippet; its systems decide whether a page makes a good answer and elevate it automatically. Passage-level extraction is algorithmic and out of publisher control. Your only handle is structure — making a unit the easiest, best-bounded candidate — not markup that forces selection.
How is this different from how AI search chooses what to cite?: Citation selection is about which already-retrieved sentence gets attributed; the Passage Pipeline is upstream of that — it is about whether your text becomes a candidate unit at all. A sentence is never retrieved alone; it rides inside a chunk, and whether that chunk is well-bounded, self-contained, and well-positioned decides whether the sentence is ever embedded, ranked, and placed into the window.

Filed underresearch note#generative-engine-optimization#answer-engine-optimization#ai-search#passage-retrieval#content-structure

Sources · 18

Every claim, dated and linked

[1]
In Google's October 2020 Search On announcement, the company introduced passage ranking and stated that sometimes the single sentence that answers a question is buried deep in a web page, and that the technology would improve 7 percent of search queries across all languages as it rolled out globally.
Google — How AI is powering a more helpful Google (Search On 2020)2020-10-15
[2]
Google's ranking-systems documentation lists a 'Passage ranking system' and defines it as an AI system used to identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search.
Google Search Central — A guide to Google Search ranking systems2025-12-10
[3]
The Dense Passage Retrieval paper showed that a dense retriever using a simple dual-encoder over passages outperforms a strong Lucene-BM25 system by 9%-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets.
Karpukhin et al. — Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 20202020-04-10
[4]
Google's Vertex AI RAG Engine documentation states that when documents are ingested into an index they are split into chunks, with a default chunk size of 1,024 tokens and a default chunk overlap of 256 tokens.
Google Cloud — Fine-tune RAG transformations (Vertex AI RAG Engine)2026-06-05
[5]
The Dense X Retrieval study found that the choice of retrieval unit significantly affects retrieval and downstream performance, and that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
Chen et al. — Dense X Retrieval: What Retrieval Granularity Should We Use?2023-12-11
[6]
Anthropic's Contextual Retrieval experiments reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — by prepending self-contained context to each chunk before embedding.
Anthropic — Introducing Contextual Retrieval2024-09-19
[7]
The Late Chunking method embeds all tokens of a long text with a long-context model first and applies chunking afterward, so that chunk embeddings capture the full surrounding context, yielding superior results across retrieval tasks.
Günther et al. — Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models2024-09-07
[8]
ColBERT introduced a late-interaction architecture that independently encodes the query and the document with BERT and computes relevance through a cheap token-level interaction step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers at competitive effectiveness.
Khattab & Zaharia — ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 20202020-04-27
[9]
ColBERTv2 produces multi-vector representations at the granularity of each token and decomposes relevance modeling into scalable token-level computations, reducing the space footprint of late-interaction models by six to ten times while reaching state-of-the-art quality.
Santhanam et al. — ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction, NAACL 20222021-12-02
[10]
Google's 'Context scoring adjustments for answer passages' patent (US9959315B1) describes adjusting each candidate answer passage's score by a context score derived from a heading vector that describes the path in the document's heading hierarchy from the root heading to the passage's heading.
Google — Context scoring adjustments for answer passages (US9959315B1)2018-05-01
[11]
Google's 'Scoring candidate answer passages' patent (US9940367B1) describes scoring passages extracted from resources and states that a candidate answer passage will be penalized if it includes text that passes formatting boundaries such as paragraphs and section breaks.
Google — Scoring candidate answer passages (US9940367B1)2018-04-10
[12]
The original Retrieval-Augmented Generation paper conditions generation on retrieved passages and compares two formulations — RAG-Sequence, which conditions on the same retrieved passages across the whole generated sequence, and RAG-Token, which can use different passages per token.
Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 20202020-05-22
[13]
The RULER long-context benchmark found that although 17 evaluated language models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K, with almost all degrading as context length grows.
Hsieh et al. — RULER: What's the Real Context Size of Your Long-Context Language Models?2024-04-09
[14]
The 'Lost in the Middle' study found that language model performance is often highest when relevant information occurs at the beginning or end of the input context and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models.
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts, TACL 20242023-07-06
[15]
Google's featured-snippets documentation states that its systems automatically determine whether a page would make a good featured snippet for a search and, if so, elevate it, and that publishers cannot mark their own page as a featured snippet.
Google Search Central — Featured snippets and your website2025-12-10
[16]
The BEIR benchmark evaluated 10 retrieval systems zero-shot across 18 datasets and found that BM25 is a robust baseline while re-ranking and late-interaction-based models on average achieve the best zero-shot performance, at higher computational cost.
Thakur et al. — BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, NeurIPS 20212021-04-17
[17]
A 2025 study on passage segmentation for extractive question answering emphasizes the critical role of chunking — how a long document is segmented into passages — in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline.
Passage Segmentation of Documents for Extractive Question Answering2025-01-17
[18]
A systematic comparison of retrieval-augmented generation and long-context LLMs found that when sufficiently resourced long-context can outperform RAG on average, while RAG retains a significant cost advantage, motivating a hybrid Self-Route method that routes queries between RAG and long-context.
Li et al. — Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach2024-07-23

Up next

Related from the desk

Cited by

Posts that link to this one

Run it on your own page · free

Don’t take our word for it — measure it.

Machine-readable mirror · /research/the-passage-not-the-page/raw.md