---
title: "The unit of retrieval is the passage, not the page — and five gates decide which survives"
url: https://martech.llc/research/the-passage-not-the-page
publishedAt: 2026-06-07
updatedAt: 2026-06-07
author: sundar
category: research-note
summary: "AI search never retrieves your page. It splits documents into chunks, embeds and ranks those units, then answers from the few passages that survive. Drawing on dense-retrieval research and Google passage-ranking docs and patents, this maps the five gates a passage must clear."
soWhat: "Engines retrieve passages, not pages — so the work shifts from “rank my page” to making each passage self-contained, well-bounded, and front-loaded enough to survive retrieval."
tags: ["generative-engine-optimization","answer-engine-optimization","ai-search","passage-retrieval","content-structure"]
keywords: ["passage retrieval","chunking for ai search","how rag retrieves content","retrieval granularity","content structure for ai search","how to get cited by ai","dense passage retrieval","passage ranking"]
claims: [{"id":"claim-1","text":"In Google's October 2020 Search On announcement, the company introduced passage ranking and stated that sometimes the single sentence that answers a question is buried deep in a web page, and that the technology would improve 7 percent of search queries across all languages as it rolled out globally.","source":"https://blog.google/products/search/search-on/","sourceTitle":"Google — How AI is powering a more helpful Google (Search On 2020)","sourceDate":"2020-10-15"},{"id":"claim-2","text":"Google's ranking-systems documentation lists a 'Passage ranking system' and defines it as an AI system used to identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search.","source":"https://developers.google.com/search/docs/appearance/ranking-systems-guide","sourceTitle":"Google Search Central — A guide to Google Search ranking systems","sourceDate":"2025-12-10"},{"id":"claim-3","text":"The Dense Passage Retrieval paper showed that a dense retriever using a simple dual-encoder over passages outperforms a strong Lucene-BM25 system by 9%-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets.","source":"https://arxiv.org/abs/2004.04906","sourceTitle":"Karpukhin et al. — Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 2020","sourceDate":"2020-04-10"},{"id":"claim-4","text":"Google's Vertex AI RAG Engine documentation states that when documents are ingested into an index they are split into chunks, with a default chunk size of 1,024 tokens and a default chunk overlap of 256 tokens.","source":"https://docs.cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/fine-tune-rag-transformations","sourceTitle":"Google Cloud — Fine-tune RAG transformations (Vertex AI RAG Engine)","sourceDate":"2026-06-05"},{"id":"claim-5","text":"The Dense X Retrieval study found that the choice of retrieval unit significantly affects retrieval and downstream performance, and that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.","source":"https://arxiv.org/abs/2312.06648","sourceTitle":"Chen et al. — Dense X Retrieval: What Retrieval Granularity Should We Use?","sourceDate":"2023-12-11"},{"id":"claim-6","text":"Anthropic's Contextual Retrieval experiments reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — by prepending self-contained context to each chunk before embedding.","source":"https://www.anthropic.com/news/contextual-retrieval","sourceTitle":"Anthropic — Introducing Contextual Retrieval","sourceDate":"2024-09-19"},{"id":"claim-7","text":"The Late Chunking method embeds all tokens of a long text with a long-context model first and applies chunking afterward, so that chunk embeddings capture the full surrounding context, yielding superior results across retrieval tasks.","source":"https://arxiv.org/abs/2409.04701","sourceTitle":"Günther et al. — Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models","sourceDate":"2024-09-07"},{"id":"claim-8","text":"ColBERT introduced a late-interaction architecture that independently encodes the query and the document with BERT and computes relevance through a cheap token-level interaction step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers at competitive effectiveness.","source":"https://arxiv.org/abs/2004.12832","sourceTitle":"Khattab & Zaharia — ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT, SIGIR 2020","sourceDate":"2020-04-27"},{"id":"claim-9","text":"ColBERTv2 produces multi-vector representations at the granularity of each token and decomposes relevance modeling into scalable token-level computations, reducing the space footprint of late-interaction models by six to ten times while reaching state-of-the-art quality.","source":"https://arxiv.org/abs/2112.01488","sourceTitle":"Santhanam et al. — ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction, NAACL 2022","sourceDate":"2021-12-02"},{"id":"claim-10","text":"Google's 'Context scoring adjustments for answer passages' patent (US9959315B1) describes adjusting each candidate answer passage's score by a context score derived from a heading vector that describes the path in the document's heading hierarchy from the root heading to the passage's heading.","source":"https://patents.google.com/patent/US9959315B1/en","sourceTitle":"Google — Context scoring adjustments for answer passages (US9959315B1)","sourceDate":"2018-05-01"},{"id":"claim-11","text":"Google's 'Scoring candidate answer passages' patent (US9940367B1) describes scoring passages extracted from resources and states that a candidate answer passage will be penalized if it includes text that passes formatting boundaries such as paragraphs and section breaks.","source":"https://patents.google.com/patent/US9940367B1/en","sourceTitle":"Google — Scoring candidate answer passages (US9940367B1)","sourceDate":"2018-04-10"},{"id":"claim-12","text":"The original Retrieval-Augmented Generation paper conditions generation on retrieved passages and compares two formulations — RAG-Sequence, which conditions on the same retrieved passages across the whole generated sequence, and RAG-Token, which can use different passages per token.","source":"https://arxiv.org/abs/2005.11401","sourceTitle":"Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 2020","sourceDate":"2020-05-22"},{"id":"claim-13","text":"The RULER long-context benchmark found that although 17 evaluated language models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K, with almost all degrading as context length grows.","source":"https://arxiv.org/abs/2404.06654","sourceTitle":"Hsieh et al. — RULER: What's the Real Context Size of Your Long-Context Language Models?","sourceDate":"2024-04-09"},{"id":"claim-14","text":"The 'Lost in the Middle' study found that language model performance is often highest when relevant information occurs at the beginning or end of the input context and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models.","source":"https://arxiv.org/abs/2307.03172","sourceTitle":"Liu et al. — Lost in the Middle: How Language Models Use Long Contexts, TACL 2024","sourceDate":"2023-07-06"},{"id":"claim-15","text":"Google's featured-snippets documentation states that its systems automatically determine whether a page would make a good featured snippet for a search and, if so, elevate it, and that publishers cannot mark their own page as a featured snippet.","source":"https://developers.google.com/search/docs/appearance/featured-snippets","sourceTitle":"Google Search Central — Featured snippets and your website","sourceDate":"2025-12-10"},{"id":"claim-16","text":"The BEIR benchmark evaluated 10 retrieval systems zero-shot across 18 datasets and found that BM25 is a robust baseline while re-ranking and late-interaction-based models on average achieve the best zero-shot performance, at higher computational cost.","source":"https://arxiv.org/abs/2104.08663","sourceTitle":"Thakur et al. — BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models, NeurIPS 2021","sourceDate":"2021-04-17"},{"id":"claim-17","text":"A 2025 study on passage segmentation for extractive question answering emphasizes the critical role of chunking — how a long document is segmented into passages — in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline.","source":"https://arxiv.org/abs/2501.09940","sourceTitle":"Passage Segmentation of Documents for Extractive Question Answering","sourceDate":"2025-01-17"},{"id":"claim-18","text":"A systematic comparison of retrieval-augmented generation and long-context LLMs found that when sufficiently resourced long-context can outperform RAG on average, while RAG retains a significant cost advantage, motivating a hybrid Self-Route method that routes queries between RAG and long-context.","source":"https://arxiv.org/abs/2407.16833","sourceTitle":"Li et al. — Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach","sourceDate":"2024-07-23"}]
---

# The unit of retrieval is the passage, not the page — and five gates decide which survives

AI search never retrieves your page. Before anything answers a question, the engine cuts your document into chunks, embeds and ranks those chunks as the unit, selects a small top-k subset into a bounded context window, and answers from whichever passages land where the model still pays attention. The page is a delivery truck; the passage is the cargo.

This is the gap most content strategy never closes. Teams optimize a page as one artifact — its title, its word count, its links — while the machine quietly disassembles it into fragments determined by a chunker the author never sees. A brilliant page that splits into half-thoughts is invisible. A plain page made of clean, self-contained units gets read. This piece traces the journey one passage takes through the retrieval machinery, using the primary sources that describe each step, and closes with a single model — the Passage Pipeline — plus an honest account of where the evidence stops.

<Aside kind="fact" title="The short version">
An engine splits your page into chunks, loses the context each chunk depended on, ranks the survivors at token and heading granularity, selects a top-k subset into a bounded window, and uses only the passages positioned where attention is highest. Five gates — Split, Contextualize, Rank, Select, Position — decide which passage survives, and each has exactly one operator lever.
</Aside>

<PassagePipeline />

## Why does your page disappear the moment it's retrieved?

The mental model of "get the page indexed" quietly broke years ago. Modern retrieval embeds and ranks passages, not documents. The canonical proof is peer-reviewed: <Claim id="claim-3">the [Dense Passage Retrieval paper](https://arxiv.org/abs/2004.04906) showed that a dense retriever using a simple dual-encoder over passages "outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy" across open-domain QA datasets.</Claim> The retrievable, embedded, ranked unit throughout is the passage. The document is just where passages happen to live.

And the cut is mechanical. <Claim id="claim-4">Google's own [Vertex AI RAG Engine documentation](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/rag-engine/fine-tune-rag-transformations) states that when documents are ingested they are split into chunks, with a default chunk size of 1,024 tokens and a 256-token overlap.</Claim> That is one vendor's default, not a universal constant — but it makes the point concrete: a number you never chose decides where your content is severed. <Claim id="claim-17">A 2025 study on [passage segmentation for extractive QA](https://arxiv.org/abs/2501.09940) emphasizes "the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline"</Claim> — segmentation quality is its own load-bearing stage, separate from what you say or who links to you.

<PassageSplit />

The lever here is structural, not editorial: write sections that each read as a standalone answer, so that wherever the chunker cuts, the resulting unit is still a coherent, complete thought rather than a sentence severed mid-argument. Section design *is* retrieval design.

## What happens to a chunk when you lift it off the page?

Splitting solves one problem and creates another. The moment a chunk is cut out, it loses the surrounding context it leaned on. A paragraph that opened with "it raised this by 67%" is meaningless once the antecedent is three chunks away. This is not a stylistic nitpick — it is measured, and it is large.

<Claim id="claim-6">Anthropic's first-party [Contextual Retrieval experiments](https://www.anthropic.com/news/contextual-retrieval) reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — purely by prepending self-contained context to each chunk before embedding.</Claim> No new authority, no new links: just making each chunk explain itself. <Claim id="claim-7">Jina AI's [Late Chunking](https://arxiv.org/abs/2409.04701) reaches the same conclusion from the model side, embedding all tokens of a long text first and chunking afterward so that "chunk embeddings capture the full contextual information, leading to superior results."</Claim> Both are techniques the *retrieval system* applies — but they reveal the publisher's analogous lever exactly.

<PassageContextualize />

There is a corollary about granularity worth internalizing. <Claim id="claim-5">The [Dense X Retrieval study](https://arxiv.org/abs/2312.06648) found that the choice of retrieval unit significantly affects performance, and that "indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks."</Claim> Atomic, self-contained factoids are more retrievable than dense paragraphs that bury three claims in one sentence. The translation for a writer: name your subject in every block, kill orphan pronouns, and make one clear claim per unit so it stands alone the instant it is lifted out.

## How is a passage actually scored?

Once chunked, candidates are ranked at sub-page granularity — tokens and structure, not "the page." <Claim id="claim-8">[ColBERT](https://arxiv.org/abs/2004.12832) introduced a "late interaction" architecture that independently encodes the query and document with BERT and scores relevance through a cheap token-level step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers.</Claim> <Claim id="claim-9">Its successor [ColBERTv2](https://arxiv.org/abs/2112.01488) "produce[s] multi-vector representations at the granularity of each token and decompose[s] relevance modeling into scalable token-level computations."</Claim> Relevance is computed on your text's constituents.

Classic search operationalized the same idea in public. <Claim id="claim-2">Google's [ranking-systems documentation](https://developers.google.com/search/docs/appearance/ranking-systems-guide) lists a "Passage ranking system" and defines it as an AI system used to "identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search."</Claim> And two granted Google patents make a page's structure the literal input to the score. <Claim id="claim-10">The ["Context scoring adjustments for answer passages" patent (US9959315B1)](https://patents.google.com/patent/US9959315B1/en) describes adjusting a candidate passage's score by a context score derived from a "heading vector" — the path in the heading hierarchy from the root heading down to the passage's heading.</Claim> <Claim id="claim-11">The ["Scoring candidate answer passages" patent (US9940367B1)](https://patents.google.com/patent/US9940367B1/en) adds that "a candidate answer passage will be penalized if it includes text that passes formatting boundaries, such as paragraphs and section breaks."</Claim>

<PassageRank />

Read those two patents together and the lever is unambiguous: lead each section with the single atomic claim it proves, placed directly under a question-shaped heading, and keep the answer inside one bounded section. Headings are not cosmetic typography. They are scored retrieval structure.

<Pullquote>Most teams write for the reader who scrolls and the crawler that indexes. The machine that actually answers reads neither — it reads a chunk, scored on its own headings and boundaries.</Pullquote>

## Which passages even make it into the answer?

Retrieval is a bottleneck by design. The engine answers from a retrieved subset, never the whole corpus or page. <Claim id="claim-12">The [original RAG paper](https://arxiv.org/abs/2005.11401) conditions generation on retrieved passages and compares two formulations — "one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token."</Claim> Only the top-k slots are filled, and the rest of your page is never read.

The slots are also scarcer than the spec sheet implies. <Claim id="claim-13">The [RULER benchmark](https://arxiv.org/abs/2404.06654) found that although 17 evaluated models "all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K,"</Claim> with almost all degrading as length grows. A giant nominal window does not rescue a weak or buried unit. <Claim id="claim-18">A systematic [study of RAG versus long-context LLMs](https://arxiv.org/abs/2407.16833) confirms engines deliberately reason over a chosen subset — long-context can win when sufficiently resourced, but RAG keeps "a distinct" cost advantage, motivating a hybrid that routes queries between the two.</Claim>

<PassageSelect />

So the lever at the selection gate is breadth, not perfection: earn several strong, independently-citable units on the topic across the page — and across pages — so at least one survives the top-k truncation. A single buried "best" paragraph loses to a page that offers the reranker three good options. <Claim id="claim-16">It pays off across engines, too: the [BEIR benchmark](https://arxiv.org/abs/2104.08663) found that across 18 datasets, "re-ranking and late-interaction-based models on average achieve the best zero-shot performances"</Claim> — the architectures that generalize best are the passage-level ones, so structuring for passage retrievability is an engine-portable bet, not a single-platform hack.

## Why does a selected passage still get ignored?

Even after a unit is selected into the window, placement governs whether the model uses it. This is the most counterintuitive gate, and the best-measured.

> Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.

<Claim id="claim-14">That U-shaped position bias is the central finding of ["Lost in the Middle"](https://arxiv.org/abs/2307.03172), and it holds "even for explicitly long-context models."</Claim> A correct answer sitting in the middle of the assembled context is quietly discounted. Google named the same failure for classic search years earlier.

> Very specific searches can be the hardest to get right, since sometimes the single sentence that answers your question might be buried deep in a web page.

<Claim id="claim-1">That is from Google's [Search On 2020 announcement](https://blog.google/products/search/search-on/), which introduced passage ranking and said the technology would "improve 7 percent of search queries across all languages" as it rolled out.</Claim> The principle predates LLMs: burial has always been the enemy.

<PassagePosition />

One honest boundary: this gate is about your page's *internal* position, not the engine's window. You cannot place your passage at the favorable edge of a model's context — the engine's reranker orders that. What you control is page-position: front-load the decisive answer near the top of the page and the top of each section, so the answer-bearing unit is the one most likely to be selected and surfaced rather than lost behind a long preamble.

## The Passage Pipeline, run on a real page

The five gates are a sequence, not a menu. A passage must survive Split (be a clean unit), Contextualize (stand alone), Rank (win on structure), Select (make the top-k), and Position (sit where it's used). The worked example below runs one real, public page — Wikipedia's *Retrieval-augmented generation* article — against one real query through all five gates. Each verdict is a structural observation, never a promise of citation.

<PassageWorkedExample />

There is one tactic the pipeline rules out. <Claim id="claim-15">Google's [featured-snippets documentation](https://developers.google.com/search/docs/appearance/featured-snippets) states that its systems "determine whether a page would make a good featured snippet… and if so, elevates it," and that you cannot mark your own page as a featured snippet.</Claim> Passage-level extraction is system-decided. There is no markup that forces selection — your only handle is to make a unit the easiest, best-bounded, best-positioned candidate. Structure is the whole game because structure is the only part you control.

## What this means for the work

The brief changes shape. "Rank this page" becomes "make these N passages individually survivable." Concretely, on any page that matters: audit each H2/H3 section to confirm it reads as a standalone answer at roughly 200–1,000 tokens; restate the subject in every block so no unit depends on its neighbors; lead each section with the atomic claim under a literal-question heading; keep each answer inside one bounded section; build several strong units per topic rather than one; and move the decisive answer to the top. None of these are tricks. They are the physical properties of a retrievable unit.

## Where the evidence runs out

This model is a synthesis, and intellectual honesty is part of the work. Several limits matter.

The academic benchmarks — DPR, ColBERT, BEIR, RULER, Dense X — run on QA corpora and research pipelines. They establish the *mechanism class*: that retrieval operates on embedded, ranked passages and that structure changes outcomes. They do not prove that any named 2026 engine — ChatGPT, Perplexity, Gemini, or Google's AI surfaces — chunks, embeds, or ranks live web pages in a specific way. The Vertex 1,024/256 default is one vendor's configuration, not a public-search-engine standard.

The position-bias gate is the subtlest. "Lost in the Middle" and RULER measure the *engine's* assembled window, ordered by *its* reranker — not your page layout. Front-loading your answer is a real, operator-controllable move, but it is a reasoned translation of the finding, not the same intervention the papers tested. Likewise, Contextual Retrieval and Late Chunking are techniques the retrieval *system* applies; the publisher's analogue — self-contained writing — is an inference, a strong one, but an inference.

And no source here measures "front-loading the answer leads to more AI citations" on a live public engine. The link from passage structure to citation outcomes is reasoned across primary sources, and it should be read that way: a well-grounded hypothesis about durable retrieval mechanics, not a guarantee about any one product. What the evidence does support is the spine of the argument — the unit of retrieval is the passage, not the page, and structure is the lever you actually hold.

<div id="run-free" className="scroll-mt-24">
<InlineToolRunner defaultTab="citerra" />
</div>