The unit of retrieval is the passage, not the page — and five gates decide which survives
AI search never retrieves your page. It splits documents into chunks, embeds and ranks those units, then answers from the few passages that survive. Drawing on dense-retrieval research and Google passage-ranking docs and patents, this maps the five gates a passage must clear.
AI search never retrieves your page. Before anything answers a question, the engine cuts your document into chunks, embeds and ranks those chunks as the unit, selects a small top-k subset into a bounded context window, and answers from whichever passages land where the model still pays attention. The page is a delivery truck; the passage is the cargo.
This is the gap most content strategy never closes. Teams optimize a page as one artifact — its title, its word count, its links — while the machine quietly disassembles it into fragments determined by a chunker the author never sees. A brilliant page that splits into half-thoughts is invisible. A plain page made of clean, self-contained units gets read. This piece traces the journey one passage takes through the retrieval machinery, using the primary sources that describe each step, and closes with a single model — the Passage Pipeline — plus an honest account of where the evidence stops.
Your page is cut into bounded chunks before it is read as a whole.
Self-contained sectionsCutting strips the context each chunk relied on to make sense.
Name your own subjectUnits are scored at token + heading granularity, not page-level.
Answer under the headingOnly a top-k subset enters a bounded context window.
Several strong unitsWhere a unit sits decides whether the model actually uses it.
Front-load the answerA passage has to survive all five gates to be the text an engine answers from. Most teams optimize the page as one artifact and never ask where the cut lines fall — so the work moves from “rank my page” to “make these passages individually survivable.”
Framework: Martech LLC · synthesis of dense-retrieval research + Google passage-ranking docs & patents
Why does your page disappear the moment it's retrieved?
The mental model of "get the page indexed" quietly broke years ago. Modern retrieval embeds and ranks passages, not documents. The canonical proof is peer-reviewed: the Dense Passage Retrieval paper showed that a dense retriever using a simple dual-encoder over passages "outperforms a strong Lucene-BM25 system greatly by 9%-19% absolute in terms of top-20 passage retrieval accuracy" across open-domain QA datasets.[3] The retrievable, embedded, ranked unit throughout is the passage. The document is just where passages happen to live.
And the cut is mechanical. Google's own Vertex AI RAG Engine documentation states that when documents are ingested they are split into chunks, with a default chunk size of 1,024 tokens and a 256-token overlap.[4] That is one vendor's default, not a universal constant — but it makes the point concrete: a number you never chose decides where your content is severed. A 2025 study on passage segmentation for extractive QA emphasizes "the critical role of chunking in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline"[17] — segmentation quality is its own load-bearing stage, separate from what you say or who links to you.
Each chunk is a standalone answer
Half-thoughts the retriever can't use
Production retrieval segments a document into fixed, bounded chunks first— Google’s Vertex AI RAG Engine defaults to 1,024-token chunks with 256-token overlap. Wherever those cuts land is the unit you’re judged on. Section design is retrieval design.
Source: Google Vertex AI RAG Engine docs (1,024-token chunks / 256 overlap) · Dense X Retrieval (arXiv 2312.06648)
The lever here is structural, not editorial: write sections that each read as a standalone answer, so that wherever the chunker cuts, the resulting unit is still a coherent, complete thought rather than a sentence severed mid-argument. Section design is retrieval design.
What happens to a chunk when you lift it off the page?
Splitting solves one problem and creates another. The moment a chunk is cut out, it loses the surrounding context it leaned on. A paragraph that opened with "it raised this by 67%" is meaningless once the antecedent is three chunks away. This is not a stylistic nitpick — it is measured, and it is large.
Anthropic's first-party Contextual Retrieval experiments reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — purely by prepending self-contained context to each chunk before embedding.[6] No new authority, no new links: just making each chunk explain itself. Jina AI's Late Chunking reaches the same conclusion from the model side, embedding all tokens of a long text first and chunking afterward so that "chunk embeddings capture the full contextual information, leading to superior results."[7] Both are techniques the retrieval system applies — but they reveal the publisher's analogous lever exactly.
“It cut this by 67% after the team shipped the change.”
“Contextual Retrieval cut top-20 retrieval failures by 67% after prepending context to each chunk.”
Anthropic’s first-party experiments cut the top-20 retrieval failure rate from 5.7% to 1.9% — a 67% reduction — purely by giving each chunk enough context to stand alone. The publisher’s version of that lever is plain writing discipline: kill orphan pronouns and restate the subject so no block depends on the paragraphs above it.
Source: Anthropic — Introducing Contextual Retrieval (2024-09-19) · Late Chunking (arXiv 2409.04701)
There is a corollary about granularity worth internalizing. The Dense X Retrieval study found that the choice of retrieval unit significantly affects performance, and that "indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks."[5] Atomic, self-contained factoids are more retrievable than dense paragraphs that bury three claims in one sentence. The translation for a writer: name your subject in every block, kill orphan pronouns, and make one clear claim per unit so it stands alone the instant it is lifted out.
How is a passage actually scored?
Once chunked, candidates are ranked at sub-page granularity — tokens and structure, not "the page." ColBERT introduced a "late interaction" architecture that independently encodes the query and document with BERT and scores relevance through a cheap token-level step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers.[8] Its successor ColBERTv2 "produce[s] multi-vector representations at the granularity of each token and decompose[s] relevance modeling into scalable token-level computations."[9] Relevance is computed on your text's constituents.
Classic search operationalized the same idea in public. Google's ranking-systems documentation lists a "Passage ranking system" and defines it as an AI system used to "identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search."[2] And two granted Google patents make a page's structure the literal input to the score. The "Context scoring adjustments for answer passages" patent (US9959315B1) describes adjusting a candidate passage's score by a context score derived from a "heading vector" — the path in the heading hierarchy from the root heading down to the passage's heading.[10] The "Scoring candidate answer passages" patent (US9940367B1) adds that "a candidate answer passage will be penalized if it includes text that passes formatting boundaries, such as paragraphs and section breaks."[11]
shallow heading path, generic
deep path + query-relevant heading
crosses a formatting boundary
Two granted Google patents make a page’s structure the input to the score: one raises a passage by its position in the heading hierarchy and how well that heading matches the query; the other penalizes a candidate that crosses paragraph or section breaks. The lever is mechanical — lead each section with the atomic claim it proves, directly under a question-shaped heading.
Source: Google patents US9959315B1 (heading-vector context score) & US9940367B1 (boundary penalty) · ColBERT (arXiv 2004.12832)
Read those two patents together and the lever is unambiguous: lead each section with the single atomic claim it proves, placed directly under a question-shaped heading, and keep the answer inside one bounded section. Headings are not cosmetic typography. They are scored retrieval structure.
Most teams write for the reader who scrolls and the crawler that indexes. The machine that actually answers reads neither — it reads a chunk, scored on its own headings and boundaries.
Which passages even make it into the answer?
Retrieval is a bottleneck by design. The engine answers from a retrieved subset, never the whole corpus or page. The original RAG paper conditions generation on retrieved passages and compares two formulations — "one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token."[12] Only the top-k slots are filled, and the rest of your page is never read.
The slots are also scarcer than the spec sheet implies. The RULER benchmark found that although 17 evaluated models "all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K,"[13] with almost all degrading as length grows. A giant nominal window does not rescue a weak or buried unit. A systematic study of RAG versus long-context LLMs confirms engines deliberately reason over a chosen subset — long-context can win when sufficiently resourced, but RAG keeps "a distinct" cost advantage, motivating a hybrid that routes queries between the two.[18]
A single buried “best” paragraph often misses the cut; a page with several strong, independent units lands at least one in the window.
Of 17 models claiming 32K+ windows, RULER found only half maintain satisfactory performance at 32K — big windows don’t rescue weak or buried units.
Source: RAG (arXiv 2005.11401) · RULER long-context benchmark (arXiv 2404.06654) · BEIR (arXiv 2104.08663)
So the lever at the selection gate is breadth, not perfection: earn several strong, independently-citable units on the topic across the page — and across pages — so at least one survives the top-k truncation. A single buried "best" paragraph loses to a page that offers the reranker three good options. It pays off across engines, too: the BEIR benchmark found that across 18 datasets, "re-ranking and late-interaction-based models on average achieve the best zero-shot performances"[16] — the architectures that generalize best are the passage-level ones, so structuring for passage retrievability is an engine-portable bet, not a single-platform hack.
Why does a selected passage still get ignored?
Even after a unit is selected into the window, placement governs whether the model uses it. This is the most counterintuitive gate, and the best-measured.
Performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts.
That U-shaped position bias is the central finding of "Lost in the Middle", and it holds "even for explicitly long-context models."[14] A correct answer sitting in the middle of the assembled context is quietly discounted. Google named the same failure for classic search years earlier.
Very specific searches can be the hardest to get right, since sometimes the single sentence that answers your question might be buried deep in a web page.
That is from Google's Search On 2020 announcement, which introduced passage ranking and said the technology would "improve 7 percent of search queries across all languages" as it rolled out.[1] The principle predates LLMs: burial has always been the enemy.
“Lost in the Middle” found a U-shaped position bias: models use information best at the beginning and end of a context and significantly worse in the middle — even long-context models. Google named the same failure for classic search in 2020 and shipped passage ranking to fix it. The operator lever is page-position, not window-position: don’t bury the decisive answer behind a long preamble.
Source: Lost in the Middle (arXiv 2307.03172, TACL 2024) · Google Search On 2020 ('buried deep in a web page')
One honest boundary: this gate is about your page's internal position, not the engine's window. You cannot place your passage at the favorable edge of a model's context — the engine's reranker orders that. What you control is page-position: front-load the decisive answer near the top of the page and the top of each section, so the answer-bearing unit is the one most likely to be selected and surfaced rather than lost behind a long preamble.
The Passage Pipeline, run on a real page
The five gates are a sequence, not a menu. A passage must survive Split (be a clean unit), Contextualize (stand alone), Rank (win on structure), Select (make the top-k), and Position (sit where it's used). The worked example below runs one real, public page — Wikipedia's Retrieval-augmented generation article — against one real query through all five gates. Each verdict is a structural observation, never a promise of citation.
Wikipedia’s Retrieval-augmented generationarticle against the query “what is retrieval-augmented generation?”
- Stage 1 · Split· you cover
The lead section chunks into a self-contained definition; ~1,024-token cuts fall on section boundaries.
tests · do the cut lines leave whole units?
What wins this gateBounded sections, each a complete thought — no answer severed across a cut.
- Stage 2 · Contextualize· you cover
The opening sentence names its subject: 'Retrieval-augmented generation (RAG) is…' — no orphan pronoun.
tests · does the chunk self-identify when lifted out?
What wins this gateRestate the entity in the block; never open with 'it' or 'this technique'.
- Stage 3 · Rank· you cover
A question-shaped heading nests cleanly (H2 › H3) with the definition front-loaded directly beneath it.
tests · is the heading path deep + relevant, answer un-split?
What wins this gateLead each section with the atomic claim, under the literal question.
- Stage 4 · Select· you cover
The page offers several independently-citable units — definition, how-it-works, history — not one.
tests · are there multiple strong units for top-k?
What wins this gateBuild redundancy: several strong passages, so one survives truncation.
- Stage 5 · Position· you cover
The decisive answer sits at the very top of the page, in the high-salience zone, not mid-document.
tests · is the payoff front-loaded, not buried?
What wins this gatePut the answer first — page-top and section-top.
Each verdict here is a structural observation — this passage is self-contained, front-loaded, well-bounded — not a claim that it therefore gets cited. The pipeline tells you what makes a passage survivable; the engine still decides. Your control ends at structure, which is exactly why structure is where the work is.
Illustrative walkthrough of the cited mechanisms on a public page — structural observations only, not a citation outcome
There is one tactic the pipeline rules out. Google's featured-snippets documentation states that its systems "determine whether a page would make a good featured snippet… and if so, elevates it," and that you cannot mark your own page as a featured snippet.[15] Passage-level extraction is system-decided. There is no markup that forces selection — your only handle is to make a unit the easiest, best-bounded, best-positioned candidate. Structure is the whole game because structure is the only part you control.
What this means for the work
The brief changes shape. "Rank this page" becomes "make these N passages individually survivable." Concretely, on any page that matters: audit each H2/H3 section to confirm it reads as a standalone answer at roughly 200–1,000 tokens; restate the subject in every block so no unit depends on its neighbors; lead each section with the atomic claim under a literal-question heading; keep each answer inside one bounded section; build several strong units per topic rather than one; and move the decisive answer to the top. None of these are tricks. They are the physical properties of a retrievable unit.
Where the evidence runs out
This model is a synthesis, and intellectual honesty is part of the work. Several limits matter.
The academic benchmarks — DPR, ColBERT, BEIR, RULER, Dense X — run on QA corpora and research pipelines. They establish the mechanism class: that retrieval operates on embedded, ranked passages and that structure changes outcomes. They do not prove that any named 2026 engine — ChatGPT, Perplexity, Gemini, or Google's AI surfaces — chunks, embeds, or ranks live web pages in a specific way. The Vertex 1,024/256 default is one vendor's configuration, not a public-search-engine standard.
The position-bias gate is the subtlest. "Lost in the Middle" and RULER measure the engine's assembled window, ordered by its reranker — not your page layout. Front-loading your answer is a real, operator-controllable move, but it is a reasoned translation of the finding, not the same intervention the papers tested. Likewise, Contextual Retrieval and Late Chunking are techniques the retrieval system applies; the publisher's analogue — self-contained writing — is an inference, a strong one, but an inference.
And no source here measures "front-loading the answer leads to more AI citations" on a live public engine. The link from passage structure to citation outcomes is reasoned across primary sources, and it should be read that way: a well-grounded hypothesis about durable retrieval mechanics, not a guarantee about any one product. What the evidence does support is the spine of the argument — the unit of retrieval is the passage, not the page, and structure is the lever you actually hold.
Don’t take our word for it — measure it.
Frequently asked questions
- Does AI search retrieve whole pages or passages?
- Passages. Production retrieval and RAG systems split a document into bounded chunks before anything reads it as a whole, then embed and rank those chunks as the unit. The peer-reviewed dense passage retrieval work operates entirely at passage level, and Google's own RAG Engine documents a default chunk size of 1,024 tokens. 'Getting the page indexed' is the wrong mental model; you get passages embedded.
- What is the Passage Pipeline?
- It is a five-gate model of how a page becomes retrievable text: Split (the engine cuts your page into chunks), Contextualize (cutting strips the context each chunk needs to stand alone), Rank (units are scored at token and heading granularity), Select (only a top-k subset enters a bounded window), and Position (where a unit sits decides whether the model uses it). Each gate has one operator lever.
- How should I structure content so AI search can retrieve it?
- Write self-contained sections that read as standalone answers, name the subject explicitly in each block so it survives being lifted off the page, lead each section with the atomic claim directly under a question-shaped heading, keep an answer inside one bounded section rather than letting it cross a section break, and front-load the decisive answer high on the page rather than burying it after a long preamble.
- Why does chunk structure matter more than authority for retrieval?
- Anthropic's first-party Contextual Retrieval experiments cut the top-20 retrieval failure rate from 5.7% to 1.9% — a roughly 67% reduction — purely by giving each chunk enough context to stand alone, with no change in authority. Segmentation quality is a load-bearing variable separate from links or domain strength: a brilliant page that chunks into half-thoughts is effectively invisible.
- Do larger context windows make content structure irrelevant?
- No. The RULER benchmark found that of 17 models claiming 32K-plus token windows, only half maintain satisfactory performance at 32K, and 'Lost in the Middle' shows performance degrades for information in the middle of a long context even for long-context models. The selection and position gates still bite, so tight, front-loaded units still win.
- Can I tag a passage as 'the answer' for AI search?
- No. Google's documentation states you cannot mark a page as a featured snippet; its systems decide whether a page makes a good answer and elevate it automatically. Passage-level extraction is algorithmic and out of publisher control. Your only handle is structure — making a unit the easiest, best-bounded candidate — not markup that forces selection.
- How is this different from how AI search chooses what to cite?
- Citation selection is about which already-retrieved sentence gets attributed; the Passage Pipeline is upstream of that — it is about whether your text becomes a candidate unit at all. A sentence is never retrieved alone; it rides inside a chunk, and whether that chunk is well-bounded, self-contained, and well-positioned decides whether the sentence is ever embedded, ranked, and placed into the window.
Sources · 18
Every claim, dated and linked- [1]
In Google's October 2020 Search On announcement, the company introduced passage ranking and stated that sometimes the single sentence that answers a question is buried deep in a web page, and that the technology would improve 7 percent of search queries across all languages as it rolled out globally.
Google — How AI is powering a more helpful Google (Search On 2020)2020-10-15
- [2]
Google's ranking-systems documentation lists a 'Passage ranking system' and defines it as an AI system used to identify individual sections or 'passages' of a web page to better understand how relevant a page is to a search.
Google Search Central — A guide to Google Search ranking systems2025-12-10
- [3]
The Dense Passage Retrieval paper showed that a dense retriever using a simple dual-encoder over passages outperforms a strong Lucene-BM25 system by 9%-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets.
Karpukhin et al. — Dense Passage Retrieval for Open-Domain Question Answering, EMNLP 20202020-04-10
- [4]
Google's Vertex AI RAG Engine documentation states that when documents are ingested into an index they are split into chunks, with a default chunk size of 1,024 tokens and a default chunk overlap of 256 tokens.
Google Cloud — Fine-tune RAG transformations (Vertex AI RAG Engine)2026-06-05
- [5]
The Dense X Retrieval study found that the choice of retrieval unit significantly affects retrieval and downstream performance, and that indexing a corpus by fine-grained units such as propositions significantly outperforms passage-level units in retrieval tasks.
Chen et al. — Dense X Retrieval: What Retrieval Granularity Should We Use?2023-12-11
- [6]
Anthropic's Contextual Retrieval experiments reported that combining contextual embeddings, contextual BM25, and reranking reduced the top-20 retrieval failure rate from 5.7% to 1.9% — roughly a 67% reduction — by prepending self-contained context to each chunk before embedding.
- [7]
The Late Chunking method embeds all tokens of a long text with a long-context model first and applies chunking afterward, so that chunk embeddings capture the full surrounding context, yielding superior results across retrieval tasks.
Günther et al. — Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models2024-09-07
- [8]
ColBERT introduced a late-interaction architecture that independently encodes the query and the document with BERT and computes relevance through a cheap token-level interaction step, making it two orders of magnitude faster with four orders of magnitude fewer FLOPs per query than prior BERT rankers at competitive effectiveness.
- [9]
ColBERTv2 produces multi-vector representations at the granularity of each token and decomposes relevance modeling into scalable token-level computations, reducing the space footprint of late-interaction models by six to ten times while reaching state-of-the-art quality.
- [10]
Google's 'Context scoring adjustments for answer passages' patent (US9959315B1) describes adjusting each candidate answer passage's score by a context score derived from a heading vector that describes the path in the document's heading hierarchy from the root heading to the passage's heading.
Google — Context scoring adjustments for answer passages (US9959315B1)2018-05-01
- [11]
Google's 'Scoring candidate answer passages' patent (US9940367B1) describes scoring passages extracted from resources and states that a candidate answer passage will be penalized if it includes text that passes formatting boundaries such as paragraphs and section breaks.
Google — Scoring candidate answer passages (US9940367B1)2018-04-10
- [12]
The original Retrieval-Augmented Generation paper conditions generation on retrieved passages and compares two formulations — RAG-Sequence, which conditions on the same retrieved passages across the whole generated sequence, and RAG-Token, which can use different passages per token.
Lewis et al. — Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, NeurIPS 20202020-05-22
- [13]
The RULER long-context benchmark found that although 17 evaluated language models all claim context sizes of 32K tokens or greater, only half of them can maintain satisfactory performance at the length of 32K, with almost all degrading as context length grows.
Hsieh et al. — RULER: What's the Real Context Size of Your Long-Context Language Models?2024-04-09
- [14]
The 'Lost in the Middle' study found that language model performance is often highest when relevant information occurs at the beginning or end of the input context and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models.
Liu et al. — Lost in the Middle: How Language Models Use Long Contexts, TACL 20242023-07-06
- [15]
Google's featured-snippets documentation states that its systems automatically determine whether a page would make a good featured snippet for a search and, if so, elevate it, and that publishers cannot mark their own page as a featured snippet.
Google Search Central — Featured snippets and your website2025-12-10
- [16]
The BEIR benchmark evaluated 10 retrieval systems zero-shot across 18 datasets and found that BM25 is a robust baseline while re-ranking and late-interaction-based models on average achieve the best zero-shot performance, at higher computational cost.
- [17]
A 2025 study on passage segmentation for extractive question answering emphasizes the critical role of chunking — how a long document is segmented into passages — in improving the performance of both dense passage retrieval and the end-to-end RAG pipeline.
Passage Segmentation of Documents for Extractive Question Answering2025-01-17
- [18]
A systematic comparison of retrieval-augmented generation and long-context LLMs found that when sufficiently resourced long-context can outperform RAG on average, while RAG retains a significant cost advantage, motivating a hybrid Self-Route method that routes queries between RAG and long-context.
Up next
Related from the desk- 2026-05-29How AI search actually chooses what to cite — and the five layers that decideAnswer-engine research
- 2026-06-02The fan-out tree you never see — why AI search ranks coverage, then kills it at the rerankAnswer-engine research
- 2026-06-06How AI search resolves your brand to an entity — and the five layers that decideEntity research
Don’t take our word for it — measure it.
Machine-readable mirror · /research/the-passage-not-the-page/raw.md