Experiment 1: Dense vs Token-Level Retrieval

Introduction

This experiment compares two fundamentally different approaches to text retrieval for multi-hop question answering — questions that require finding and combining information from multiple documents to answer correctly.

📦 Dense Embeddings

Compress an entire text into a single fixed-size vector (e.g., 1536 numbers). Retrieval computes the cosine similarity between question and document vectors.

Pro: Fast (one dot product per document). Con: Loses fine-grained token-level information.

🧩 Token-Level Embeddings

Produce one vector per token in the text. A 100-word document generates ~120 vectors of 128 dimensions each. Scored via MaxSim: for each query token, find the best-matching document token, then sum.

Pro: Preserves fine-grained matching. Con: More expensive to compute and store.

How do we evaluate?

All results use exact brute force search — every question is compared against every document in the corpus. No approximate nearest neighbor indexes, no shortcuts. This gives us the true retrieval quality of each embedding model.

Recall@K

What fraction of relevant documents appear in the top-K? E.g., need 3 docs, found 2 in top-100 → Recall = 0.67

Success@K

Did we find all relevant documents in top-K? Binary: 1 or 0. The strictest metric — missing even one means failure.

MRR First

1/rank of the first relevant document found. Higher = first relevant result appears earlier.

MRR Last

1/rank of the last relevant document. Zero if any are missing. Measures how deep you must look to have everything.

All metrics are evaluated at K = 5, 10, 20, 50, and 100. Results can be filtered by number of hops (2, 3, or 4 supporting documents required).

Datasets

We evaluate on two multi-hop QA datasets with very different characteristics:

MuSiQue

101,958 chunks (deduplicated Wikipedia paragraphs)

22,355 questions (multi-hop, 2-4 supporting documents each)

Chunk length: min ~25, avg ~104, max ~766 tokens

All chunks fit within any model's context window (even 512-token models).

Source: StonyBrookNLP/musique (TACL 2022)

Current results: 1,000 questions sampled (seed=42), 3 token models.

MultiHop-RAG

609 chunks (full news articles from 2023)

2,255 questions (inference, comparison, temporal)

Chunk length: min ~1,092, avg ~2,585, max ~16,585 tokens — 25× longer than MuSiQue

100% of chunks exceed 512 tokens. Models with a 512-token limit truncate every single document.

Source: yixuantt/MultiHop-RAG (COLM 2024)

All 2,255 questions evaluated, all 5 token models.

Embedding Models

Dense Embeddings (single vector per text)

OpenAI text-embedding-3 DENSE

Provider: OpenAI

Models: text-embedding-3-small (256 to 1536 dimensions) and text-embedding-3-large (256 to 3072 dimensions)

Context window: 8,191 tokens

How it works: Each text is compressed into a single vector. Retrieval uses cosine similarity (dot product, since vectors are L2-normalized). One API call per text.

9 variants tested: small at 256/512/1024/1536d, large at 256/512/1024/1536/3072d

Token-Level Embeddings (one vector per token, MaxSim scoring)

512-token context

ColBERTv2 TOKEN 512

Provider: Stanford NLP (Omar Khattab)

HuggingFace: colbert-ir/colbertv2.0

Paper: ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (NAACL 2022)

Year: 2022

Dimensions: 128 per token

Context window: 512 tokens

The original late interaction model that pioneered multi-vector retrieval. Uses BERT-base (110M parameters) with a linear projection layer that reduces each token embedding from 768 to 128 dimensions. Applies a skiplist to filter punctuation tokens from documents before indexing. Queries are padded to a fixed length of 32 tokens with [MASK] tokens that act as learned query augmentation.

AnswerAI-ColBERT TOKEN 512

Provider: Answer.AI

HuggingFace: answerdotai/answerai-colbert-small-v1

Blog: Small but Mighty: Introducing answerai-colbert-small

Year: August 2024

Dimensions: 96 per token

Context window: 512 tokens

A lightweight proof-of-concept showing that smaller ColBERT models can be highly effective. Only 33 million parameters (vs 110M for ColBERTv2). Uses 96-dimensional token embeddings (vs 128 for others), reducing storage and compute by 25%. Despite its small size, it outperforms the original ColBERTv2 on standard benchmarks.

8192-token context

Jina-ColBERT-v2 TOKEN 8192

Provider: Jina AI

HuggingFace: jinaai/jina-colbert-v2

Paper: Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever

Year: August 2024

Dimensions: 128 per token

Context window: 8,192 tokens

Multilingual late interaction retriever based on XLM-RoBERTa, supporting 89 languages. Trained with Matryoshka Representation Learning, allowing flexible output dimensions (128, 96, or 64) at inference time without retraining. Reduces storage requirements by up to 50% compared to previous models while maintaining strong cross-lingual performance.

GTE-ModernColBERT TOKEN 8192

Provider: LightOn AI

HuggingFace: lightonai/GTE-ModernColBERT-v1

Blog: LightOn release announcement

Year: May 2025

Dimensions: 128 per token

Context window: 8,192 tokens

State-of-the-art general-purpose retrieval model built on ModernBERT (Alibaba-NLP). Trained on MS MARCO using the PyLate library. Achieves top scores on the BEIR benchmark for general retrieval. The extended 8,192-token context window makes it suitable for long documents without truncation. Variable-length query output (no fixed padding like ColBERTv2).

Reason-ModernColBERT TOKEN 8192

Provider: LightOn AI

HuggingFace: lightonai/Reason-ModernColBERT

Blog: LightOn release announcement

Year: May 2025

Dimensions: 128 per token

Context window: 8,192 tokens

Finetuned from GTE-ModernColBERT on the reasonir-hq dataset, specifically optimized for reasoning-intensive retrieval tasks (multi-hop QA, complex queries). Achieves SOTA on the BRIGHT benchmark, outperforming all models up to 7 billion parameters, including ReasonIR-8B (45× its size). Uses a longer query context of 128 tokens to handle complex reasoning queries.

Cross-Dataset Comparison

How does each model perform across both datasets? Filter by hop count, metric, and K value.

Key Finding: Reason-ModernColBERT flips from worst on MuSiQue (Recall@100 = 0.67) to best on MultiHop-RAG (0.996). The explanation is context length: MuSiQue chunks average 104 tokens (all models see everything), but MultiHop-RAG chunks average 2,585 tokens. Models limited to 512 tokens (ColBERTv2, AnswerAI) lose >80% of each article, while 8192-token models process them nearly in full. Use the hop tabs above to see how this gap widens with more required supporting documents.

MuSiQue Results

102K chunks, 1,000 questions (seed=42). Short Wikipedia paragraphs (~104 tokens).

Note: Uses 3 of 5 token models (AnswerAI and Jina pending). Results may change when all models are included.

Full Results Table (click to expand)

Key Finding: Dense large-3072 leads in recall (0.83) and success (0.64), but GTE-ModernColBERT achieves the best MRR First (0.78) — it finds the first relevant chunk faster than any other model. For dense embeddings, more dimensions consistently help: large-3072 > large-1536 > large-1024. The large model also outperforms small at the same dimension. Switch to 4-hop tab to see that success drops to 0.18 — finding all 4 chunks in 102K is very hard.

MultiHop-RAG Results

609 chunks, 2,255 questions (all). Long news articles (~2,585 tokens avg).

Full Results Table (click to expand)

Key Finding: Reason-ModernColBERT achieves near-perfect retrieval (Recall@100 = 0.996, Success@100 = 0.989) and the best MRR First (0.77). Dense large-3072 is close (0.990 recall) — its 8191-token context handles most articles. The real losers are 512-token models: ColBERTv2 drops to 0.91 recall, AnswerAI to 0.92. With only 609 chunks the corpus is "easy" — even 4-hop Success@100 stays above 0.97. The more interesting comparison is on MuSiQue where the 102K corpus makes retrieval genuinely hard.