Comparing embedding approaches for Multi-Hop Question Answering
This experiment compares two fundamentally different approaches to text retrieval for multi-hop question answering — questions that require finding and combining information from multiple documents to answer correctly.
Compress an entire text into a single fixed-size vector (e.g., 1536 numbers). Retrieval computes the cosine similarity between question and document vectors.
Pro: Fast (one dot product per document). Con: Loses fine-grained token-level information.
Produce one vector per token in the text. A 100-word document generates ~120 vectors of 128 dimensions each. Scored via MaxSim: for each query token, find the best-matching document token, then sum.
Pro: Preserves fine-grained matching. Con: More expensive to compute and store.
All results use exact brute force search — every question is compared against every document in the corpus. No approximate nearest neighbor indexes, no shortcuts. This gives us the true retrieval quality of each embedding model.
What fraction of relevant documents appear in the top-K? E.g., need 3 docs, found 2 in top-100 → Recall = 0.67
Did we find all relevant documents in top-K? Binary: 1 or 0. The strictest metric — missing even one means failure.
1/rank of the first relevant document found. Higher = first relevant result appears earlier.
1/rank of the last relevant document. Zero if any are missing. Measures how deep you must look to have everything.
All metrics are evaluated at K = 5, 10, 20, 50, and 100. Results can be filtered by number of hops (2, 3, or 4 supporting documents required).
We evaluate on two multi-hop QA datasets with very different characteristics:
101,958 chunks (deduplicated Wikipedia paragraphs)
22,355 questions (multi-hop, 2-4 supporting documents each)
Chunk length: min ~25, avg ~104, max ~766 tokens
All chunks fit within any model's context window (even 512-token models).
Source: StonyBrookNLP/musique (TACL 2022)
Current results: 1,000 questions sampled (seed=42), 3 token models.
609 chunks (full news articles from 2023)
2,255 questions (inference, comparison, temporal)
Chunk length: min ~1,092, avg ~2,585, max ~16,585 tokens — 25× longer than MuSiQue
100% of chunks exceed 512 tokens. Models with a 512-token limit truncate every single document.
Source: yixuantt/MultiHop-RAG (COLM 2024)
All 2,255 questions evaluated, all 5 token models.
Provider: OpenAI
Models: text-embedding-3-small (256 to 1536 dimensions) and text-embedding-3-large (256 to 3072 dimensions)
Context window: 8,191 tokens
How it works: Each text is compressed into a single vector. Retrieval uses cosine similarity (dot product, since vectors are L2-normalized). One API call per text.
9 variants tested: small at 256/512/1024/1536d, large at 256/512/1024/1536/3072d
Provider: Stanford NLP (Omar Khattab)
HuggingFace: colbert-ir/colbertv2.0
Paper: ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction (NAACL 2022)
Year: 2022
Dimensions: 128 per token
Context window: 512 tokens
The original late interaction model that pioneered multi-vector retrieval. Uses BERT-base (110M parameters) with a linear projection layer that reduces each token embedding from 768 to 128 dimensions. Applies a skiplist to filter punctuation tokens from documents before indexing. Queries are padded to a fixed length of 32 tokens with [MASK] tokens that act as learned query augmentation.
Provider: Answer.AI
HuggingFace: answerdotai/answerai-colbert-small-v1
Blog: Small but Mighty: Introducing answerai-colbert-small
Year: August 2024
Dimensions: 96 per token
Context window: 512 tokens
A lightweight proof-of-concept showing that smaller ColBERT models can be highly effective. Only 33 million parameters (vs 110M for ColBERTv2). Uses 96-dimensional token embeddings (vs 128 for others), reducing storage and compute by 25%. Despite its small size, it outperforms the original ColBERTv2 on standard benchmarks.
Provider: Jina AI
HuggingFace: jinaai/jina-colbert-v2
Paper: Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Year: August 2024
Dimensions: 128 per token
Context window: 8,192 tokens
Multilingual late interaction retriever based on XLM-RoBERTa, supporting 89 languages. Trained with Matryoshka Representation Learning, allowing flexible output dimensions (128, 96, or 64) at inference time without retraining. Reduces storage requirements by up to 50% compared to previous models while maintaining strong cross-lingual performance.
Provider: LightOn AI
HuggingFace: lightonai/GTE-ModernColBERT-v1
Blog: LightOn release announcement
Year: May 2025
Dimensions: 128 per token
Context window: 8,192 tokens
State-of-the-art general-purpose retrieval model built on ModernBERT (Alibaba-NLP). Trained on MS MARCO using the PyLate library. Achieves top scores on the BEIR benchmark for general retrieval. The extended 8,192-token context window makes it suitable for long documents without truncation. Variable-length query output (no fixed padding like ColBERTv2).
Provider: LightOn AI
HuggingFace: lightonai/Reason-ModernColBERT
Blog: LightOn release announcement
Year: May 2025
Dimensions: 128 per token
Context window: 8,192 tokens
Finetuned from GTE-ModernColBERT on the reasonir-hq dataset, specifically optimized for reasoning-intensive retrieval tasks (multi-hop QA, complex queries). Achieves SOTA on the BRIGHT benchmark, outperforming all models up to 7 billion parameters, including ReasonIR-8B (45× its size). Uses a longer query context of 128 tokens to handle complex reasoning queries.
How does each model perform across both datasets? Filter by hop count, metric, and K value.
102K chunks, 1,000 questions (seed=42). Short Wikipedia paragraphs (~104 tokens).
Note: Uses 3 of 5 token models (AnswerAI and Jina pending). Results may change when all models are included.
609 chunks, 2,255 questions (all). Long news articles (~2,585 tokens avg).