ArangoML Research Reports

Interactive experiment reports and benchmarks

Dense vs Token-Level Retrieval for Multi-Hop QA

Retrieval Benchmark April 2026

Comparing 14 embedding models (9 OpenAI dense + 5 ColBERT-style token-level) across two multi-hop QA datasets: MuSiQue (102K short Wikipedia paragraphs) and MultiHop-RAG (609 long news articles). Evaluates Recall, Success, MRR at multiple K values with exact brute-force search. Results filterable by hop count (2, 3, 4).

ArangoDB Vector Index vs Brute Force

ArangoDB Retrieval April 2026

Quantifies the accuracy loss and timing overhead of ArangoDB's APPROX_NEAR_COSINE (IVF vector index) compared to exact brute-force cosine similarity. Tests 8 embedding variants on 6,655 multi-hop questions with nLists=3,192 and nProbe=159. Measures network RTT separately to isolate index computation time.