Interactive experiment reports and benchmarks
Retrieval Benchmark April 2026
Comparing 14 embedding models (9 OpenAI dense + 5 ColBERT-style token-level) across two multi-hop QA datasets: MuSiQue (102K short Wikipedia paragraphs) and MultiHop-RAG (609 long news articles). Evaluates Recall, Success, MRR at multiple K values with exact brute-force search. Results filterable by hop count (2, 3, 4).
ArangoDB Retrieval April 2026
Quantifies the accuracy loss and timing overhead of ArangoDB's APPROX_NEAR_COSINE (IVF vector index) compared to exact brute-force cosine similarity. Tests 8 embedding variants on 6,655 multi-hop questions with nLists=3,192 and nProbe=159. Measures network RTT separately to isolate index computation time.