RO
Roni Itkin, Noam Issachar, Yehonatan Keypur, Yehonatan Keypur, Anpei Chen, Sagie Benaim
4/15/2026
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation
TL;DR
DR^3-Eval is a new benchmark for evaluating deep research agents on complex multi-step tasks using a static corpus that simulates web complexity while remaining reproducible. It introduces a five-dimensional evaluation framework (Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, Depth Quality) and reveals critical failure modes in retrieval robustness and hallucination control.
- •New reproducible benchmark (DR^3-Eval) for evaluating deep research agents on multimodal report generation tasks
- •Five-dimensional evaluation framework validates alignment with human judgment on Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality
- •State-of-the-art language models show critical gaps: retrieval robustness failures and hallucination control issues
Generated with AI, which can make mistakes.
Is this a good recommendation for you?

