Back to feed
RO
Roni Itkin, Noam Issachar, Yehonatan Keypur, Yehonatan Keypur, Anpei Chen, Sagie Benaim
4/15/2026

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

TL;DR

DR^3-Eval is a new benchmark for evaluating deep research agents on complex multi-step tasks using a static corpus that simulates web complexity while remaining reproducible. It introduces a five-dimensional evaluation framework (Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, Depth Quality) and reveals critical failure modes in retrieval robustness and hallucination control.

  • New reproducible benchmark (DR^3-Eval) for evaluating deep research agents on multimodal report generation tasks
  • Five-dimensional evaluation framework validates alignment with human judgment on Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality
  • State-of-the-art language models show critical gaps: retrieval robustness failures and hallucination control issues

Generated with AI, which can make mistakes.

Is this a good recommendation for you?

Explore more