OpenAI's New ROSALIND Is Now Performing At Human Level

ToolSimulator: scalable tool testing for AI agents

Epic Games launches AI conversations tool for Fortnite creators to build dynamic NPCs

How robots learn: A brief, contemporary history
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 combines diffusion-based trajectory generation with RL-optimized reranking for autonomous driving motion planning. The framework achieves 56% collision rate reduction through temporal consistency in policy optimization and structured feedback mechanisms. Real-world deployment validates improved safety and smoothness in urban traffic scenarios.See more
DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation
DR^3-Eval is a new benchmark for evaluating deep research agents on complex multi-step tasks using a static corpus that simulates web complexity while remaining reproducible. It introduces a five-dimensional evaluation framework (Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, Depth Quality) and reveals critical failure modes in retrieval robustness and hallucination control.See more
Omnichannel ordering with Amazon Bedrock AgentCore and Amazon Nova 2 Sonic
Why having “humans in the loop” in an AI war is an illusion
MIT Technology Review examines the legal and operational tension between Anthropic and the Pentagon over AI deployment in warfare, arguing that "humans in the loop" safeguards may be illusory as AI systems increasingly operate autonomously in real combat scenarios, particularly in the Iran conflict.See more
Agentic AI costs more than you budgeted. Here's why.
Agentic AI deployments often exceed budgets because teams focus on development costs while overlooking operating expenses: token usage, governance, evaluation infrastructure, security, and scaling all compound rapidly. Most enterprises don't model these hidden costs until they're already absorbing them in production. Accurate ROI requires forecasting the full total cost of ownership, not just initial build.See more

OpenAI updates Codex with desktop control and memory; intensifies competition with Claude Code
Tesla brings its robotaxi service to Dallas and Houston
Tesla expanded its driverless robotaxi service to Dallas and Houston, marking its third Texas city after Austin. The company began offering fully autonomous rides without safety drivers in January 2026. This geographic expansion signals Tesla's confidence in scaling autonomous ride-hailing infrastructure.See more

From hours to minutes: How Agentic AI gave marketers time back for what matters
Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents
Hugging Face publishes technical analysis of VAKRA, an agent framework examining reasoning patterns, tool-use capabilities, and failure modes. The post provides insights into how agents handle complex tasks and where they break down.See more
The next evolution of the Agents SDK
OpenAI released an updated Agents SDK with native sandbox execution and a model-native harness for building secure, long-running agents. The enhancement enables safer integration with files and external tools. This update targets developers building multi-step AI workflows with improved execution safety.See more
AI Weekly Issue #484: Legal risks, robotics, and developer tools—quick hits
AI Weekly Issue 484 covers regulatory risk, robotics adoption, and developer tools. Key stories: AI chat logs can be used as legal evidence in court, creating compliance concerns; Chery's humanoid robot at $42K signals automotive's robotics pivot with pricing expected to halve within a year. Anthropic's Claude Code Routines gained strong developer adoption (686 HN points) for automating repetitive workflows.See more
Hightouch reaches $100M ARR fueled by marketing tools powered by AI
Hightouch, a data integration platform for marketing teams, reached $100M in annual recurring revenue after growing ARR by $70M over just 20 months—fueled by its newly launched AI agent platform. The milestone reflects surging enterprise demand for AI-powered marketing automation. Hightouch's rapid growth demonstrates how AI-driven tools are becoming essential infrastructure for modern marketing operations.See more
Gitar, a startup that uses agents to secure code, emerges from stealth with $9 million
Gitar, a startup using AI agents to review code—including code generated by AI systems—emerged from stealth with $9 million in funding. The company tackles the growing challenge of securing AI-generated code alongside traditionally-written code. As AI code generation becomes more prevalent in development workflows, automated agent-based review solutions offer timely security assessment.See more
How to Implement Tool Calling with Gemma 4 and Python
Tutorial on implementing tool calling with Gemma 4, an open-weights model, using Python. Covers practical setup and integration patterns for developers building with open-source LLMs. Relevant for teams evaluating alternatives to closed-model APIs.See more


