Benchmarking LLM Agents on Real-World Enterprise Tasks

Benchmark Report

Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks

Authors

Michael WornowPhD. Student at Stanford University

Vaishnav GarodiaM.S. Computer Science at Stanford University

Vasilis VassalosSr. Director of ai/ml | Aisera

Utkarsh ContractorField CTO | Aisera

Aisera’s AI Agents built using domain-specific LLMs sets the standard, outperforming general-purpose models in real-world use cases!

Aisera Agents

GPT-4o

Claude 3.5 Sonnet

Gemini 1.5 Pro

AI agents are reshaping enterprise work. As LLMs gain industry focus, how can organizations identify the best fit for their use case or domain? Traditional evaluation methods often rely on synthetic data or artificial scenarios, failing to address real-world use cases across domains such as IT, HR, and more.

Download this report to:

Learn about the CLASSic framework for agentic AI benchmarking
Get a comprehensive report with results and analysis
Understand the implications and tradeoffs for implementing agentic AI

Coming soon: Algorithms, and datasets to evaluate AI Agents of your choice.

Download Report

Accepted as a conference paper at ICLR 2025