Agent Arena: Causal Evaluation of Agents in the Real World Agents are increasingly doing real work. The resulting task distribution has greatly expanded. We desire an agent evaluation that scales along with usage and capability.
New Categories for Web Development in Code Arena AI coding models are increasingly used to build web apps, but aggregated leaderboards obscure key performance differences. After analyzing 250k+ Code Arena prompts, we identified major front-end task categories and built new leaderboard views to compare model strengths and weaknesses.
Supporting Independent Research in AI Evaluation Arena’s Academic Partnerships Program provides funding and support for independent research advancing the scientific foundations of AI evaluation.