Flutterby™! : TheAgentCompany 2025-07-02 16:50:27.903117+02

TheAgentCompany

2025-07-02 16:50:27.903117+02 by Dan Lyke 0 comments

The Register: AI agents get office tasks wrong around 70% of the time, and a lot of them aren't AI at all

According to Gartner, many agents are fiction without the science. "Many vendors are contributing to the hype by engaging in 'agent washing' – the rebranding of existing products, such as AI assistants, robotic process automation (RPA) and chatbots, without substantial agentic capabilities," the firm says. "Gartner estimates only about 130 of the thousands of agentic AI vendors are real."

Which, if course, duh, but mostly this is about TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents--in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on this https URL.

https://doi.org/10.48550/arXiv.2412.14161

Comment policy

We will not edit your comments. However, we may delete your comments, or cause them to be hidden behind another link, if we feel they detract from the conversation. Commercial plugs are fine, if they are relevant to the conversation, and if you don't try to pretend to be a consumer. Annoying endorsements will be deleted if you're lucky, if you're not a whole bunch of people smarter and more articulate than you will ridicule you, and we will leave such ridicule in place.

Flutterby™ is a trademark claimed by

Flutterby™! : TheAgentCompany

TheAgentCompany

Add your own comment:

Comment policy