Perspectives

Trase Tops GAIA Leaderboard

Outperforming AI agent systems by Meta, Microsoft, and others in worldwide competition

Written by:

Joe Laws

Co-Founder & CEO, Trase September 27, 2024

It’s official, Trase is Number 1!

For three weeks running, Trase has been at the top of the leaderboard at Hugging Face’s GAIA (General AI Assistants) – a benchmark test that measures the real-world abilities of AI agents, such as their reasoning, multi-modality handling, general tool-use proficiency, and web browsing – outperforming agent systems developed by the likes of Meta, Microsoft, Hugging Face, and well-funded startups including Baichuan.

When we first entered the competition, we were confident in our technology and its capabilities – even as we prepared to go toe-to-toe against major players in the field of artificial intelligence. So, leapfrogging to the top spot of GAIA’s leaderboard with our first submission was not only gratifying, it confirmed what we already knew: We’re onto something big at Trase.

So, what is GAIA anyway?

GAIA’s evaluation consists of 450+ non-trivial questions where there’s a single unambiguous answer, allowing entrants the ability to demonstrate the effectiveness and the sophistication of their technologies in a straightforward, ranked manner. Hosted on the machine learning platform Hugging Face, GAIA gives AI agents questions that cannot be easily answered through large web scrapes typically used to train large language models (LLMs). To solve each question, agents must have different levels of tooling and autonomy. In addition, some questions come with files that reflect real-world use cases, such as images, audio, and range in type from Word, Excel,and PowerPoint, to PDF, JSON, SML, CSV, and Zip, among others.

To avoid the common pitfalls of LLM evaluations, GAIA poses challenging, multi-step questions that are non-gameable and not easily brute-forced.

Sample Question:

“I thought we could try a fun word puzzle together. I’ve got a Boggle board here: ABRL EITE IONS FPEI (making words out of these letters). I’d like to know the longest word that can be generated from the board. Please find the longest English language word that can be generated from this board. If more than one word of the same length exists at the maximum word length, please report the longest word that comes first, alphabetically. Oh, and I know that there might be different wordlists available for Boggle, so let’s please just use the words_alpha dictionary found at https://github.com/dwyl/english-words as the dictionary for our game.”

The GAIA Leaderboard, showing Trase Agent in the Number 1 spot.

How Trase Secured the Top Ranking

With an overall success rate of 35.55%, Trase ranks ahead of submissions from Meta, Microsoft, Hugging Face, AutoGPT, Princeton University, University of Hong Kong, the UK AI Safety Institute, and Baichuan (a Chinese Startup that has raised ~$700 million to date).

To achieve its current standing, the Trase Agent deployed several innovations. First, it fine-tuned the base LLM through iterative self-improvement, modifying a technique known as STaR: Self-Taught Reasoner, which is used with agents and trained on action trajectories generated while solving similar problems. The Trase Agent – trained on data to learn how humans solve problems – augmented these trajectories in line with a human expert’s actions to allow it to solve the questions. In addition, the agent’s fine-tuned LLM specified its actions as code for the richer expressiveness of control loops and other primitives allowing it to compose multiple tool calls into one single code action.

Another innovation the Trase Agent employs is self-critique to decide whether adequate information has been gathered to answer the question and if the information gathered makes sense. This is performed by a top-level agent (think of it as a team leader) that does self-critique analysis and then calls on other agents to solve more basic problems. Through this method, the Trase Agent can accurately determine if an answer is true, then can utilize the result as a signal for when to switch tools, leverage alternate techniques, or even communicate with other specialized agents or fine-tuned LLMs to solve the problem.

Why GAIA Matters for AI Evolution

Achieving 90% proficiency is considered the threshold at which agent-based systems could replace humans for complex tasks that involve combining multiple systems, media, data sources, and outputs. This threshold serves as a kind of “Turing Test” for multi-agent systems (where it’s difficult for users to determine whether they’re talking to a human or computer).

Even when equipped with tools for web browsing and code execution, traditional LLMs struggle to achieve high success rates on GAIA due to their lack of specialized file handlers, long-term planning, and ability to reason over many tool calls. For example, OpenAI’s GPT-4, one of the most advanced LLMs available, only manages a 30% success rate on GAIA’s easiest tasks and 0% on its hardest. In contrast, human respondents have an average success rate of 92%, underscoring the gap between current AI capabilities and human reasoning. This makes the GAIA benchmark a critical measure for evaluating the real-world effectiveness of AI agents.

What’s Next for Trase?

Since our inception a year ago, Trase has maintained its deep commitment to advancing AI technology. Although we’ve been at the top of the GAIA leaderboard for several weeks, we’re not ones to rest on our laurels. We’ve now set our sights on improving our percentage on GAIA, with the goal of reaching the holy grail in AI: 90 percent proficiency to drive maximum human productivity and efficiency.

As a company, we’re excited to continue to innovate, refine our AI agents, and, ultimately, push the boundaries of what’s possible within the world of artificial intelligence.

Learn more about us at www.trasesystems.com

Share