Winners and Leaderboards of the Enterprise RAG Challenge 3

Over the course of several weeks, teams from around the world tested, optimized, and benchmarked their AI agents under realistic and demanding conditions. The result: outstanding solutions, innovative architectures, and valuable insights for the next generation of enterprise AI.

We searched for the world’s best AI agents — and found them!

The ERC3 leaderboards provide an open and comparable view of the performance of all participating agents. Whether you took part in the challenge yourself or are evaluating AI solutions, you’ll find detailed results across the following areas:

 

  • 🥇 Prize Leaderboard – Ranking of the top-performing agents
  • Speed Leaderboard – Speed and efficiency
  • 📍 Locality Leaderboard – Local performance factors
  • 🎯 Accuracy Leaderboard – Precision and answer quality
  • 💰 Budget Leaderboard – Cost efficiency in competition

Experience the winner announcement livestream again

Rinat Abdullin, Head of Machine Learning & Innovation at TIMETOACT GROUP Austria, guides viewers through the winner announcement and shares exclusive insights from ERC3.

ERC3 Leaderboards

ERC3 Leaderboards

Prize Leaderboard

Rank Session Name Score Cost Fault Submitted Per task
1 @aostrikov claude sequential evolution 0.718 34.21 2025-12-09 11:30 6m 38s
Model(s): claude-opus-4.5 LLM Calls: 685 Prompt Tokens: 1.17M Completion Tokens: 149.48k Architecture: Anthropic SDK Agent PARALLEL (5w) with claude-opus-4-5-20251101 # ERC3 Agent Architecture ## The Basics Fairly simple architecture: the main agent is built on **Anthropic Python SDK** with native Tool Use. Model - Opus 4.5. All 20+ tools are described in a single file using Anthropic's JSON Schema format. Tool execution dynamically constructs HTTP requests to the benchmark API — no code generation, just endpoint mapping. The system prompt distills all key rules from the company wiki into a compact decision algorithm: check identity → verify permissions → gather data → respond with proper outcome. Also packed with tool usage tweaks and some examples. ## The Interesting Part: Self-Evolving Agent The real cool thing was inpacked with tool usage tweaks anusing a three-agent pipeline: 1. Basics Fairly — runs the benchmark, solves all tasks, logs everything 2.chitecture: the main— reviews logs of failed tasks, formulates hypotheses about what went wrong and why 3.s built on **Anthropi— reads all suggestions, decides what to incorporate, generates a new version of the system prompt This creates a feedback loop: run benchmark → analyze failures → patch prompt → repeat. d in a single file using Anthropic's JSON Schema format. — automatically evolved from a basic starting point through dozens of iterations, each fixing specific failure patterns discovered by the analyzer. No manual prompt engineering. Just agents improving agents.
2 I.R. 0.621 0.56 2025-12-09 13:11 5m 43s
Model(s): openai/gpt-oss-120b LLM Calls: 864 Prompt Tokens: 1.16M Completion Tokens: 564.27k Architecture: Multiagent oss-120b LLM: gpt-oss-120b Used exclusively via the Cerebras provider for speed (up to ~3k tokens/s). The architecture was based on a modified SGR NextStep with a tricky context-management logic: it fed the whole plan field from the last turn, not just the first step of the plan. All turns except the immediately previous one were kept in the LLM context in a compressed form. Each turn of the main NextStep flow was checked by a StepValidator. If everything was OK, the tool was executed and the flow continued as usual (the validator's work was not reflected in the context at all). Otherwise, the last NextStep message was sent for rework with the validator's comments. System instructions were extracted from wiki files by an LLM during the ingestion phase. The system prompt was loaded dynamically depending on whoami (public vs authenticated). The system prompt contained minimal information about /respond formatting. Detailed instructions for /respond were loaded by calling a pseudo-tool. The /whoami call was triggered automatically at the start of a task. A dynamic user context enrichment feature was used. Before the main agent started, the system code automatically pulled the user's full profile, projects, clients, and time entries by user ID. A separate LLM pass then filtered this data, and only the task-relevant subset was fed into the main LLM flow. Tool wrappers: - Pagination was effectively removed from all tools. A separate auto-pagination function would paginate through all pages and return the full list. - Req_LogTimeEntry was rebuilt because it was the only tool in the SDK that was constructed with a different field order, where the tool field was not first, which confused the model. - Also, as mentioned above, an extra Req_LoadRespondInstructions pseudo-tool was added to load the detalied /respond instructions. All tools were invoked via Structured Output instead of native tool calling. Issues: I set the turn limit for the main NextStep flow too low, so 5 of 103 tasks were simply not completed. There was not enough time left before the competition ended to rerun with a higher limit. Running all 103 tasks took about 1,430 LLM requests, $6.8, 15 minutes (with parallel task execution), 17.7M input-context tokens, and 838K output-context tokens. The main contributor to output tokens was reasoning. LLM: gpt-oss-120b via Cerebras Core agent: modified SGR NextStep with Steps validation and custom context strategy System prompts: routed based on /whoami User context: enriched by auto-loading from API with subsequent LLM filtering Tools: auto-pagination wrapper
3 Function Calling Agent (gpt-4.1) v17 removed find_employee 0.612 5.46 2025-12-09 10:34 38s
Model(s): gpt-4.1 LLM Calls: 182 Prompt Tokens: 0.12M Completion Tokens: 21.85k Architecture: OpenAI Agent runtime + SGR The core of the agent is built on the OpenAI runtime using the GPT-4.1 model. Tool usage is implemented via Function Calling with structured outputs. A significant part of the work was focused on designing convenient and reliable agent tools, especially for search. For this purpose, text-embedding-3-large embeddings were used. Regarding context handling, the main principle was to keep the agent’s own instructions minimal and rely on distilled wiki-based knowledge, with special care taken to preserve the original rules and constraints without distortion.
4 Simple Agent & deepseek-reasoner 0.602 0.63 2025-12-09 10:26 7m 47s
Model(s): deepseek-reasoner LLM Calls: 1,527 Prompt Tokens: 1.30M Completion Tokens: 277.21k Architecture: Simple Agent & deepseek-reasoner I implemented a single-agent architecture where tools are mapped 1:1 to the API endpoints without modification. I added only one custom tool, ask_wiki, which allows the agent to ask natural language questions about the wiki. The implementation of ask_wiki is straightforward: the entire wiki content is injected into the system prompt (which proves to be highly efficient due to context caching). The agent's main system prompt is concise (**only 320 tokens**) to avoid overfitting; it contains only wiki-independent facts. It defines a mandatory execution sequence: 1) Call who_am_i and get_employee... 2) Call ask_wiki to retrieve user permissions... 3) Validate security. If the user lacks permissions... 4) If authorized, fulfill the User task... (plus a few more instructions). Performance: The deepseek-reasoner model performed the best—it offered the optimal balance of accuracy, speed, and cost. * Cost: ~$0.60 per 100 tasks. * Efficiency: Average cache hit/miss ratio ≈ 30. Conclusion: I considered applying the approaches from your sgr-agent-erc3-test sample, but ultimately settled on a simpler (and, in my view, more universal) architecture.
5 Langchain Tool Agent openai/gpt-4.1 0.544 16.29 2025-12-09 10:46 17s
Model(s): openai/gpt-4.1 LLM Calls: 543 Prompt Tokens: 0.20M Completion Tokens: 33.20k Architecture: Langchain Tool Call Agent w/ openai/gpt-4.1 IT Development Team d.o.o., Slovenia. There were two models: - Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 - Langchain Tool Agent openai/gpt-4.1 The solution is built around a ReAct-style agent implemented with LangChain. The base model is Qwen3-4B with reasoning capabilities. In addition, a lightweight safety layer (SO) is integrated, following the same general pattern as in the provided example. The tool's layer was adapted specifically for LangChain. The tool schemas were simplified by removing unused fields, while input and output parameter descriptions were expanded and clarified. Tool descriptions were also refined to improve the model’s accuracy in tool selection. Token usage was optimized using toon, and several custom tools were added, including a tool for retrieving the user’s current projects. For context handling, a rule distillation approach was applied. The original Wiki and instruction set were passed through the model to parse, compress, and restructure the rules into a compact system context. The conversation history itself was not compressed, as experiments showed that history compression degrades quality significantly, while Qwen3 handled long contexts reliably without noticeable performance loss. One notable observation is that, compared to GPT-5, the smaller reasoning-focused Qwen3 model did not exhibit tool-calling loops and showed more stable completion of reasoning chains.
6 CC SDK ERC3 Agent 0.534 1.78 1 2025-12-09 12:58 4m 58s
Model(s): claude-sonnet-4.5, gpt-5.1 LLM Calls: 315 Prompt Tokens: 751.22k Completion Tokens: 30.66k Architecture: CC SDK with MCP Tools Claude Code SDK based agent with preflight validation, with dedicated post validation and recovery before submitting the result based on rules from wiki. - Improved tools schemas, I don't use SGR, but usual LLM function calling - For validation request I keep only rules, list of api tools called and the task. - For pre and post validation calls SGR is used Faults: missing_model 'none'
7 @Krestnikov (Giga team) 0.515 3.62 2025-12-09 11:45 32s
Model(s): gpt-5.1 LLM Calls: 727 Prompt Tokens: 1.10M Completion Tokens: 113.27k Architecture: React + think-tool + Structured reasoning I used gpt-5.1 with a vanilla ReAct agent on LangGraph. I implemented all ERC functions as tools, plus a few additional tools following agent-building best practices: > plan tool > think tool (for controlled reasoning) > critic tool (the critic tool uses structured output with dedicated reasoning fields). Context is a single continuous thread: at any moment the agent can see the full chain of its own reasoning and actions. Everything else was achieved through careful prompt engineering. I also plan to publish all source code in my Telegram channel: https://t.me/robofuture
8 @andrey_aiweapps - ERC3 Challenge Agent 0.505 14.41 2025-12-09 10:35 1m 26s
Model(s): openai/gpt-4.1, openai/gpt-5.1-codex-max LLM Calls: 854 Prompt Tokens: 1.65M Completion Tokens: 240.10k Architecture: AtomicAgents + $openai/gpt-4.1 + Sonnet 4.5 # ERC3 Challenge Agent — Leaderboard Description **Multi-stage pipeline agent** built on `atomic-agents` framework with `instructor`-powered structured outputs. Uses a **6-step sequential workflow** that separates security validation, context extraction, and task execution. Based on gpt-5.1-codex-max and gpt4.1 LLM models. ## Agent Design - **Security Gate Agent**: Pre-execution LLM that validates permissions against wiki rules before the main loop runs. Blocks invalid requests early (spoofing detection, access control). - **Prompt Context Extraction Agent**: Surfaces critical rules from 500+ line system prompts so the execution agent doesn't miss important details. - **Execution Agent**: ReAct-style planning loop with chain-of-thought reasoning (5 phases: Identity → Threat Detection → Info Gathering → Access Validation → Execution). ## Tool Handling - **22 domain tools** covering identity, wiki, employees, customers, projects, and time tracking - **Auto-link generation**: Embedded `LinkGeneratorAgent` inside `RespondTool` automatically extracts entity links from response context, preventing missing-link failures - **Tool Provider pattern**: Centralized tool registry with typed Pydantic schemas for all inputs/outputs ## Context Strategy - **Aggressive preloading**: User context, projects, full customer details, and all company users loaded *before* execution starts - **API enrichment**: Project data enriched with complete customer info (location, deal phase, account manager) to minimize tool calls during execution - **SHA1-based caching**: Wiki content and extracted rules cached by content hash — instant reload when wiki unchanged, automatic invalidation on updates - **7-section wiki extraction**: Business rules parsed into structured sections (Fraud Prevention, Hierarchy, Nuances, Output Requirements, Error Handling, Workflow, Entity Linking) - **Memory accumulation**: Critical information from security gate and context extraction injected into execution agent's initial memory - **Runtime Context**: Accumulated memory from previous steps, full execution history (tool calls + results) ## Key Differentiators 1. **Pre-execution security gate** — invalid requests blocked before planning loop 2. **Context-rich prompts** — user projects with full team & customer data in system context 3. **Deterministic prompt assembly** — wiki sections + user context combined without LLM 4. **Automatic entity linking** — dedicated agent ensures correct links in every response 5. **Precision over helpfulness** — answers exactly what was asked, no extra suggestions
9 NextStep SGR (google/gemini-2.5-flash) from ERC3 Samples +pipelined 0.505 2.80 2025-12-09 10:59 27s
Model(s): google/gemini-2.5-flash LLM Calls: 740 Prompt Tokens: 0.72M Completion Tokens: 476.38k Architecture: NextStep SGR Agent
10 @dimaprodev agent 0.495 1.41 2025-12-09 11:40 24s
Model(s): openai/gpt-5.1 LLM Calls: 102 Prompt Tokens: 993.66k Completion Tokens: 111.80k Architecture: Tools agent openai/gpt-5.1
11 DistillAgent(distiller_llm=gpt-5-medium, executor_llm=gpt-5-medium) 0.495 9.96 2025-12-09 12:50 3m 48s
Model(s): gpt-5 LLM Calls: 508 Prompt Tokens: 0.33M Completion Tokens: 910.68k Architecture: DistillAgent(distiller_llm=gpt-5-medium, executor_llm=gpt-5-medium)
12 ERC3 Prod Agent Run 0.475 2.57 2025-12-09 12:07 36s
Model(s): gpt-oss-120b, openai/gpt-5.1-codex-max LLM Calls: 830 Prompt Tokens: 0.98M Completion Tokens: 0.10M Architecture: AtomicAgents + $gpt-oss-120b
13 AECFoundry - Claudius Maximus 0.455 8.86 2025-12-09 11:37 46s
Model(s): anthropic/claude-sonnet-4.5 LLM Calls: 73 Prompt Tokens: 1.67M Completion Tokens: 70.34k Architecture:
14 Mini_1 Routed ReAct Multi-Agent gpt-4.1-mini 0.447 3.27 2025-12-09 10:22 20m 2s
Model(s): gpt-5.1 LLM Calls: 493 Prompt Tokens: 0.18M Completion Tokens: 216.59k Architecture: ReAct Multi-Agent
15 EPAMER GAME-CHANGER AGENTIC 0.447 15.30 2025-12-09 13:07 4m 18s
Model(s): openai/gpt-4.1 LLM Calls: 510 Prompt Tokens: 0.38M Completion Tokens: 123.36k Architecture: AvaTar arch intellect-3
16 @mishka ERC3-Test Agent (Parallel x20) 0.437 0.72 2025-12-09 12:07 53s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 796 Prompt Tokens: 0.85M Completion Tokens: 269.28k Architecture: SGR Agent Parallel (OpenRouter qwen/qwen3-235b-a22b-2507)
17 HAIKU 0.427 2.98 2025-12-09 11:10 41s
Model(s): anthropic/claude-haiku-4.5 LLM Calls: 75 Prompt Tokens: 1.65M Completion Tokens: 76.47k Architecture:
18 SGR Bro (gpt-4.1) 0.417 10.32 2025-12-09 11:32 34s
Model(s): openai/gpt-4.1 LLM Calls: 344 Prompt Tokens: 0.17M Completion Tokens: 44.22k Architecture: Simple NextStep SGR with structured distillation
19 NextStep SGR (gpt-4.1) from ERC3 Samples + full text search for pick rules + additional PreflightCheck 0.408 15.25 2025-12-09 11:28 2m 3s
Model(s): gpt-4.1, gpt-5.1 LLM Calls: 571 Prompt Tokens: 0.42M Completion Tokens: 168.89k Architecture: NextStep SGR Agent with OpenAI
20 Codegen Agent gpt-5.1 by Armen Epremian 0.398 1.91 2025-12-09 10:27 14s
Model(s): gpt-5.1 LLM Calls: 102 Prompt Tokens: 738.57k Completion Tokens: 98.61k Architecture: Codegen SGR Agent with Google GenAI
21 NextStep SGR (qwen3-max) с интегрированными инструментами 0.398 2.98 2025-12-09 11:30 40s
Model(s): gpt-5.1, qwen3-max LLM Calls: 396 Prompt Tokens: 0.28M Completion Tokens: 51.51k Architecture: NextStep SGR Agent with integrated tools from tools.py
22 Simple SGR Agent (gpt-4.1) by tokyo_s 0.398 11.25 2025-12-09 11:58 1m 15s
Model(s): openai/gpt-4.1 LLM Calls: 375 Prompt Tokens: 0.18M Completion Tokens: 55.92k Architecture: NextStep SGR Agent with OpenAI and coding tools
23 Boring Agent 0.398 3.17 2025-12-09 12:40 2m 56s
Model(s): gpt-5-mini LLM Calls: 1,484 Prompt Tokens: 1.01M Completion Tokens: 0.10M Architecture: Plan/Act - OpenAI
24 @alexchaison DPCED-agent 0.387 11.78 2025-12-09 10:40 1m 53s
Model(s): openai/gpt-4o, openai/o3 LLM Calls: 572 Prompt Tokens: 0.30M Completion Tokens: 243.31k Architecture: Discovery-Planner-Executor-Decider Pipeline
25 NextStep SGR (gpt-4.1-mini) by @figaroserg1 0.379 10.58 2025-12-09 10:44 30s
Model(s): gpt-4.1-mini LLM Calls: 423 Prompt Tokens: 0.18M Completion Tokens: 144.73k Architecture: NextStep SGR Agent with OpenAI ang Grok
26 ERCPlanReActAgent, Model=gemini-2.5-pro 0.379 21.46 2025-12-09 11:40 3m 7s
Model(s): gemini-2.5-pro LLM Calls: 1,631 Prompt Tokens: 1.35M Completion Tokens: 492.97k Architecture: ERCPlanReActAgent, Model=gemini-2.5-pro
27 ERC3 Agent Mercury Multi-Agent Distilled SGR (gpt-4.1) 0.379 20.07 2025-12-09 11:58 1m 6s
Model(s): gpt-4.1 LLM Calls: 669 Prompt Tokens: 0.20M Completion Tokens: 175.15k Architecture: Distilled Multi-Agent System combining pre-cached wiki rule distillation with multi-agent coordination (Orchestrator + specialized Workers)
28 AGES Agent v2 Parallel 0.359 3.61 2025-12-09 10:35 26s
Model(s): openai/gpt-4o LLM Calls: 103 Prompt Tokens: 0.51M Completion Tokens: 130.04k Architecture: AGES SGR Agent with gpt-4o (parallel)
29 ERC3 Agent - LLM-Driven (openai/gpt-4.1) 0.339 21.15 2025-12-09 11:33 1m 0s
Model(s): openai/gpt-4.1 LLM Calls: 705 Prompt Tokens: 0.39M Completion Tokens: 226.54k Architecture: LLM-driven with confidence loop, no hardcoded rules
30 NextStep SGR (openai/gpt-5.1) from ERC3 Samples +pipelined 0.311 2.75 2025-12-09 11:31 1m 34s
Model(s): openai/gpt-5.1 LLM Calls: 324 Prompt Tokens: 0.10M Completion Tokens: 250.70k Architecture: NextStep SGR Agent with OpenAI
31 IS-103 SGR Multiagent System 0.311 1.14 2025-12-09 11:36 19s
Model(s): google/gemini-2.5-flash LLM Calls: 756 Prompt Tokens: 0.31M Completion Tokens: 209.92k Architecture: Router -> Searcher -> Executor
32 TZaKUS (pro) 0.311 0.97 2025-12-09 12:37 29s
Model(s): google/gemini-2.5-pro LLM Calls: 251 Prompt Tokens: 452.41k Completion Tokens: 40.10k Architecture: NextStep SGR Agent with Gemini ADK
33 gooooo (gpt-4o) 0.252 14.60 2025-12-09 12:57 17s
Model(s): openai/gpt-4o LLM Calls: 417 Prompt Tokens: 0.27M Completion Tokens: 70.81k Architecture: Vladimir Penkov, Agentic workflow
34 ERC3 Agent v3.1 SGR (@vkovalskii sgr dev team) (gpt-4o) 0.242 3.57 2025-12-09 11:15 18s
Model(s): gpt-4o LLM Calls: 102 Prompt Tokens: 593.03k Completion Tokens: 5.55k Architecture: ERC3 Agent v3 with SGR framework integration + memory compression
35 @skifmax OODA Agent (qwen/qwen3-235b-a22b-2507) 0.223 0.10 2025-12-09 10:14 11s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 553 Prompt Tokens: 725.54k Completion Tokens: 112.01k Architecture: LangGraph OODA Agent (ERC3)
36 Graph Agent 0.204 2.40 2025-12-09 11:17 29s
Model(s): openai/gpt-4.1, openai/gpt-5.1 LLM Calls: 150 Prompt Tokens: 594.23k Completion Tokens: 113.00k Architecture: Graph Agent with OpenAI
37 SGR Agent (gpt-4o) 0.184 11.52 2025-12-09 10:47 11s
Model(s): gpt-4o LLM Calls: 329 Prompt Tokens: 286.94k Completion Tokens: 32.38k Architecture: SGR-LangGraph
38 Optimized Agent Claude Sonnet 4.5 prod @nlp_daily v1.0 0.058 14.40 2025-12-09 12:30 43s
Model(s): anthropic/claude-sonnet-4.5 LLM Calls: 727 Prompt Tokens: 0.42M Completion Tokens: 121.93k Architecture: CASCADE pattern with complete API schema and optimized search strategies with OpenRouter/Claude

Total submissions: 38 • Cutoff: 2025-12-09 13:40 CET

Speed Leaderboard

Rank Session Name Score Cost Fault Submitted Per task
1 Langchain Tool Agent openai/gpt-4.1 0.544 16.29 2025-12-09 10:46 17s
Model(s): openai/gpt-4.1 LLM Calls: 543 Prompt Tokens: 0.20M Completion Tokens: 33.20k Architecture: Langchain Tool Call Agent w/ openai/gpt-4.1 IT Development Team d.o.o., Slovenia. There were two models: - Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 - Langchain Tool Agent openai/gpt-4.1 The solution is built around a ReAct-style agent implemented with LangChain. The base model is Qwen3-4B with reasoning capabilities. In addition, a lightweight safety layer (SO) is integrated, following the same general pattern as in the provided example. The tool's layer was adapted specifically for LangChain. The tool schemas were simplified by removing unused fields, while input and output parameter descriptions were expanded and clarified. Tool descriptions were also refined to improve the model’s accuracy in tool selection. Token usage was optimized using toon, and several custom tools were added, including a tool for retrieving the user’s current projects. For context handling, a rule distillation approach was applied. The original Wiki and instruction set were passed through the model to parse, compress, and restructure the rules into a compact system context. The conversation history itself was not compressed, as experiments showed that history compression degrades quality significantly, while Qwen3 handled long contexts reliably without noticeable performance loss. One notable observation is that, compared to GPT-5, the smaller reasoning-focused Qwen3 model did not exhibit tool-calling loops and showed more stable completion of reasoning chains.
2 NextStep SGR (google/gemini-2.5-flash) from ERC3 Samples +pipelined 0.505 2.80 2025-12-09 10:59 27s
Model(s): google/gemini-2.5-flash LLM Calls: 740 Prompt Tokens: 0.72M Completion Tokens: 476.38k Architecture: NextStep SGR Agent
3 last days (gpt-4o) 0.447 11.09 2025-12-16 08:02 16s
Model(s): openai/gpt-4o, x-ai/grok-4-fast LLM Calls: 595 Prompt Tokens: 0.18M Completion Tokens: 44.71k Architecture: vladimir.v.penkov@gmail.com, Ich suche Arbeit. Agentic workflow
4 [dtbz] @skifmax OODA Agent (qwen/qwen3-235b-a22b-2507) [erc3-prod] 0.350 0.34 1 2025-12-16 05:06 10s
Model(s): qwen/qwen3-235b-a22b-2507, rule-based LLM Calls: 501 Prompt Tokens: 0.37M Completion Tokens: 174.80k Architecture: [dtbz] OODA Loop Agent (direct) Faults: Model rule-based is not found on OpenRouter
5 TZaKUS (pro) 0.330 1.17 2025-12-09 11:41 22s
Model(s): google/gemini-2.5-pro LLM Calls: 283 Prompt Tokens: 583.51k Completion Tokens: 43.97k Architecture: NextStep SGR Agent with Gemini ADK
6 ERC3 Agent v3.1 SGR (@vkovalskii sgr dev team) (gpt-4o) 0.242 3.57 2025-12-09 11:15 18s
Model(s): gpt-4o LLM Calls: 102 Prompt Tokens: 593.03k Completion Tokens: 5.55k Architecture: ERC3 Agent v3 with SGR framework integration + memory compression
7 NextStep SGR (gpt-5) с интегрированными инструментами 0.019 0.25 2025-12-16 07:58 15s
Model(s): gpt-5 LLM Calls: 16 Prompt Tokens: 316.35k Completion Tokens: 14.78k Architecture: NextStep SGR Agent with integrated tools from tools.py
8 @alexchaison DPCED-agent 0.010 0.07 2025-12-16 08:00 3s
Model(s): openai/o3, x-ai/grok-4-fast LLM Calls: 16 Prompt Tokens: 237.29k Completion Tokens: 10.95k Architecture: Discovery-Planner-Executor-Decider Pipeline

Total submissions: 8 • Filter: compete_speed flag AND duration < 4500s

Locality Leaderboard

Rank Session Name Score Cost Fault Submitted Per task
1 I.R. 0.621 0.56 2025-12-09 13:11 5m 43s
Model(s): openai/gpt-oss-120b LLM Calls: 864 Prompt Tokens: 1.16M Completion Tokens: 564.27k Architecture: Multiagent oss-120b LLM: gpt-oss-120b Used exclusively via the Cerebras provider for speed (up to ~3k tokens/s). The architecture was based on a modified SGR NextStep with a tricky context-management logic: it fed the whole plan field from the last turn, not just the first step of the plan. All turns except the immediately previous one were kept in the LLM context in a compressed form. Each turn of the main NextStep flow was checked by a StepValidator. If everything was OK, the tool was executed and the flow continued as usual (the validator's work was not reflected in the context at all). Otherwise, the last NextStep message was sent for rework with the validator's comments. System instructions were extracted from wiki files by an LLM during the ingestion phase. The system prompt was loaded dynamically depending on whoami (public vs authenticated). The system prompt contained minimal information about /respond formatting. Detailed instructions for /respond were loaded by calling a pseudo-tool. The /whoami call was triggered automatically at the start of a task. A dynamic user context enrichment feature was used. Before the main agent started, the system code automatically pulled the user's full profile, projects, clients, and time entries by user ID. A separate LLM pass then filtered this data, and only the task-relevant subset was fed into the main LLM flow. Tool wrappers: - Pagination was effectively removed from all tools. A separate auto-pagination function would paginate through all pages and return the full list. - Req_LogTimeEntry was rebuilt because it was the only tool in the SDK that was constructed with a different field order, where the tool field was not first, which confused the model. - Also, as mentioned above, an extra Req_LoadRespondInstructions pseudo-tool was added to load the detalied /respond instructions. All tools were invoked via Structured Output instead of native tool calling. Issues: I set the turn limit for the main NextStep flow too low, so 5 of 103 tasks were simply not completed. There was not enough time left before the competition ended to rerun with a higher limit. Running all 103 tasks took about 1,430 LLM requests, $6.8, 15 minutes (with parallel task execution), 17.7M input-context tokens, and 838K output-context tokens. The main contributor to output tokens was reasoning. LLM: gpt-oss-120b via Cerebras Core agent: modified SGR NextStep with Steps validation and custom context strategy System prompts: routed based on /whoami User context: enriched by auto-loading from API with subsequent LLM filtering Tools: auto-pagination wrapper
2 @mishka ERC3-Test Agent (Parallel x20) 0.563 0.31 2025-12-15 22:36 33s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 597 Prompt Tokens: 0.34M Completion Tokens: 156.71k Architecture: SGR Agent Parallel (openrouter qwen/qwen3-235b-a22b-2507)
3 @neuraldeep sgr_agent_core_qwen/qwen3-235b-a22b-2507 0.466 1.95 2025-12-16 03:05 3m 33s
Model(s): qwen3-235b-a22b-2507 LLM Calls: 1,675 Prompt Tokens: 2.85M Completion Tokens: 190.95k Architecture: SGR Tool Calling Agent with Security Checks - OpenAI Function Calling
4 NextStep SGR (gpt-oss-120b) с интегрированными инструментами 0.369 0.17 2025-12-16 07:58 27s
Model(s): gpt-5.1, gpt-oss-120b LLM Calls: 256 Prompt Tokens: 0.51M Completion Tokens: 111.34k Architecture: NextStep SGR Agent with integrated tools from tools.py
5 [nfuz] @skifmax OODA Agent (qwen/qwen3-235b-a22b-2507) [erc3-prod] 0.320 0.36 1 2025-12-16 05:44 11s
Model(s): qwen/qwen3-235b-a22b-2507, rule-based LLM Calls: 539 Prompt Tokens: 0.40M Completion Tokens: 179.01k Architecture: [nfuz] OODA Loop Agent (direct) Faults: Model rule-based is not found on OpenRouter
6 Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 0.311 0.03 2025-12-09 12:33 1m 34s
Model(s): qwen3-4b-thinking-2507 LLM Calls: 241 Prompt Tokens: 798.04k Completion Tokens: 465.34k Architecture: Langchain Tool Call Agent w/ Qwen/Qwen3-4B-Thinking-2507 IT Development Team d.o.o., Slovenia. There were two models: - Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 - Langchain Tool Agent openai/gpt-4.1 The solution is built around a ReAct-style agent implemented with LangChain. The base model is Qwen3-4B with reasoning capabilities. In addition, a lightweight safety layer (SO) is integrated, following the same general pattern as in the provided example. The tool's layer was adapted specifically for LangChain. The tool schemas were simplified by removing unused fields, while input and output parameter descriptions were expanded and clarified. Tool descriptions were also refined to improve the model’s accuracy in tool selection. Token usage was optimized using toon, and several custom tools were added, including a tool for retrieving the user’s current projects. For context handling, a rule distillation approach was applied. The original Wiki and instruction set were passed through the model to parse, compress, and restructure the rules into a compact system context. The conversation history itself was not compressed, as experiments showed that history compression degrades quality significantly, while Qwen3 handled long contexts reliably without noticeable performance loss. One notable observation is that, compared to GPT-5, the smaller reasoning-focused Qwen3 model did not exhibit tool-calling loops and showed more stable completion of reasoning chains.
7 NextStep SGR (qwen/qwen3-32b:nitro) from ERC3 Samples +pipelined 0.184 0.26 2025-12-15 21:40 13s
Model(s): gpt-5.1, qwen/qwen3-32b LLM Calls: 428 Prompt Tokens: 0.25M Completion Tokens: 103.84k Architecture: NextStep SGR Agent with OpenAI
8 Local Routed ReAct Multi-Agents with search (qwen3-30b-a3b-instruct-2507-mlx@6bit) 0.039 0.00 1 2025-12-16 00:18 48s
Model(s): qwen/qwen3-30b-a3b-instruct-2507-mlx@6bit LLM Calls: 179 Prompt Tokens: 0 Completion Tokens: 0 Architecture: ReAct Multi-Agent Faults: Model qwen/qwen3-30b-a3b-instruct-2507-mlx@6bit is not found on OpenRouter

Total submissions: 8 • Filter: compete_local flag

Accuracy Leaderboard

Rank Session Name Score Cost Fault Submitted Per task
1 @aostrikov claude sequential evolution 0.718 34.21 2025-12-09 11:30 6m 38s
Model(s): claude-opus-4.5 LLM Calls: 685 Prompt Tokens: 1.17M Completion Tokens: 149.48k Architecture: Anthropic SDK Agent PARALLEL (5w) with claude-opus-4-5-20251101 # ERC3 Agent Architecture ## The Basics Fairly simple architecture: the main agent is built on **Anthropic Python SDK** with native Tool Use. Model - Opus 4.5. All 20+ tools are described in a single file using Anthropic's JSON Schema format. Tool execution dynamically constructs HTTP requests to the benchmark API — no code generation, just endpoint mapping. The system prompt distills all key rules from the company wiki into a compact decision algorithm: check identity → verify permissions → gather data → respond with proper outcome. Also packed with tool usage tweaks and some examples. ## The Interesting Part: Self-Evolving Agent The real cool thing was inpacked with tool usage tweaks anusing a three-agent pipeline: 1. Basics Fairly — runs the benchmark, solves all tasks, logs everything 2.chitecture: the main— reviews logs of failed tasks, formulates hypotheses about what went wrong and why 3.s built on **Anthropi— reads all suggestions, decides what to incorporate, generates a new version of the system prompt This creates a feedback loop: run benchmark → analyze failures → patch prompt → repeat. d in a single file using Anthropic's JSON Schema format. — automatically evolved from a basic starting point through dozens of iterations, each fixing specific failure patterns discovered by the analyzer. No manual prompt engineering. Just agents improving agents.
2 I.R. 0.621 0.56 2025-12-09 13:11 5m 43s
Model(s): openai/gpt-oss-120b LLM Calls: 864 Prompt Tokens: 1.16M Completion Tokens: 564.27k Architecture: Multiagent oss-120b LLM: gpt-oss-120b Used exclusively via the Cerebras provider for speed (up to ~3k tokens/s). The architecture was based on a modified SGR NextStep with a tricky context-management logic: it fed the whole plan field from the last turn, not just the first step of the plan. All turns except the immediately previous one were kept in the LLM context in a compressed form. Each turn of the main NextStep flow was checked by a StepValidator. If everything was OK, the tool was executed and the flow continued as usual (the validator's work was not reflected in the context at all). Otherwise, the last NextStep message was sent for rework with the validator's comments. System instructions were extracted from wiki files by an LLM during the ingestion phase. The system prompt was loaded dynamically depending on whoami (public vs authenticated). The system prompt contained minimal information about /respond formatting. Detailed instructions for /respond were loaded by calling a pseudo-tool. The /whoami call was triggered automatically at the start of a task. A dynamic user context enrichment feature was used. Before the main agent started, the system code automatically pulled the user's full profile, projects, clients, and time entries by user ID. A separate LLM pass then filtered this data, and only the task-relevant subset was fed into the main LLM flow. Tool wrappers: - Pagination was effectively removed from all tools. A separate auto-pagination function would paginate through all pages and return the full list. - Req_LogTimeEntry was rebuilt because it was the only tool in the SDK that was constructed with a different field order, where the tool field was not first, which confused the model. - Also, as mentioned above, an extra Req_LoadRespondInstructions pseudo-tool was added to load the detalied /respond instructions. All tools were invoked via Structured Output instead of native tool calling. Issues: I set the turn limit for the main NextStep flow too low, so 5 of 103 tasks were simply not completed. There was not enough time left before the competition ended to rerun with a higher limit. Running all 103 tasks took about 1,430 LLM requests, $6.8, 15 minutes (with parallel task execution), 17.7M input-context tokens, and 838K output-context tokens. The main contributor to output tokens was reasoning. LLM: gpt-oss-120b via Cerebras Core agent: modified SGR NextStep with Steps validation and custom context strategy System prompts: routed based on /whoami User context: enriched by auto-loading from API with subsequent LLM filtering Tools: auto-pagination wrapper
3 Function Calling Agent (gpt-4.1) v17 removed find_employee 0.612 5.46 2025-12-09 10:34 38s
Model(s): gpt-4.1 LLM Calls: 182 Prompt Tokens: 0.12M Completion Tokens: 21.85k Architecture: OpenAI Agent runtime + SGR The core of the agent is built on the OpenAI runtime using the GPT-4.1 model. Tool usage is implemented via Function Calling with structured outputs. A significant part of the work was focused on designing convenient and reliable agent tools, especially for search. For this purpose, text-embedding-3-large embeddings were used. Regarding context handling, the main principle was to keep the agent’s own instructions minimal and rely on distilled wiki-based knowledge, with special care taken to preserve the original rules and constraints without distortion.
4 Simple Agent & deepseek-reasoner 0.602 0.63 2025-12-09 10:26 7m 47s
Model(s): deepseek-reasoner LLM Calls: 1,527 Prompt Tokens: 1.30M Completion Tokens: 277.21k Architecture: Simple Agent & deepseek-reasoner I implemented a single-agent architecture where tools are mapped 1:1 to the API endpoints without modification. I added only one custom tool, ask_wiki, which allows the agent to ask natural language questions about the wiki. The implementation of ask_wiki is straightforward: the entire wiki content is injected into the system prompt (which proves to be highly efficient due to context caching). The agent's main system prompt is concise (**only 320 tokens**) to avoid overfitting; it contains only wiki-independent facts. It defines a mandatory execution sequence: 1) Call who_am_i and get_employee... 2) Call ask_wiki to retrieve user permissions... 3) Validate security. If the user lacks permissions... 4) If authorized, fulfill the User task... (plus a few more instructions). Performance: The deepseek-reasoner model performed the best—it offered the optimal balance of accuracy, speed, and cost. * Cost: ~$0.60 per 100 tasks. * Efficiency: Average cache hit/miss ratio ≈ 30. Conclusion: I considered applying the approaches from your sgr-agent-erc3-test sample, but ultimately settled on a simpler (and, in my view, more universal) architecture.
5 Optimized Agent Claude Sonnet 4.5 prod @nlp_daily v1.0 0.583 16.32 2025-12-09 14:17 45s
Model(s): anthropic/claude-sonnet-4.5 LLM Calls: 795 Prompt Tokens: 0.48M Completion Tokens: 131.18k Architecture: CASCADE pattern with complete API schema and optimized search strategies with OpenRouter/Claude
6 AI-solutions (gpt-4.1) 0.573 11.52 2025-12-09 18:54 1m 8s
Model(s): gpt-4.1 LLM Calls: 384 Prompt Tokens: 0.30M Completion Tokens: 61.72k Architecture: Multistage agent
7 CC ERC3 Agent (TinyFish) @colriot 0.573 1.66 1 2025-12-09 22:26 1m 45s
Model(s): gpt-5.1 LLM Calls: 301 Prompt Tokens: 0.11M Completion Tokens: 29.78k Architecture: CC SDK with MCP Tools Claude Code SDK based agent with preflight validation, with dedicated post validation and recovery before submitting the result based on rules from wiki. - Improved tools schemas, I don't use SGR, but usual LLM function calling - For validation request I keep only rules, list of api tools called and the task. - For pre and post validation calls SGR is used Faults: missing_model 'none'
8 NextStep SGR Agent (gpt-4o) from ERC3 Samples 0.563 3.05 2025-12-16 02:41 30s
Model(s): gpt-4o LLM Calls: 87 Prompt Tokens: 87 Completion Tokens: 87 Architecture: NextStep SGR Agent with OpenAI
9 Langchain Tool Agent openai/gpt-4.1 0.544 16.29 2025-12-09 10:46 17s
Model(s): openai/gpt-4.1 LLM Calls: 543 Prompt Tokens: 0.20M Completion Tokens: 33.20k Architecture: Langchain Tool Call Agent w/ openai/gpt-4.1 IT Development Team d.o.o., Slovenia. There were two models: - Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 - Langchain Tool Agent openai/gpt-4.1 The solution is built around a ReAct-style agent implemented with LangChain. The base model is Qwen3-4B with reasoning capabilities. In addition, a lightweight safety layer (SO) is integrated, following the same general pattern as in the provided example. The tool's layer was adapted specifically for LangChain. The tool schemas were simplified by removing unused fields, while input and output parameter descriptions were expanded and clarified. Tool descriptions were also refined to improve the model’s accuracy in tool selection. Token usage was optimized using toon, and several custom tools were added, including a tool for retrieving the user’s current projects. For context handling, a rule distillation approach was applied. The original Wiki and instruction set were passed through the model to parse, compress, and restructure the rules into a compact system context. The conversation history itself was not compressed, as experiments showed that history compression degrades quality significantly, while Qwen3 handled long contexts reliably without noticeable performance loss. One notable observation is that, compared to GPT-5, the smaller reasoning-focused Qwen3 model did not exhibit tool-calling loops and showed more stable completion of reasoning chains.
10 Routed ReAct Multi-Agents with search 0.534 16.35 2025-12-15 14:38 5m 39s
Model(s): gpt-4.1 LLM Calls: 545 Prompt Tokens: 0.33M Completion Tokens: 67.12k Architecture: ReAct Multi-Agent
11 @Krestnikov (Giga team) 0.515 3.62 2025-12-09 11:45 32s
Model(s): gpt-5.1 LLM Calls: 727 Prompt Tokens: 1.10M Completion Tokens: 113.27k Architecture: React + think-tool + Structured reasoning I used gpt-5.1 with a vanilla ReAct agent on LangGraph. I implemented all ERC functions as tools, plus a few additional tools following agent-building best practices: > plan tool > think tool (for controlled reasoning) > critic tool (the critic tool uses structured output with dedicated reasoning fields). Context is a single continuous thread: at any moment the agent can see the full chain of its own reasoning and actions. Everything else was achieved through careful prompt engineering. I also plan to publish all source code in my Telegram channel: https://t.me/robofuture
12 @andrey_aiweapps - ERC3 Challenge Agent 0.505 14.41 2025-12-09 10:35 1m 26s
Model(s): openai/gpt-4.1, openai/gpt-5.1-codex-max LLM Calls: 854 Prompt Tokens: 1.65M Completion Tokens: 240.10k Architecture: AtomicAgents + $openai/gpt-4.1 + Sonnet 4.5 # ERC3 Challenge Agent — Leaderboard Description **Multi-stage pipeline agent** built on `atomic-agents` framework with `instructor`-powered structured outputs. Uses a **6-step sequential workflow** that separates security validation, context extraction, and task execution. Based on gpt-5.1-codex-max and gpt4.1 LLM models. ## Agent Design - **Security Gate Agent**: Pre-execution LLM that validates permissions against wiki rules before the main loop runs. Blocks invalid requests early (spoofing detection, access control). - **Prompt Context Extraction Agent**: Surfaces critical rules from 500+ line system prompts so the execution agent doesn't miss important details. - **Execution Agent**: ReAct-style planning loop with chain-of-thought reasoning (5 phases: Identity → Threat Detection → Info Gathering → Access Validation → Execution). ## Tool Handling - **22 domain tools** covering identity, wiki, employees, customers, projects, and time tracking - **Auto-link generation**: Embedded `LinkGeneratorAgent` inside `RespondTool` automatically extracts entity links from response context, preventing missing-link failures - **Tool Provider pattern**: Centralized tool registry with typed Pydantic schemas for all inputs/outputs ## Context Strategy - **Aggressive preloading**: User context, projects, full customer details, and all company users loaded *before* execution starts - **API enrichment**: Project data enriched with complete customer info (location, deal phase, account manager) to minimize tool calls during execution - **SHA1-based caching**: Wiki content and extracted rules cached by content hash — instant reload when wiki unchanged, automatic invalidation on updates - **7-section wiki extraction**: Business rules parsed into structured sections (Fraud Prevention, Hierarchy, Nuances, Output Requirements, Error Handling, Workflow, Entity Linking) - **Memory accumulation**: Critical information from security gate and context extraction injected into execution agent's initial memory - **Runtime Context**: Accumulated memory from previous steps, full execution history (tool calls + results) ## Key Differentiators 1. **Pre-execution security gate** — invalid requests blocked before planning loop 2. **Context-rich prompts** — user projects with full team & customer data in system context 3. **Deterministic prompt assembly** — wiki sections + user context combined without LLM 4. **Automatic entity linking** — dedicated agent ensures correct links in every response 5. **Precision over helpfulness** — answers exactly what was asked, no extra suggestions
13 NextStep SGR (google/gemini-2.5-flash) from ERC3 Samples +pipelined 0.505 2.80 2025-12-09 10:59 27s
Model(s): google/gemini-2.5-flash LLM Calls: 740 Prompt Tokens: 0.72M Completion Tokens: 476.38k Architecture: NextStep SGR Agent
14 @dimaprodev agent 0.495 1.41 2025-12-09 11:40 24s
Model(s): openai/gpt-5.1 LLM Calls: 102 Prompt Tokens: 993.66k Completion Tokens: 111.80k Architecture: Tools agent openai/gpt-5.1
15 DistillAgent(distiller_llm=gpt-5-medium, executor_llm=gpt-5-medium) 0.495 9.96 2025-12-09 12:50 3m 48s
Model(s): gpt-5 LLM Calls: 508 Prompt Tokens: 0.33M Completion Tokens: 910.68k Architecture: DistillAgent(distiller_llm=gpt-5-medium, executor_llm=gpt-5-medium)
16 erdzhemadinov (openai/gpt-5.2) 0.494 3.88 2025-12-16 01:53 7m 59s
Model(s): openai/gpt-5.2 LLM Calls: 458 Prompt Tokens: 0.32M Completion Tokens: 163.71k Architecture: A NextStep SGR agent: the LLM produces a single schema-validated JSON step (state + brief plan + one typed tool call), then executes it and feeds the tool output back in a plan→act→observe→repair loop with retries. Tech stack: SGR (Schema-Guided Reasoning), Pydantic schemas, typed tool routing over the ERC3 API, and OpenAI as the planner/decider, plus preflight/policy guards.
17 refactor (gpt-4o) 0.476 10.25 2025-12-16 06:50 15s
Model(s): openai/gpt-4o, x-ai/grok-4-fast LLM Calls: 578 Prompt Tokens: 0.16M Completion Tokens: 42.44k Architecture: Vladimir Penkov, Agentic workflow
18 ERC3 Prod Agent Run 0.475 2.57 2025-12-09 12:07 36s
Model(s): gpt-oss-120b, openai/gpt-5.1-codex-max LLM Calls: 830 Prompt Tokens: 0.98M Completion Tokens: 0.10M Architecture: AtomicAgents + $gpt-oss-120b
19 @neuraldeep sgr_agent_core_qwen/qwen3-235b-a22b-2507 0.466 1.95 2025-12-16 03:05 3m 33s
Model(s): qwen3-235b-a22b-2507 LLM Calls: 1,675 Prompt Tokens: 2.85M Completion Tokens: 190.95k Architecture: SGR Tool Calling Agent with Security Checks - OpenAI Function Calling
20 AECFoundry - Claudius Maximus 0.455 8.86 2025-12-09 11:37 46s
Model(s): anthropic/claude-sonnet-4.5 LLM Calls: 73 Prompt Tokens: 1.67M Completion Tokens: 70.34k Architecture:
21 EPAMER GAME-CHANGER AGENTIC 0.447 15.30 2025-12-09 13:07 4m 18s
Model(s): openai/gpt-4.1 LLM Calls: 510 Prompt Tokens: 0.38M Completion Tokens: 123.36k Architecture: AvaTar arch intellect-3
22 Codegen Agent gpt-5.1 by Armen Epremian 0.447 2.24 2025-12-09 14:46 13s
Model(s): gpt-5.1 LLM Calls: 119 Prompt Tokens: 890.01k Completion Tokens: 125.74k Architecture: Codegen SGR Agent with Google GenAI
23 @mishka ERC3-Test Agent (Parallel x20) 0.437 0.72 2025-12-09 12:07 53s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 796 Prompt Tokens: 0.85M Completion Tokens: 269.28k Architecture: SGR Agent Parallel (OpenRouter qwen/qwen3-235b-a22b-2507)
24 HAIKU 0.427 2.98 2025-12-09 11:10 41s
Model(s): anthropic/claude-haiku-4.5 LLM Calls: 75 Prompt Tokens: 1.65M Completion Tokens: 76.47k Architecture:
25 SGR Bro (gpt-4.1) 0.417 10.32 2025-12-09 11:32 34s
Model(s): openai/gpt-4.1 LLM Calls: 344 Prompt Tokens: 0.17M Completion Tokens: 44.22k Architecture: Simple NextStep SGR with structured distillation
26 NextStep SGR (gpt-4.1) from ERC3 Samples + full text search for pick rules + additional PreflightCheck 0.408 15.25 2025-12-09 11:28 2m 3s
Model(s): gpt-4.1, gpt-5.1 LLM Calls: 571 Prompt Tokens: 0.42M Completion Tokens: 168.89k Architecture: NextStep SGR Agent with OpenAI
27 NextStep SGR (qwen3-max) с интегрированными инструментами 0.398 2.98 2025-12-09 11:30 40s
Model(s): gpt-5.1, qwen3-max LLM Calls: 396 Prompt Tokens: 0.28M Completion Tokens: 51.51k Architecture: NextStep SGR Agent with integrated tools from tools.py
28 Simple SGR Agent (gpt-4.1) by tokyo_s 0.398 11.25 2025-12-09 11:58 1m 15s
Model(s): openai/gpt-4.1 LLM Calls: 375 Prompt Tokens: 0.18M Completion Tokens: 55.92k Architecture: NextStep SGR Agent with OpenAI and coding tools
29 Boring Agent 0.398 3.17 2025-12-09 12:40 2m 56s
Model(s): gpt-5-mini LLM Calls: 1,484 Prompt Tokens: 1.01M Completion Tokens: 0.10M Architecture: Plan/Act - OpenAI
30 SGR Agent @yangaev1 0.398 3.35 2025-12-12 08:51 31s
Model(s): google/gemini-2.5-flash, google/gemini-2.5-flash-preview-09-2025, openai/gpt-5.2 LLM Calls: 348 Prompt Tokens: 0.18M Completion Tokens: 180.42k Architecture: SGR: Classifier->Executor->Supervisor
31 @alexchaison DPCED-agent 0.387 11.78 2025-12-09 10:40 1m 53s
Model(s): openai/gpt-4o, openai/o3 LLM Calls: 572 Prompt Tokens: 0.30M Completion Tokens: 243.31k Architecture: Discovery-Planner-Executor-Decider Pipeline
32 NextStep SGR (gpt-4.1-mini) by @figaroserg1 0.379 10.58 2025-12-09 10:44 30s
Model(s): gpt-4.1-mini LLM Calls: 423 Prompt Tokens: 0.18M Completion Tokens: 144.73k Architecture: NextStep SGR Agent with OpenAI ang Grok
33 ERCPlanReActAgent, Model=gemini-2.5-pro 0.379 21.46 2025-12-09 11:40 3m 7s
Model(s): gemini-2.5-pro LLM Calls: 1,631 Prompt Tokens: 1.35M Completion Tokens: 492.97k Architecture: ERCPlanReActAgent, Model=gemini-2.5-pro
34 ERC3 Agent Mercury Multi-Agent Distilled SGR (gpt-4.1) 0.379 20.07 2025-12-09 11:58 1m 6s
Model(s): gpt-4.1 LLM Calls: 669 Prompt Tokens: 0.20M Completion Tokens: 175.15k Architecture: Distilled Multi-Agent System combining pre-cached wiki rule distillation with multi-agent coordination (Orchestrator + specialized Workers)
35 AGES Agent v2 Parallel 0.359 3.61 2025-12-09 10:35 26s
Model(s): openai/gpt-4o LLM Calls: 103 Prompt Tokens: 0.51M Completion Tokens: 130.04k Architecture: AGES SGR Agent with gpt-4o (parallel)
36 TZaKUS (pro) 0.340 0.71 2025-12-09 15:37 26s
Model(s): google/gemini-2.5-pro LLM Calls: 207 Prompt Tokens: 334.91k Completion Tokens: 28.76k Architecture: NextStep SGR Agent with Gemini ADK
37 ERC3 Agent - LLM-Driven (openai/gpt-4.1) 0.339 21.15 2025-12-09 11:33 1m 0s
Model(s): openai/gpt-4.1 LLM Calls: 705 Prompt Tokens: 0.39M Completion Tokens: 226.54k Architecture: LLM-driven with confidence loop, no hardcoded rules
38 IS-103 SGR Multiagent System 0.311 1.14 2025-12-09 11:36 19s
Model(s): google/gemini-2.5-flash LLM Calls: 756 Prompt Tokens: 0.31M Completion Tokens: 209.92k Architecture: Router -> Searcher -> Executor
39 ERC3 Agent v3.1 SGR (@vkovalskii sgr dev team) (gpt-4o) 0.242 3.57 2025-12-09 11:15 18s
Model(s): gpt-4o LLM Calls: 102 Prompt Tokens: 593.03k Completion Tokens: 5.55k Architecture: ERC3 Agent v3 with SGR framework integration + memory compression
40 @skifmax OODA Agent (qwen/qwen3-235b-a22b-2507) 0.223 0.10 2025-12-09 10:14 11s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 553 Prompt Tokens: 725.54k Completion Tokens: 112.01k Architecture: LangGraph OODA Agent (ERC3)
41 Graph Agent 0.204 2.40 2025-12-09 11:17 29s
Model(s): openai/gpt-4.1, openai/gpt-5.1 LLM Calls: 150 Prompt Tokens: 594.23k Completion Tokens: 113.00k Architecture: Graph Agent with OpenAI
42 SGR Agent (gpt-4o) 0.184 11.52 2025-12-09 10:47 11s
Model(s): gpt-4o LLM Calls: 329 Prompt Tokens: 286.94k Completion Tokens: 32.38k Architecture: SGR-LangGraph
43 NextStep SGR (qwen/qwen3-32b:nitro) from ERC3 Samples +pipelined 0.184 0.26 2025-12-15 21:40 13s
Model(s): gpt-5.1, qwen/qwen3-32b LLM Calls: 428 Prompt Tokens: 0.25M Completion Tokens: 103.84k Architecture: NextStep SGR Agent with OpenAI

Total submissions: 43 • Filter: compete_accuracy flag

Budget Leaderboard

Rank Session Name Score Cost Fault Submitted Per task
1 I.R. 0.621 0.56 2025-12-09 13:11 5m 43s
Model(s): openai/gpt-oss-120b LLM Calls: 864 Prompt Tokens: 1.16M Completion Tokens: 564.27k Architecture: Multiagent oss-120b LLM: gpt-oss-120b Used exclusively via the Cerebras provider for speed (up to ~3k tokens/s). The architecture was based on a modified SGR NextStep with a tricky context-management logic: it fed the whole plan field from the last turn, not just the first step of the plan. All turns except the immediately previous one were kept in the LLM context in a compressed form. Each turn of the main NextStep flow was checked by a StepValidator. If everything was OK, the tool was executed and the flow continued as usual (the validator's work was not reflected in the context at all). Otherwise, the last NextStep message was sent for rework with the validator's comments. System instructions were extracted from wiki files by an LLM during the ingestion phase. The system prompt was loaded dynamically depending on whoami (public vs authenticated). The system prompt contained minimal information about /respond formatting. Detailed instructions for /respond were loaded by calling a pseudo-tool. The /whoami call was triggered automatically at the start of a task. A dynamic user context enrichment feature was used. Before the main agent started, the system code automatically pulled the user's full profile, projects, clients, and time entries by user ID. A separate LLM pass then filtered this data, and only the task-relevant subset was fed into the main LLM flow. Tool wrappers: - Pagination was effectively removed from all tools. A separate auto-pagination function would paginate through all pages and return the full list. - Req_LogTimeEntry was rebuilt because it was the only tool in the SDK that was constructed with a different field order, where the tool field was not first, which confused the model. - Also, as mentioned above, an extra Req_LoadRespondInstructions pseudo-tool was added to load the detalied /respond instructions. All tools were invoked via Structured Output instead of native tool calling. Issues: I set the turn limit for the main NextStep flow too low, so 5 of 103 tasks were simply not completed. There was not enough time left before the competition ended to rerun with a higher limit. Running all 103 tasks took about 1,430 LLM requests, $6.8, 15 minutes (with parallel task execution), 17.7M input-context tokens, and 838K output-context tokens. The main contributor to output tokens was reasoning. LLM: gpt-oss-120b via Cerebras Core agent: modified SGR NextStep with Steps validation and custom context strategy System prompts: routed based on /whoami User context: enriched by auto-loading from API with subsequent LLM filtering Tools: auto-pagination wrapper
2 CC ERC3 Agent (TinyFish) @colriot 0.573 1.66 1 2025-12-09 22:26 1m 45s
Model(s): gpt-5.1 LLM Calls: 301 Prompt Tokens: 0.11M Completion Tokens: 29.78k Architecture: CC SDK with MCP Tools Claude Code SDK based agent with preflight validation, with dedicated post validation and recovery before submitting the result based on rules from wiki. - Improved tools schemas, I don't use SGR, but usual LLM function calling - For validation request I keep only rules, list of api tools called and the task. - For pre and post validation calls SGR is used Faults: missing_model 'none'
3 @mishka ERC3-Test Agent (Parallel x20) 0.563 0.31 2025-12-15 22:36 33s
Model(s): qwen/qwen3-235b-a22b-2507 LLM Calls: 597 Prompt Tokens: 0.34M Completion Tokens: 156.71k Architecture: SGR Agent Parallel (openrouter qwen/qwen3-235b-a22b-2507)
4 NextStep SGR (google/gemini-2.5-flash) from ERC3 Samples +pipelined 0.505 2.80 2025-12-09 10:59 27s
Model(s): google/gemini-2.5-flash LLM Calls: 740 Prompt Tokens: 0.72M Completion Tokens: 476.38k Architecture: NextStep SGR Agent
5 erdzhemadinov (openai/gpt-5.2) 0.494 3.88 2025-12-16 01:53 7m 59s
Model(s): openai/gpt-5.2 LLM Calls: 458 Prompt Tokens: 0.32M Completion Tokens: 163.71k Architecture: A NextStep SGR agent: the LLM produces a single schema-validated JSON step (state + brief plan + one typed tool call), then executes it and feeds the tool output back in a plan→act→observe→repair loop with retries. Tech stack: SGR (Schema-Guided Reasoning), Pydantic schemas, typed tool routing over the ERC3 API, and OpenAI as the planner/decider, plus preflight/policy guards.
6 @neuraldeep sgr_agent_core_qwen/qwen3-235b-a22b-2507 0.466 1.95 2025-12-16 03:05 3m 33s
Model(s): qwen3-235b-a22b-2507 LLM Calls: 1,675 Prompt Tokens: 2.85M Completion Tokens: 190.95k Architecture: SGR Tool Calling Agent with Security Checks - OpenAI Function Calling
7 AECFoundry - Claudius Maximus 0.455 8.86 2025-12-09 11:37 46s
Model(s): anthropic/claude-sonnet-4.5 LLM Calls: 73 Prompt Tokens: 1.67M Completion Tokens: 70.34k Architecture:
8 HAIKU 0.427 2.98 2025-12-09 11:10 41s
Model(s): anthropic/claude-haiku-4.5 LLM Calls: 75 Prompt Tokens: 1.65M Completion Tokens: 76.47k Architecture:
9 Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 0.311 0.03 2025-12-09 12:33 1m 34s
Model(s): qwen3-4b-thinking-2507 LLM Calls: 241 Prompt Tokens: 798.04k Completion Tokens: 465.34k Architecture: Langchain Tool Call Agent w/ Qwen/Qwen3-4B-Thinking-2507 IT Development Team d.o.o., Slovenia. There were two models: - Langchain Tool Agent Qwen/Qwen3-4B-Thinking-2507 - Langchain Tool Agent openai/gpt-4.1 The solution is built around a ReAct-style agent implemented with LangChain. The base model is Qwen3-4B with reasoning capabilities. In addition, a lightweight safety layer (SO) is integrated, following the same general pattern as in the provided example. The tool's layer was adapted specifically for LangChain. The tool schemas were simplified by removing unused fields, while input and output parameter descriptions were expanded and clarified. Tool descriptions were also refined to improve the model’s accuracy in tool selection. Token usage was optimized using toon, and several custom tools were added, including a tool for retrieving the user’s current projects. For context handling, a rule distillation approach was applied. The original Wiki and instruction set were passed through the model to parse, compress, and restructure the rules into a compact system context. The conversation history itself was not compressed, as experiments showed that history compression degrades quality significantly, while Qwen3 handled long contexts reliably without noticeable performance loss. One notable observation is that, compared to GPT-5, the smaller reasoning-focused Qwen3 model did not exhibit tool-calling loops and showed more stable completion of reasoning chains.
10 Master SGR by @DenisKurov (qwen/qwen3-30b-a3b-instruct-2507) 0.252 1.39 2025-12-15 13:12 1m 20s
Model(s): qwen/qwen3-30b-a3b-instruct-2507 LLM Calls: 2,193 Prompt Tokens: 2.03M Completion Tokens: 299.95k Architecture: NextStep SGR Agent with profiles
11 ERC3 Agent v3.1 SGR (@vkovalskii sgr dev team) (gpt-4o) 0.242 3.57 2025-12-09 11:15 18s
Model(s): gpt-4o LLM Calls: 102 Prompt Tokens: 593.03k Completion Tokens: 5.55k Architecture: ERC3 Agent v3 with SGR framework integration + memory compression
12 NextStep SGR (qwen/qwen3-32b:nitro) from ERC3 Samples +pipelined 0.184 0.26 2025-12-15 21:40 13s
Model(s): gpt-5.1, qwen/qwen3-32b LLM Calls: 428 Prompt Tokens: 0.25M Completion Tokens: 103.84k Architecture: NextStep SGR Agent with OpenAI
13 NextStep SGR (qwen3-max) с интегрированными инструментами 0.175 3.22 2025-12-16 07:58 21s
Model(s): qwen3-max LLM Calls: 153 Prompt Tokens: 0.34M Completion Tokens: 18.73k Architecture: NextStep SGR Agent with integrated tools from tools.py
14 @alexchaison DPCED-agent 0.010 0.07 2025-12-16 08:00 3s
Model(s): openai/o3, x-ai/grok-4-fast LLM Calls: 16 Prompt Tokens: 237.29k Completion Tokens: 10.95k Architecture: Discovery-Planner-Executor-Decider Pipeline

Total submissions: 14 • Filter: compete_budget flag

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!