8th place in Enterprise RAG Challenge 2025:

Answering Business Questions with LLMs

Key Insights
  • PDF extraction is tough: Annual reports contain vital data, but extracting structured information automatically from PDFs is complex due to their human-centric formatting.
  • Retrieval quality is crucial: AI-generated answers strongly depend on correctly finding relevant pages within large documents.

  • Approach comparison:

    • Naive approach (Gemini 2.0 Flash): Quick and easy but limited with large files or multiple documents.

    • Dense retrieval (Vector Databases): Highly accurate, scalable, but requires extensive preprocessing.

    • Multi-agent systems: Improved accuracy by combining multiple AI models, adding complexity and cost.

  • Best performing solution: Combining Gemini for retrieval and OpenAI for answer generation delivered the best balance of accuracy and ease of use, ranking 8th place on final leaderboard.

3/11/25

Final leaderboard with TAT employees submissions marked. Hours: time it took the team to produce the results, R: Retrieval Score. Max: 100, G: Generation Score. Max: 100, Score: Final score (R/3+G). Max: 133, AI: teams leveraged Rinat Abdullins AI Research (through communities or TimeToAct), Lcl: Local model was used

The Enterprise RAG Challenge on 27.02.2025 demands answering company-specific questions based on annual reports using large language models (LLMs) and sophisticated retrieval-augmented generation (RAG) techniques. 

Participants must extract precise information from extensive PDF documents, which often span 60 to 100 pages, contain complex formatting, and include tables and illustrations (as you can see below). Parsing data from PDFs is a challenge in itself. These files are optimized for visual presentation rather than structured data extraction, making it difficult to retrieve relevant text and numerical information accurately.

Example of a challenging PDF page to parse (Playtech plc Annual Report and Financial Statements 2022, p. 62)

Here are some example questions from the challenge to be answered with such PDFs:

A successful submission must not only provide an accurate answer but also reference the page numbers containing the relevant information. This enables two evaluation criteria (as can be seen in the leaderboard above):

  1. Generation Score (G): Measures the correctness of the extracted answer.
  2. Retrieval Score (R): Assesses whether the retrieval mechanism identified the correct locations in the document.

The generation score is inherently constrained by the retrieval score - if the wrong pages are extracted, even the best model cannot produce the correct answer. A higher retrieval score increases the likelihood of accurate answers.

A properly formatted submission looks like this:

To generate high-quality answers with reliable references, I built several LLM-powered approaches. You can find my implementations and experiments in the source code repository.

Naive Gemini 2.0 Flash Approach

📌 Ranking: 57th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: Gemini 2.0 Flash, chain of thought, structured outputs

Overview:
For the first, naive approach, I took inspiration from a blog post by Philipp Schmid. The key idea was to leverage the one-million-token context window of Gemini 2.0 Flash, which is free within certain usage limits (details).

Google's genAI package enables direct PDF file uploads into the model's context. This meant that each query was passed to Gemini along with a custom system prompt and the full PDF file, allowing the model to process the document in a single step.

Applied LLM Techniques:
To ensure machine-readable responses, I enforced structured outputs (SO), which included:

  • Chain of thought (CoT) reasoning: Encourages the model to explicitly think through the steps before generating an answer. This also helps with debugging and refining prompts.

  • Final answer: The extracted response.

  • Reference list: Page numbers where relevant information was found.

Challenges

One major limitation was API rate limits and file size constraints. Some requests, especially those involving multiple companies, failed because they required loading five separate PDFs into context - exceeding the model’s capacity.

As Philipp Schmid outlined, optimizing PDFs before upload or splitting large files could have helped mitigate these issues. However, due to time constraints, I was unable to iterate systematically on a test set or analyze CoT outputs for fine-tuned prompt engineering. This limited the ability to refine the model’s reasoning and response accuracy.

Pros & Cons

✅ Pros

  • - Easy setup – No pre-processing required.

  • - Cost-effective – Free within usage limits.

  • - Fast deployment – Minimal development effort.

❌ Cons

- Rate limits & file size issues – Needs optimization for large or multi-document queries.

- Lack of iterative refinement – No systematic debugging or CoT analysis.

Multi-agent Approaches

While the naive approach provided an easy way to process PDFs, it struggled with multi-company queries and API rate limits. To improve retrieval accuracy and scalability, I experimented with more advanced multi-agent systems.

The process follows a structured pipeline:

  1. Routing: An OpenAI GPT-4o-based router agent first identifies the relevant companies in a given query and formulates extended subqueries for each.

  2. Company-Specific Retrieval: Specialized agents fetch relevant information for each company.

  3. Merging: A final merger agent compiles all responses into a final structured answer.

Dense Retrieval

Keywords: Custom chunking, Qdrant vector database, query extension, chain of thought, structured outputs

For this approach, each company-specialized agent used the Qdrant vector database for retrieval. PDFs were first converted to Markdown format using docling, offering advanced PDF understanding and parsing, especially for tables.

Custom Chunking Strategy

The extracted markdown content was split into custom chunks:

  • - Chapter-based segmentation to retain context.
  • - Paragraph-based splitting, only when necessary.
  • - Tables preserved as whole units with metadata (paragraphs before table).

The chunks were embedded using OpenAI's "text-embedding-3-small" model. Based on the identified company and refined query of the router, the vector database retrieved the top five relevant chunks per question.

Challenges

- One PDF required manual reshaping due to an incompatible page format. 

- Two reports were too large (one had 1,043 pages); a quick fix was to process only the first 150 pages. 

- Parsing all PDFs took ~10 hours. 

- Some extracted chunks contained artefacts, requiring post-processing before embedding.

Future Improvements

A more structured iterative development cycle with a test set could have helped fine-tune the custom chunking strategy and retrieval parameters (e.g., number of retrieved chunks). A hybrid retrieval approach, using keyword search in the PDF could have further boosted results. For example, finding the values of assets could be easily done by a human by searching for “total assets” in the PDF. Also re-ranking of the obtained results by another agent could be beneficial.

✅ Pros

  1. Structured chunking strategy – Custom chunking improves context retention and table preservation and offers full control.

  2. No need to load the full PDF into context – Only relevant chunks are retrieved, reducing token usage and avoiding API rate limits, enabling handling of lengthy reports.

❌ Cons

  1. Expensive pre-processing – Requires PDF parsing, markdown conversion, chunking, and embedding, increasing complexity and cost.

  2. Longer setup time – Initial document processing took ~10 hours

  3. Chunk retrieval tuning required – The number of retrieved chunks, embedding quality, and query formulation need fine-tuning for optimal results.

  4. Not fully automated – Some manual fixes were needed for problematic PDFs (e.g., reshaping incompatible reports).

IBM Generation Approach

📌 Ranking: 94th (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o, IBM

In this approach, the retrieved chunks from the dense retrieval pipeline were passed to IBM's "granite-20b-code-instruct" model, which served as the company-specialized agent.

  • This agent generated company-specific answers, but only as plain text, instructed to output json data (structured outputs were not supported).

  • The output was then sent to a GPT-4o-based merger agent, which reformatted it into structured outputs suitable for final submission.

  • Chain of thought reasoning was applied at both stages.

  • References were not implemented due to lack of time, negatively impacting the retrieval score and hence the ranking of this approach.

✅ Pros

  1. More sophisticated retrieval method.

❌ Cons

  1. Expensive pre-processing due to PDF conversion, chunking, and embedding.

  2. No structured outputs of model, which makes data handling more challenging

OpenAI Generation Approach

📌 Ranking: 28th Place (across all submissions)
📌 Code: GitHub Repository
📌 Keywords: multi-agent, dense retrieval, OpenAI GPT-4o

Similar to the IBM-based approach, but instead of IBM’s model, OpenAI’s GPT-4o was used for company-specific responses.

  • - Retrieved chunks (along with the extended query) were sent to GPT-4o.

  • - Generated responses included chunk IDs, which were then mapped back to PDF page numbers using fuzzy search.

  • - A merger model then compiled the final structured answer.

✅ Pros

More sophisticated retrieval method, offering higher accuracy.

❌ Cons

Expensive pre-processing due to PDF conversion, chunking, and embedding.

Gemini-based Retrieval + OpenAI-based Generation

📌 Ranking: 19th Place (across all submissions), 8th Place across teams
📌 Code: GitHub Repository
📌 Keywords: multi-agent, Gemini 2.0 Flash, OpenAI GPT-4o

Instead of relying on vector-based retrieval, this method directly queried Gemini 2.0 Flash with the full annual report of the relevant company in context.

Key Difference from the Naive Gemini Approach:

  • - The router agent extended queries before passing them to Gemini.

  • - Multi-company queries were split into separate subqueries, avoiding rate limits.

Challenges

  1. Rate limits still occurred for "Poste Italiane" due to its 1,000+ page report.

  2. A quick fix could have been to optimize or truncate the PDF, but this was not implemented due to time constraints.

✅ Pros

  1. No expensive pre-processing (faster & cheaper).

  2. Achieved the highest scores among tested methods.

❌ Cons

  • Still need to handle API rate limits for extremely large PDFs.

Final Overview & Thoughts

*) rank on the leaderboard with all submissions, placed 8th on final leaderboard considering only the best submission of each team

Key Takeaways

  • For simple use cases, Gemini 2.0 Flash is quick and easy but struggles with large reports and multi-company queries.

  • Dense retrieval provides a scalable and modular solution, reducing API costs by not requiring the full PDF in context. Ultimately leading to the highest retrieval score among the approaches.

  • Multi-agent systems (especially OpenAI-based) achieve higher accuracy but require more processing time and cost.

  • Gemini + OpenAI merging performed best in rankings but still requires rate limit handling for massive PDFs.

  • Better final answer generation could have boosted scores. While the retrieval mechanism performed well (as indicated by the high retrieval score), prompt engineering in iterative assessment could have further improved the accuracy of generated answers.

The multi-agent approaches significantly improved retrieval accuracy and final answer quality compared to simpler single-agent methods. However, they also introduced higher costs and complexity due to their multi-step nature. The dense retrieval approach, while structured and scalable, further amplified these issues with its pre-processing and embedding overhead.

The Gemini-based retrieval with OpenAI generation eliminated many of these challenges by removing the need for dense retrieval, leading to the best scores with minimal setup - making it a highly efficient hybrid solution.

Despite the challenges and room for optimization, the ERC 2025 offered a great opportunity to prototype different solutions, experiment with LLM-powered retrieval, and learn a lot under real-world constraints. Getting my “hands dirty” with RAG techniques, multi-agent collaboration, and retrieval models in such a short timeframe was an amazing experience! 🚀
Blog 1/21/25

AI Contest - Enterprise RAG Challenge

TIMETOACT GROUP Austria demonstrates how RAG technologies can revolutionize processes with the Enterprise RAG Challenge.

Insights

These are the proud winners of the Enterprise RAG Challenge

Discover the winners of the Enterprise RAG Challenge! Explore top RAG solutions, watch the official announcement, and see how AI-driven retrieval and LLMs shaped the best-performing models.

Blog 9/20/23

LLM Performance Series: Batching

Beginning with the September Trustbit LLM Benchmarks, we are now giving particular focus to a range of enterprise workloads. These encompass the kinds of tasks associated with Large Language Models that are frequently encountered in the context of large-scale business digitalization.

Blog 10/4/24

Open-sourcing 4 solutions from the Enterprise RAG Challenge

Our RAG competition is a friendly challenge different AI Assistants competed in answering questions based on the annual reports of public companies.

Event 6/10/25

Business Fest Budapest

Meet catworkx at Business Fest Budapest 2025! Visit booth I17 to explore scalable IT workflows, agile solutions, and AI-driven insights with Atlassian experts.

Guide

Future-Proof Your Business with SAP Cloud ERP

Discover how SAP Cloud ERP transforms business operations with enhanced agility, reduced costs, and real-time decision-making. Download our free guide and future-proof your organization.

Blog

Crisis management & building a sustainable future with AI

Non-profit organizations develop AI models to tackle global challenges - and draw lessons for businesses worldwide

Referenz

The Path to Greater Transparency in IT with Jira Software

More transparency in IT and with outside service providers - this was the challenge facing a financial services provider with around 17,000 consultants in use. Together with catworkx, Jira Software...

Blog

Women's power in IT: A conversation with Sarah

On International Women's Day 2025, our colleague Sarah shares what the motto means to her. She talks about her path to leadership, challenges in IT, and how she experiences equality daily.

Referenz

Agile project portfolios implemented with Atlassian tools

As a solution provider for all aspects of doctor-patient communication, Thieme Compliance was faced with the challenge of mapping company-wide projects transparently and centrally.

Event 5/15/25

Community Lead Conference (CLC)

The Community Lead Conference (CLC) Berlin will take place on 15 May 2025 at the GLS Campus Berlin.

CLOUDPILOTS, Google Workspace, G Suite, Google Cloud, GCP, MeisterTask, MindMeister, Freshworks, Freshdesk, Freshsales, Freshservice, Looker, VMware Engine
Produkt

Google Analytics

Google Analytics comprehensively analyzes website data with free tools in one place. Understand your customers better with three simple steps.

Referenz 8/8/22

Interdisciplinary collaboration at C&A with Atlassian

To homogenize the toolchain TIMETOACT replaced two independent ticketing systems for C&A Services GmbH &Co. with the Atlassian product range. With this step into the enterprise cloud, the fashion retailer is putting is putting an exclamation mark on cross-departmental and cross-location digital collaboration.

Leistung

Managed services that scale up with your business

As your projects become more complex and dynamic, you must ensure performance, availability and scalability.

Blog 10/31/23

5 Inconvenient Questions when hiring an AI company

This article discusses five questions you should ask when buying an AI. These questions are inconvenient for providers of AI products, but they are necessary to ensure that you are getting the best product for your needs. The article also discusses the importance of testing the AI system on your own data to see how it performs.

Zusammenarbeiten mit Google Workspace
Lösung 3/10/23

Business Continuity

Do you know your critical business functions and their risks? Do you know what to do when something goes wrong?

Navigationsbild zu Business Intelligence
Service

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

Navigationsbild zu Business Intelligence
Service

Business Intelligence

Business Intelligence (BI) is a technology-driven process for analyzing data and presenting usable information. On this basis, sound decisions can be made.

Wissen 8/30/24

LLM-Benchmarks August 2024

Instead of our general LLM benchmarks, we present the first benchmark of different AI architectures in August.

Blog

LLM Benchmarks April 2025

Ranking the best performing large language models for digital product development.

Bleiben Sie mit dem TIMETOACT GROUP Newsletter auf dem Laufenden!