We see the potential of AI in regulatory workflows. We also see the practical gap between promising answers and audit-ready reliability. We've been measuring it.
Even the best-performing search-enabled systems give factually incorrect answers or cite the wrong references roughly one out of three times when tested on the EU AI Act.
Generic agents are not enough for regulated work. Purpose-built systems are necessary. So are rigorous evaluations that ensure those systems are fit for purpose, and remain so over time.
The EU AI Act Q&A Challenge serves as the next step: advancing the evaluation of purpose-built regulatory AI systems.
For the broader regulatory context and how this translates into implementation work, see AI in Regulated Life Science and our AI Governance & Compliance service.
To reflect how regulatory AI is used in practice, where more aspects than just answer correctness matter, the competition is multi-dimensional. Every submission is scored against question-specific ground truth across five dimensions. Furthermore, the evaluation is repeated when considering a simulated multi-turn conversation, to reflect how these systems are used in practice.
Answer Correctness
Tested against question-specific ground-truth correctness criteria. Variants: strict and loose.
Reference Accuracy
Proposed references checked against expected ones. Variants: strict and loose.
Conciseness
Answer and reference-set lengths assessed against benchmark exemplars.
Tone
Assessment of clarity and appropriateness of the language for regulatory contexts.
Latency
Time from prompt submission to response measured per question.
Illustrative visualization of results for one multi-turn configuration (e.g. on/off). Contestant names are invented.
Participants receive an individual benchmark report and may be included in public summary materials after the opt-out period.
A best-effort, automated assessment grounded in question-specific correctness checks and reference verification.
A dedicated report for your system, with scoring across every benchmark dimension and comparisons against off-the-shelf reference methods.
Your system is featured in regenold's downstream publications: articles, web content, and social posts after the opt-out window.
Opt-out option. Not happy with your individual report? You can opt out in writing within 10 working days from when results are communicated - your entry is then anonymized in our use of the results.
Participation is free and open to anyone willing to test their AI system on the EU AI Act.
Send your endpoint details and participant information to our technical contact. We confirm onboarding and reserve your benchmark slot.
We send conversation histories to your API and collect the JSON responses. Latency and multi-turn behaviour are measured automatically.
Your individual report shows performance across all dimensions, alongside reference benchmarks from popular off-the-shelf methods.
Your system needs to expose an API that accepts a conversation history and returns a single JSON response. The format follows the OpenAI/LiteLLM message convention.
# JSON with OpenAI/LiteLLM message format [ {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}, ..., {"role": "user", "content": "..."} ]
# JSON with three fields { "reasoning": "Optional. Not scored.", "answer": "Short, professional answer.", "references": [ "Annex IV.2", "Article 3.1" ] }
Anyone willing to test their AI system on EU AI Act questions. Participation is free of charge. Each contestant can submit one entry per edition.
No. Any conversational AI system that exposes an API and returns the required JSON format can participate. The benchmark itself is purpose-built for the EU AI Act, but contestants range from off-the-shelf systems to fully purpose-built regulatory tools.
After the 10-working-day opt-out window, we publish the results of this edition in articles, web content, and social posts. The published material includes the performance of all contestants (anonymized where opt-out applies). We also intend to use the results as a seed for a future live benchmark with rolling submissions.
Strict scoring requires all ground-truth elements to be present and correct. Loose scoring accepts answers and references that are partially correct or that include the right elements with some additional content. Reporting both variants gives a clearer picture of where a system holds up under tighter assessment criteria.
All questions pertain to Regulation (EU) 2024/1689 dated 13 June 2024. Subsequent amendments and delegated acts are out of scope for this edition.
Yes, that's the plan. The competition results will seed a live benchmark with rolling submissions, similar to how state-of-the-art LLM and agent benchmarks operate. Contestants will be able to submit updated systems at any time, and we'd welcome ongoing participation.
The EU AI Act is the first benchmark we're publishing. The same approach (ground-truth questions, multi-dimensional scoring, independent evaluation) applies to any regulatory or quality framework that AI systems are expected to handle reliably.
We're considering future editions on Clinical Trial Regulation (EU) 536/2014, MDR, IVDR, ICH guidelines, GMP, GVP, and others. If there's a framework where you'd want to see your AI system independently benchmarked, tell us. Suggestions shape what we run next.
For companies deploying AI in regulated workflows, the relevant benchmark often isn't a public regulation. It's your internal standards, your SOPs, your validation criteria. We design custom evaluations that test AI systems against the specific requirements that matter to your organisation. This sits within our broader AI Governance & Compliance service, alongside risk classification, vendor qualification, validation, and inspection readiness.
The competition runs through May and June 2026. We walk you through onboarding and run the evaluation.