Research publication

This article presents research benchmark results only. Council 1 Pro is not a medical device and is not intended for clinical use, medical diagnosis, or treatment decisions. All benchmark results are from controlled research evaluations using synthetic test cases.

Research

Council 1 Pro achieves state-of-the-art results on MedAsk research benchmark.

December 10, 2025

8 min read

We’re sharing research results for Council 1 Pro, our fine-tuned medical language model. In benchmark evaluations on MedAsk, Council 1 Pro achieved the highest scores among the frontier models we tested.

These benchmark results demonstrate advances in medical natural language understanding. While encouraging for research purposes, benchmark performance does not validate clinical utility—real-world applications require extensive additional validation.

What is MedAsk?

MedAsk is a research benchmark built around synthetic medical vignettes. Each vignette represents a hypothetical case with a target answer. A model must:

Engage in multi-turn Q&A to gather information

Produce a ranked list of possible answers

Be scored on Top-1, Top-3, or Top-5 accuracy

We evaluated on 400 synthetic vignettes spanning a wide range of medical topics. This benchmark measures language model capabilities, not clinical readiness.

Benchmark methodology update

To improve benchmark consistency and reproducibility, we updated MedAsk so that GPT-5 powers both:

Simulated dialogue

Standardized responses during the multi-turn evaluation

Automated grading

Consistent scoring criteria applied uniformly across models

Why this improves the benchmark

More standardized evaluation: every model interacts with consistent test conditions.
Reduced variance: automated grading removes human scoring inconsistencies.
Better reproducibility: results can be more reliably replicated by other researchers.

Benchmark results

Below are the results on 400 MedAsk vignettes (December 10, 2025), using the updated evaluation methodology.

Model	Top-1	Top-3	Top-5	Top-1 Accuracy
Council 1 ProTop	347	382	394	86.8%
GPT-5.1	327	380	396	81.8%
Claude Sonnet 4.5	320	383	396	80%
Kimi K2	319	383	394	79.8%
DeepSeek V3.1	270	345	371	67.5%

Top-1 accuracy comparison

Key finding

Top-1 performance.

Council 1 Pro achieves 86.8% Top-1 accuracy (347/400 correct first-choice answers):

+5.0 pts

compared to GPT-5.1 (81.8%)

+6.8 pts

compared to Claude Sonnet 4.5 (80.0%)

Multi-metric performance comparison

Top-3 and Top-5 results

All frontier models show strong Top-3 and Top-5 coverage (95%+), indicating that modern language models can identify relevant possibilities. The differentiation primarily occurs in Top-1 accuracy.

Technical approach

Council 1 Pro was developed through iterative fine-tuning focused on medical language understanding and information retrieval quality.

Key technical aspects

Information gathering optimization

The model is tuned to ask relevant follow-up questions that help narrow down possibilities, improving efficiency in multi-turn conversations.

Ranking calibration

Training emphasized producing well-calibrated rankings where the most relevant answer appears first, rather than simply listing many possibilities.

Limitations and important context

Critical limitations

• Benchmark ≠ Clinical Validation: These results are from a research benchmark using synthetic cases, not real patients.
• Not for Medical Use: Council 1 Pro is a research model and is not intended, designed, or validated for clinical diagnosis or treatment decisions.
• Controlled Conditions: Benchmark evaluations don't capture the complexity and variability of real-world medical scenarios.
• No Substitute for Medical Professionals: AI models cannot and should not replace the judgment of qualified healthcare providers.

What these results represent

This benchmark demonstrates progress in medical natural language understanding for research purposes. It represents a step forward in AI capabilities on standardized tests, but significant additional work—including prospective validation, safety studies, and regulatory review—would be required before any clinical application could be considered.

Future research directions

We’re continuing research in several areas:

Harder benchmarks

Developing more challenging evaluation scenarios

Uncertainty estimation

Improving model calibration and confidence assessment

Documentation tools

Exploring applications for clinical documentation support

Interested in learning more about our research or exploring documentation tools for your organization?