Research publication

This article presents research benchmark results only. Council 1 Pro is not a medical device and is not intended for clinical use, medical diagnosis, or treatment decisions. All benchmark results are from controlled research evaluations using synthetic test cases.

Research

Council 1 Pro achieves state-of-the-art results on MedAsk research benchmark.

December 10, 2025
8 min read

We’re sharing research results for Council 1 Pro, our fine-tuned medical language model. In benchmark evaluations on MedAsk, Council 1 Pro achieved the highest scores among the frontier models we tested.

These benchmark results demonstrate advances in medical natural language understanding. While encouraging for research purposes, benchmark performance does not validate clinical utility—real-world applications require extensive additional validation.

What is MedAsk?

MedAsk is a research benchmark built around synthetic medical vignettes. Each vignette represents a hypothetical case with a target answer. A model must:

1

Engage in multi-turn Q&A to gather information

2

Produce a ranked list of possible answers

3

Be scored on Top-1, Top-3, or Top-5 accuracy

We evaluated on 400 synthetic vignettes spanning a wide range of medical topics. This benchmark measures language model capabilities, not clinical readiness.

Benchmark methodology update

To improve benchmark consistency and reproducibility, we updated MedAsk so that GPT-5 powers both:

Simulated dialogue

Standardized responses during the multi-turn evaluation

Automated grading

Consistent scoring criteria applied uniformly across models

Why this improves the benchmark

  • More standardized evaluation: every model interacts with consistent test conditions.
  • Reduced variance: automated grading removes human scoring inconsistencies.
  • Better reproducibility: results can be more reliably replicated by other researchers.

Benchmark results

Below are the results on 400 MedAsk vignettes (December 10, 2025), using the updated evaluation methodology.

ModelTop-1Top-3Top-5Top-1 Accuracy
Council 1 ProTop34738239486.8%
GPT-5.132738039681.8%
Claude Sonnet 4.532038339680%
Kimi K231938339479.8%
DeepSeek V3.127034537167.5%

Top-1 accuracy comparison

Key finding

Top-1 performance.

Council 1 Pro achieves 86.8% Top-1 accuracy (347/400 correct first-choice answers):

+5.0 pts
compared to GPT-5.1 (81.8%)
+6.8 pts
compared to Claude Sonnet 4.5 (80.0%)

Multi-metric performance comparison

Top-3 and Top-5 results

All frontier models show strong Top-3 and Top-5 coverage (95%+), indicating that modern language models can identify relevant possibilities. The differentiation primarily occurs in Top-1 accuracy.

Technical approach

Council 1 Pro was developed through iterative fine-tuning focused on medical language understanding and information retrieval quality.

Key technical aspects

Information gathering optimization

The model is tuned to ask relevant follow-up questions that help narrow down possibilities, improving efficiency in multi-turn conversations.

Ranking calibration

Training emphasized producing well-calibrated rankings where the most relevant answer appears first, rather than simply listing many possibilities.

Limitations and important context

Critical limitations

  • Benchmark ≠ Clinical Validation: These results are from a research benchmark using synthetic cases, not real patients.
  • Not for Medical Use: Council 1 Pro is a research model and is not intended, designed, or validated for clinical diagnosis or treatment decisions.
  • Controlled Conditions: Benchmark evaluations don't capture the complexity and variability of real-world medical scenarios.
  • No Substitute for Medical Professionals: AI models cannot and should not replace the judgment of qualified healthcare providers.

What these results represent

This benchmark demonstrates progress in medical natural language understanding for research purposes. It represents a step forward in AI capabilities on standardized tests, but significant additional work—including prospective validation, safety studies, and regulatory review—would be required before any clinical application could be considered.

Future research directions

We’re continuing research in several areas:

Harder benchmarks

Developing more challenging evaluation scenarios

Uncertainty estimation

Improving model calibration and confidence assessment

Documentation tools

Exploring applications for clinical documentation support

Interested in learning more about our research or exploring documentation tools for your organization?

Contact us

Disclaimer: Council 1 Pro is a research model. The benchmark results presented here are for informational purposes only and do not constitute medical advice, diagnosis, or treatment recommendations. Council 1 Pro is not a medical device and has not been validated for clinical use. Healthcare decisions should only be made by qualified medical professionals.

Benchmark conducted December 10, 2025 using MedAsk research benchmark with standardized evaluation methodology.