Research publication
This article presents research benchmark results only. Council 1 Pro is not a medical device and is not intended for clinical use, medical diagnosis, or treatment decisions. All benchmark results are from controlled research evaluations using synthetic test cases.
Council 1 Pro achieves state-of-the-art results on MedAsk research benchmark.
We’re sharing research results for Council 1 Pro, our fine-tuned medical language model. In benchmark evaluations on MedAsk, Council 1 Pro achieved the highest scores among the frontier models we tested.
These benchmark results demonstrate advances in medical natural language understanding. While encouraging for research purposes, benchmark performance does not validate clinical utility—real-world applications require extensive additional validation.
What is MedAsk?
MedAsk is a research benchmark built around synthetic medical vignettes. Each vignette represents a hypothetical case with a target answer. A model must:
Engage in multi-turn Q&A to gather information
Produce a ranked list of possible answers
Be scored on Top-1, Top-3, or Top-5 accuracy
We evaluated on 400 synthetic vignettes spanning a wide range of medical topics. This benchmark measures language model capabilities, not clinical readiness.
Benchmark methodology update
To improve benchmark consistency and reproducibility, we updated MedAsk so that GPT-5 powers both:
Simulated dialogue
Standardized responses during the multi-turn evaluation
Automated grading
Consistent scoring criteria applied uniformly across models
Why this improves the benchmark
- More standardized evaluation: every model interacts with consistent test conditions.
- Reduced variance: automated grading removes human scoring inconsistencies.
- Better reproducibility: results can be more reliably replicated by other researchers.
Benchmark results
Below are the results on 400 MedAsk vignettes (December 10, 2025), using the updated evaluation methodology.
| Model | Top-1 | Top-3 | Top-5 | Top-1 Accuracy |
|---|---|---|---|---|
| Council 1 ProTop | 347 | 382 | 394 | 86.8% |
| GPT-5.1 | 327 | 380 | 396 | 81.8% |
| Claude Sonnet 4.5 | 320 | 383 | 396 | 80% |
| Kimi K2 | 319 | 383 | 394 | 79.8% |
| DeepSeek V3.1 | 270 | 345 | 371 | 67.5% |
Top-1 accuracy comparison
Key finding
Top-1 performance.
Council 1 Pro achieves 86.8% Top-1 accuracy (347/400 correct first-choice answers):
Multi-metric performance comparison
Top-3 and Top-5 results
All frontier models show strong Top-3 and Top-5 coverage (95%+), indicating that modern language models can identify relevant possibilities. The differentiation primarily occurs in Top-1 accuracy.
Technical approach
Council 1 Pro was developed through iterative fine-tuning focused on medical language understanding and information retrieval quality.
Key technical aspects
Information gathering optimization
The model is tuned to ask relevant follow-up questions that help narrow down possibilities, improving efficiency in multi-turn conversations.
Ranking calibration
Training emphasized producing well-calibrated rankings where the most relevant answer appears first, rather than simply listing many possibilities.
Limitations and important context
Critical limitations
- • Benchmark ≠ Clinical Validation: These results are from a research benchmark using synthetic cases, not real patients.
- • Not for Medical Use: Council 1 Pro is a research model and is not intended, designed, or validated for clinical diagnosis or treatment decisions.
- • Controlled Conditions: Benchmark evaluations don't capture the complexity and variability of real-world medical scenarios.
- • No Substitute for Medical Professionals: AI models cannot and should not replace the judgment of qualified healthcare providers.
What these results represent
This benchmark demonstrates progress in medical natural language understanding for research purposes. It represents a step forward in AI capabilities on standardized tests, but significant additional work—including prospective validation, safety studies, and regulatory review—would be required before any clinical application could be considered.
Future research directions
We’re continuing research in several areas:
Harder benchmarks
Developing more challenging evaluation scenarios
Uncertainty estimation
Improving model calibration and confidence assessment
Documentation tools
Exploring applications for clinical documentation support
Interested in learning more about our research or exploring documentation tools for your organization?
Contact us