PARHAF CliBench
A benchmark for measuring how 7B-9B language models handle structured information extraction on real French clinical notes.
Why it matters
Hospitals need reliable extraction for de-identification, infection tracking, and structured reporting. This benchmark tests whether compact open models are actually ready for those workflows.
Technical Challenge
Clinical IE requires more than medical knowledge: models must emit valid structured outputs, preserve exact spans, and stay robust across heterogeneous tasks and note styles.
Methodology / Evaluation
Evaluated 6 instruction-tuned LLMs and GLiNER2 on 4 PARHAF tasks, covering 5,005 documents and 65,065 predictions, with micro-F1, schema conformity, empty-output rates, latency, and bootstrap confidence intervals.