Data Scientist / Applied ML Engineer
Health Data Hub
Key Systems
-
hub
Large-scale clinical data integration, linking and cleaning 100M+ rows from SNDS and Paris Hospitals EDWs, with scalable ETL workflows and ~150 derived analysis-ready variables.
-
hub
Synthetic data generation, designing ParaBios, a multi-tabular stochastic generator enabling multi-constraint data synthesis under formal schema specifications at 500 MB/min throughput.
-
hub
Clinical NLP anonymization, implementing ASR/NER pipelines (Whisper + GLiNER) for emergency call data, achieving ~92% anonymization recall.
Ownership & Impact
Full-stack ownership across a live healthcare data platform, from ingestion pipelines to deployed predictive models. Contributed to several internal tooling initiatives alongside the core modelling work, including documentation automation and structured data generation.