About
I trained and practiced as a physician in the NHS, where I saw first-hand the limits of clinical workflows and the potential of digital health and machine learning tools. That experience led me to focus full-time on AI research and engineering.
Since then, I've worked at MIT and Harvard on robustness evaluation, interpretability, and agentic systems, publishing in venues including NeurIPS, EMNLP, Nature Medicine, and The Lancet Digital Health. I co-authored the TRIPOD-LLM reporting guideline, developed benchmarks for clinical AI, and delivered international talks, including a TED talk on AI in healthcare.
Alongside research, I've built and deployed full-stack clinical AI systems end-to-end — from raw EMR extraction and data pipelines through to live, real-time applications used in frontline care.
I'm currently exploring the applications of agentic systems in healthcare and biotech, with a focus on agentic interaction environments, reinforcement learning tooling, and methods that can accelerate healthcare delivery and enable continual knowledge creation.
Employment History
Education
Skills
Research & Publications
A collection of research and projects that are of particular interest, focused on AI safety, robustness evaluation, and healthcare applications.
- K
KScope: A Framework for Characterizing the Knowledge Status of Language Models
NeurIPS
A comprehensive framework for systematically characterizing and evaluating the knowledge status of large language models, providing insights into what models know, how they know it, and the boundaries of their knowledge. - M
MedBrowseComp: Benchmarking Medical Deep Research and Computer Use
arXiv Preprint
A comprehensive benchmark for evaluating medical deep research and computer use capabilities, featuring complex multi-hop questions that require agents to navigate medical databases, clinical trials, and regulatory information. - S
Sparse autoencoder features for classifications and transferability
EMNLP
An investigation into the use of sparse autoencoder features for classification tasks, examining their transferability across different domains and model architectures. - T
The TRIPOD-LLM reporting guideline for studies using large language models
Nature Medicine
TRIPOD-LLM provides a comprehensive framework for transparent reporting of large language models in healthcare applications. Developed through expert consensus, these guidelines introduce a modular format with 19 main items and 50 subitems, addressing the unique challenges of LLMs in biomedical research. - W
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
NAACL
A comprehensive multilingual and multimodal medical examination dataset designed to evaluate the capabilities of multimodal language models across diverse medical domains and languages. - A
A closer look at AUROC and AUPRC under class imbalance
NeurIPS
This study disproves the popular belief that AUPRC is the best metric in class imbalance settings. Using a novel theoretical framework, we show that AUPRC is inherently discriminatory, favoring subgroups with higher prevalence of positive labels. - C
Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
NeurIPS
This research initiative delves into the biases inherent in large language models, particularly those used in healthcare applications. Through systematic analysis of 'The Pile,' Cross-Care exposes how pre-training data can skew model outputs, potentially leading to misinformed medical insights. - L
Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
EMNLP
This study investigates the surprising fragility of large language models (LLMs) when faced with drug name variations in biomedical contexts. Through systematic analysis using the RABBITS framework, we expose how LLMs struggle with brand and generic drug name substitutions, potentially impacting their reliability in healthcare applications. - P
Peer review of GPT-4 technical report and systems card
PLOS Digital Health
A comprehensive peer review of OpenAI's GPT-4 technical documentation, evaluating the transparency, methodology, and safety considerations presented in their systems card. - T
The effect of using a large language model to respond to patient messages
Lancet Digital Health
Using LLMs to draft responses to patient questions consumes significant physician time and LLMs could aid in reducing documentation burden. This study evaluates the effectiveness of responses to real-world questions and evaluates rates of potentially harmful responses. - M
Mapping and evaluating national data flows: transparency, privacy, and guiding infrastructural transformation
Lancet Digital Health
The study explores the UK's NHS data management, uncovering a vast network of data flows across healthcare and research sectors. Key findings highlight transparency issues and trust concerns in data handling, alongside prevalent non-compliance with safe data access practices. - A
An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research
Lancet Digital Health
Continuous evaluation of AI models is essential to ensure safe deployment. Disparity Dashboards systematically and continuously evaluate the impact of AI models on different subgroups of the population. - A
Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias
British Journal of Anaesthesia
A systematic review examining the current state of AI applications in mechanical ventilation, focusing on design methodologies, reporting standards, and potential biases in existing research.