Jack Gallifant

Hi, I'm Jack

AI researcher and engineer focused on robustness, interpretability, and agentic systems. Former NHS physician now building and evaluating AI for healthcare and beyond.

About

I trained and practiced as a physician in the NHS, where I saw first-hand the limits of clinical workflows and the potential of digital health and machine learning tools. That experience led me to focus full-time on AI research and engineering.

Since then, I've worked at MIT and Harvard on robustness evaluation, interpretability, and agentic systems, publishing in venues including NeurIPS, EMNLP, Nature Medicine, and The Lancet Digital Health. I co-authored the TRIPOD-LLM reporting guideline, developed benchmarks for clinical AI, and delivered international talks, including a TED talk on AI in healthcare.

Alongside research, I've built and deployed full-stack clinical AI systems end-to-end — from raw EMR extraction and data pipelines through to live, real-time applications used in frontline care.

I'm currently exploring the applications of agentic systems in healthcare and biotech, with a focus on agentic interaction environments, reinforcement learning tooling, and methods that can accelerate healthcare delivery and enable continual knowledge creation.

Employment History

Education

Maastricht University

2024 - 2026

PhD — AI for Oncology & Healthcare

Doctoral research hosted at the Artificial Intelligence in Medicine (AIM) Program, Mass General Brigham & Harvard Medical School (Boston, USA). Thesis submitted October 2025: “Towards Reliable Medical Knowledge in AI: Metrics, Robustness, Interpretability, and Benchmarks for Clinical-Grade Language Models and Agents.” Supervisor: Prof. Hugo J.W.L. Aerts. Degree pending conferral 2026.

King's College London

2019 - 2020

MSc Human and Applied Physiology

Investigated ventilation strategies during the COVID-19 pandemic and authored an international review on AI for mechanical ventilation. Proposed multimodal deep reinforcement learning models for personalized ventilator management.

Norwich Medical School, University of East Anglia

2015 - 2021

MBBS Medicine

Completed USMLE Step 1 (242) and Step 2CK (251) alongside UK medical training. Completed clinical externship in Anaesthesia at Mount Sinai, New York. Captain of the 1st XI Field Hockey team.

Skills

Python

JavaScript

React

Next.js

TypeScript

Node.js

SQL

Flask

FastAPI

Docker

Azure

AWS

GCS

TailwindCSS

Langchain

Langgraph

Selected Publications

Research & Publications

A collection of research and projects that are of particular interest, focused on AI safety, robustness evaluation, and healthcare applications.

2025

KScope: A Framework for Characterizing the Knowledge Status of Language Models

NeurIPS

A comprehensive framework for systematically characterizing and evaluating the knowledge status of large language models, providing insights into what models know, how they know it, and the boundaries of their knowledge.

ArXiv

GitHub

2025

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

arXiv Preprint

A comprehensive benchmark for evaluating medical deep research and computer use capabilities, featuring complex multi-hop questions that require agents to navigate medical databases, clinical trials, and regulatory information.

ArXiv

Project Page

2025

Sparse autoencoder features for classifications and transferability

EMNLP

An investigation into the use of sparse autoencoder features for classification tasks, examining their transferability across different domains and model architectures.

ArXiv

2025

The TRIPOD-LLM reporting guideline for studies using large language models

Nature Medicine

TRIPOD-LLM provides a comprehensive framework for transparent reporting of large language models in healthcare applications. Developed through expert consensus, these guidelines introduce a modular format with 19 main items and 50 subitems, addressing the unique challenges of LLMs in biomedical research.

Nature Medicine

Interactive Website

2025

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

NAACL

A comprehensive multilingual and multimodal medical examination dataset designed to evaluate the capabilities of multimodal language models across diverse medical domains and languages.

ACL Anthology

Dataset

2024

A closer look at AUROC and AUPRC under class imbalance

NeurIPS

This study disproves the popular belief that AUPRC is the best metric in class imbalance settings. Using a novel theoretical framework, we show that AUPRC is inherently discriminatory, favoring subgroups with higher prevalence of positive labels.

ArXiv

2024

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

NeurIPS

This research initiative delves into the biases inherent in large language models, particularly those used in healthcare applications. Through systematic analysis of 'The Pile,' Cross-Care exposes how pre-training data can skew model outputs, potentially leading to misinformed medical insights.

ArXiv

2024

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

EMNLP

This study investigates the surprising fragility of large language models (LLMs) when faced with drug name variations in biomedical contexts. Through systematic analysis using the RABBITS framework, we expose how LLMs struggle with brand and generic drug name substitutions, potentially impacting their reliability in healthcare applications.

ArXiv

Leaderboard

2024

Peer review of GPT-4 technical report and systems card

PLOS Digital Health

A comprehensive peer review of OpenAI's GPT-4 technical documentation, evaluating the transparency, methodology, and safety considerations presented in their systems card.

PLOS Digital Health

2024

The effect of using a large language model to respond to patient messages

Lancet Digital Health

Using LLMs to draft responses to patient questions consumes significant physician time and LLMs could aid in reducing documentation burden. This study evaluates the effectiveness of responses to real-world questions and evaluates rates of potentially harmful responses.

Lancet Digital Health

2023

Mapping and evaluating national data flows: transparency, privacy, and guiding infrastructural transformation

Lancet Digital Health

The study explores the UK's NHS data management, uncovering a vast network of data flows across healthcare and research sectors. Key findings highlight transparency issues and trust concerns in data handling, alongside prevalent non-compliance with safe data access practices.

Lancet Digital Health

2022

An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research

Lancet Digital Health

Continuous evaluation of AI models is essential to ensure safe deployment. Disparity Dashboards systematically and continuously evaluate the impact of AI models on different subgroups of the population.

Lancet Digital Health

2022

Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias

British Journal of Anaesthesia

A systematic review examining the current state of AI applications in mechanical ventilation, focusing on design methodologies, reporting standards, and potential biases in existing research.

British Journal of Anaesthesia

About

Employment History

Harvard Medical School / Brigham & Women's Hospital

Massachusetts Institute of Technology

NHS - Guy's and St Thomas' Trust

NHS - Imperial College Healthcare Trust

Education

Maastricht University

King's College London

Norwich Medical School, University of East Anglia

Skills

Research & Publications

KScope: A Framework for Characterizing the Knowledge Status of Language Models

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Sparse autoencoder features for classifications and transferability

The TRIPOD-LLM reporting guideline for studies using large language models

WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

A closer look at AUROC and AUPRC under class imbalance

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Peer review of GPT-4 technical report and systems card

The effect of using a large language model to respond to patient messages

Mapping and evaluating national data flows: transparency, privacy, and guiding infrastructural transformation

An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research

Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias