Hi, I'm Jack
AI researcher and engineer focused on robustness, interpretability, and agentic systems. Former NHS physician now building and evaluating AI for healthcare and beyond.
JG

About

I trained and practiced as a physician in the NHS, where I saw first-hand the limits of clinical workflows and the potential of digital health and machine learning tools. That experience led me to focus full-time on AI research and engineering.

Since then, I've worked at MIT and Harvard on robustness evaluation, interpretability, and agentic systems, publishing in venues including NeurIPS, EMNLP, Nature Medicine, and The Lancet Digital Health. I co-authored the TRIPOD-LLM reporting guideline, developed benchmarks for clinical AI, and delivered international talks, including a TED talk on AI in healthcare.

Alongside research, I've built and deployed full-stack clinical AI systems end-to-end — from raw EMR extraction and data pipelines through to live, real-time applications used in frontline care.

I'm currently exploring the applications of agentic systems in healthcare and biotech, with a focus on agentic interaction environments, reinforcement learning tooling, and methods that can accelerate healthcare delivery and enable continual knowledge creation.

Employment History

H

Harvard Medical School / Brigham & Women's Hospital

2024 - Present
Postdoctoral Research Associate
Developed and deployed the first end-to-end agentic adverse event reporting system, retrospectively validated and prospectively deployed for daily real-time identification of immunotherapy toxicities. Additionally, built comprehensive benchmarks for evaluating LLM robustness, including MedBrowse-Comp (deep research and computer-use benchmark in healthcare), RABBITS (robustness to drug name variation), and WorldMedQA-V (one of the first multilingual, multimodal medical evaluation datasets).
M

Massachusetts Institute of Technology

2023 - 2024
Postdoctoral Research Associate
Coordinated international research teams on AI governance, contributing to global standards for AI safety and evaluation such as the TRIPOD-LLM Statement. Concurrently, redefined evaluation metrics for machine learning under class imbalance, demonstrating limitations of AUPRC and advancing AUROC through theoretical and empirical work. Delivered keynotes and panels on large language models and responsible AI across four continents, including a TED Talk ("Whose Life Will AI Save?") with 34,000+ views.
N

NHS - Guy's and St Thomas' Trust

2022 - 2024
Honorary Clinical Data Scientist
Quantified miscalibration harm in the National Early Warning Score due to pulse oximetry bias ("Hidden Hypoxaemia"), leading to roundtables with device manufacturers and the FDA.
N

NHS - Imperial College Healthcare Trust

2021 - 2023
Foundation Doctor
Provided frontline care in medical and acute services, with a focus on critical care and severe respiratory failure. Designed and implemented a surgical outcome dashboard that improved ICU bed allocation during the pandemic recovery. Also contributed to EHR deployment across North West London trusts as part of the Cerner Development Team.

Skills

Python
JavaScript
React
Next.js
TypeScript
Node.js
R
SQL
Flask
FastAPI
Docker
Azure
AWS
GCS
TailwindCSS
Langchain
Langgraph
Selected Publications

Research & Publications

A collection of research and projects that are of particular interest, focused on AI safety, robustness evaluation, and healthcare applications.

  • K

    KScope: A Framework for Characterizing the Knowledge Status of Language Models

    NeurIPS

    A comprehensive framework for systematically characterizing and evaluating the knowledge status of large language models, providing insights into what models know, how they know it, and the boundaries of their knowledge.
  • M

    MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

    arXiv Preprint

    A comprehensive benchmark for evaluating medical deep research and computer use capabilities, featuring complex multi-hop questions that require agents to navigate medical databases, clinical trials, and regulatory information.
  • S

    Sparse autoencoder features for classifications and transferability

    EMNLP

    An investigation into the use of sparse autoencoder features for classification tasks, examining their transferability across different domains and model architectures.
  • T

    The TRIPOD-LLM reporting guideline for studies using large language models

    Nature Medicine

    TRIPOD-LLM provides a comprehensive framework for transparent reporting of large language models in healthcare applications. Developed through expert consensus, these guidelines introduce a modular format with 19 main items and 50 subitems, addressing the unique challenges of LLMs in biomedical research.
  • W

    WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation

    NAACL

    A comprehensive multilingual and multimodal medical examination dataset designed to evaluate the capabilities of multimodal language models across diverse medical domains and languages.
  • A

    A closer look at AUROC and AUPRC under class imbalance

    NeurIPS

    This study disproves the popular belief that AUPRC is the best metric in class imbalance settings. Using a novel theoretical framework, we show that AUPRC is inherently discriminatory, favoring subgroups with higher prevalence of positive labels.
  • C

    Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

    NeurIPS

    This research initiative delves into the biases inherent in large language models, particularly those used in healthcare applications. Through systematic analysis of 'The Pile,' Cross-Care exposes how pre-training data can skew model outputs, potentially leading to misinformed medical insights.
  • L

    Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

    EMNLP

    This study investigates the surprising fragility of large language models (LLMs) when faced with drug name variations in biomedical contexts. Through systematic analysis using the RABBITS framework, we expose how LLMs struggle with brand and generic drug name substitutions, potentially impacting their reliability in healthcare applications.
  • P

    Peer review of GPT-4 technical report and systems card

    PLOS Digital Health

    A comprehensive peer review of OpenAI's GPT-4 technical documentation, evaluating the transparency, methodology, and safety considerations presented in their systems card.
  • T

    The effect of using a large language model to respond to patient messages

    Lancet Digital Health

    Using LLMs to draft responses to patient questions consumes significant physician time and LLMs could aid in reducing documentation burden. This study evaluates the effectiveness of responses to real-world questions and evaluates rates of potentially harmful responses.
  • M

    Mapping and evaluating national data flows: transparency, privacy, and guiding infrastructural transformation

    Lancet Digital Health

    The study explores the UK's NHS data management, uncovering a vast network of data flows across healthcare and research sectors. Key findings highlight transparency issues and trust concerns in data handling, alongside prevalent non-compliance with safe data access practices.
  • A

    An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research

    Lancet Digital Health

    Continuous evaluation of AI models is essential to ensure safe deployment. Disparity Dashboards systematically and continuously evaluate the impact of AI models on different subgroups of the population.
  • A

    Artificial intelligence for mechanical ventilation: systematic review of design, reporting standards, and bias

    British Journal of Anaesthesia

    A systematic review examining the current state of AI applications in mechanical ventilation, focusing on design methodologies, reporting standards, and potential biases in existing research.