Research & Writing

Publications.

Selected preprints, conference papers, and technical notes on the thermodynamics of intelligence and mechanistic interpretability.

Primary Affiliation

Metanthropic Research

I publish the majority of my formal research under the Metanthropic charter. Access our full archives, including safety evaluations and interpretability logs.

Visit Lab Archives

Metanthropic Research2025

The Fragility of Guardrails: Cognitive Jamming and Repetition Collapse in Safety-Steered LLMs

A mechanistic audit of LLM residual streams using Sparse Autoencoders (SAEs). We demonstrate that aggressive safety-steering vectors often interfere with latent world-modeling circuits, triggering 'Cognitive Jamming'—a failure mode where models spiral into repetition rather than grounded refusal.

Metanthropic Research2025

Dataset Distillation for the Pre-Training Era

Introducing Linear Gradient Matching (LGM), a novel method to condense massive datasets into a single synthetic image per class. We reveal shared 'Platonic' representations across foundation models (CLIP, DINO-v2) and demonstrate how distilled data acts as a lens to diagnose model robustness.