Publications.
Selected preprints, conference papers, and technical notes on the thermodynamics of intelligence and mechanistic interpretability.
Metanthropic Research
I publish the majority of my formal research under the Metanthropic charter. Access our full archives, including safety evaluations and interpretability logs.
The Fragility of Guardrails: Cognitive Jamming and Repetition Collapse in Safety-Steered LLMs
A mechanistic audit of LLM residual streams using Sparse Autoencoders (SAEs). We demonstrate that aggressive safety-steering vectors often interfere with latent world-modeling circuits, triggering 'Cognitive Jamming'—a failure mode where models spiral into repetition rather than grounded refusal.
Dataset Distillation for the Pre-Training Era
Introducing Linear Gradient Matching (LGM), a novel method to condense massive datasets into a single synthetic image per class. We reveal shared 'Platonic' representations across foundation models (CLIP, DINO-v2) and demonstrate how distilled data acts as a lens to diagnose model robustness.