The prevailing narrative suggests that AI models are completely objective. In reality, large language models inherit deep moral, cultural, and political biases from their training data. When these models are deployed in high-stakes environments—like healthcare, legal sentencing, or automated hiring—these latent biases can cause systemic harm.
Mechanistic Interpretability
Rather than treating the model as an opaque oracle, our research focuses on Mechanistic Interpretability. We aim to identify the specific neural circuits and attention heads responsible for moral reasoning. By reverse-engineering these circuits, we can isolate where a model stores "bias" versus where it stores "fact."
Safety-Steered Training
Once these moral features are mapped, we can actively steer them. Instead of trying to create a completely "unbiased" model (which is philosophically impossible), we propose "Safety-Steered Training." This allows deployers to explicitly define and inject targeted ethical axioms directly into the model's latent space, ensuring transparent and predictable behavior aligned with human values.