Data Efficiency

Dataset Distillation via LGM

Compressing petabytes of noisy internet data into pure, high-signal training manifolds using Latent Gradient Matching.

The current paradigm of LLM pre-training relies on the "data scaling law"—the assumption that feeding models exponentially larger datasets inevitably leads to better reasoning. However, internet-scale datasets are saturated with noise, redundancy, and low-quality heuristics. This brute-force approach results in catastrophic computational waste.

Latent Gradient Matching (LGM)

Our Dataset Distillation framework operates on the principle of LGM. Instead of randomly sampling data, LGM computes the gradient trajectory a model takes when learning a "perfect" concept from high-quality proprietary data. It then searches the raw, noisy dataset for token sequences that, when compressed, perfectly match that idealized gradient trajectory.

The result is a distilled dataset that is orders of magnitude smaller but contains the exact cognitive "signal" required for reasoning. By training on LGM-distilled data, we achieve GPT-4 class reasoning on models a fraction of the size, significantly lowering the barrier to entry for training advanced foundation models.