Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

This paper introduces **Compute as Teacher (CaT)**, a novel method that converts a large language model's (LLM) inference-time exploration into **reference-free supervision** by synthesizing a single, improved reference answer from multiple parallel rollouts generated by the model. This synthesized reference is then used as a teacher signal for training (CaT-RL) or immediate inference-time gain (CaT). For **verifiable tasks** like math, programmatic checks compare rollouts to the synthesized answer, while for **non-verifiable tasks**, the anchor model proposes specific, auditable rubrics that an independent LLM judge scores to provide a fine-grained reward. The study demonstrates that CaT-RL significantly improves performance across multiple LLM families on both **mathematical reasoning (MATH-500)** and **non-verifiable dialogue (HealthBench)**, outperforming various selection and single-sample baselines and even achieving results competitive with human-annotated feedback. The core mechanism involves the anchor policy reconciling contradictions and omissions across rollouts to construct a superior answer, suggesting that compute can effectively substitute for missing human-labeled supervision.