Our method leverages three networks (a), which are trained in three alternating phases: the roll-out phase (b), the policy update phase (c), and the alignment phase (d). The grey boxes represent networks frozen during the specific phase and the dashed arrows indicate the gradient flow. (b) In the roll-out phase, the KL-Divergence between the proxy student and teacher is used as a penalty term. (c) In addition to the policy gradient, the teacher encoder is updated by backpropagating through the KL-Divergence between the action distribution of the teacher and the proxy student. (d) Using student observations, the proxy student is aligned to the student while the student is aligned to the teacher network using consistency losses.
@article{messikommer2025student,
title = {Student-Informed Teacher Training},
author = {Messikommer, Nico and Xing, Jiaxu and Aljalbout, Elie and Scaramuzza, Davide},
journal = {International Conference on Learning Representations},
year = {2025},
}