InstructGPT-RLHF

Posted on 2024-06-27 Edited on 2025-01-21 In Paper reading Views:

Training language models to follow instructions with human feedback

The motivation is that the output of GPT series are not aligned with the user needs, since the pre-text task in GPT is to predict next word, but the expectation of ChatGPT is to be useful and helpful for users. Thus, the authors give a 3-stage framework to fine-tune GPT-3.

First, they collect some prompts (questions) and answers from users/labelers, and use these to fine-tune GPT-3.

Second, they ask labelers to rank the multiple outputs of the model from step 1 for a single prompt. Then they train a reward model to match the ranking.

Third, they further train the model from step 1 such that its output has a high score from the reward model from step 2.

It is called reinforcement learning from human feedback, or RLHF.