RLHF: Reinforcement Learning from Human Feedback | by Ms Aerin | Oct, 2023

Like everyone else, this is the first I am experiencing closed research. Since I was in college, all frontier research has been and peer-reviewed, until recently. And I believe openness ultimately advances science more than closedness.

If we aim to match the of ChatGPT through open source, I believe we need to start taking data more seriously. A substantial part of ChatGPT’s effectiveness might not come from, say, specific ML architecture, techniques, or . But more likely, it’s from the breadth, and quality of the instruction data.

To put it bluntly, fine-tuning on mediocre instruction data is a waste of compute. Let’s take a look at what has changed in the training data and paradigm—how we are now formatting the training data differently and therefore learning differently than in past large-scale pre-training.

RLHF stands for from Human Feedback. It has two main components:

  1. Reinforcement Learning (RL)
  2. Human Feedback (HF)

Source link