Другой интересный инсайд из статьи:. Reinforcement learning proved...
Другой интересный инсайд из статьи:
Reinforcement learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs throughout the annotation process. Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Furthermore, the model’s performance is capped by the writing abilities of the most skilled annotators.
И ещё:
Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could reprioritize and devote more annotation effort to preference-based annotation for RLHF.
То есть полученная на SFT модель (дообучение на предзаготовленных хорошо вычищенных данных) в целом уже себя неплохо показывает, и можно не тратить деньги на ручное написание "идеальных" ответов модели человеком, а переходить исключительно к оценке предпочтений (то самые A лучше B).
As the man said, High-Quality Data Is All We Need
Reinforcement learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs throughout the annotation process. Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Furthermore, the model’s performance is capped by the writing abilities of the most skilled annotators.
И ещё:
Surprisingly, we found that the outputs sampled from the resulting SFT model were often competitive with SFT data handwritten by human annotators, suggesting that we could reprioritize and devote more annotation effort to preference-based annotation for RLHF.
То есть полученная на SFT модель (дообучение на предзаготовленных хорошо вычищенных данных) в целом уже себя неплохо показывает, и можно не тратить деньги на ручное написание "идеальных" ответов модели человеком, а переходить исключительно к оценке предпочтений (то самые A лучше B).
As the man said, High-Quality Data Is All We Need
Источник: Сиолошная
2023-07-18 20:29:40