Dataset Mixture. They trained on 13T tokens
Dataset Mixture
They trained on 13T tokens.
CommonCrawl & RefinedWeb are both 5T.
Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of "unaccounted for" tokens: The "secret" data.
Which by this point we already get rumors that parts of it came from twitter, reddit & youtube.
[Rumors that start to become lawsuits]
Some speculations are:
- LibGen (4M+ books)
- Sci-Hub (80M+ papers)
- All of GitHub
My own opinion:
The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.
This is very easy to convert to txt file and than with self-instruct into instruction form.
This creates the "illusion" that GPT-4 "is smart" no matter who use it.
Computer scientist? sure! it can help you with your questions about P!=NP
Philosophy major? It can totally talk to you about epistemology.
Don't you see?
It was trained on the textbooks. It is so obvious.
There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on.
There are some books it knows so well that it had seen them for sure.
Moreover, If i remember correctly: It even know the unique ids of project Euler exes.
They trained on 13T tokens.
CommonCrawl & RefinedWeb are both 5T.
Remove the duplication of tokens from multiple epochs and we get to a much reasonable number of "unaccounted for" tokens: The "secret" data.
Which by this point we already get rumors that parts of it came from twitter, reddit & youtube.
[Rumors that start to become lawsuits]
Some speculations are:
- LibGen (4M+ books)
- Sci-Hub (80M+ papers)
- All of GitHub
My own opinion:
The missing dataset it a custom dataset of college textbooks collected by hand for as much courses as possible.
This is very easy to convert to txt file and than with self-instruct into instruction form.
This creates the "illusion" that GPT-4 "is smart" no matter who use it.
Computer scientist? sure! it can help you with your questions about P!=NP
Philosophy major? It can totally talk to you about epistemology.
Don't you see?
It was trained on the textbooks. It is so obvious.
There are also papers that try to extract by force memorized parts of books from GPT-4 to understand what it trained on.
There are some books it knows so well that it had seen them for sure.
Moreover, If i remember correctly: It even know the unique ids of project Euler exes.
Источник: gonzo-обзоры ML статей
2023-07-11 15:14:08