From a research standpoint, GPT2 is groundbreaking in two ways. One is its size, says Dario Amodei, OpenAI’s research director. The models “were 12 times bigger, and the dataset was 15 times bigger and much broader” than the previous state-of-the-art AI model. It was trained on a dataset containing about 10m articles, selected by trawling the social news site Reddit for links with more than three votes. The vast collection of text weighed in at 40 GB, enough to store about 35,000 copies of Moby Dick.
The amount of data GPT2 was trained on directly affected its quality, giving it more knowledge of how to understand written text. It also led to the second breakthrough. GPT2 is far more general purpose than previous text models. By structuring the text that is input, it can perform tasks including translation and summarisation, and pass simple reading comprehension tests, often performing as well or better than other AIs that have been built specifically for those tasks.
Tip of the hat to Chris Aldrich.