On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <l...@lu.is> wrote: > On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipe...@gmail.com>, > wrote: > > Luis, > > OpenAI researchers have released some info about data sources that > trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165 > > See section 2.2, starting on page 8 of the PDF. > > The full text of English Wikipedia is one of five sources, the others > being CommonCrawl, a smaller subset of scraped websites based on > upvoted reddit links, and two unrevealed datasets of scanned books. > (I've read speculation that one of these datasets is basically the > Library Genesis archive.) Wikipedia is much smaller than the other > datasets, although they did weight it somewhat more heavily than any > other dataset. With the extra weighting, they say Wikipedia accounts > for 3% of the total training. > > > Thanks, Sage. Facebookâs recently-released LLaMa also shares some of their > training sources, it turns out, with similar weighting for Wikipedia - only > 4.5% of training text, but more heavily weighted than most other sources: > > https://twitter.com/GuillaumeLample/status/1629151234597740550 >
Those stats are undercounting, since the top source (CommonCrawl) also itself includes Wikipedia as its third largest source. https://commoncrawl.github.io/cc-crawl-statistics/plots/domains <https://twitter.com/GuillaumeLample/status/1629151234597740550> > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/ To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org