Anders, do you have a citation for “use Wikipedia content considerably”?

Lots of early-ish ML work was heavily dependent on Wikipedia, but 
state-of-the-art Large Language Models are trained on vast quantities of text, 
of which Wikipedia is only a small part. ChatGPT does not share their data 
sources (as far as I know) but the project released their Pile a 
few years back, and that already had Wikipedia as < 5% of the text data; I 
think it is safe to assume that the percentage is smaller for newer models:

Techniques to improve reliability of LLM output may rely more heavily on 
Wikipedia. For example, Facebook uses Wikipedia rather heavily in this 
*research paper*: But I have seen no evidence 
that techniques like that are in use by OpenAI, or that they’re specifically 
trained on Wikipedia. If you’ve seen discussion of that, or evidence from 
output suggesting it, that’d be interesting and important!

ML news:
On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten <>, 
> BIng with ChatGPT is now released by Micrsoft.
> And from what I understand they use Wikipedia content considerably. If
> you ask Who is A B and A B is not widely known, the result is more or
> less identical to the content from the Wikipedia article (but worse, as
> it "makes up" facts that is incorrect).
> In a way I am glad to see Wikipedia is fully relevant even in this
> emerging AI-driven search world. But Google search has ben careful to
> always have a link to Wikipedia besides their made up summary of facts,
> which here it is missing (yet?). And for licences, they are all ignored.
> So if this is the future the number of  accesses from users to Wikipedia
> will collapse, and also their willingness to donate... (but our content
> still a cornerstone for knowledge)
> Anders
> (I got a lot of fact from an article in Swedish main newspaper by their
> tech editor. He started asking fact of himself, and when he received
> facts from his Wp article plus being credited to a book he had noting to
> do with, he started to try to tell/learn ChatGPT of this error. The
> chatPGT only got angry accusing the techeditor for lying and in the end
> cut off the conversation, as ChatGPT continued to teat the techeditor as
> lyer and vandal..).
> _______________________________________________
> Wikimedia-l mailing list --, guidelines at: 
> and 
> Public archives at 
> To unsubscribe send an email to
Wikimedia-l mailing list --, guidelines at: and
Public archives at
To unsubscribe send an email to

Reply via email to