This is an important development for editors to be aware of - we're going to have to be increasingly on the lookout for sources using ML-generated bullshit. Here are two instances I'm aware of this week:
https://www.thenation.com/article/culture/internet-archive-publishers-lawsuit-chatbot/ > In late February, Tyler Cowen, a libertarian economics professor at George > Mason University, published a blog post titled > <https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html>, > “Who was the most important critic of the printing press in the 17th > century?” Cowen’s post contended that the polymath and statesman Francis > Bacon was an “important” critic of the printing press; unfortunately, the > post contains long, fake quotes attributed to Bacon’s *The Advancement of > Learning *(1605), complete with false chapter and section numbers. > Tech writer Mathew Ingram drew attention to the fabrications a few days > later > <https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/>, > noting that Cowen has been writing approvingly about the AI chatbot > ChatGPT > <https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html> > for > some time now; several commenters on Cowen’s post assumed the fake quotes > must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed > questions regarding the post by press time, and later removed the post > entirely, with no explanation whatsoever. However, a copy remains at the > Internet Archive’s Wayback Machine). > > > https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc > An article claiming to identify misinformation in an Oscar-winning > documentary about imprisoned Russian dissident Alexei Navalny is itself > full of misinformation, thanks to the author using AI. > Investigative news outlet *The Grayzone* recently published an article > <https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/> > that included AI-generated text as a source for its information. The > piece > <http://web.archive.org/web/20230314131551/https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>, > “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy > Komisar, included hyperlinks to PDFs > <http://web.archive.org/web/20230314121144/https://www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf> > uploaded to the author’s personal website that appear to be screenshots > of conversations she had with ChatSonic, a free generative AI chatbot that > advertises itself as a ChatGPT alternative that can “write factual trending > content” using Google search results. That said, I don't think this is anything to be too stressed about; the Grayzone is already a deprecated source and blogs like Marginal Revolution are treated with caution, though Cowen has sufficient credentials to be treated as a reliable expert. On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen <kimmo.virta...@wikimedia.fi> wrote: > Hi, > > The development of open-source large language models is going forward. The > GPT-4 was released and it seems that it passed the Bar exam and tried to > hire humans to solve catchpas which were too complex. However, the > development in the open source and hacking side has been pretty fast and it > seems that there are all the pieces for running LLM models in personal > hardware (and in web browsers). Biggest missing piece is fine tuning of > open source models such as Neox for the English language. For multilingual > and multimodal (for example images+text) the model is also needed. > > > So this is kind of a link dump for relevant things for creation of open > source LLM model and service and also recap where the hacker community is > now. > > > 1.) Creation of an initial unaligned model. > > - Possible models > - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI > (Apache 2.0) > - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by > Facebook (MIT-licence) > - LLaMa > <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by > Facebook (custom license, leaked research use only) > - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom > license <https://huggingface.co/spaces/bigscience/license>. open, > non-commercial) > > > 2.) Fine-tuning or align > > - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa > - Alpaca: A Strong, Replicable Instruction-Following Model > <https://crfm.stanford.edu/2023/03/13/alpaca.html> > - Train and run Stanford Alpaca on your own machine > <https://replicate.com/blog/replicate-alpaca> > - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning > <https://github.com/tloen/alpaca-lora> > > > 3.) 8,4,3 bit-quantization of model for reduced hardware requirements > > - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp > <https://til.simonwillison.net/llms/llama-7b-m2> > - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> & > llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions) > - Int-4 LLaMa is not enough - Int-3 and beyond > <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> > - How is LLaMa.cpp possible? > <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> > > > 4.) Easy-to-use interfaces > > - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly > libraries to run LLM models in the browser) > - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and > Alpaca in own computer as Node.js web service) > - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> > (stable > diffusion image generation in browser) > > > Br, > -- Kimmo Virtanen > > On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen <kimmo.virta...@gmail.com> > wrote: > >> Hi, >> >> The development of open-source large language models is going forward. >> The GPT-4 was released and it seems that it passed the Bar exam and tried >> to hire humans to solve catchpas which were too complex to it. However, the >> development in open source and hacking side has been pretty fast and it >> seems that there is all the pieces for running LLM models in personal >> hardware (and in web browser). Biggest missing piece is fine tuning of >> open source model such as Neox for english language. For multilingual and >> multimodal (for example images+text) the model is also needed. >> >> >> So this is kind of link dump for relevant things for creation of open >> source LLM model and service and also recap where hacker community is now. >> >> >> 1.) Creation of an initial unaligned model. >> >> - Possible models >> - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by >> EleutherAI (Apache 2.0) >> - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by >> Facebook (MIT-licence) >> - LLaMa >> <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by >> Facebook (custom license, leaked research use only) >> - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom >> license <https://huggingface.co/spaces/bigscience/license>. open, >> non-commercial) >> >> >> 2.) Fine-tuning or align >> >> - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa >> - Alpaca: A Strong, Replicable Instruction-Following Model >> <https://crfm.stanford.edu/2023/03/13/alpaca.html> >> - Train and run Stanford Alpaca on your own machine >> <https://replicate.com/blog/replicate-alpaca> >> - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning >> <https://github.com/tloen/alpaca-lora> >> >> >> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements >> >> - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp >> <https://til.simonwillison.net/llms/llama-7b-m2> >> - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> & >> llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions) >> - Int-4 LLaMa is not enough - Int-3 and beyond >> <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> >> - How is LLaMa.cpp possible? >> <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> >> >> >> 4.) Easy-to-use interfaces >> >> - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly >> libraries to run LLM models in the browser) >> - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and >> Alpaca in own computer as Node.js web service) >> - web-stable-diffusion >> <https://github.com/mlc-ai/web-stable-diffusion> (stable diffusion >> image generation in browser) >> >> >> Br, >> -- Kimmo Virtanen >> >> On Mon, Mar 6, 2023 at 6:50 AM Steven Walling <steven.wall...@gmail.com> >> wrote: >> >>> >>> >>> On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <l...@lu.is> wrote: >>> >>>> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross < >>>> ragesoss+wikipe...@gmail.com>, wrote: >>>> >>>> Luis, >>>> >>>> OpenAI researchers have released some info about data sources that >>>> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165 >>>> >>>> See section 2.2, starting on page 8 of the PDF. >>>> >>>> The full text of English Wikipedia is one of five sources, the others >>>> being CommonCrawl, a smaller subset of scraped websites based on >>>> upvoted reddit links, and two unrevealed datasets of scanned books. >>>> (I've read speculation that one of these datasets is basically the >>>> Library Genesis archive.) Wikipedia is much smaller than the other >>>> datasets, although they did weight it somewhat more heavily than any >>>> other dataset. With the extra weighting, they say Wikipedia accounts >>>> for 3% of the total training. >>>> >>>> >>>> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of >>>> their training sources, it turns out, with similar weighting for Wikipedia >>>> - only 4.5% of training text, but more heavily weighted than most other >>>> sources: >>>> >>>> https://twitter.com/GuillaumeLample/status/1629151234597740550 >>>> >>> >>> Those stats are undercounting, since the top source (CommonCrawl) also >>> itself includes Wikipedia as its third largest source. >>> >>> https://commoncrawl.github.io/cc-crawl-statistics/plots/domains >>> >>> <https://twitter.com/GuillaumeLample/status/1629151234597740550> >>>> _______________________________________________ >>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, >>>> guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines >>>> and https://meta.wikimedia.org/wiki/Wikimedia-l >>>> Public archives at >>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/ >>>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org >>> >>> _______________________________________________ >>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines >>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and >>> https://meta.wikimedia.org/wiki/Wikimedia-l >>> Public archives at >>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/ >>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org >> >> _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/KODAIRDAW6TESXS6DHIX2QLLCYYFDKCB/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/WZAKGYUJGUOAXQLO5OUGB5EL67OXQ5C6/ To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org