I really feel like we're getting into pretty aggressive corporate abuse of the Wikipedia copyleft.
On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski <adamsobie...@hotmail.com> wrote: > Hello, > > I would like to indicate "Copilot" in the Edge browser as being > potentially relevant to Wikipedia [1][2]. > > It is foreseeable that end-users will be able to open sidebars in their > Web browsers and subsequently chat with large language models about the > contents of specific Web documents, e.g., encyclopedia articles. Using Web > browsers, there can be task contexts available, including the documents or > articles in users' current tabs, potentially including users' scroll > positions, potentially including users' selections or highlightings of > content. > > I, for one, am thinking about how Web standards, e.g., Web schema, can be > of use for amplifying these features and capabilities for end-users. > > > Best regards, > Adam Sobieski > > [1] > https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023 > [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html > > ------------------------------ > *From:* Kimmo Virtanen <kimmo.virta...@wikimedia.fi> > *Sent:* Friday, March 17, 2023 8:17 AM > *To:* Wikimedia Mailing List <wikimedia-l@lists.wikimedia.org> > *Subject:* [Wikimedia-l] Re: Bing-ChatGPT > > Hi, > > The development of open-source large language models is going forward. The > GPT-4 was released and it seems that it passed the Bar exam and tried to > hire humans to solve catchpas which were too complex. However, the > development in the open source and hacking side has been pretty fast and it > seems that there are all the pieces for running LLM models in personal > hardware (and in web browsers). Biggest missing piece is fine tuning of > open source models such as Neox for the English language. For multilingual > and multimodal (for example images+text) the model is also needed. > > > So this is kind of a link dump for relevant things for creation of open > source LLM model and service and also recap where the hacker community is > now. > > > 1.) Creation of an initial unaligned model. > > - Possible models > - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI > (Apache 2.0) > - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by > Facebook (MIT-licence) > - LLaMa > <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by > Facebook (custom license, leaked research use only) > - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom > license <https://huggingface.co/spaces/bigscience/license>. open, > non-commercial) > > > 2.) Fine-tuning or align > > - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa > - Alpaca: A Strong, Replicable Instruction-Following Model > <https://crfm.stanford.edu/2023/03/13/alpaca.html> > - Train and run Stanford Alpaca on your own machine > <https://replicate.com/blog/replicate-alpaca> > - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning > <https://github.com/tloen/alpaca-lora> > > > 3.) 8,4,3 bit-quantization of model for reduced hardware requirements > > - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp > <https://til.simonwillison.net/llms/llama-7b-m2> > - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> & > llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions) > - Int-4 LLaMa is not enough - Int-3 and beyond > <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> > - How is LLaMa.cpp possible? > <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> > > > 4.) Easy-to-use interfaces > > - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly > libraries to run LLM models in the browser) > - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and > Alpaca in own computer as Node.js web service) > - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> > (stable > diffusion image generation in browser) > > > Br, > -- Kimmo Virtanen > > On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen <kimmo.virta...@gmail.com> > wrote: > > Hi, > > The development of open-source large language models is going forward. The > GPT-4 was released and it seems that it passed the Bar exam and tried to > hire humans to solve catchpas which were too complex to it. However, the > development in open source and hacking side has been pretty fast and it > seems that there is all the pieces for running LLM models in personal > hardware (and in web browser). Biggest missing piece is fine tuning of > open source model such as Neox for english language. For multilingual and > multimodal (for example images+text) the model is also needed. > > > So this is kind of link dump for relevant things for creation of open > source LLM model and service and also recap where hacker community is now. > > > 1.) Creation of an initial unaligned model. > > - Possible models > - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI > (Apache 2.0) > - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by > Facebook (MIT-licence) > - LLaMa > <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by > Facebook (custom license, leaked research use only) > - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom > license <https://huggingface.co/spaces/bigscience/license>. open, > non-commercial) > > > 2.) Fine-tuning or align > > - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa > - Alpaca: A Strong, Replicable Instruction-Following Model > <https://crfm.stanford.edu/2023/03/13/alpaca.html> > - Train and run Stanford Alpaca on your own machine > <https://replicate.com/blog/replicate-alpaca> > - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning > <https://github.com/tloen/alpaca-lora> > > > 3.) 8,4,3 bit-quantization of model for reduced hardware requirements > > - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp > <https://til.simonwillison.net/llms/llama-7b-m2> > - Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> & > llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions) > - Int-4 LLaMa is not enough - Int-3 and beyond > <https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and> > - How is LLaMa.cpp possible? > <https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible> > > > 4.) Easy-to-use interfaces > > - Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly > libraries to run LLM models in the browser) > - Dalai <https://github.com/cocktailpeanut/dalai> ( run LLaMA and > Alpaca in own computer as Node.js web service) > - web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> > (stable > diffusion image generation in browser) > > > Br, > -- Kimmo Virtanen > > On Mon, Mar 6, 2023 at 6:50 AM Steven Walling <steven.wall...@gmail.com> > wrote: > > > > On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is) <l...@lu.is> wrote: > > On Feb 22, 2023 at 9:28 AM -0800, Sage Ross <ragesoss+wikipe...@gmail.com>, > wrote: > > Luis, > > OpenAI researchers have released some info about data sources that > trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165 > > See section 2.2, starting on page 8 of the PDF. > > The full text of English Wikipedia is one of five sources, the others > being CommonCrawl, a smaller subset of scraped websites based on > upvoted reddit links, and two unrevealed datasets of scanned books. > (I've read speculation that one of these datasets is basically the > Library Genesis archive.) Wikipedia is much smaller than the other > datasets, although they did weight it somewhat more heavily than any > other dataset. With the extra weighting, they say Wikipedia accounts > for 3% of the total training. > > > Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their > training sources, it turns out, with similar weighting for Wikipedia - only > 4.5% of training text, but more heavily weighted than most other sources: > > https://twitter.com/GuillaumeLample/status/1629151234597740550 > > > Those stats are undercounting, since the top source (CommonCrawl) also > itself includes Wikipedia as its third largest source. > > https://commoncrawl.github.io/cc-crawl-statistics/plots/domains > > <https://twitter.com/GuillaumeLample/status/1629151234597740550> > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org > > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org > > _______________________________________________ > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/AWUNEC7JCIHFPE3LS5M2MDZTMVG25V3H/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
_______________________________________________ Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/MJFBIBW7WPMOEKNCUN4OEDX3FU2KE3ED/ To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org