[Wikimedia-l] Re: Bing-ChatGPT

2023-03-29 Thread Lauren Worden
On Wed, Mar 29, 2023 at 1:04 PM Felipe Schenone  wrote:
>
> FYI, there's an open letter requesting a 6-month pause on AI development, [ 
> https://futureoflife.org/open-letter/pause-giant-ai-experiments/ ] with 
> reasonable arguments (in my opinion) and signed by several big names too.

First, I want to point out that a "pause for at least 6 months the
training of AI systems more powerful than GPT-4" doesn't involve
halting research on how to prevent existing models from hallucinating,
how to cause them to summarize and cite reliable sources verifiably
and neutrally, how to allow them to be easily and inexpensively edited
for updates and corrections, or benchmarking the performance competing
approaches to using them for editing tasks, as I've proposed the
Foundation should do.

Secondly, I doubt such a pause on training larger models will do
anything to address any of the largest risks of LLMs, including any of
the risks which have been articulated as a threat to the projects, as
far as I know. Existing models a couple generations behind the
bleeding edge are more than good enough to, for example, run an
organic-appearing campaign to bias Wikipedia articles in pernicious
ways for pay, or even as a dedicated individual's personal project
with a budget no larger than that of many common hobbies.

I suggest that the nominally non-free restrictions on use of the BLOOM
RAIL license are a superior approach to addressing the immediate risks
compared to a mere six month moratorium on larger models, especially
if those restrictions were codified into law. The following is from
https://openfuture.pubpub.org/pub/notes-on-open-ai

"The authors of the RAIL license acknowledge that the license does not
meet the Open Source Initiative definition of open code licenses (and
it does not meet the Open Definition either). In related news, the
newly launched Can’t Be Evil licenses also challenged established open
licensing models, while seeking to uphold the spirit of open
sharing Traditionally, debates over what constitutes an open
license were related to normative debates about ensuring user
freedoms. Authors of the RAIL license rightly point out that these
need to be balanced today with care for responsible uses."

-LW


-LW
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/AH5IRDTBVLHDMWRBKKIFGB2GR7FQ6SXL/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-29 Thread Felipe Schenone
FYI, there's an open letter requesting a 6-month pause on AI development
, with
reasonable arguments (in my opinion) and signed by several big names too.
The basic rationale, as I understand it, is that similar to human
cloning, human germline modification, gain-of-function research and other
world-changing and potentially dangerous technologies, there should be some
kind of procedure to ensure that safety keeps pace with development, which
the current AI race is not allowing.

On Sun, Mar 19, 2023 at 5:20 AM Kimmo Virtanen 
wrote:

> Or, maybe just require an open disclosure of where the bot pulled from and
>> how much, instead of having it be a black box? "Text in this response
>> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
>> 'SomeOtherThing', 10%...".
>
>
> Current (ie. ChatGPT) systems doesn't work that way, as the source of
> information is lost in the process when the information is encoded into the
> model. The model is just a network of probabilities, and it is highly
> compressed compared to the original data. We are missing the point if we
> believe it is a copy of source data and not a tool to interact with
> information using natural languages.
>
> Soon, tools can retrieve data from external sources and write answers
> based on them[1]. For example, in the Wikipedia context, this would be to
> use a search engine to find information automatically, summarize findings,
> and generate references for the results. Or vice versa, retrieve
> information from Wikipedia or Wikidata. Then we will get source data, too,
> but the LLM model's internal reasoning will still be fuzzy.
>
> [1] https://interconnected.org/home/2023/03/16/singularity
>
> Br,
> -- Kimmo Virtanen
>
>
> On Sun, Mar 19, 2023 at 8:24 AM Todd Allen  wrote:
>
>> Or, maybe just require an open disclosure of where the bot pulled from
>> and how much, instead of having it be a black box? "Text in this response
>> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
>> 'SomeOtherThing', 10%...".
>>
>> On Sat, Mar 18, 2023 at 10:17 PM Steven Walling 
>> wrote:
>>
>>>
>>>
>>> On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller 
>>> wrote:
>>>
 On Fri, Mar 17, 2023 at 7:05 PM Steven Walling <
 steven.wall...@gmail.com> wrote:

 > IANAL of course, but to me this implies that responsibility for the
 *egregious* lack
 > of attribution in models that rely substantially on Wikipedia is
 violating the Attribution
 > requirements of CC licenses.

 Morally, I agree that companies like OpenAI would do well to recognize
 and nurture the sources they rely upon in training their models.
 Especially as the web becomes polluted with low quality AI-generated
 content, it would seem in everybody's best interest to sustain the
 communities and services that make and keep high quality information
 available. Not just Wikimedia, but also the Internet Archive, open
 access journals and preprint servers, etc.

 Legally, it seems a lot murkier. OpenAI in particular does not
 distribute any of its GPT models. You can feed them prompts by various
 means, and get responses back. Do those responses plagiarize
 Wikipedia?

 With image-generating models like Stable Diffusion, it's been found
 that the models sometimes generate output nearly indistinguishable
 from source material [1]. I don't know if similar studies have been
 undertaken for text-generating models yet. You can certainly ask GPT-4
 to generate something that looks like a Wikipedia article -- here are
 example results for generating a random Wikipedia article:

 Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)
 GPT-4
 
 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1
 (cut off at the ChatGPT generation limit)
 GPT-4 run 2:
 https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2
 GPT-4
  run
 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3

 It imitates the form of a Wikipedia article & mixes up / makes up
 assertions, but I don't know that any of its generations would meet
 the standard of infringing on the Wikipedia article's copyright. IANAL
 either, and as you say, the legal landscape is evolving rapidly.

 Warmly,
 Erik
>>>
>>>
>>> The whole thing is definitely a hot mess. If the remixing/transformation
>>> by the model is a derivative work, it means OpenAI is potentially violating
>>> the ShareAlike requirement by not distributing the text output as CC. But
>>> on other hand the nature of the model means they’re combining CC and non
>>> free works freely / at random, unless a court would interpret whatever % of
>>> training data 

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-25 Thread Amir Sarabadani
Repeating exactly what has been in the training data is not overfitting.
Overfitting is when the model fails to recognize the underlying pattern in
the training data leading to inaccurate or false results when used on new
data. Getting the exact same prediction from the training data set is
actually how the models train (to minimize the difference). Supervised ML
in its core is a regression function,

Andrew Ng examples a model that predicts the price of a house based on
number of rooms and such. Overfitting is when a completely new house
results in a wildly inaccurate price range. Coming with the exact price as
an existing house in the training set is not overfitting nor a bug.

Speaking as someone who built many ML models for Wikimedia.

Am Do., 23. März 2023 um 11:23 Uhr schrieb Kimmo Virtanen <
kimmo.virta...@wikimedia.fi>:

> Hi,
>
> I just noticed that OpenAI has fixed the Wikidata property and item
> mappings so now it can generate working SPARQL. Example:
>
>
> Prompt : Search finnish female journaists using SPARQL from Wikidata?
>
> *GPT-3.5 (default)*
>
>> To search for Finnish female journalists using SPARQL from Wikidata, you
>> can use the following query:
>> SELECT ?journalistLabel
>> WHERE
>> {
>>   ?journalist wdt:P106 wd:Q1930187 .
>>   ?journalist wdt:P27 wd:Q33 .
>>   ?journalist wdt:P21 wd:Q6581072 .
>>   ?journalist wdt:P569 ?birthdate .
>>   ?journalist wdt:P21 ?gender .
>>   FILTER(YEAR(?birthdate) < 2005) .
>>   FILTER(?gender = wd:Q6581072) .
>>   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
>> }
>> This query will retrieve the labels (names) of all female journalists who
>> were born before the year 2005, are Finnish citizens, and have a profession
>> (P106) of journalist (Q1930187). You can execute this query on the Wikidata
>> Query Service (https://query.wikidata.org/) by pasting the query into
>> the query editor and clicking the "Run" button. The results will be
>> displayed in a table format.
>
>
> *GPT-4*
>
> To search for Finnish female journalists using SPARQL on Wikidata, you can
>> use the following query:
>>
>
>
> SELECT ?journalist ?journalistLabel WHERE {
>>   ?journalist wdt:P31 wd:Q5; # instance of human
>>   wdt:P106 wd:Q1930187;  # occupation: journalist
>>   wdt:P21 wd:Q6581072;   # gender: female
>>   wdt:P27 wd:Q33.# country of citizenship: Finland
>>   SERVICE wikibase:label { bd:serviceParam wikibase:language "fi,en". }
>> }
>> This query searches for entities that are instances of human (Q5), have
>> an occupation as a journalist (Q1930187), are female (Q6581072), and have
>> Finnish citizenship (Q33). The query results will include the journalist's
>> Wikidata ID and their label (name) in Finnish or English. You can run this
>> query using the Wikidata Query Service (https://query.wikidata.org/).
>
>
> Br,
> -- Kimmo Virtanen, Zache
>
> On Thu, Mar 23, 2023 at 8:07 AM Erik Moeller  wrote:
>
>> On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden 
>> wrote:
>> > BARD also produces lengthy passages from its training data verbatim
>> > without elicitation:
>> >
>> https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_forum_word_for_word/jd58764/
>>
>> Very true. I tested the "Mr. Ripley" example with Bard, and it
>> instantly reproduced the exact language of the Wikipedia article's
>> lede section; only the remainder of the generation was substantially
>> different. [1]
>>
>> Warmly,
>> Erik
>>
>> [1] https://social.coop/@eloquence/110071078132245483
>> ___
>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/V55NYC4RY5O53TABK4O625AJ4HYAAFJA/
>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/IZVYC6IYGCKCAH5AEJBQNEFMGSVKPR5B/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org



-- 
Amir (he/him)
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/XKKOBF5RJRG5DZXTOE6Y5QAMD5ME57A6/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-23 Thread Kimmo Virtanen
Hi,

I just noticed that OpenAI has fixed the Wikidata property and item
mappings so now it can generate working SPARQL. Example:


Prompt : Search finnish female journaists using SPARQL from Wikidata?

*GPT-3.5 (default)*

> To search for Finnish female journalists using SPARQL from Wikidata, you
> can use the following query:
> SELECT ?journalistLabel
> WHERE
> {
>   ?journalist wdt:P106 wd:Q1930187 .
>   ?journalist wdt:P27 wd:Q33 .
>   ?journalist wdt:P21 wd:Q6581072 .
>   ?journalist wdt:P569 ?birthdate .
>   ?journalist wdt:P21 ?gender .
>   FILTER(YEAR(?birthdate) < 2005) .
>   FILTER(?gender = wd:Q6581072) .
>   SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } .
> }
> This query will retrieve the labels (names) of all female journalists who
> were born before the year 2005, are Finnish citizens, and have a profession
> (P106) of journalist (Q1930187). You can execute this query on the Wikidata
> Query Service (https://query.wikidata.org/) by pasting the query into the
> query editor and clicking the "Run" button. The results will be displayed
> in a table format.


*GPT-4*

To search for Finnish female journalists using SPARQL on Wikidata, you can
> use the following query:
>


SELECT ?journalist ?journalistLabel WHERE {
>   ?journalist wdt:P31 wd:Q5; # instance of human
>   wdt:P106 wd:Q1930187;  # occupation: journalist
>   wdt:P21 wd:Q6581072;   # gender: female
>   wdt:P27 wd:Q33.# country of citizenship: Finland
>   SERVICE wikibase:label { bd:serviceParam wikibase:language "fi,en". }
> }
> This query searches for entities that are instances of human (Q5), have an
> occupation as a journalist (Q1930187), are female (Q6581072), and have
> Finnish citizenship (Q33). The query results will include the journalist's
> Wikidata ID and their label (name) in Finnish or English. You can run this
> query using the Wikidata Query Service (https://query.wikidata.org/).


Br,
-- Kimmo Virtanen, Zache

On Thu, Mar 23, 2023 at 8:07 AM Erik Moeller  wrote:

> On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden 
> wrote:
> > BARD also produces lengthy passages from its training data verbatim
> > without elicitation:
> >
> https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_forum_word_for_word/jd58764/
>
> Very true. I tested the "Mr. Ripley" example with Bard, and it
> instantly reproduced the exact language of the Wikipedia article's
> lede section; only the remainder of the generation was substantially
> different. [1]
>
> Warmly,
> Erik
>
> [1] https://social.coop/@eloquence/110071078132245483
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/V55NYC4RY5O53TABK4O625AJ4HYAAFJA/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/IZVYC6IYGCKCAH5AEJBQNEFMGSVKPR5B/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-23 Thread Erik Moeller
On Wed, Mar 22, 2023 at 11:53 AM Lauren Worden  wrote:
> BARD also produces lengthy passages from its training data verbatim
> without elicitation:
> https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_forum_word_for_word/jd58764/

Very true. I tested the "Mr. Ripley" example with Bard, and it
instantly reproduced the exact language of the Wikipedia article's
lede section; only the remainder of the generation was substantially
different. [1]

Warmly,
Erik

[1] https://social.coop/@eloquence/110071078132245483
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/V55NYC4RY5O53TABK4O625AJ4HYAAFJA/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-22 Thread Lauren Worden
Google BARD, announced this week, also tries and fails to perform
attribution and verification:
https://old.reddit.com/r/Bard/comments/11yeegu/google_bard_claims_bard_has_already_been_shut/jd77wpo/

BARD also produces lengthy passages from its training data verbatim
without elicitation:
https://old.reddit.com/r/Bard/comments/11xxaxj/bard_copied_user_text_from_a_forum_word_for_word/jd58764/

>  Another thing the Foundation could do without editors
> getting involved (a class action suit by editors would probably at
> best be counterproductive at this point, for a number of reasons, and
> could backfire) is to highlight and encourage the ongoing but
> relatively obscure work on attribution and verification by LLMs. There
> are two projects in particular, SPARROW [
> https://arxiv.org/abs/2209.14375 ] and RARR
> [https://arxiv.org/abs/2210.08726 ] that deserve wider recognition,
> support, and work on replication by third parties. These research
> directions are the most robust way to avoid the hallucination problems
> which are at the root of most everything that can go wrong when LLMs
> are used to produce Wikipedia content, so it would be extremely
> helpful if the Foundation uses its clout to shine a light and point
> out that they do what we expect of Wikipedia editors: provide sources
> in support of summary text cited in a way that third parties can
> independently verify.
>
> The Bing LLM already includes some attempt at doing this with a dual
> process search system, which I believe is modeled after the SPARROW
> approach, but without the explicit rigor such as in RARR, it can fail
> spectacularly, and produce the same confidently wrong output everyone
> has recently become familiar with, but with the confounding problem of
> appearing to cite sources in support, but which aren't. For example,
> see this thread:
> https://twitter.com/dileeplearning/status/1634699315582226434
>
> -LW
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6VNAZ4OJVWY5UIY5ZTG74ZS3EMQJ7DXH/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org


[Wikimedia-l] Re: Bing-ChatGPT

2023-03-21 Thread Lauren Worden
On Mon, Mar 20, 2023 at 9:28 PM Kim Bruning via Wikimedia-l
 wrote:
> On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
> >
> >  LLMs absolutely do encode a verbatim copy of their
> > training data, which can be produced intact with little effort.
>
> > https://arxiv.org/pdf/2205.10770.pdf
> > https://bair.berkeley.edu/blog/2020/12/20/lmmem/
>
> My understanding so far is that encoding a verbatim copy is typically due to 
> 'Overfitting'.
>
> This is considered a type of bug. It is undesirable for many reasons
> (technical, ethical, legal).

I believe the authors mainly use "overfitting" to describe the
condition when the model produces verbatim copies of its training data
instead of a reasonably distinct paraphrase or summary when the
verbatim source is not specifically elicited. But it's not clear to me
that the term isn't used in both ways.

This brings up an important point. ChatGPT seems to almost always
avoid the kind of infringing paraphrases described in
https://en.wikipedia.org/wiki/Wikipedia:Close_paraphrasing when asked
to paraphrase or summarize input text, which makes it very useful for
easily avoiding such issues. I get the feeling that Wikipedia editors
are already using it for this purpose on a relatively large scale. But
I'm hesitant to encourage such use until copyright experts familiar
with legal precedents involving "substantial similarity" as described
in that essay have had the opportunity to evaluate whether such LLM
output is a problem over a wide range of example cases. Ordinary
Wikipedia editors have no way to know how likely this is as a problem,
how to evaluate specific cases, or how to address such issues when
they arise. Professional guidance would be very helpful on this topic.

On Mon, Mar 20, 2023 at 8:01 PM Erik Moeller  wrote:
>
> ... I agree that it
> would be good for WMF to engage with LLM providers on these questions
> of attribution sooner rather than later, if that is not already
> underway. WMF is, as I understand it, still not in any privileged
> position of asserting or enforcing copyright (because it requires no
> copyright assignment from authors) -- but it can certainly make legal
> requirements clear, and also develop best practices that go beyond the
> legal minimum.

Thank you. Another thing the Foundation could do without editors
getting involved (a class action suit by editors would probably at
best be counterproductive at this point, for a number of reasons, and
could backfire) is to highlight and encourage the ongoing but
relatively obscure work on attribution and verification by LLMs. There
are two projects in particular, SPARROW [
https://arxiv.org/abs/2209.14375 ] and RARR
[https://arxiv.org/abs/2210.08726 ] that deserve wider recognition,
support, and work on replication by third parties. These research
directions are the most robust way to avoid the hallucination problems
which are at the root of most everything that can go wrong when LLMs
are used to produce Wikipedia content, so it would be extremely
helpful if the Foundation uses its clout to shine a light and point
out that they do what we expect of Wikipedia editors: provide sources
in support of summary text cited in a way that third parties can
independently verify.

The Bing LLM already includes some attempt at doing this with a dual
process search system, which I believe is modeled after the SPARROW
approach, but without the explicit rigor such as in RARR, it can fail
spectacularly, and produce the same confidently wrong output everyone
has recently become familiar with, but with the confounding problem of
appearing to cite sources in support, but which aren't. For example,
see this thread:
https://twitter.com/dileeplearning/status/1634699315582226434

-LW
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/LGDGX6MPZJGSV2GZV7M2LQ6OLRYFQCVS/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-21 Thread Adam Sobieski
In addition to all content being available verbatim versus all content being 
unavailable verbatim, developers might desire for some content to be available 
verbatim while having other content available only indirectly.

While AI systems can automatically determine which content to usefully store 
verbatim, if we desire for content authors to be able to provide hints, we 
could consider new HTML markup elements or some clever uses of existing 
elements and attributes or schema.org Web schemas.

In these regards, consider the following example, where an HTML document author 
desires to hint that a topic sentence of a paragraph is desired to be quoted 
verbatim while the remainder of that paragraph is desired only to be indirectly 
available. Perhaps the markup could resemble something like the following 
rough-draft sketch:

This is some text, a topic 
sentence. This is a secondary sentence in the paragraph.

This sketch shows some overlapping markup approaches. Perhaps all elements with 
IDs, URL-addressable content, should be considered to be verbatim quotable. Or, 
perhaps some HTML attribute, e.g., role, could be of use. Again, schema.org Web 
schemas could also be of use.


Also, I hope that you find interesting the following discussion thread: 
https://github.com/microsoft/semantic-kernel/discussions/108 about Educational 
Applications of AI in Web Browsers. There, I ask some questions about modern 
LLMs and APIs, about referring to documents by URLs in prompts, about 
prioritizing some documents for utilization over others when answering 
questions, and so forth. A “Web browser Copilot” would have educational 
applications. It could allow students to ask questions pertinent to the 
specific HTML, PDF, and EPUB documents that they are browsing and, perhaps, AI 
components could navigate to pages, scroll to content, and highlight document 
content for end-users while responding.


Best regards,
Adam Sobieski


From: Kim Bruning via Wikimedia-l 
Sent: Sunday, March 19, 2023 10:36 PM
To: Wikimedia Mailing List 
Cc: Kim Bruning 
Subject: [Wikimedia-l] Re: Bing-ChatGPT

On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
>
> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort.

> https://arxiv.org/pdf/2205.10770.pdf
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

My understanding so far is that encoding a verbatim copy is typically due to 
'Overfitting'.

This is considered a type of bug. It is undesirable for many reasons
(technical, ethical, legal).

Models are (supposed to be) trained to prevent this as much as possible.

Clearly there was still work to be done in dec 2020 at the least.

sincerely,
Kim Bruning
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/5PNCR3KVBCEEKYT6I3J6VZKFE7NFIGB2/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/T5KLPE5ITDE23BYERCQ5W5UCDKSU3LSD/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-20 Thread Kim Bruning via Wikimedia-l
On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
> 
> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort. 

> https://arxiv.org/pdf/2205.10770.pdf
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

My understanding so far is that encoding a verbatim copy is typically due to 
'Overfitting'.

This is considered a type of bug. It is undesirable for many reasons
(technical, ethical, legal).

Models are (supposed to be) trained to prevent this as much as possible.

Clearly there was still work to be done in dec 2020 at the least. 

sincerely,
Kim Bruning
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/5PNCR3KVBCEEKYT6I3J6VZKFE7NFIGB2/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org


[Wikimedia-l] Re: Bing-ChatGPT

2023-03-20 Thread Erik Moeller
On Sun, Mar 19, 2023 at 12:12 PM Lauren Worden  wrote:

> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort. See
> https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
> paragraph of the Background and Related Work section on page 2, where
> document extraction is considered an "attack" against such systems,
> which to me implies that the researchers fully realize they are
> involved with copyright issues on an enormous scale. Please see also
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

Thanks for these links, Lauren. I think it could be a very interesting
research project (for WMF, affiliates or Wikimedia research community
members) to attempt to recall Wikimedia project content such as
Wikipedia articles via the GPT-3.5 or GPT-4 API, to begin quantifying
the degree to which the models produce exact copies (or legally
covered derivative works--as opposed to novel expressions).

> With luck we will all have the chance to discuss these issues in
> detail on the March 23 Zoom discussion of large language models for
> Wikimedia projects:
> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends#Open_call:_Artificial_Intelligence_in_Wikimedia

I won't be able to join but am glad this is happening. I agree that it
would be good for WMF to engage with LLM providers on these questions
of attribution sooner rather than later, if that is not already
underway. WMF is, as I understand it, still not in any privileged
position of asserting or enforcing copyright (because it requires no
copyright assignment from authors) -- but it can certainly make legal
requirements clear, and also develop best practices that go beyond the
legal minimum.

Warmly,
Erik
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/53LPHDEFJIY646GRJS5SCZYNWMWDZG4Q/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-19 Thread Lauren Worden
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller  wrote:
>
> ...With image-generating models like Stable Diffusion, it's been found
> that the models sometimes generate output nearly indistinguishable
> from source material [1]. I don't know if similar studies have been
> undertaken for text-generating models yet.

They have, and LLMs absolutely do encode a verbatim copy of their
training data, which can be produced intact with little effort. See
https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
paragraph of the Background and Related Work section on page 2, where
document extraction is considered an "attack" against such systems,
which to me implies that the researchers fully realize they are
involved with copyright issues on an enormous scale. Please see also
https://bair.berkeley.edu/blog/2020/12/20/lmmem/

On Sat, Mar 18, 2023 at 9:17 PM Steven Walling  wrote:
>
> The whole thing is definitely a hot mess. If the remixing/transformation by 
> the model is a derivative work, it means OpenAI is potentially violating the 
> ShareAlike requirement by not distributing the text output as CC

The Foundation needs to get on top of this, by making a public request
to all of the LLM providers which use Wikipedia as training data,
asking that they acknowledge attribution of any output which may have
depended on CC-BY-SA content, licence model productions as CC-BY-SA,
and most importantly, disclaim any notion of accuracy or fidelity to
the training data. This needs to be done soon. So many people are
preparing to turn the reins of their editorial control over to these
new LLMs which they don't understand, and the problems at
CNET[https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151], let
alone Tyler Cowen's blog, have already felt the pain but sadly decided
to hastily try to cover it up. The overarching risk here is akin to
"citogenesis" but much more pernicious.

On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen
 wrote:
>
>> Or, maybe just require an open disclosure of where the bot pulled from and 
>> how much, instead of having it be a black box? "Text in this response 
>> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 
>> 'SomeOtherThing', 10%...".
>
> Current (ie. ChatGPT) systems doesn't work that way, as the source of 
> information is lost in the process when the information is encoded into the 
> model

In fact, they do work that way, but it takes some effort to elucidate
the source of any given output. Anyone discussing these issues needs
to become familiar with ROME:
https://twitter.com/mengk20/status/1588581237345595394 Please see also
https://www.youtube.com/watch?v=_NMQyOu2HTo

With luck we will all have the chance to discuss these issues in
detail on the March 23 Zoom discussion of large language models for
Wikimedia projects:
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends#Open_call:_Artificial_Intelligence_in_Wikimedia

--LW
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/JUIBHOLPLLE2VH4PSCUH4I5WV5OCX2MS/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-19 Thread Kimmo Virtanen
>
> Or, maybe just require an open disclosure of where the bot pulled from and
> how much, instead of having it be a black box? "Text in this response
> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
> 'SomeOtherThing', 10%...".


Current (ie. ChatGPT) systems doesn't work that way, as the source of
information is lost in the process when the information is encoded into the
model. The model is just a network of probabilities, and it is highly
compressed compared to the original data. We are missing the point if we
believe it is a copy of source data and not a tool to interact with
information using natural languages.

Soon, tools can retrieve data from external sources and write answers based
on them[1]. For example, in the Wikipedia context, this would be to use a
search engine to find information automatically, summarize findings, and
generate references for the results. Or vice versa, retrieve information
from Wikipedia or Wikidata. Then we will get source data, too, but the LLM
model's internal reasoning will still be fuzzy.

[1] https://interconnected.org/home/2023/03/16/singularity

Br,
-- Kimmo Virtanen


On Sun, Mar 19, 2023 at 8:24 AM Todd Allen  wrote:

> Or, maybe just require an open disclosure of where the bot pulled from and
> how much, instead of having it be a black box? "Text in this response
> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
> 'SomeOtherThing', 10%...".
>
> On Sat, Mar 18, 2023 at 10:17 PM Steven Walling 
> wrote:
>
>>
>>
>> On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller  wrote:
>>
>>> On Fri, Mar 17, 2023 at 7:05 PM Steven Walling 
>>> wrote:
>>>
>>> > IANAL of course, but to me this implies that responsibility for the
>>> *egregious* lack
>>> > of attribution in models that rely substantially on Wikipedia is
>>> violating the Attribution
>>> > requirements of CC licenses.
>>>
>>> Morally, I agree that companies like OpenAI would do well to recognize
>>> and nurture the sources they rely upon in training their models.
>>> Especially as the web becomes polluted with low quality AI-generated
>>> content, it would seem in everybody's best interest to sustain the
>>> communities and services that make and keep high quality information
>>> available. Not just Wikimedia, but also the Internet Archive, open
>>> access journals and preprint servers, etc.
>>>
>>> Legally, it seems a lot murkier. OpenAI in particular does not
>>> distribute any of its GPT models. You can feed them prompts by various
>>> means, and get responses back. Do those responses plagiarize
>>> Wikipedia?
>>>
>>> With image-generating models like Stable Diffusion, it's been found
>>> that the models sometimes generate output nearly indistinguishable
>>> from source material [1]. I don't know if similar studies have been
>>> undertaken for text-generating models yet. You can certainly ask GPT-4
>>> to generate something that looks like a Wikipedia article -- here are
>>> example results for generating a random Wikipedia article:
>>>
>>> Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)
>>> GPT-4
>>>  run
>>> 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1
>>> (cut off at the ChatGPT generation limit)
>>> GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2
>>> GPT-4 
>>> run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
>>>
>>> It imitates the form of a Wikipedia article & mixes up / makes up
>>> assertions, but I don't know that any of its generations would meet
>>> the standard of infringing on the Wikipedia article's copyright. IANAL
>>> either, and as you say, the legal landscape is evolving rapidly.
>>>
>>> Warmly,
>>> Erik
>>
>>
>> The whole thing is definitely a hot mess. If the remixing/transformation
>> by the model is a derivative work, it means OpenAI is potentially violating
>> the ShareAlike requirement by not distributing the text output as CC. But
>> on other hand the nature of the model means they’re combining CC and non
>> free works freely / at random, unless a court would interpret whatever % of
>> training data comes from us as the direct degree to which the model output
>> is derived from Wikipedia. Either way it’s going to be up to some legal
>> representation of copyright holders to test the boundaries here.
>>
>>
>>> [1]
>>> https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/
>>> ___
>>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>>> https://meta.wikimedia.org/wiki/Wikimedia-l
>>> Public archives at
>>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/CO3IJWXGHTBP3YE7AKUHHKPAL5HA56IC/
>>> To 

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-19 Thread Kimmo Virtanen
In european union there is in pipeline "Regulation Laying Down Harmonized
Rules on Artificial Intelligence" aka AI Act and "AI Liability Directive"
(AILD). AI act text afaik is currently in finalizing phase.
*
https://www.insideprivacy.com/artificial-intelligence/eu-ai-policy-and-regulation-what-to-look-out-for-in-2023/
*
https://www.dentons.com/en/insights/articles/2023/february/1/regulating-ai-in-eu-three-things-that-you-need-to-know

Interesting note is that there is no wikipedia article about these or even
wikidata items.

Br,
-- Kimmo Virtanen, Zache

On Sat, Mar 18, 2023 at 4:06 AM Steven Walling 
wrote:

> On Fri, Mar 17, 2023 at 6:03 PM The Cunctator  wrote:
>
>> I really feel like we're getting into pretty aggressive corporate abuse
>> of the Wikipedia copyleft.
>>
>
> I completely agree. It makes me pretty angry that Wikipedians have spent
> millions of volunteer hours creating content to educate and inform people
> as accurately as we can, and it's being used to generate convincing but
> often wildly misleading bullshit.
>
> The ground truth on what generated AI content is (from a copyright
> position) and where authorship/ownership lies seems to be rapidly evolving.
> The U.S. Copyright Office recently refused to issue copyrights for some
> AI-generated works, seemingly on the principle that they lack human
> authorship / are essential to contracting work for hire from an artist or
> writer.
>
> IANAL of course, but to me this implies that responsibility for the
> *egregious* lack of attribution in models that rely substantially on
> Wikipedia is violating the Attribution requirements of CC licenses. Just
> like the Foundation took a principled position in testing the legality of
> warrantless mass surveillance, I would love to see us push back on the
> notion that it's legal or moral for OpenAI or any of these other companies
> to take our content and use it to flood the Internet with machine-generated
> word diarrhea.
>
>
>> On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski 
>> wrote:
>>
>>> Hello,
>>>
>>> I would like to indicate "Copilot" in the Edge browser as being
>>> potentially relevant to Wikipedia [1][2].
>>>
>>> It is foreseeable that end-users will be able to open sidebars in their
>>> Web browsers and subsequently chat with large language models about the
>>> contents of specific Web documents, e.g., encyclopedia articles. Using Web
>>> browsers, there can be task contexts available, including the documents or
>>> articles in users' current tabs, potentially including users' scroll
>>> positions, potentially including users' selections or highlightings of
>>> content.
>>>
>>> I, for one, am thinking about how Web standards, e.g., Web schema, can
>>> be of use for amplifying these features and capabilities for end-users.
>>>
>>>
>>> Best regards,
>>> Adam Sobieski
>>>
>>> [1]
>>> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
>>> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>>>
>>> --
>>> *From:* Kimmo Virtanen 
>>> *Sent:* Friday, March 17, 2023 8:17 AM
>>> *To:* Wikimedia Mailing List 
>>> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>>>
>>> Hi,
>>>
>>> The development of open-source large language models is going forward.
>>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>>> to hire humans to solve catchpas which were too complex. However, the
>>> development in the open source and hacking side has been pretty fast and it
>>> seems that there are all the pieces for running LLM models in personal
>>> hardware (and in web browsers). Biggest missing piece is fine tuning of
>>> open source models such as Neox for the English language. For multilingual
>>> and multimodal (for example images+text) the model is also needed.
>>>
>>>
>>> So this is kind of a link dump for relevant things for creation of open
>>> source LLM model and service and also recap where the hacker community is
>>> now.
>>>
>>>
>>> 1.) Creation of an initial unaligned model.
>>>
>>>- Possible models
>>>   - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>>>   EleutherAI (Apache 2.0)
>>>   - Fairseq Dense
>>>   <https://huggingface.co/KoboldAI/fairseq-dense-13B> by Facebook
>>>   (MIT-licen

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-19 Thread Galder Gonzalez Larrañaga
Dear all,Your discussion and points are really interested. I just wanted to point that, as far as I know, "Text in this response derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 'SomeOtherThing', 10%..."." Idea is impossible, as generative AIs derive from all articles or texts at the same time.BestGalder ___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/VC666JWWVZJ77SZIO7KF46RIL236LF5N/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-19 Thread Todd Allen
Or, maybe just require an open disclosure of where the bot pulled from and
how much, instead of having it be a black box? "Text in this response
derived from: 17% Wikipedia article 'Example', 12% Wikipedia article
'SomeOtherThing', 10%...".

On Sat, Mar 18, 2023 at 10:17 PM Steven Walling 
wrote:

>
>
> On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller  wrote:
>
>> On Fri, Mar 17, 2023 at 7:05 PM Steven Walling 
>> wrote:
>>
>> > IANAL of course, but to me this implies that responsibility for the
>> *egregious* lack
>> > of attribution in models that rely substantially on Wikipedia is
>> violating the Attribution
>> > requirements of CC licenses.
>>
>> Morally, I agree that companies like OpenAI would do well to recognize
>> and nurture the sources they rely upon in training their models.
>> Especially as the web becomes polluted with low quality AI-generated
>> content, it would seem in everybody's best interest to sustain the
>> communities and services that make and keep high quality information
>> available. Not just Wikimedia, but also the Internet Archive, open
>> access journals and preprint servers, etc.
>>
>> Legally, it seems a lot murkier. OpenAI in particular does not
>> distribute any of its GPT models. You can feed them prompts by various
>> means, and get responses back. Do those responses plagiarize
>> Wikipedia?
>>
>> With image-generating models like Stable Diffusion, it's been found
>> that the models sometimes generate output nearly indistinguishable
>> from source material [1]. I don't know if similar studies have been
>> undertaken for text-generating models yet. You can certainly ask GPT-4
>> to generate something that looks like a Wikipedia article -- here are
>> example results for generating a random Wikipedia article:
>>
>> Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)
>> GPT-4 
>> run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1
>> (cut off at the ChatGPT generation limit)
>> GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2
>> GPT-4 
>> run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
>>
>> It imitates the form of a Wikipedia article & mixes up / makes up
>> assertions, but I don't know that any of its generations would meet
>> the standard of infringing on the Wikipedia article's copyright. IANAL
>> either, and as you say, the legal landscape is evolving rapidly.
>>
>> Warmly,
>> Erik
>
>
> The whole thing is definitely a hot mess. If the remixing/transformation
> by the model is a derivative work, it means OpenAI is potentially violating
> the ShareAlike requirement by not distributing the text output as CC. But
> on other hand the nature of the model means they’re combining CC and non
> free works freely / at random, unless a court would interpret whatever % of
> training data comes from us as the direct degree to which the model output
> is derived from Wikipedia. Either way it’s going to be up to some legal
> representation of copyright holders to test the boundaries here.
>
>
>> [1]
>> https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/
>> ___
>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/CO3IJWXGHTBP3YE7AKUHHKPAL5HA56IC/
>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/4BZ5B4DFK3HTWM6CHPZ4Q4RDZIGIN26V/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/4YHAFKDLAPFCNRQGAY77KWRIOIBRWVUH/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-18 Thread Steven Walling
On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller  wrote:

> On Fri, Mar 17, 2023 at 7:05 PM Steven Walling 
> wrote:
>
> > IANAL of course, but to me this implies that responsibility for the
> *egregious* lack
> > of attribution in models that rely substantially on Wikipedia is
> violating the Attribution
> > requirements of CC licenses.
>
> Morally, I agree that companies like OpenAI would do well to recognize
> and nurture the sources they rely upon in training their models.
> Especially as the web becomes polluted with low quality AI-generated
> content, it would seem in everybody's best interest to sustain the
> communities and services that make and keep high quality information
> available. Not just Wikimedia, but also the Internet Archive, open
> access journals and preprint servers, etc.
>
> Legally, it seems a lot murkier. OpenAI in particular does not
> distribute any of its GPT models. You can feed them prompts by various
> means, and get responses back. Do those responses plagiarize
> Wikipedia?
>
> With image-generating models like Stable Diffusion, it's been found
> that the models sometimes generate output nearly indistinguishable
> from source material [1]. I don't know if similar studies have been
> undertaken for text-generating models yet. You can certainly ask GPT-4
> to generate something that looks like a Wikipedia article -- here are
> example results for generating a random Wikipedia article:
>
> Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)
> GPT-4 
> run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1
> (cut off at the ChatGPT generation limit)
> GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2
> GPT-4 
> run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3
>
> It imitates the form of a Wikipedia article & mixes up / makes up
> assertions, but I don't know that any of its generations would meet
> the standard of infringing on the Wikipedia article's copyright. IANAL
> either, and as you say, the legal landscape is evolving rapidly.
>
> Warmly,
> Erik


The whole thing is definitely a hot mess. If the remixing/transformation by
the model is a derivative work, it means OpenAI is potentially violating
the ShareAlike requirement by not distributing the text output as CC. But
on other hand the nature of the model means they’re combining CC and non
free works freely / at random, unless a court would interpret whatever % of
training data comes from us as the direct degree to which the model output
is derived from Wikipedia. Either way it’s going to be up to some legal
representation of copyright holders to test the boundaries here.


> [1]
> https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/CO3IJWXGHTBP3YE7AKUHHKPAL5HA56IC/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/4BZ5B4DFK3HTWM6CHPZ4Q4RDZIGIN26V/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-18 Thread Erik Moeller
On Fri, Mar 17, 2023 at 7:05 PM Steven Walling  wrote:

> IANAL of course, but to me this implies that responsibility for the 
> *egregious* lack
> of attribution in models that rely substantially on Wikipedia is violating 
> the Attribution
> requirements of CC licenses.

Morally, I agree that companies like OpenAI would do well to recognize
and nurture the sources they rely upon in training their models.
Especially as the web becomes polluted with low quality AI-generated
content, it would seem in everybody's best interest to sustain the
communities and services that make and keep high quality information
available. Not just Wikimedia, but also the Internet Archive, open
access journals and preprint servers, etc.

Legally, it seems a lot murkier. OpenAI in particular does not
distribute any of its GPT models. You can feed them prompts by various
means, and get responses back. Do those responses plagiarize
Wikipedia?

With image-generating models like Stable Diffusion, it's been found
that the models sometimes generate output nearly indistinguishable
from source material [1]. I don't know if similar studies have been
undertaken for text-generating models yet. You can certainly ask GPT-4
to generate something that looks like a Wikipedia article -- here are
example results for generating a random Wikipedia article:

Article: https://en.wikipedia.org/wiki/The_Talented_Mr._Ripley_(film)
GPT-4 run 1: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/1
(cut off at the ChatGPT generation limit)
GPT-4 run 2: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/2
GPT-4 run 3: https://en.wikipedia.org/wiki/User:Eloquence/GPT4_Example/3

It imitates the form of a Wikipedia article & mixes up / makes up
assertions, but I don't know that any of its generations would meet
the standard of infringing on the Wikipedia article's copyright. IANAL
either, and as you say, the legal landscape is evolving rapidly.

Warmly,
Erik

[1] 
https://arstechnica.com/information-technology/2023/02/researchers-extract-training-images-from-stable-diffusion-but-its-difficult/
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/CO3IJWXGHTBP3YE7AKUHHKPAL5HA56IC/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-18 Thread Matej Grochal
Hi all

I agree that the AI creators should attribute Wikipedia as their source.
But on the other hand, when the result is incorrect etc we might actually
be glad that they do not attribute it.
The issue is how to convince readers to come to the source (our projects)
rather than using in-between steps (AI or otherwise).

Matej

On Sat, Mar 18, 2023 at 3:22 AM Steven Walling 
wrote:

> On Fri, Mar 17, 2023 at 6:03 PM The Cunctator  wrote:
>
>> I really feel like we're getting into pretty aggressive corporate abuse
>> of the Wikipedia copyleft.
>>
>
> I completely agree. It makes me pretty angry that Wikipedians have spent
> millions of volunteer hours creating content to educate and inform people
> as accurately as we can, and it's being used to generate convincing but
> often wildly misleading bullshit.
>
> The ground truth on what generated AI content is (from a copyright
> position) and where authorship/ownership lies seems to be rapidly evolving.
> The U.S. Copyright Office recently refused to issue copyrights for some
> AI-generated works, seemingly on the principle that they lack human
> authorship / are essential to contracting work for hire from an artist or
> writer.
>
> IANAL of course, but to me this implies that responsibility for the
> *egregious* lack of attribution in models that rely substantially on
> Wikipedia is violating the Attribution requirements of CC licenses. Just
> like the Foundation took a principled position in testing the legality of
> warrantless mass surveillance, I would love to see us push back on the
> notion that it's legal or moral for OpenAI or any of these other companies
> to take our content and use it to flood the Internet with machine-generated
> word diarrhea.
>
>
>> On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski 
>> wrote:
>>
>>> Hello,
>>>
>>> I would like to indicate "Copilot" in the Edge browser as being
>>> potentially relevant to Wikipedia [1][2].
>>>
>>> It is foreseeable that end-users will be able to open sidebars in their
>>> Web browsers and subsequently chat with large language models about the
>>> contents of specific Web documents, e.g., encyclopedia articles. Using Web
>>> browsers, there can be task contexts available, including the documents or
>>> articles in users' current tabs, potentially including users' scroll
>>> positions, potentially including users' selections or highlightings of
>>> content.
>>>
>>> I, for one, am thinking about how Web standards, e.g., Web schema, can
>>> be of use for amplifying these features and capabilities for end-users.
>>>
>>>
>>> Best regards,
>>> Adam Sobieski
>>>
>>> [1]
>>> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
>>> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>>>
>>> --
>>> *From:* Kimmo Virtanen 
>>> *Sent:* Friday, March 17, 2023 8:17 AM
>>> *To:* Wikimedia Mailing List 
>>> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>>>
>>> Hi,
>>>
>>> The development of open-source large language models is going forward.
>>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>>> to hire humans to solve catchpas which were too complex. However, the
>>> development in the open source and hacking side has been pretty fast and it
>>> seems that there are all the pieces for running LLM models in personal
>>> hardware (and in web browsers). Biggest missing piece is fine tuning of
>>> open source models such as Neox for the English language. For multilingual
>>> and multimodal (for example images+text) the model is also needed.
>>>
>>>
>>> So this is kind of a link dump for relevant things for creation of open
>>> source LLM model and service and also recap where the hacker community is
>>> now.
>>>
>>>
>>> 1.) Creation of an initial unaligned model.
>>>
>>>- Possible models
>>>   - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>>>   EleutherAI (Apache 2.0)
>>>   - Fairseq Dense
>>>   <https://huggingface.co/KoboldAI/fairseq-dense-13B> by Facebook
>>>   (MIT-licence)
>>>   - LLaMa
>>>   <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>>>   Facebook (custom license, leaked research use only)
>>>   - Bloom <https://huggingface.co/bigscie

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-18 Thread Peter Southwood
“Cowen has sufficient credentials to be treated as a reliable expert”

Maybe not for much longer.

Cheers, P.

 

From: The Cunctator [mailto:cuncta...@gmail.com] 
Sent: 17 March 2023 17:49
To: Wikimedia Mailing List
Subject: [Wikimedia-l] Re: Bing-ChatGPT

 

This is an important development for editors to be aware of - we're going to 
have to be increasingly on the lookout for sources using ML-generated bullshit. 
Here are two instances I'm aware of this week:

 

https://www.thenation.com/article/culture/internet-archive-publishers-lawsuit-chatbot/
In late February, Tyler Cowen, a libertarian economics professor at George 
Mason University, published a blog post  
<https://web.archive.org/web/20230305055906/https:/marginalrevolution.com/marginalrevolution/2023/02/who-was-the-most-important-critic-of-the-printing-press-in-the-17th-century.html>
 titled, “Who was the most important critic of the printing press in the 17th 
century?” Cowen’s post contended that the polymath and statesman Francis Bacon 
was an “important” critic of the printing press; unfortunately, the post 
contains long, fake quotes attributed to Bacon’s The Advancement of Learning 
(1605), complete with false chapter and section numbers.
Tech writer Mathew Ingram drew attention to the fabrications  
<https://newsletter.mathewingram.com/tyler-cowen-francis-bacon-and-the-chatgpt-engine/>
 a few days later, noting that Cowen has been  
<https://marginalrevolution.com/marginalrevolution/2023/02/how-should-you-talk-to-chatgpt-a-users-guide.html>
 writing approvingly about the AI chatbot ChatGPT for some time now; several 
commenters on Cowen’s post assumed the fake quotes must be the handiwork of 
ChatGPT. (Cowen did not reply to e-mailed questions regarding the post by press 
time, and later removed the post entirely, with no explanation whatsoever. 
However, a copy remains at the Internet Archive’s Wayback Machine).

 

 
<https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc>
 
https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc
An article claiming to identify misinformation in an Oscar-winning documentary 
about imprisoned Russian dissident Alexei Navalny is itself full of 
misinformation, thanks to the author using AI. 
Investigative news outlet The Grayzone recently  
<https://thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/> 
published an article that included AI-generated text as a source for its 
information. The  
<http://web.archive.org/web/20230314131551/https:/thegrayzone.com/2023/03/13/oscar-navalny-documentary-misinformation/>
 piece, “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by 
Lucy Komisar, included hyperlinks to  
<http://web.archive.org/web/20230314121144/https:/www.thekomisarscoop.com/wp-content/uploads/2023/02/Many-contributors-have-backgrounds-that-suggest-they-are-biased-in-favor-of-western-governments-and-against-its-enemies.pdf>
 PDFs uploaded to the author’s personal website that appear to be screenshots 
of conversations she had with ChatSonic, a free generative AI chatbot that 
advertises itself as a ChatGPT alternative that can “write factual trending 
content” using Google search results.

That said, I don't think this is anything to be too stressed about; the 
Grayzone is already a deprecated source and blogs like Marginal Revolution are 
treated with caution, though Cowen has sufficient credentials to be treated as 
a reliable expert.

 

On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen  
wrote:

Hi,

 

The development of open-source large language models is going forward. The 
GPT-4 was released and it seems that it passed the Bar exam and tried to hire 
humans to solve catchpas which were too complex. However, the development in 
the open source and hacking side has been pretty fast and it seems that there 
are all the pieces for running LLM models in personal hardware (and in web 
browsers). Biggest missing piece is fine tuning of open source models such as 
Neox for the English language. For multilingual and multimodal (for example 
images+text) the model is also needed.

 

So this is kind of a link dump for relevant things for creation of open source 
LLM model and service and also recap where the hacker community is now.

 

1.) Creation of an initial unaligned model. 

· Possible models

·  <https://github.com/EleutherAI/gpt-neox> 20b Neo(X) by EleutherAI 
(Apache 2.0)

·  <https://huggingface.co/KoboldAI/fairseq-dense-13B> Fairseq Dense by 
Facebook (MIT-licence)

·  <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> 
LLaMa by Facebook (custom license, leaked research use only)

·  <https://huggingface.co/bigscience/bloom> Bloom by Bigscience ( 
<https://huggingface.co/spaces/bigscience/license

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-17 Thread Steven Walling
On Fri, Mar 17, 2023 at 6:03 PM The Cunctator  wrote:

> I really feel like we're getting into pretty aggressive corporate abuse of
> the Wikipedia copyleft.
>

I completely agree. It makes me pretty angry that Wikipedians have spent
millions of volunteer hours creating content to educate and inform people
as accurately as we can, and it's being used to generate convincing but
often wildly misleading bullshit.

The ground truth on what generated AI content is (from a copyright
position) and where authorship/ownership lies seems to be rapidly evolving.
The U.S. Copyright Office recently refused to issue copyrights for some
AI-generated works, seemingly on the principle that they lack human
authorship / are essential to contracting work for hire from an artist or
writer.

IANAL of course, but to me this implies that responsibility for the
*egregious* lack of attribution in models that rely substantially on
Wikipedia is violating the Attribution requirements of CC licenses. Just
like the Foundation took a principled position in testing the legality of
warrantless mass surveillance, I would love to see us push back on the
notion that it's legal or moral for OpenAI or any of these other companies
to take our content and use it to flood the Internet with machine-generated
word diarrhea.


> On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski 
> wrote:
>
>> Hello,
>>
>> I would like to indicate "Copilot" in the Edge browser as being
>> potentially relevant to Wikipedia [1][2].
>>
>> It is foreseeable that end-users will be able to open sidebars in their
>> Web browsers and subsequently chat with large language models about the
>> contents of specific Web documents, e.g., encyclopedia articles. Using Web
>> browsers, there can be task contexts available, including the documents or
>> articles in users' current tabs, potentially including users' scroll
>> positions, potentially including users' selections or highlightings of
>> content.
>>
>> I, for one, am thinking about how Web standards, e.g., Web schema, can be
>> of use for amplifying these features and capabilities for end-users.
>>
>>
>> Best regards,
>> Adam Sobieski
>>
>> [1]
>> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
>> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>>
>> --
>> *From:* Kimmo Virtanen 
>> *Sent:* Friday, March 17, 2023 8:17 AM
>> *To:* Wikimedia Mailing List 
>> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>>
>> Hi,
>>
>> The development of open-source large language models is going forward.
>> The GPT-4 was released and it seems that it passed the Bar exam and tried
>> to hire humans to solve catchpas which were too complex. However, the
>> development in the open source and hacking side has been pretty fast and it
>> seems that there are all the pieces for running LLM models in personal
>> hardware (and in web browsers). Biggest missing piece is fine tuning of
>> open source models such as Neox for the English language. For multilingual
>> and multimodal (for example images+text) the model is also needed.
>>
>>
>> So this is kind of a link dump for relevant things for creation of open
>> source LLM model and service and also recap where the hacker community is
>> now.
>>
>>
>> 1.) Creation of an initial unaligned model.
>>
>>- Possible models
>>   - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by
>>   EleutherAI (Apache 2.0)
>>   - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
>>   Facebook (MIT-licence)
>>   - LLaMa
>>   <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>>   Facebook (custom license, leaked research use only)
>>   - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
>>   license <https://huggingface.co/spaces/bigscience/license>. open,
>>   non-commercial)
>>
>>
>> 2.) Fine-tuning or align
>>
>>- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>>   - Alpaca: A Strong, Replicable Instruction-Following Model
>>   <https://crfm.stanford.edu/2023/03/13/alpaca.html>
>>   - Train and run Stanford Alpaca on your own machine
>>   <https://replicate.com/blog/replicate-alpaca>
>>   - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>>   <https://github.com/tloen/alpaca-lora>
>>
>>
>> 3.) 8,4,3 bit-quantization of model for reduced hardware 

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-17 Thread The Cunctator
I really feel like we're getting into pretty aggressive corporate abuse of
the Wikipedia copyleft.

On Fri, Mar 17, 2023, 4:45 PM Adam Sobieski 
wrote:

> Hello,
>
> I would like to indicate "Copilot" in the Edge browser as being
> potentially relevant to Wikipedia [1][2].
>
> It is foreseeable that end-users will be able to open sidebars in their
> Web browsers and subsequently chat with large language models about the
> contents of specific Web documents, e.g., encyclopedia articles. Using Web
> browsers, there can be task contexts available, including the documents or
> articles in users' current tabs, potentially including users' scroll
> positions, potentially including users' selections or highlightings of
> content.
>
> I, for one, am thinking about how Web standards, e.g., Web schema, can be
> of use for amplifying these features and capabilities for end-users.
>
>
> Best regards,
> Adam Sobieski
>
> [1]
> https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
> [2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html
>
> --
> *From:* Kimmo Virtanen 
> *Sent:* Friday, March 17, 2023 8:17 AM
> *To:* Wikimedia Mailing List 
> *Subject:* [Wikimedia-l] Re: Bing-ChatGPT
>
> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex. However, the
> development in the open source and hacking side has been pretty fast and it
> seems that there are all the pieces for running LLM models in personal
> hardware (and in web browsers). Biggest missing piece is fine tuning of
> open source models such as Neox for the English language. For multilingual
> and multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of a link dump for relevant things for creation of open
> source LLM model and service and also recap where the hacker community is
> now.
>
>
> 1.) Creation of an initial unaligned model.
>
>- Possible models
>   - 20b Neo(X) <https://github.com/EleutherAI/gpt-neox> by EleutherAI
>   (Apache 2.0)
>   - Fairseq Dense <https://huggingface.co/KoboldAI/fairseq-dense-13B> by
>   Facebook (MIT-licence)
>   - LLaMa
>   <https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by
>   Facebook (custom license, leaked research use only)
>   - Bloom <https://huggingface.co/bigscience/bloom> by Bigscience (custom
>   license <https://huggingface.co/spaces/bigscience/license>. open,
>   non-commercial)
>
>
> 2.) Fine-tuning or align
>
>- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>   - Alpaca: A Strong, Replicable Instruction-Following Model
>   <https://crfm.stanford.edu/2023/03/13/alpaca.html>
>   - Train and run Stanford Alpaca on your own machine
>   <https://replicate.com/blog/replicate-alpaca>
>   - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>   <https://github.com/tloen/alpaca-lora>
>
>
> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>
>- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
><https://til.simonwillison.net/llms/llama-7b-m2>
>- Github: bloomz.cpp <https://github.com/NouamaneTazi/bloomz.cpp> &
>llama.cpp <https://github.com/ggerganov/llama.cpp> (C++ only versions)
>- Int-4 LLaMa is not enough - Int-3 and beyond
><https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
>- How is LLaMa.cpp possible?
><https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>
>
>
> 4.) Easy-to-use interfaces
>
>- Transformer.js <https://xenova.github.io/transformers.js/> (WebAssembly
>libraries to run LLM models in the browser)
>- Dalai <https://github.com/cocktailpeanut/dalai>  ( run LLaMA and
>Alpaca in own computer as Node.js web service)
>- web-stable-diffusion <https://github.com/mlc-ai/web-stable-diffusion> 
> (stable
>diffusion image generation in browser)
>
>
> Br,
> -- Kimmo Virtanen
>
> On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen 
> wrote:
>
> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex to it. However, the
> development in open source and hacking side has been pretty fast and it
> seems that there i

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-17 Thread Adam Sobieski
Hello,

I would like to indicate "Copilot" in the Edge browser as being potentially 
relevant to Wikipedia [1][2].

It is foreseeable that end-users will be able to open sidebars in their Web 
browsers and subsequently chat with large language models about the contents of 
specific Web documents, e.g., encyclopedia articles. Using Web browsers, there 
can be task contexts available, including the documents or articles in users' 
current tabs, potentially including users' scroll positions, potentially 
including users' selections or highlightings of content.

I, for one, am thinking about how Web standards, e.g., Web schema, can be of 
use for amplifying these features and capabilities for end-users.


Best regards,
Adam Sobieski

[1] 
https://learn.microsoft.com/en-us/deployedge/microsoft-edge-relnote-stable-channel?ranMID=24542#version-1110166141-march-13-2023
[2] https://www.engadget.com/microsoft-edge-ai-copilot-184033427.html


From: Kimmo Virtanen 
Sent: Friday, March 17, 2023 8:17 AM
To: Wikimedia Mailing List 
Subject: [Wikimedia-l] Re: Bing-ChatGPT

Hi,


The development of open-source large language models is going forward. The 
GPT-4 was released and it seems that it passed the Bar exam and tried to hire 
humans to solve catchpas which were too complex. However, the development in 
the open source and hacking side has been pretty fast and it seems that there 
are all the pieces for running LLM models in personal hardware (and in web 
browsers). Biggest missing piece is fine tuning of open source models such as 
Neox for the English language. For multilingual and multimodal (for example 
images+text) the model is also needed.


So this is kind of a link dump for relevant things for creation of open source 
LLM model and service and also recap where the hacker community is now.


1.) Creation of an initial unaligned model.

  *   Possible models
 *   20b Neo(X)<https://github.com/EleutherAI/gpt-neox> by EleutherAI 
(Apache 2.0)
 *   Fairseq Dense<https://huggingface.co/KoboldAI/fairseq-dense-13B> by 
Facebook (MIT-licence)
 *   
LLaMa<https://ai.facebook.com/blog/large-language-model-llama-meta-ai/> by 
Facebook (custom license, leaked research use only)
 *   Bloom<https://huggingface.co/bigscience/bloom> by Bigscience (custom 
license<https://huggingface.co/spaces/bigscience/license>. open, non-commercial)


2.) Fine-tuning or align

  *   Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
 *   Alpaca: A Strong, Replicable Instruction-Following 
Model<https://crfm.stanford.edu/2023/03/13/alpaca.html>
 *   Train and run Stanford Alpaca on your own 
machine<https://replicate.com/blog/replicate-alpaca>
 *   Github: Alpaca-LoRA: Low-Rank LLaMA 
Instruct-Tuning<https://github.com/tloen/alpaca-lora>


3.) 8,4,3 bit-quantization of model for reduced hardware requirements

  *   Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with 
llama.cpp<https://til.simonwillison.net/llms/llama-7b-m2>
  *   Github: bloomz.cpp<https://github.com/NouamaneTazi/bloomz.cpp> & 
llama.cpp<https://github.com/ggerganov/llama.cpp> (C++ only versions)
  *   Int-4 LLaMa is not enough - Int-3 and 
beyond<https://nolanoorg.substack.com/p/int-4-llama-is-not-enough-int-3-and>
  *   How is LLaMa.cpp 
possible?<https://finbarrtimbers.substack.com/p/how-is-llamacpp-possible>


4.) Easy-to-use interfaces

  *   Transformer.js<https://xenova.github.io/transformers.js/> (WebAssembly 
libraries to run LLM models in the browser)
  *   Dalai<https://github.com/cocktailpeanut/dalai>  ( run LLaMA and Alpaca in 
own computer as Node.js web service)
  *   web-stable-diffusion<https://github.com/mlc-ai/web-stable-diffusion> 
(stable diffusion image generation in browser)

Br,
-- Kimmo Virtanen

On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen 
mailto:kimmo.virta...@gmail.com>> wrote:
Hi,


The development of open-source large language models is going forward. The 
GPT-4 was released and it seems that it passed the Bar exam and tried to hire 
humans to solve catchpas which were too complex to it. However, the development 
in open source and hacking side has been pretty fast and it seems that there is 
all the pieces for running LLM models in personal hardware (and in web 
browser). Biggest missing piece is fine tuning of open source model such as 
Neox for english language. For multilingual and multimodal (for example 
images+text) the model is also needed.


So this is kind of link dump for relevant things for creation of open source 
LLM model and service and also recap where hacker community is now.


1.) Creation of an initial unaligned model.

  *   Possible models
 *   20b Neo(X)<https://github.com/EleutherAI/gpt-neox> by EleutherAI 
(Apache 2.0)
 *   Fairseq Dense<https://huggingface.co/KoboldAI/fairseq-dense-13B> by 
Facebook (MIT-licence)
   

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-17 Thread The Cunctator
This is an important development for editors to be aware of - we're going
to have to be increasingly on the lookout for sources using ML-generated
bullshit. Here are two instances I'm aware of this week:

https://www.thenation.com/article/culture/internet-archive-publishers-lawsuit-chatbot/
> In late February, Tyler Cowen, a libertarian economics professor at George
> Mason University, published a blog post titled
> ,
> “Who was the most important critic of the printing press in the 17th
> century?” Cowen’s post contended that the polymath and statesman Francis
> Bacon was an “important” critic of the printing press; unfortunately, the
> post contains long, fake quotes attributed to Bacon’s *The Advancement of
> Learning *(1605), complete with false chapter and section numbers.
> Tech writer Mathew Ingram drew attention to the fabrications a few days
> later
> ,
> noting that Cowen has been writing approvingly about the AI chatbot
> ChatGPT
> 
>  for
> some time now; several commenters on Cowen’s post assumed the fake quotes
> must be the handiwork of ChatGPT. (Cowen did not reply to e-mailed
> questions regarding the post by press time, and later removed the post
> entirely, with no explanation whatsoever. However, a copy remains at the
> Internet Archive’s Wayback Machine).
>


>
> https://www.vice.com/en/article/3akz8y/ai-injected-misinformation-into-article-claiming-misinformation-in-navalny-doc
> An article claiming to identify misinformation in an Oscar-winning
> documentary about imprisoned Russian dissident Alexei Navalny is itself
> full of misinformation, thanks to the author using AI.
> Investigative news outlet *The Grayzone* recently published an article
> 
>  that included AI-generated text as a source for its information. The
> piece
> ,
> “Oscar-winning ‘Navalny’ documentary is packed with misinformation” by Lucy
> Komisar, included hyperlinks to PDFs
> 
>  uploaded to the author’s personal website that appear to be screenshots
> of conversations she had with ChatSonic, a free generative AI chatbot that
> advertises itself as a ChatGPT alternative that can “write factual trending
> content” using Google search results.

That said, I don't think this is anything to be too stressed about; the
Grayzone is already a deprecated source and blogs like Marginal Revolution
are treated with caution, though Cowen has sufficient credentials to be
treated as a reliable expert.

On Fri, Mar 17, 2023 at 11:23 AM Kimmo Virtanen 
wrote:

> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex. However, the
> development in the open source and hacking side has been pretty fast and it
> seems that there are all the pieces for running LLM models in personal
> hardware (and in web browsers). Biggest missing piece is fine tuning of
> open source models such as Neox for the English language. For multilingual
> and multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of a link dump for relevant things for creation of open
> source LLM model and service and also recap where the hacker community is
> now.
>
>
> 1.) Creation of an initial unaligned model.
>
>- Possible models
>   - 20b Neo(X)  by EleutherAI
>   (Apache 2.0)
>   - Fairseq Dense  by
>   Facebook (MIT-licence)
>   - LLaMa
>    by
>   Facebook (custom license, leaked research use only)
>   - Bloom  by Bigscience (custom
>   license . open,
>   non-commercial)
>
>
> 2.) Fine-tuning or align
>
>- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>   - Alpaca: A Strong, Replicable Instruction-Following Model
>   
>   - Train and run Stanford Alpaca on your own machine
>   
>  

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-17 Thread Kimmo Virtanen
Hi,

The development of open-source large language models is going forward. The
GPT-4 was released and it seems that it passed the Bar exam and tried to
hire humans to solve catchpas which were too complex. However, the
development in the open source and hacking side has been pretty fast and it
seems that there are all the pieces for running LLM models in personal
hardware (and in web browsers). Biggest missing piece is fine tuning of
open source models such as Neox for the English language. For multilingual
and multimodal (for example images+text) the model is also needed.


So this is kind of a link dump for relevant things for creation of open
source LLM model and service and also recap where the hacker community is
now.


1.) Creation of an initial unaligned model.

   - Possible models
  - 20b Neo(X)  by EleutherAI
  (Apache 2.0)
  - Fairseq Dense  by
  Facebook (MIT-licence)
  - LLaMa
   by
  Facebook (custom license, leaked research use only)
  - Bloom  by Bigscience (custom
  license . open,
  non-commercial)


2.) Fine-tuning or align

   - Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
  - Alpaca: A Strong, Replicable Instruction-Following Model
  
  - Train and run Stanford Alpaca on your own machine
  
  - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
  


3.) 8,4,3 bit-quantization of model for reduced hardware requirements

   - Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
   
   - Github: bloomz.cpp  &
   llama.cpp  (C++ only versions)
   - Int-4 LLaMa is not enough - Int-3 and beyond
   
   - How is LLaMa.cpp possible?
   


4.) Easy-to-use interfaces

   - Transformer.js  (WebAssembly
   libraries to run LLM models in the browser)
   - Dalai   ( run LLaMA and
   Alpaca in own computer as Node.js web service)
   - web-stable-diffusion
 (stable
   diffusion image generation in browser)


Br,
-- Kimmo Virtanen

On Fri, Mar 17, 2023 at 1:53 PM Kimmo Virtanen 
wrote:

> Hi,
>
> The development of open-source large language models is going forward. The
> GPT-4 was released and it seems that it passed the Bar exam and tried to
> hire humans to solve catchpas which were too complex to it. However, the
> development in open source and hacking side has been pretty fast and it
> seems that there is all the pieces for running LLM models in personal
> hardware (and in web browser). Biggest missing piece is fine tuning of
> open source model such as Neox for english language. For multilingual and
> multimodal (for example images+text) the model is also needed.
>
>
> So this is kind of link dump for relevant things for creation of open
> source LLM model and service and also recap where hacker community is now.
>
>
> 1.) Creation of an initial unaligned model.
>
>- Possible models
>   - 20b Neo(X)  by EleutherAI
>   (Apache 2.0)
>   - Fairseq Dense  by
>   Facebook (MIT-licence)
>   - LLaMa
>    by
>   Facebook (custom license, leaked research use only)
>   - Bloom  by Bigscience (custom
>   license . open,
>   non-commercial)
>
>
> 2.) Fine-tuning or align
>
>- Example: Standford Alpaca is ChatGPT fine-tuned LLaMa
>   - Alpaca: A Strong, Replicable Instruction-Following Model
>   
>   - Train and run Stanford Alpaca on your own machine
>   
>   - Github: Alpaca-LoRA: Low-Rank LLaMA Instruct-Tuning
>   
>
>
> 3.) 8,4,3 bit-quantization of model for reduced hardware requirements
>
>- Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp
>
>- Github: bloomz.cpp  &
>llama.cpp  (C++ only versions)
>- Int-4 LLaMa is not enough - Int-3 and beyond

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-05 Thread Steven Walling
On Sun, Mar 5, 2023 at 8:39 PM Luis (lu.is)  wrote:

> On Feb 22, 2023 at 9:28 AM -0800, Sage Ross ,
> wrote:
>
> Luis,
>
> OpenAI researchers have released some info about data sources that
> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>
> See section 2.2, starting on page 8 of the PDF.
>
> The full text of English Wikipedia is one of five sources, the others
> being CommonCrawl, a smaller subset of scraped websites based on
> upvoted reddit links, and two unrevealed datasets of scanned books.
> (I've read speculation that one of these datasets is basically the
> Library Genesis archive.) Wikipedia is much smaller than the other
> datasets, although they did weight it somewhat more heavily than any
> other dataset. With the extra weighting, they say Wikipedia accounts
> for 3% of the total training.
>
>
> Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their
> training sources, it turns out, with similar weighting for Wikipedia - only
> 4.5% of training text, but more heavily weighted than most other sources:
>
> https://twitter.com/GuillaumeLample/status/1629151234597740550
>

Those stats are undercounting, since the top source (CommonCrawl) also
itself includes Wikipedia as its third largest source.

https://commoncrawl.github.io/cc-crawl-statistics/plots/domains


> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6UKCJWOUR2KVTS7QZYKPMKQGONXZ72QR/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-03-05 Thread Luis (lu.is)
On Feb 22, 2023 at 9:28 AM -0800, Sage Ross , 
wrote:
> Luis,
>
> OpenAI researchers have released some info about data sources that
> trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165
>
> See section 2.2, starting on page 8 of the PDF.
>
> The full text of English Wikipedia is one of five sources, the others
> being CommonCrawl, a smaller subset of scraped websites based on
> upvoted reddit links, and two unrevealed datasets of scanned books.
> (I've read speculation that one of these datasets is basically the
> Library Genesis archive.) Wikipedia is much smaller than the other
> datasets, although they did weight it somewhat more heavily than any
> other dataset. With the extra weighting, they say Wikipedia accounts
> for 3% of the total training.

Thanks, Sage. Facebook’s recently-released LLaMa also shares some of their 
training sources, it turns out, with similar weighting for Wikipedia - only 
4.5% of training text, but more heavily weighted than most other sources:

https://twitter.com/GuillaumeLample/status/1629151234597740550
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/W3HAFQIMQWBZDTZL6EYZKFG3D2KL7XDL/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-23 Thread Erik Moeller
On Mon, Feb 20, 2023 at 12:33 PM Jimmy Wales  wrote:
> Speaking only for myself, out of curiosity, some real world examples might be 
> helpful here.   I don't have access to Bing's
> version yet, but I do have access to chat.openai.com which is very impressive 
> but deeply flawed.

I've found ChatGPT most useful for small coding tasks (with a lot of
scrutiny). Most of the other practical applications I've heard of have
been of the creative variety, or in writing mundane letters, emails,
proposals, summaries, etc. As an example, please find a
ChatGPT-generated summary of this email at the end.

I think it's best to view ChatGPT (and its like) at this stage at, at
its best, a useful assistive technology and, at its worst, a
distributed denial of service attack on our collective ability to
understand our world.

The attempts to quickly commercially exploit these technologies tend
to push their impact more towards the latter, at least until those
deep flaws you mention are addressed.

It's a technology that requires a high degree of literacy in its
responsible use, while suggesting to the user that it requires none: a
dangerous combination.

The grand vision is to create human-level artificial intelligence.
"AGI" (Artificial General Intelligence) is now an explicit stated goal
of major players in the field. Of course, if AGI is in fact realized,
it _will_ change everything: a dream as big as SETI or limitless
energy generation. But for now we just have sparkling autocomplete.

It's easy to enumerate potential positive applications (assisted
editing, Wikidata query generation via natural language, automatic
summaries of open access citations, ...). For any one of them, I think
the challenge is to figure out a way towards _responsible_
integrations that don't proliferate misinformation and add value.

I do think that it is strategically vital for Wikimedia to understand
and explore this space, to look for low-risk/high-reward applications,
and to be dispassionate and objective in the face of both AI hype and
anti-AI backlash.

Erik

---

ChatGPT summary of this email:

The email discusses the practical applications of ChatGPT and warns
about the negative consequences of quickly commercializing AI
technology. The writer suggests responsible integration of AI to avoid
misinformation and add value, and recommends that Wikimedia explore
low-risk/high-reward AI applications while remaining objective in the
face of AI hype and backlash.
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/SDRJGERONJY7FCQYRJ3ZVX74OIUX4LPJ/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org


[Wikimedia-l] Re: Bing-ChatGPT

2023-02-22 Thread Anders Wennersten
I got the impression from the tech editor who I read the article from, 
that there is a big difference in how ChatGPT is used together with 
Bing. Jimmy Wales here describes my own experience using only ChatGPT, 
if you ask "who is NN#, you get unusable rubbish back.


But when the techeditor asked Bing-ChatGPT "who is Linus Larsson" (his 
name) he got very good result, that only exists in the article of him on 
swwp (no article of him exists on enwp). I can not interpret that in 
other way then that this version looked up Wikipedia, when asked


But I am am not a tech wizard so can be wrong

Anders

https://www.dn.se/kultur/linus-larsson-microsofts-ai-gjorde-slut-med-mig-pa-alla-hjartans-dag/ 



(the article in Swedish, heading says "Microsoft AI ended our relation 
on Valentin Day")


I also like the Ai  is insulting, stating as an answer "are you a fool 
or only stupid?" It seems to need to get trained on our UCoC



Den 2023-02-22 kl. 17:32, skrev Sage Ross:

Luis,

OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

See section 2.2, starting on page 8 of the PDF.

The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.

-Sage

On Wed, Feb 22, 2023 at 8:19 AM Luis (lu.is)  wrote:

Anders, do you have a citation for “use Wikipedia content considerably”?

Lots of early-ish ML work was heavily dependent on Wikipedia, but state-of-the-art 
Large Language Models are trained on vast quantities of text, of which Wikipedia 
is only a small part. ChatGPT does not share their data sources (as far as I know) 
but the Eleuther.ai project released their Pile a few years back, and that already 
had Wikipedia as < 5% of the text data; I think it is safe to assume that the 
percentage is smaller for newer models:  https://arxiv.org/abs/2101.00027

Techniques to improve reliability of LLM output may rely more heavily on 
Wikipedia. For example, Facebook uses Wikipedia rather heavily in this 
*research paper*: https://arxiv.org/abs/2208.03299 But I have seen no evidence 
that techniques like that are in use by OpenAI, or that they’re specifically 
trained on Wikipedia. If you’ve seen discussion of that, or evidence from 
output suggesting it, that’d be interesting and important!

Social: @luis_in_br...@social.coop
ML news: openml.fyi
On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten , 
wrote:

BIng with ChatGPT is now released by Micrsoft.

And from what I understand they use Wikipedia content considerably. If
you ask Who is A B and A B is not widely known, the result is more or
less identical to the content from the Wikipedia article (but worse, as
it "makes up" facts that is incorrect).

In a way I am glad to see Wikipedia is fully relevant even in this
emerging AI-driven search world. But Google search has ben careful to
always have a link to Wikipedia besides their made up summary of facts,
which here it is missing (yet?). And for licences, they are all ignored.

So if this is the future the number of  accesses from users to Wikipedia
will collapse, and also their willingness to donate... (but our content
still a cornerstone for knowledge)

Anders

(I got a lot of fact from an article in Swedish main newspaper by their
tech editor. He started asking fact of himself, and when he received
facts from his Wp article plus being credited to a book he had noting to
do with, he started to try to tell/learn ChatGPT of this error. The
chatPGT only got angry accusing the techeditor for lying and in the end
cut off the conversation, as ChatGPT continued to teat the techeditor as
lyer and vandal..).
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/FPPZSZUIG4SKGIWGX57O4K7MGJO4CSI3/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org


[Wikimedia-l] Re: Bing-ChatGPT

2023-02-22 Thread Sage Ross
Luis,

OpenAI researchers have released some info about data sources that
trained GPT-3 (and hence ChatGPT): https://arxiv.org/abs/2005.14165

See section 2.2, starting on page 8 of the PDF.

The full text of English Wikipedia is one of five sources, the others
being CommonCrawl, a smaller subset of scraped websites based on
upvoted reddit links, and two unrevealed datasets of scanned books.
(I've read speculation that one of these datasets is basically the
Library Genesis archive.) Wikipedia is much smaller than the other
datasets, although they did weight it somewhat more heavily than any
other dataset. With the extra weighting, they say Wikipedia accounts
for 3% of the total training.

-Sage

On Wed, Feb 22, 2023 at 8:19 AM Luis (lu.is)  wrote:
>
> Anders, do you have a citation for “use Wikipedia content considerably”?
>
> Lots of early-ish ML work was heavily dependent on Wikipedia, but 
> state-of-the-art Large Language Models are trained on vast quantities of 
> text, of which Wikipedia is only a small part. ChatGPT does not share their 
> data sources (as far as I know) but the Eleuther.ai project released their 
> Pile a few years back, and that already had Wikipedia as < 5% of the text 
> data; I think it is safe to assume that the percentage is smaller for newer 
> models:  https://arxiv.org/abs/2101.00027
>
> Techniques to improve reliability of LLM output may rely more heavily on 
> Wikipedia. For example, Facebook uses Wikipedia rather heavily in this 
> *research paper*: https://arxiv.org/abs/2208.03299 But I have seen no 
> evidence that techniques like that are in use by OpenAI, or that they’re 
> specifically trained on Wikipedia. If you’ve seen discussion of that, or 
> evidence from output suggesting it, that’d be interesting and important!
>
> Social: @luis_in_br...@social.coop
> ML news: openml.fyi
> On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten 
> , wrote:
>
> BIng with ChatGPT is now released by Micrsoft.
>
> And from what I understand they use Wikipedia content considerably. If
> you ask Who is A B and A B is not widely known, the result is more or
> less identical to the content from the Wikipedia article (but worse, as
> it "makes up" facts that is incorrect).
>
> In a way I am glad to see Wikipedia is fully relevant even in this
> emerging AI-driven search world. But Google search has ben careful to
> always have a link to Wikipedia besides their made up summary of facts,
> which here it is missing (yet?). And for licences, they are all ignored.
>
> So if this is the future the number of  accesses from users to Wikipedia
> will collapse, and also their willingness to donate... (but our content
> still a cornerstone for knowledge)
>
> Anders
>
> (I got a lot of fact from an article in Swedish main newspaper by their
> tech editor. He started asking fact of himself, and when he received
> facts from his Wp article plus being credited to a book he had noting to
> do with, he started to try to tell/learn ChatGPT of this error. The
> chatPGT only got angry accusing the techeditor for lying and in the end
> cut off the conversation, as ChatGPT continued to teat the techeditor as
> lyer and vandal..).
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at 
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at 
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/FPPZSZUIG4SKGIWGX57O4K7MGJO4CSI3/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/ET3W3Q2IXZAAJHLDXCMDIU3FELLLMWXX/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-22 Thread Luis (lu.is)
Anders, do you have a citation for “use Wikipedia content considerably”?

Lots of early-ish ML work was heavily dependent on Wikipedia, but 
state-of-the-art Large Language Models are trained on vast quantities of text, 
of which Wikipedia is only a small part. ChatGPT does not share their data 
sources (as far as I know) but the Eleuther.ai project released their Pile a 
few years back, and that already had Wikipedia as < 5% of the text data; I 
think it is safe to assume that the percentage is smaller for newer models:  
https://arxiv.org/abs/2101.00027

Techniques to improve reliability of LLM output may rely more heavily on 
Wikipedia. For example, Facebook uses Wikipedia rather heavily in this 
*research paper*: https://arxiv.org/abs/2208.03299 But I have seen no evidence 
that techniques like that are in use by OpenAI, or that they’re specifically 
trained on Wikipedia. If you’ve seen discussion of that, or evidence from 
output suggesting it, that’d be interesting and important!

Social: @luis_in_br...@social.coop
ML news: openml.fyi
On Feb 20, 2023 at 1:52 AM -0800, Anders Wennersten , 
wrote:
> BIng with ChatGPT is now released by Micrsoft.
>
> And from what I understand they use Wikipedia content considerably. If
> you ask Who is A B and A B is not widely known, the result is more or
> less identical to the content from the Wikipedia article (but worse, as
> it "makes up" facts that is incorrect).
>
> In a way I am glad to see Wikipedia is fully relevant even in this
> emerging AI-driven search world. But Google search has ben careful to
> always have a link to Wikipedia besides their made up summary of facts,
> which here it is missing (yet?). And for licences, they are all ignored.
>
> So if this is the future the number of  accesses from users to Wikipedia
> will collapse, and also their willingness to donate... (but our content
> still a cornerstone for knowledge)
>
> Anders
>
> (I got a lot of fact from an article in Swedish main newspaper by their
> tech editor. He started asking fact of himself, and when he received
> facts from his Wp article plus being credited to a book he had noting to
> do with, he started to try to tell/learn ChatGPT of this error. The
> chatPGT only got angry accusing the techeditor for lying and in the end
> cut off the conversation, as ChatGPT continued to teat the techeditor as
> lyer and vandal..).
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at 
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/FPPZSZUIG4SKGIWGX57O4K7MGJO4CSI3/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-21 Thread Ali Kia
Hi.
Thanks a lot.

در تاریخ سه‌شنبه ۲۱ فوریهٔ ۲۰۲۳،‏ ۱:۳۰ Eduardo Testart 
نوشت:

> Hi again,
>
> Another potentially interesting podcast for some touching this matter
> (more or less):
> https://www.nytimes.com/2023/02/17/podcasts/hard-fork-bing-ai-elon.html
>
> Linked to the ones I sent before on the other thread.
>
> If this is the new Napster revolution equivalent, yeah I know... back in
> the day, buckle up!
>
>
> Cheers,
>
> On Mon, Feb 20, 2023, 17:33 Jimmy Wales 
> wrote:
>
>> Speaking only for myself, out of curiosity, some real world examples
>> might be helpful here.   I don't have access to Bing's
>> version yet, but I do have access to chat.openai.com which is very
>> impressive but deeply flawed.
>>
>> I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but not
>> famous) and the answer is weird and laughably bad with
>> more sentences false than true.   Among other silly things, it says that
>> she worked for Theresa May and was involved
>> with Brexit negotiations, which if you knew my wife's politics borders on
>> libel.  It also says she co-founded an organization
>> (which as far as I can tell, it just made up out of thin air) with Nick
>> Clegg's wife.  It's completely mad.
>>
>>
>>
>>
>> On 2023-02-20 09:34, Anders Wennersten wrote:
>>
>> > BIng with ChatGPT is now released by Micrsoft. > > And from what I
>> understand they use Wikipedia content considerably. > If you ask Who is A B
>> and A B is not widely known, the result is more > or less identical to the
>> content from the Wikipedia article (but > worse, as it "makes up" facts
>> that is incorrect). > > In a way I am glad to see Wikipedia is fully
>> relevant even in this > emerging AI-driven search world. But Google search
>> has ben careful to > always have a link to Wikipedia besides their made up
>> summary of > facts, which here it is missing (yet?). And for licences, they
>> are > all ignored. > > So if this is the future the number of accesses from
>> users to > Wikipedia will collapse, and also their willingness to donate...
>> (but > our content still a cornerstone for knowledge) > > Anders > > (I got
>> a lot of fact from an article in Swedish main newspaper by > their tech
>> editor. He started asking fact of himself, and when he > received facts
>> from his Wp article plus being credited to a book he > had noting to do
>> with, he started to try to tell/learn ChatGPT of > this error. The chatPGT
>> only got angry accusing the techeditor for > lying and in the end cut off
>> the conversation, as ChatGPT continued > to teat the techeditor as lyer and
>> vandal..). > ___ Wikimedia-l
>> mailing > list -- wikimedia-l@lists.wikimedia.org, guidelines at: >
>> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and >
>> https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at >
>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
>> > > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>>
>> ___
>> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
>> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
>> https://meta.wikimedia.org/wiki/Wikimedia-l
>> Public archives at
>> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/MRF2C3YMIF64RNWOYD3PB2DZ24VVFKMG/
>> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/ZT3JW7YF6PCBHP6SOOD3RGITFSSMSMDO/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/HG4A6REYCMK6KEYU2OV2NWDEWRRFWTHA/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-20 Thread Eduardo Testart
Hi again,

Another potentially interesting podcast for some touching this matter (more
or less):
https://www.nytimes.com/2023/02/17/podcasts/hard-fork-bing-ai-elon.html

Linked to the ones I sent before on the other thread.

If this is the new Napster revolution equivalent, yeah I know... back in
the day, buckle up!


Cheers,

On Mon, Feb 20, 2023, 17:33 Jimmy Wales  wrote:

> Speaking only for myself, out of curiosity, some real world examples might
> be helpful here.   I don't have access to Bing's
> version yet, but I do have access to chat.openai.com which is very
> impressive but deeply flawed.
>
> I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but not
> famous) and the answer is weird and laughably bad with
> more sentences false than true.   Among other silly things, it says that
> she worked for Theresa May and was involved
> with Brexit negotiations, which if you knew my wife's politics borders on
> libel.  It also says she co-founded an organization
> (which as far as I can tell, it just made up out of thin air) with Nick
> Clegg's wife.  It's completely mad.
>
>
>
>
> On 2023-02-20 09:34, Anders Wennersten wrote:
>
> > BIng with ChatGPT is now released by Micrsoft. > > And from what I
> understand they use Wikipedia content considerably. > If you ask Who is A B
> and A B is not widely known, the result is more > or less identical to the
> content from the Wikipedia article (but > worse, as it "makes up" facts
> that is incorrect). > > In a way I am glad to see Wikipedia is fully
> relevant even in this > emerging AI-driven search world. But Google search
> has ben careful to > always have a link to Wikipedia besides their made up
> summary of > facts, which here it is missing (yet?). And for licences, they
> are > all ignored. > > So if this is the future the number of accesses from
> users to > Wikipedia will collapse, and also their willingness to donate...
> (but > our content still a cornerstone for knowledge) > > Anders > > (I got
> a lot of fact from an article in Swedish main newspaper by > their tech
> editor. He started asking fact of himself, and when he > received facts
> from his Wp article plus being credited to a book he > had noting to do
> with, he started to try to tell/learn ChatGPT of > this error. The chatPGT
> only got angry accusing the techeditor for > lying and in the end cut off
> the conversation, as ChatGPT continued > to teat the techeditor as lyer and
> vandal..). > ___ Wikimedia-l
> mailing > list -- wikimedia-l@lists.wikimedia.org, guidelines at: >
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and >
> https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at >
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
> > > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
>
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/MRF2C3YMIF64RNWOYD3PB2DZ24VVFKMG/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/ZT3JW7YF6PCBHP6SOOD3RGITFSSMSMDO/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-20 Thread Jimmy Wales
Speaking only for myself, out of curiosity, some real world examples 
might be helpful here.   I don't have access to Bing's
version yet, but I do have access to chat.openai.com which is very 
impressive but deeply flawed.


I asked "Who is Kate Garvey?" (my wife, known a bit to the media, but 
not famous) and the answer is weird and laughably bad with
more sentences false than true.   Among other silly things, it says that 
she worked for Theresa May and was involved
with Brexit negotiations, which if you knew my wife's politics borders 
on libel.  It also says she co-founded an organization
(which as far as I can tell, it just made up out of thin air) with Nick 
Clegg's wife.  It's completely mad.





On 2023-02-20 09:34, Anders Wennersten wrote:

BIng with ChatGPT is now released by Micrsoft.  > > And from what I understand they use Wikipedia content 
considerably. > If you ask Who is A B and A B is not widely known, the 
result is more > or less identical to the content from the Wikipedia 
article (but > worse, as it "makes up" facts that is incorrect). > > In 
a way I am glad to see Wikipedia is fully relevant even in this > 
emerging AI-driven search world. But Google search has ben careful to > 
always have a link to Wikipedia besides their made up summary of > 
facts, which here it is missing (yet?). And for licences, they are > all 
ignored. > > So if this is the future the number of accesses from users 
to > Wikipedia will collapse, and also their willingness to donate... 
(but > our content still a cornerstone for knowledge) > > Anders > > (I 
got a lot of fact from an article in Swedish main newspaper by > their 
tech editor. He started asking fact of himself, and when he > received 
facts from his Wp article plus being credited to a book he > had noting 
to do with, he started to try to tell/learn ChatGPT of > this error. The 
chatPGT only got angry accusing the techeditor for > lying and in the 
end cut off the conversation, as ChatGPT continued > to teat the 
techeditor as lyer and vandal..). > 
___ Wikimedia-l mailing > 
list -- wikimedia-l@lists.wikimedia.org, guidelines at: > 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > 
https://meta.wikimedia.org/wiki/Wikimedia-l Public archives at > 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/ 
> > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/MRF2C3YMIF64RNWOYD3PB2DZ24VVFKMG/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Bing-ChatGPT

2023-02-20 Thread Kim Bruning via Wikimedia-l
FWIW YMMV,

Executive Summary:
==

* I looked into Stable Diffusion recently. BEWARE: The actual technical and 
legal
  situation on the ground with these systems is VERY different from what -say- 
twitter 
  will lead you to believe. Also :Everything you know will be wrong and out of
  date inside 1-2 months at this time. 

* In general: Times are changing. For better or for worse;
  if we seize the initiative here, we may be able to advance our cause
  considerably.


Stable Diffusion:
==

I recently got into a kerfluffle elsewhere wrt Stable Diffusion,
which is a similar technology, forcing me to research it in more detail.

Initially I was inclined to take claims by people opposed to SD at face
value, (people claimed with absolute certainty that SD was art-theft,
unethical, out to destroy artists, and all around Bad Guys (tm) ...

... but on researching I was surprised to find:

* SD was FLOSS and scrupulously annotated. (may or may not be
relevant here)

and/or when I looked at the (C) situation one or more of the following
applied: 
* there was no copyright whatsoever due to significant non-human input [1]. 
* Or there was a very strong case for transformative fair use and significant 
  non-infringing uses as per [2].
* And even IF any actual copying/derivation could be argued, it was de minimis 
[3] 
  (on average 2 bits of data per 50 byte image) 


Finally:
* The current rate of innovation in this sphere is dizzying. From ugly
  muddy blobs ~12 months ago to 


This situation surprised me somewhat. I would be very interested to see what the
ChatGPT defense will look like. 


In general:
=

On the short term, precedents or
reactive legislation _might_ hurt wikipedia somewhat, but in the mid-term
I have a hope that the (C) system will be found to be in need of an
overhaul anyway. This would then be an opportunity for CC/FLOSS to engage and
advance our goals and advocate for our ethics.


sincerely,
Kim




[1] https://en.wikipedia.org/wiki/Monkey_selfie_copyright_dispute
[2] 
https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Universal_City_Studios,_Inc.
[3] https://en.wikipedia.org/wiki/De_minimis



On Mon, Feb 20, 2023 at 10:34:16AM +0100, Anders Wennersten wrote:
> BIng with ChatGPT is now released by Micrsoft.
> 
> And from what I understand they use Wikipedia content considerably. If you
> ask Who is A B and A B is not widely known, the result is more or less
> identical to the content from the Wikipedia article (but worse, as it "makes
> up" facts that is incorrect).
> 
> In a way I am glad to see Wikipedia is fully relevant even in this emerging
> AI-driven search world. But Google search has ben careful to always have a
> link to Wikipedia besides their made up summary of facts, which here it is
> missing (yet?). And for licences, they are all ignored.
> 
> So if this is the future the number of  accesses from users to Wikipedia
> will collapse, and also their willingness to donate... (but our content
> still a cornerstone for knowledge)
> 
> Anders
> 
> (I got a lot of fact from an article in Swedish main newspaper by their tech
> editor. He started asking fact of himself, and when he received facts from
> his Wp article plus being credited to a book he had noting to do with, he
> started to try to tell/learn ChatGPT of this error. The chatPGT only got
> angry accusing the techeditor for lying and in the end cut off the
> conversation, as ChatGPT continued to teat the techeditor as lyer and
> vandal..).
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
> https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at 
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GJJNX2Y7BX5RZYGAIYTUI6O6CSBN72EH/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/N2YUVV2R4L45ZWQQRX2F3V4XGVZFBNJF/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org