On Sat, Mar 18, 2023 at 3:49 PM Erik Moeller <eloque...@gmail.com> wrote:
>
> ...With image-generating models like Stable Diffusion, it's been found
> that the models sometimes generate output nearly indistinguishable
> from source material [1]. I don't know if similar studies have been
> undertaken for text-generating models yet.

They have, and LLMs absolutely do encode a verbatim copy of their
training data, which can be produced intact with little effort. See
https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
paragraph of the Background and Related Work section on page 2, where
document extraction is considered an "attack" against such systems,
which to me implies that the researchers fully realize they are
involved with copyright issues on an enormous scale. Please see also
https://bair.berkeley.edu/blog/2020/12/20/lmmem/

On Sat, Mar 18, 2023 at 9:17 PM Steven Walling <steven.wall...@gmail.com> wrote:
>
> The whole thing is definitely a hot mess. If the remixing/transformation by 
> the model is a derivative work, it means OpenAI is potentially violating the 
> ShareAlike requirement by not distributing the text output as CC....

The Foundation needs to get on top of this, by making a public request
to all of the LLM providers which use Wikipedia as training data,
asking that they acknowledge attribution of any output which may have
depended on CC-BY-SA content, licence model productions as CC-BY-SA,
and most importantly, disclaim any notion of accuracy or fidelity to
the training data. This needs to be done soon. So many people are
preparing to turn the reins of their editorial control over to these
new LLMs which they don't understand, and the problems at
CNET[https://gizmodo.com/cnet-ai-chatgpt-news-robot-1849996151], let
alone Tyler Cowen's blog, have already felt the pain but sadly decided
to hastily try to cover it up. The overarching risk here is akin to
"citogenesis" but much more pernicious.

On Sun, Mar 19, 2023 at 1:20 AM Kimmo Virtanen
<kimmo.virta...@wikimedia.fi> wrote:
>
>> Or, maybe just require an open disclosure of where the bot pulled from and 
>> how much, instead of having it be a black box? "Text in this response 
>> derived from: 17% Wikipedia article 'Example', 12% Wikipedia article 
>> 'SomeOtherThing', 10%...".
>
> Current (ie. ChatGPT) systems doesn't work that way, as the source of 
> information is lost in the process when the information is encoded into the 
> model....

In fact, they do work that way, but it takes some effort to elucidate
the source of any given output. Anyone discussing these issues needs
to become familiar with ROME:
https://twitter.com/mengk20/status/1588581237345595394 Please see also
https://www.youtube.com/watch?v=_NMQyOu2HTo

With luck we will all have the chance to discuss these issues in
detail on the March 23 Zoom discussion of large language models for
Wikimedia projects:
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends#Open_call:_Artificial_Intelligence_in_Wikimedia

--LW
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/JUIBHOLPLLE2VH4PSCUH4I5WV5OCX2MS/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

Reply via email to