On Sun, Mar 19, 2023 at 12:12 PM Lauren Worden <laurenworde...@gmail.com> wrote:

> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort. See
> https://arxiv.org/pdf/2205.10770.pdf -- in particular the first
> paragraph of the Background and Related Work section on page 2, where
> document extraction is considered an "attack" against such systems,
> which to me implies that the researchers fully realize they are
> involved with copyright issues on an enormous scale. Please see also
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

Thanks for these links, Lauren. I think it could be a very interesting
research project (for WMF, affiliates or Wikimedia research community
members) to attempt to recall Wikimedia project content such as
Wikipedia articles via the GPT-3.5 or GPT-4 API, to begin quantifying
the degree to which the models produce exact copies (or legally
covered derivative works--as opposed to novel expressions).

> With luck we will all have the chance to discuss these issues in
> detail on the March 23 Zoom discussion of large language models for
> Wikimedia projects:
> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/External_Trends#Open_call:_Artificial_Intelligence_in_Wikimedia

I won't be able to join but am glad this is happening. I agree that it
would be good for WMF to engage with LLM providers on these questions
of attribution sooner rather than later, if that is not already
underway. WMF is, as I understand it, still not in any privileged
position of asserting or enforcing copyright (because it requires no
copyright assignment from authors) -- but it can certainly make legal
requirements clear, and also develop best practices that go beyond the
legal minimum.

Warmly,
Erik
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/53LPHDEFJIY646GRJS5SCZYNWMWDZG4Q/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

Reply via email to