On Mon, Mar 20, 2023 at 9:28 PM Kim Bruning via Wikimedia-l
<wikimedia-l@lists.wikimedia.org> wrote:
> On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
> >
> > .... LLMs absolutely do encode a verbatim copy of their
> > training data, which can be produced intact with little effort.
>
> > https://arxiv.org/pdf/2205.10770.pdf
> > https://bair.berkeley.edu/blog/2020/12/20/lmmem/
>
> My understanding so far is that encoding a verbatim copy is typically due to 
> 'Overfitting'.
>
> This is considered a type of bug. It is undesirable for many reasons
> (technical, ethical, legal).

I believe the authors mainly use "overfitting" to describe the
condition when the model produces verbatim copies of its training data
instead of a reasonably distinct paraphrase or summary when the
verbatim source is not specifically elicited. But it's not clear to me
that the term isn't used in both ways.

This brings up an important point. ChatGPT seems to almost always
avoid the kind of infringing paraphrases described in
https://en.wikipedia.org/wiki/Wikipedia:Close_paraphrasing when asked
to paraphrase or summarize input text, which makes it very useful for
easily avoiding such issues. I get the feeling that Wikipedia editors
are already using it for this purpose on a relatively large scale. But
I'm hesitant to encourage such use until copyright experts familiar
with legal precedents involving "substantial similarity" as described
in that essay have had the opportunity to evaluate whether such LLM
output is a problem over a wide range of example cases. Ordinary
Wikipedia editors have no way to know how likely this is as a problem,
how to evaluate specific cases, or how to address such issues when
they arise. Professional guidance would be very helpful on this topic.

On Mon, Mar 20, 2023 at 8:01 PM Erik Moeller <eloque...@gmail.com> wrote:
>
> ... I agree that it
> would be good for WMF to engage with LLM providers on these questions
> of attribution sooner rather than later, if that is not already
> underway. WMF is, as I understand it, still not in any privileged
> position of asserting or enforcing copyright (because it requires no
> copyright assignment from authors) -- but it can certainly make legal
> requirements clear, and also develop best practices that go beyond the
> legal minimum.

Thank you. Another thing the Foundation could do without editors
getting involved (a class action suit by editors would probably at
best be counterproductive at this point, for a number of reasons, and
could backfire) is to highlight and encourage the ongoing but
relatively obscure work on attribution and verification by LLMs. There
are two projects in particular, SPARROW [
https://arxiv.org/abs/2209.14375 ] and RARR
[https://arxiv.org/abs/2210.08726 ] that deserve wider recognition,
support, and work on replication by third parties. These research
directions are the most robust way to avoid the hallucination problems
which are at the root of most everything that can go wrong when LLMs
are used to produce Wikipedia content, so it would be extremely
helpful if the Foundation uses its clout to shine a light and point
out that they do what we expect of Wikipedia editors: provide sources
in support of summary text cited in a way that third parties can
independently verify.

The Bing LLM already includes some attempt at doing this with a dual
process search system, which I believe is modeled after the SPARROW
approach, but without the explicit rigor such as in RARR, it can fail
spectacularly, and produce the same confidently wrong output everyone
has recently become familiar with, but with the confounding problem of
appearing to cite sources in support, but which aren't. For example,
see this thread:
https://twitter.com/dileeplearning/status/1634699315582226434

-LW
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/LGDGX6MPZJGSV2GZV7M2LQ6OLRYFQCVS/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

Reply via email to