In addition to all content being available verbatim versus all content being 
unavailable verbatim, developers might desire for some content to be available 
verbatim while having other content available only indirectly.

While AI systems can automatically determine which content to usefully store 
verbatim, if we desire for content authors to be able to provide hints, we 
could consider new HTML markup elements or some clever uses of existing 
elements and attributes or schema.org Web schemas.

In these regards, consider the following example, where an HTML document author 
desires to hint that a topic sentence of a paragraph is desired to be quoted 
verbatim while the remainder of that paragraph is desired only to be indirectly 
available. Perhaps the markup could resemble something like the following 
rough-draft sketch:

<p><span id="anchor123" role="quoteable">This is some text, a topic 
sentence.</span> This is a secondary sentence in the paragraph.</p>

This sketch shows some overlapping markup approaches. Perhaps all elements with 
IDs, URL-addressable content, should be considered to be verbatim quotable. Or, 
perhaps some HTML attribute, e.g., role, could be of use. Again, schema.org Web 
schemas could also be of use.


Also, I hope that you find interesting the following discussion thread: 
https://github.com/microsoft/semantic-kernel/discussions/108 about Educational 
Applications of AI in Web Browsers. There, I ask some questions about modern 
LLMs and APIs, about referring to documents by URLs in prompts, about 
prioritizing some documents for utilization over others when answering 
questions, and so forth. A “Web browser Copilot” would have educational 
applications. It could allow students to ask questions pertinent to the 
specific HTML, PDF, and EPUB documents that they are browsing and, perhaps, AI 
components could navigate to pages, scroll to content, and highlight document 
content for end-users while responding.


Best regards,
Adam Sobieski

________________________________
From: Kim Bruning via Wikimedia-l <wikimedia-l@lists.wikimedia.org>
Sent: Sunday, March 19, 2023 10:36 PM
To: Wikimedia Mailing List <wikimedia-l@lists.wikimedia.org>
Cc: Kim Bruning <k...@kimbruning.nl>
Subject: [Wikimedia-l] Re: Bing-ChatGPT

On Sun, Mar 19, 2023 at 02:48:12AM -0700, Lauren Worden wrote:
>
> They have, and LLMs absolutely do encode a verbatim copy of their
> training data, which can be produced intact with little effort.

> https://arxiv.org/pdf/2205.10770.pdf
> https://bair.berkeley.edu/blog/2020/12/20/lmmem/

My understanding so far is that encoding a verbatim copy is typically due to 
'Overfitting'.

This is considered a type of bug. It is undesirable for many reasons
(technical, ethical, legal).

Models are (supposed to be) trained to prevent this as much as possible.

Clearly there was still work to be done in dec 2020 at the least.

sincerely,
        Kim Bruning
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/5PNCR3KVBCEEKYT6I3J6VZKFE7NFIGB2/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/T5KLPE5ITDE23BYERCQ5W5UCDKSU3LSD/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

Reply via email to