whatever) contributions to Gentoo

Robin H. Johnson Mon, 04 Mar 2024 22:12:10 -0800

(Full disclosure: I presently work for a non-FAANG cloud company
with a primary business focus in providing GPU access, for AI & other
workloads; I don't feel that is a conflict of interest, but understand
that others might not feel the same way).

Yes, we need to formally address the concerns.
However, I don't come to the same conclusion about an outright ban.

I think we need to:
1. Short-term, clearly point out why much of the present outputs
   would violate existing policies. Esp. the low-grade garbage output.
2. Short & medium-term: a time-limited policy saying "no AI-backend
   works temporarily, while waiting for legal precedent", which clear
   guidelines about what is being the blocking deal.
3. Longer-term, produce a policy that shows how AI generation can be
   used for good, in a safe way**.
4. Keep the human in the loop; no garbage reinforcing garbage.

Further points inline.

On Tue, Feb 27, 2024 at 03:45:17PM +0100, Michał Górny wrote:
> Hello,
> 
> Given the recent spread of the "AI" bubble, I think we really need to
> look into formally addressing the related concerns.  In my opinion,
> at this point the only reasonable course of action would be to safely
> ban "AI"-backed contribution entirely.  In other words, explicitly
> forbid people from using ChatGPT, Bard, GitHub Copilot, and so on, to
> create ebuilds, code, documentation, messages, bug reports and so on for
> use in Gentoo.
Are there footholds where you see AI tooling would be acceptable to you
today? AI-summarization of inputs, if correct & free of hallucinations,
is likely to be of immediate value. I see this coming up in terms of
analyzing code backtraces as well as better license analysis tooling.
The best tools here include citations that should be verified as to why
the system thinks the outcome is correct: buyer-beware if you don't
verify the citations.

> Just to be clear, I'm talking about our "original" content.  We can't do
> much about upstream projects using it.
> 
> Rationale:
> 
> 1. Copyright concerns.  At this point, the copyright situation around
> generated content is still unclear.  What's pretty clear is that pretty
> much all LLMs are trained on huge corpora of copyrighted material, and
> all fancy "AI" companies don't give shit about copyright violations.
> In particular, there's a good risk that these tools would yield stuff we
> can't legally use.
The Gentoo Foundation (and SPI) are both US legal entities. That means
at least abiding by US copyright law...
As of writing this, the present US Copyright office says AI-generated
works are NOT eligible for their *own* copyright registration. The
outputs are either un-copyrightable or if they are sufficiently
similarly to existing works, that original copyright stands (with
license and authorship markings required).

That's going to be a problem if the EU, UK & other major WIPO members
come to a different conclusion, but for now, as a US-based organization,
Gentoo has the rules it must follow.

The fact that it *might* be uncopyrightable, and NOT tagged as such
gives me equal concern to the missing attribution & license statements.
Enough untagged uncopyrightable material present MAY invalidate larger
copyrights.

Clearer definitions about the distinction between public domain vs
uncopyrightable are also required in our Gentoo documentation (at a high level
ineligible vs not copyrighted vs expired vs laws/acts-of-government vs
works-of-government, but there is nuance).

> 
> 2. Quality concerns.  LLMs are really great at generating plausibly
> looking bullshit.  I suppose they can provide good assistance if you are
> careful enough, but we can't really rely on all our contributors being
> aware of the risks.
100% agree; The quality of output is the largest concern *right now*. 
The consistency of output is strongly related: given similar inputs
(including best practices not changing over time), it should give
similar outputs.

How good must the output be to negate this concern?
Current-state-of-the-art can probably write ebuilds with fewer QA
violations than most contributors, esp. given automated QA checking
tools for a positive reinforcement loop.

Besides the actual output being low-quality, the larger problem is that
users submitting it don't realize that it's low-quality (or in a few
cases don't care).

Gentoo's existing policies may only need tweaks & re-iteration here.
- GLEP76 does not set out clear guidelines for uncopyrightable works.
- GLEP76 should have a clarification that asserting GCO/DCO over
  AI-generated works at this time is not acceptable.

> 3. Ethical concerns.  As pointed out above, the "AI" corporations don't
> give shit about copyright, and don't give shit about people.  The AI
> bubble is causing huge energy waste.  It is giving a great excuse for
> layoffs and increasing exploitation of IT workers.  It is driving
> enshittification of the Internet, it is empowering all kinds of spam
> and scam.
Is an ethical AI entity possible? Your argument here is really an
extension of a much older maxim: "There's no ethical consumption under
capitalism". This can encompass most tech corporations, AI or not.
It's just much more readily exposed with AI than other "big tech"
movements, because AI and the name of AI is being used do immoral &
unethical things far more frequently that before.

An truly ethical AI entity should also not be the outcome of
rent-seeking behaviors (maybe profit-seeking, but that returns to the
perils of capitalism).

The energy waste argument is also one that needs to be made carefully: The
training & fine-tuning phases today are energy wastes, only compared to the
lifetime energy usage of a human to learn the same things. When that gets more
efficient, the human may be the energy waste ;-) [1].

The generation/inference phases may be able to generate correct output
MUCH more efficiently than a human. If I think of how many times I run
"ebuild ... test" and "pkgcheck scan" some packaging, trying to get it
correct: the AI will be able to do a better job than most developers in
reasonable course of time...

Gentoo's purpose as an organization, is not to be arbiters of ethics: we
can stand against unethical actions. Where is that middle ground?

At the top, I noted that it will be possible in future for AI generation
to be used in a good, safe way, and we should provide some signals to
the researchers behind the AI industry on this matter.

What should it have?
- The output has correct license & copyright attributions for portions that are 
copyrightable.
- The output explicitly disclaims copyright for uncopyrightable portions
  (yes, this is a higher bar than we set for humans today).
- The output is provably correct (QA checks, actually running tests etc)
- The output is free of non-functional/nonsense garbage.
- The output is free of hallucinations (aka don't invent dependencies that 
don't exist).

Can you please contribute other requirements that you feel "good" AI output 
should have?

[1]
Citation needed; Best estimate I have says:
https://www.eia.gov/tools/faqs/faq.php?id=85&t=1 76 MMBtu/person/year
https://www.wolframalpha.com/input?i=+76+MMBtu+to+MWh => 22.27 MWh/person/year
vs
Facebook claims entire model development energy consumption on all 4 sizes of 
LLaMA was 2,638 MWh
https://kaspergroesludvigsen.medium.com/facebook-disclose-the-carbon-footprint-of-their-new-llama-models-9629a3c5c28b

2638 / 22.27 => 118.45 people 
So Development energy was the same as 118 average people doing average things 
for a year.
(not CompSci students compiling their code many times).

The outcome here: don't use AI where a human would be much more efficient,
unless you have strong reasons why it would be better to use the AI than a
human. We haven't crossed that threshold YET, but the day is coming, esp. with
amortized costs that training is a rare event compared to inference.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation President & Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

signature.asc
Description: PGP signature

Re: [gentoo-dev] RFC: banning "AI"-backed (LLM/GPT/whatever) contributions to Gentoo

Reply via email to