Re: [gentoo-dev] Re: EGO_SUM

2023-06-09 Thread Florian Schmaus

On 02/06/2023 10.31, Michał Górny wrote:

On Fri, 2023-06-02 at 10:17 +0200, Florian Schmaus wrote:

On 30/05/2023 18.35, Arthur Zamarin wrote:

On 30/05/2023 18.52, Florian Schmaus wrote:

To prevent harm from Gentoo, we should reach an agreement that everyone
can live with. To achieve a consensus, and since I can not rule out that
I missed a post that includes specific numbers, please share your ideas
on how EGO_SUM could be reinstated in ::gentoo by replying to this mail.


I still want to ask why in ::gentoo should it be enabled? I'm trying to
understand why?


In short: Auditability
[…]
A Gentoo developer, Gentoo user, or, anyone can look at the ebuild and
immediately tell that it will likely not inject malicious code into the
resulting binary image. Furthermore, the only input is from upstream,
and while you may not look at every line of source code, you assign a
certain trust level to upstream and probably assume that the input is
also likely non-malicious.



This reasoning is seriously flawed.  A "typical" EGO_SUM ebuilds
contains dozens to hundreds of different packages from dozens of
different authors.  You can't seriously expect anyone to be able to
reasonably establish trust to all of them.


I am sorry. I was unable to get my point across.

The security impact is unrelated to what you describe. You always have a 
certain degree of trust in upstream. Regardless if upstream is consumed 
by 100 Gentoo packages or if there are 100 entries in EGO_SUM.


The point was and is about *non-upstream input* in the ebuild. While 
EGO_SUM fetches its artifacts from upstream, a dependency tarball does 
typically not originate from upstream.


Even if we would not trust EGO_SUM upstream, consuming inputs via 
EGO_SUM would still be better from a security perspective because 
EGO_SUM upstream is consumed by Gentoo and all of Go's ecosystem. Hence, 
if something gets compromised, it will likely be detected quickly. 
Compared to dependency tarballs, which are usually only consumed by Gentoo.



> In the end, gentoo.git security model is entirely reliant
> on the developer verifying the final product and signing on it.
> Everything else is untrustworthy noise.

How do you verify the output, that is, final product? This is hard, for 
example, reproducible builds are far from trivial to achieve.


On the other hand, ensuring that the input matches what upstream 
provides and expects is far more manageable.


- Flow


OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-06-02 Thread Michał Górny
On Fri, 2023-06-02 at 10:17 +0200, Florian Schmaus wrote:
> On 30/05/2023 18.35, Arthur Zamarin wrote:
> > On 30/05/2023 18.52, Florian Schmaus wrote:
> > > To prevent harm from Gentoo, we should reach an agreement that everyone
> > > can live with. To achieve a consensus, and since I can not rule out that
> > > I missed a post that includes specific numbers, please share your ideas
> > > on how EGO_SUM could be reinstated in ::gentoo by replying to this mail.
> > 
> > I still want to ask why in ::gentoo should it be enabled? I'm trying to
> > understand why? 
> 
> In short: Auditability
> 
> Let me try to explain with a simplified example.
> 
> Gentoo's ebuilds contain the instructions to transform source code 
> (input) via a compilation process (transformation) into a binary image 
> (output).
> 
> A pseudo-example ebuild may contain the following
> 
> foo-1.0.ebuild:
> ```
> # Input
> SRC_URI="https://foo-soft.org/foo/1.0/foo-1.0.tar.gz;
> 
> # Transformation
> src_compile() {
>  emake foo
> }
> 
> # Output into imagedir $D
> src_install() {
>  emake DESTDIR="${D}" foo-install
> }
> ```
> 
> A Gentoo developer, Gentoo user, or, anyone can look at the ebuild and 
> immediately tell that it will likely not inject malicious code into the 
> resulting binary image. Furthermore, the only input is from upstream, 
> and while you may not look at every line of source code, you assign a 
> certain trust level to upstream and probably assume that the input is 
> also likely non-malicious.
> 

This reasoning is seriously flawed.  A "typical" EGO_SUM ebuilds
contains dozens to hundreds of different packages from dozens of
different authors.  You can't seriously expect anyone to be able to
reasonably establish trust to all of them.

In the end, gentoo.git security model is entirely reliant
on the developer verifying the final product and signing on it. 
Everything else is untrustworthy noise.

-- 
Best regards,
Michał Górny




Re: [gentoo-dev] Re: EGO_SUM

2023-06-02 Thread Florian Schmaus

Hi Arthur,

thanks for your mail.

On 30/05/2023 18.35, Arthur Zamarin wrote:

On 30/05/2023 18.52, Florian Schmaus wrote:

To prevent harm from Gentoo, we should reach an agreement that everyone
can live with. To achieve a consensus, and since I can not rule out that
I missed a post that includes specific numbers, please share your ideas
on how EGO_SUM could be reinstated in ::gentoo by replying to this mail.


I still want to ask why in ::gentoo should it be enabled? I'm trying to
understand why? 


In short: Auditability

Let me try to explain with a simplified example.

Gentoo's ebuilds contain the instructions to transform source code 
(input) via a compilation process (transformation) into a binary image 
(output).


A pseudo-example ebuild may contain the following

foo-1.0.ebuild:
```
# Input
SRC_URI="https://foo-soft.org/foo/1.0/foo-1.0.tar.gz;

# Transformation
src_compile() {
emake foo
}

# Output into imagedir $D
src_install() {
emake DESTDIR="${D}" foo-install
}
```

A Gentoo developer, Gentoo user, or, anyone can look at the ebuild and 
immediately tell that it will likely not inject malicious code into the 
resulting binary image. Furthermore, the only input is from upstream, 
and while you may not look at every line of source code, you assign a 
certain trust level to upstream and probably assume that the input is 
also likely non-malicious.


That changes fundamentally with dependency tarballs. Now you have

foo-1.0.ebuild:
```
# Input
SRC_URI="
https://foo-soft.org/foo/1.0/foo-1.0.tar.gz
https://some-random.dude/on/the/internet/foo-1.0-deps.tar.gz
"

# Transformation
src_compile() {
emake foo
}

# Output into imagedir $D
src_install() {
emake DESTDIR="${D}" foo-install
}
```

Now you need to look into foo-1.0-deps.tar.gz if you want the keep the 
level of trust as before. And here, "look into foo-1.0-deps.tar.gz" 
means to ideally apply the same steps the creator of the tarball 
supposedly did and compare your foo-1.0-deps.tar.gz tarball with the one 
from the ebuild. To make matters worse, you can not simply compare the 
two tarballs bytewise, but you have to compare the archives for 
structural identity.


In the case of ::gentoo, this is especially problematic for 
proxy-maintained packages. See 
https://github.com/gentoo/gentoo/pull/27050 for an actual example.


Assuming that every developer will accurately audit the non-upstream 
inputs, a proxied maintainer provides, creates considerable wiggle room 
for a highly security-sensitive matter. And even if we would establish a 
firm policy, we still would need the tools to verify the non-upstream 
inputs (which we do not have currently). Furthermore, Gentoo lacks 
manpower, not only in the proxy-maint project, and verifying 
non-upstream inputs introduces additional effort maintaining ::gentoo.


Last but not least, this also affects non-proxied packages in ::gentoo.

Even if every one of my fellow Gentoo developers is trustworthy, the 
fact that most ebuilds are easily auditable by simply looking at them is 
a huge advantage. Of course, some ebuilds pull in a lot of third-party 
patches (Xen, for example), which makes it hard to verify those. But not 
having EGO_SUM means that *all Go-packages* are immediately more 
challenging to verify because of the non-upstream input that the 
dependency tarball presents. Regardless if a Gentoo developer created 
the tarball or not.




Also please remember the issue of scale. Look at the amount of packages
under dev-python. There are a lot of tools written in Go.


We currently have around 250 Go-packages in ::gentoo and dev-python/* 
alone contains 1600 packages. So the package-count numbers of the two 
programming languages are not yet comparable. But note that I suggested 
to review the EGO_SUM policy once the number of Go packages has doubled 
or in two years (whatever comes first) in my previous mail.


- Flow



OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-31 Thread William Hubbs
On Wed, May 31, 2023 at 08:30:58AM +0200, pascal.jaeger leimstift.de wrote:
> 
> > Arthur Zamarin  hat am 30.05.2023 18:35 CEST 
> > geschrieben:
> > 
> > 
> > Currently the best solution *per package* is to speak with upstream, to
> > add a CI workflow which create a source tarball which includes `vendor`
> > dir. This is the best way, and I'm doing that for multiple upstream of
> > some random Go packages in ::gentoo. But I know the disadvantage -
> > requirement to speak with upstream, explain why, and add it to the
> > system. This is best long-run solution, but more hardships.
> > 
> 
> I would like to add to this, that even if upstream is not willing to do this, 
> devs could automate the creation of vendor tarballs using GitHub actions. I 
> only did this for an upstream repositories that are also on GitHub and for 
> projects written in Rust. Initially I did this for complicated Rust projects 
> with several git submodules and submodules of submodules. But with a little 
> tweaking of the GitHub actions I think it would be possible to use it for Go 
> as well.  
> https://wiki.gentoo.org/wiki/User:Schievel/autocreate_rust_sources
> 
> This is additional initial work, but once you set it up, you don't even have 
> the extra work of creating a new EGO_SUM for every package release. Ideally 
> you just have to change the version in the file name of the ebuild to bump a 
> package.
> 
> Security wise I do not see a difference between this and creating the vendor 
> tarball manually and uploading it to GitHub, as many proxy maintainers 
> without devspace do it. 

Can we please avoid vendor tarballs? there are situations, say when a
dependency includes non-go code, when vendor tarballs do not work.
That is why I went with the dependency tarballs.

I haven't written github actions, but here is the script I use to create
them, partly thanks to Sam for this.

This is stored in my ~/bin directory and I run it from the top level of
a go project which does not have a "vendor" directory.

William
#!/bin/bash

if [[ -z $1 ]]; then
printf "no tarball name specified\n" >&2
return 1
fi

GOMODCACHE=${PWD}/go-mod go mod download -modcacherw
XZ_OPT='-T0 -9' \
tar --owner 0 --group 0 --posix -acf ${1}-deps.tar.xz go-mod
rm -fr go-mod


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-31 Thread Arsen Arsenović

Andrew Ammerlaan  writes:

> On 30/05/2023 18:35, Arthur Zamarin wrote:
>> My solution is as such:
>> 1. Undeprecate EGO_SUM in eclass
>> 2. Forbid it's usage in ::gentoo (done by pkgcheck, error level, will
>> fail CI and as such we can see the misuse). Overlays are allowed.
>> 3. Maintainer starts talks with upstreams to add release workflow to
>> create vendored source tarball, in hopes of it succeeding. "Start early,
>> to future profit". I see this flow similar to the "always try to
>> upstream patches".
>> 4. Until upstream adds it, in ::gentoo use vendor tarballs.
>> I also think many devs agree with this solution, but I can't talk for
>> them, so I'll be happy agreeing devs can at least reply shortly their
>> agreement or disagreement.
>
> I fully agree with Arthur

+1

> With regards to proxy-maintained packages: The proxy can generate and upload
> the vendor tarball for the proxied, this is not that much extra work.

This expands the required trust in proxy maintainers, in a way which is
unusually easy to double check.

We can automate generating vendor tarballs (or more).  If implemented
such that tarballs are reproducible, it should be easy to verify by
running the same procedure from a different host and verifying.

There would still be a slight cost to an initial 'whitelist package'
step or such, but IMO, that's not a very large cost.  (and, also,
possibly some other mechanism could be implemented)

> Best regards,
> Andrew


-- 
Arsen Arsenović


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-31 Thread Ryan Qian
Just FYI, here is a working GitHub action for generating vendor tarballs in the 
same repo but with different branches 
https://github.com/bekcpear/gopkg-vendors/blob/main/.github/workflows/make-vendor.yaml
It has already worked for a long time.

Sincerely.
Ryan

> 在 2023年5月31日,14:20,Andrew Ammerlaan  写道:
> 
> On 30/05/2023 18:35, Arthur Zamarin wrote:
>> My solution is as such:
>> 1. Undeprecate EGO_SUM in eclass
>> 2. Forbid it's usage in ::gentoo (done by pkgcheck, error level, will
>> fail CI and as such we can see the misuse). Overlays are allowed.
>> 3. Maintainer starts talks with upstreams to add release workflow to
>> create vendored source tarball, in hopes of it succeeding. "Start early,
>> to future profit". I see this flow similar to the "always try to
>> upstream patches".
>> 4. Until upstream adds it, in ::gentoo use vendor tarballs.
>> I also think many devs agree with this solution, but I can't talk for
>> them, so I'll be happy agreeing devs can at least reply shortly their
>> agreement or disagreement.
> 
> I fully agree with Arthur
> 
> With regards to proxy-maintained packages: The proxy can generate and upload 
> the vendor tarball for the proxied, this is not that much extra work.
> 
> Best regards,
> Andrew
> 
> 
> 


Re: [gentoo-dev] Re: EGO_SUM

2023-05-31 Thread pascal.jaeger leimstift.de


> Arthur Zamarin  hat am 30.05.2023 18:35 CEST 
> geschrieben:
> 
> 
> Currently the best solution *per package* is to speak with upstream, to
> add a CI workflow which create a source tarball which includes `vendor`
> dir. This is the best way, and I'm doing that for multiple upstream of
> some random Go packages in ::gentoo. But I know the disadvantage -
> requirement to speak with upstream, explain why, and add it to the
> system. This is best long-run solution, but more hardships.
> 

I would like to add to this, that even if upstream is not willing to do this, 
devs could automate the creation of vendor tarballs using GitHub actions. I 
only did this for an upstream repositories that are also on GitHub and for 
projects written in Rust. Initially I did this for complicated Rust projects 
with several git submodules and submodules of submodules. But with a little 
tweaking of the GitHub actions I think it would be possible to use it for Go as 
well.  
https://wiki.gentoo.org/wiki/User:Schievel/autocreate_rust_sources

This is additional initial work, but once you set it up, you don't even have 
the extra work of creating a new EGO_SUM for every package release. Ideally you 
just have to change the version in the file name of the ebuild to bump a 
package.

Security wise I do not see a difference between this and creating the vendor 
tarball manually and uploading it to GitHub, as many proxy maintainers without 
devspace do it. 

Regards
Pascal



Re: [gentoo-dev] Re: EGO_SUM

2023-05-31 Thread Andrew Ammerlaan

On 30/05/2023 18:35, Arthur Zamarin wrote:

My solution is as such:

1. Undeprecate EGO_SUM in eclass
2. Forbid it's usage in ::gentoo (done by pkgcheck, error level, will
fail CI and as such we can see the misuse). Overlays are allowed.
3. Maintainer starts talks with upstreams to add release workflow to
create vendored source tarball, in hopes of it succeeding. "Start early,
to future profit". I see this flow similar to the "always try to
upstream patches".
4. Until upstream adds it, in ::gentoo use vendor tarballs.

I also think many devs agree with this solution, but I can't talk for
them, so I'll be happy agreeing devs can at least reply shortly their
agreement or disagreement.


I fully agree with Arthur

With regards to proxy-maintained packages: The proxy can generate and 
upload the vendor tarball for the proxied, this is not that much extra 
work.


Best regards,
Andrew




Re: [gentoo-dev] Re: EGO_SUM

2023-05-30 Thread Oskari Pirhonen
On Tue, May 30, 2023 at 21:30:49 +0500, Anna (cybertailor) Vyalkova wrote:
> On 2023-05-30 17:52, Florian Schmaus wrote:
> > To prevent harm from Gentoo, we should reach an agreement that everyone 
> > can live with. To achieve a consensus, and since I can not rule out that 
> > I missed a post that includes specific numbers, please share your ideas 
> > on how EGO_SUM could be reinstated in ::gentoo by replying to this mail.
> 
> Instate a policy to allow EGO_SUM in the gentoo tree:
> 
> 1) from proxied maintainers

I agree that allowing EGO_SUM in ::gentoo at least for proxy maintained
packages would be a good idea. I don't have any Go packages, but I can
see how it could be cumbersome to get a tarball hosted somewhere.

- Oskari


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-30 Thread Arthur Zamarin
On 30/05/2023 18.52, Florian Schmaus wrote:
> 
> I am thankful that the council considered my request to vote on the
> topic. However, the council decided not to vote on this in its last
> session and to return the issue to the mailing lists.
> 
> Some see the requirement of some limitations as necessity it comes to
> reinstating EGO_SUM. Unfortunately, I could not see specific numbers
> mentioned since June 2022 in the three EGO_SUM threads [1, 2, 3] I am
> aware of.
> 
> To prevent harm from Gentoo, we should reach an agreement that everyone
> can live with. To achieve a consensus, and since I can not rule out that
> I missed a post that includes specific numbers, please share your ideas
> on how EGO_SUM could be reinstated in ::gentoo by replying to this mail.

I still want to ask why in ::gentoo should it be enabled? I'm trying to
understand why? If you speak about overlays, then I agree that it should
be allowed there, but I don't see any benefit to it existence in
::gentoo. My reason for that difference: the existence of gentoo-devs
with access to ~devspace.

Currently the best solution *per package* is to speak with upstream, to
add a CI workflow which create a source tarball which includes `vendor`
dir. This is the best way, and I'm doing that for multiple upstream of
some random Go packages in ::gentoo. But I know the disadvantage -
requirement to speak with upstream, explain why, and add it to the
system. This is best long-run solution, but more hardships.

> Having EGO_SUM would significantly increase the security of Gentoo's
> users (amongst other benefits).

While technically correct, we return to same "confidence" issue in the
dev (a dev can add malicious code into ebuild). Yes, adding malicious
code inside vendor tarball to hide it is easier and robbat2 demonstrated
it as working.

How can we solve it? One weird idea I have is to use vendor tarball
consisting of multiple tarballs per package, and include hash for it
inside the vendor tarball. I think you can compare the manifest stored
in `go.sum` file in source code with the once from the tarball
(verification of that claim needed). As a result I think we can offline
verify it.

> Personally, I do not see that we currently need any form of limitation
> to reinstate EGO_SUM. I substantiated this with data based on a two-year
> history analysis of gentoo.git. The summary is that the
> - size increase of ::gentoo is unproblematic for users
> - additional sync delta of ::gentoo is unproblematic for users
> - higher rate of gentoo.git's increase is unproblematic for developers
> when we reinstate EGO_SUM in ::gentoo.

Why "unproblematic"? Where I leave I have quite high RTT, meaning each
download takes long initial time until fetches with good speed. Fetching
a lot of small files is really bad for me (even from mirror in same
country, sigh). Having big deltas hit hard the git packs, higher load on
a lot of places.

Thinking on infra side, I remember stories of the issues when go.pkg was
doing full `git clone` (not shallow copy) of the whole gentoo.git
repository. Now imagine we allow the huge and frequent deltas of go
modules to run, image how fast we get to huge full repository. Yes, now
we blacklist this stupid failure of go.pkg, but it might happen with
other service. Full git clones aren't that rare.

Also note that Go packages tend to update frequently (because of all the
bundling and security issues). The fact you don't see a lot of updates
in ::gentoo is because many of them are under less active developers
(not to offend anyone, it is fine to skip bumps were a good place, not
my place to criticize!).

Also please remember the issue of scale. Look at the amount of packages
under dev-python. There are a lot of tools written in Go.

> Therefore, we could (and IMHO should) simply un-deprecate EGO_SUM.
> However, I would review this decision once the number of Go packages has
> doubled or in two years (whatever comes first).
> 
> Many share the concerns of an EGO_SUM-less world. I know that some seek
> a compromise by reinstating EGO_SUM with some limitations. The ::gentoo
> repository is able to handle packages (at least) up to the range of 2 to
> 1.5 MiB total package-directory size. Therefore I propose a limit in
> that range.

My solution is as such:

1. Undeprecate EGO_SUM in eclass
2. Forbid it's usage in ::gentoo (done by pkgcheck, error level, will
fail CI and as such we can see the misuse). Overlays are allowed.
3. Maintainer starts talks with upstreams to add release workflow to
create vendored source tarball, in hopes of it succeeding. "Start early,
to future profit". I see this flow similar to the "always try to
upstream patches".
4. Until upstream adds it, in ::gentoo use vendor tarballs.

I also think many devs agree with this solution, but I can't talk for
them, so I'll be happy agreeing devs can at least reply shortly their
agreement or disagreement.

> - Flow
> 
> 
> 1: 

Re: [gentoo-dev] Re: EGO_SUM

2023-05-22 Thread Florian Schmaus

On 08/05/2023 14.03, Michał Górny wrote:

On Mon, 2023-05-08 at 09:53 +0200, Florian Schmaus wrote:

Furthermore, both numbers, 256 MiB and 410 MiB, are based on the
over-approximation that every EGO_SUM package uses 1.6 MiB, which is
almost certainly not the case. The mean package-directory size of a
EGO_SUM using package at 2022-02-16 was 280 KiB.


Please extend this analysis to Manifest changes over time, and how they
are going to impact total gentoo.git size.


Gladly.

The average daily change caused by Manifests of EGO_SUM packages from 
2020-02-16 to 2022-02-16 was at most 80 KiB. (See below for the 
methodology used to obtain this number.)


In other words, a daily syncing user had at most 80 KiB traffic on 
average per day to sync the Manifests of all EGO_SUM that existed on 
2022-02-16.


Even in lesser developed regions of the world, 80 KiB a day are 
manageable. And, this would still be the case if we double, quadruple or 
octuple this number.


I note that this number does not include ebuilds and metadata. However, 
one can easily over-approximate that the additional ebuilds and metadata 
delta, that comes with the observed Manifest changes, is smaller than 
the Manifest changes themselves. Therefore, a pessimistic approximation 
is twice 80 KiB.


But then again, the 80 KiB are not considering transport compression. 
And, as we have learned, Manifests roughly compress to 50% of their 
original size. So the average EGO_SUM-generated network traffic, 
assuming that it is compressed, remains in the region of hundred 
kilobytes per day.


We can also use this number to over-approximate the growth rate of 
gentoo.git due to EGO_SUM.


Assume that 120 EGO_SUM packages cause a daily growth rate of 160 KiB, 
that is 2x 80 KiB and the number we have used above. Doubling this 
number would yield the estimated rate of the current number of Go 
packages in ::gentoo. This rate amounts to 320 KiB daily, increasing 
gentoo.git by 114 MiB per year. Please double this number for a bit of 
future safety.


In summary, this and the previous analysis finds not data-size-based 
arguments against EGO_SUM's usage.


Using EGO_SUM is fine for users and developers. The ::gentoo increase, 
even if it would quadruple the current size, does not entail any issues. 
The expected average daily delta that EGO_SUM would cause today is also 
no threat, even for users with low-bandwidth connections. The size 
increase which EGO_SUM causes to gentoo.git is also within manageable 
bounds. If an ebuild developer has 1-2 gigabytes free on their disk, 
they will not need to buy a larger disk in the coming years if we start 
using EGO_SUM again in ::gentoo.


- Flow


# Appendix: Methodology

We took gentoo.git at 2022-02-16 at the commit 60dc7a03ff2f. From there, 
we created the numstat log (git log --numstat) of each Manifest of every 
EGO_SUM package. We configured the numstat log to go back at most two 
years in time, that is, till 2020-02-16. The numstat log contains the 
changed lines (added/removed) of the Manifest in the target period. An 
awk script calculated the total sum of added and removed lines. Note 
that this treats removed lines equal to added lines, even though the 
removed lines should cause significantly less network traffic. We also 
extracted the date of the oldest commit in the observed period. This 
date was used to calculate the total number of days in the period, which 
accounts for packages that came to life after 2020-02-16 and would 
otherwise skew the analysis towards smaller results.


Dividing the total number of changed lines by the number of days yields 
the average number of lines changed per day per package.


We further determined the worst-observed line length of EGO_SUM packages 
manifests, which was 404 bytes.


Summarizing the average number of lines changed over all packages 
yielded 195.58093724672614. Multiplying this number by the maximal 
observed line length of 404 bytes gives 79014.69 bytes per day or, in 
other words, roughly 80 KiB per day.


The raw and post-processed results of this analysis are available at

https://dev.gentoo.org/~flow/gentoo-tree-analysis-results/2023-05-17T100838-gentoo-at-2022-02-16-60dc7a03ff2f/

The code used to carry out this analysis is available at

https://gitlab.gentoo.org/flow/gentoo-tree-analysis

for everyone to study the code, reproduce the results, and check for 
issues and bugs.


As always, I appreciate any feedback.


OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-08 Thread Michał Górny
On Mon, 2023-05-08 at 09:53 +0200, Florian Schmaus wrote:
> On 02.05.23 21:45, Sam James wrote:
> > Florian Schmaus  writes:
> > > On 27/04/2023 23.16, Sam James wrote:
> > > > Florian Schmaus  writes:
> > > > > On 26/04/2023 18.12, Matt Turner wrote:
> > > > > > On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  
> > > > > > wrote:
> > > > > > > The discussion would be more productive if someone who is 
> > > > > > > supporting the
> > > > > > > EGO_SUM deprecation could rationally summarize the main arguments 
> > > > > > > why we
> > > > > > > deprecated EGO_SUM.
> > > > > > You're requesting the changes. It's on you to read the previous
> > > > > > threads and try to understand. It's not others' responsibilities to
> > > > > > justify the status quo to you, but tl;dr is Manifest files grew to
> > > > > > insane sizes for golang packages with many dependencies, and the
> > > > > > Manifest size is a cost all Gentoo users pay regardless of whether
> > > > > > they use the package.
> > > > > 
> > > > > I am sorry. I did try to understand the reasoning in the previous
> > > > > threads. However, I do not conclude that the "cost" users must pay for
> > > > > EGO_SUM justifies EGO_SUM's deprecation. It is the other way around:
> > > > > EGO_SUM's advantages do not explain its deprecation, even if users
> > > > > have to pay a cost.
> > > > > 
> > > > > You write that the "Manifest sizes grew to insane sizes"?
> > > > > 
> > > > > At which boundary does a package size, the total size of the package's
> > > > > directory, become insane?
> > > > > 
> > > > > Disk space is cheap. Currently, ::gentoo, without metadata, is around
> > > > > 470 MiB. If you add 10 Go packages with a whopping 200 KiB each, then
> > > > > this adds 2 MiB to that. I need someone to explain how this
> > > > > constitutes an issue with disk space. Even if we add 100 Go packages,
> > > > > probably roughly the number of Go packages we have in ::gentoo, then
> > > > > those 20 MiB are not significant. Needless to say that the average
> > > > > size of a Go package is less than the 200 KiB uses in this
> > > > > calculation.
> > > > The numbers you've used here suggest you've missed some of the
> > > > big problematic cases from the past:
> > > > - https://bugs.gentoo.org/833478 (1.1MB manifest)
> > > > - https://bugs.gentoo.org/833477 (1.6MB manifest)
> > > 
> > > Thanks for pointing those bugs out.
> > > 
> > > But please allow me to clarify that I did not miss those "problematic"
> > > cases from the past.
> > 
> > This kind of phrasing is the sort of thing which makes it seem like you
> > don't appreciate/acknowledge others' concerns.
> 
> I am genuinely sorry if my usage of "problematic" made it appear that I 
> do not appreciate the other's concerns. Like most people on this mailing 
> list, I appreciate everyone who cares about Gentoo and raises concerns.
> 
> I do, however, not share the concerns regarding EGO_SUM.
> 
> It is hard to share concerns based on rather abstract reasons—for 
> example, the portrayal of EGO_SUM as unfair.
> 
> It would be easier to share concerns if somebody gave concrete reasons 
> against EGO_SUM. For example, use cases that are no longer possible. Or 
> developers or users who are restricted in their work by EGO_SUM in a 
> relevant way.
> 
> But actual problems that currently speak against the use of EGO_SUM have 
> not surfaced.
> 
> 
> > I said problematic because it was clearly beyond what your worst-case
> > estimates were, i.e. far more than what you were saying would be a
> > large amount for the purposes of calculations.
> 
> Using the term "worst-case", even if I put it in quotes, probably got 
> people on the wrong track. I am sorry for that; my bad. It is, in 
> general, impossible even to approximate the worst-case size-increase of 
> ::gentoo.
> 
> Our best chance is to use historical data to interpolate the future.
> 
> My back-of-the-envolope calculation was 256 Go-packages, with each 
> having 1 MiB. An analysis of the three on 2022-02-16, at the commit 
> right before Minikube and k3s were cleaned, showed that only five 
> packages out of 120 had larger package-directory sizes than one MiB.
> 
> 256 Go-packages is roughly the number of Go-packages we have right now. 
> Assuming they all have a package-directory size of 1.6 MiB, the most 
> extensive EGO_SUM package the analysis yielded so far, we end up with 
> 410 MiB.
> 
> The point you criticize was that a system able to handle the current 
> size of ::gentoo would also be able to manage an additional 256 MiB. The 
> point still stands if we exchange the 256 MiB with 410 MiB.
> 
> Furthermore, both numbers, 256 MiB and 410 MiB, are based on the 
> over-approximation that every EGO_SUM package uses 1.6 MiB, which is 
> almost certainly not the case. The mean package-directory size of a 
> EGO_SUM using package at 2022-02-16 was 280 KiB.
> 

Please extend this analysis to Manifest changes over time, and how they
are going to impact 

Re: [gentoo-dev] Re: EGO_SUM

2023-05-08 Thread Florian Schmaus

On 02.05.23 21:45, Sam James wrote:

Florian Schmaus  writes:

On 27/04/2023 23.16, Sam James wrote:

Florian Schmaus  writes:

On 26/04/2023 18.12, Matt Turner wrote:

On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:

The discussion would be more productive if someone who is supporting the
EGO_SUM deprecation could rationally summarize the main arguments why we
deprecated EGO_SUM.

You're requesting the changes. It's on you to read the previous
threads and try to understand. It's not others' responsibilities to
justify the status quo to you, but tl;dr is Manifest files grew to
insane sizes for golang packages with many dependencies, and the
Manifest size is a cost all Gentoo users pay regardless of whether
they use the package.


I am sorry. I did try to understand the reasoning in the previous
threads. However, I do not conclude that the "cost" users must pay for
EGO_SUM justifies EGO_SUM's deprecation. It is the other way around:
EGO_SUM's advantages do not explain its deprecation, even if users
have to pay a cost.

You write that the "Manifest sizes grew to insane sizes"?

At which boundary does a package size, the total size of the package's
directory, become insane?

Disk space is cheap. Currently, ::gentoo, without metadata, is around
470 MiB. If you add 10 Go packages with a whopping 200 KiB each, then
this adds 2 MiB to that. I need someone to explain how this
constitutes an issue with disk space. Even if we add 100 Go packages,
probably roughly the number of Go packages we have in ::gentoo, then
those 20 MiB are not significant. Needless to say that the average
size of a Go package is less than the 200 KiB uses in this
calculation.

The numbers you've used here suggest you've missed some of the
big problematic cases from the past:
- https://bugs.gentoo.org/833478 (1.1MB manifest)
- https://bugs.gentoo.org/833477 (1.6MB manifest)


Thanks for pointing those bugs out.

But please allow me to clarify that I did not miss those "problematic"
cases from the past.


This kind of phrasing is the sort of thing which makes it seem like you
don't appreciate/acknowledge others' concerns.


I am genuinely sorry if my usage of "problematic" made it appear that I 
do not appreciate the other's concerns. Like most people on this mailing 
list, I appreciate everyone who cares about Gentoo and raises concerns.


I do, however, not share the concerns regarding EGO_SUM.

It is hard to share concerns based on rather abstract reasons—for 
example, the portrayal of EGO_SUM as unfair.


It would be easier to share concerns if somebody gave concrete reasons 
against EGO_SUM. For example, use cases that are no longer possible. Or 
developers or users who are restricted in their work by EGO_SUM in a 
relevant way.


But actual problems that currently speak against the use of EGO_SUM have 
not surfaced.




I said problematic because it was clearly beyond what your worst-case
estimates were, i.e. far more than what you were saying would be a
large amount for the purposes of calculations.


Using the term "worst-case", even if I put it in quotes, probably got 
people on the wrong track. I am sorry for that; my bad. It is, in 
general, impossible even to approximate the worst-case size-increase of 
::gentoo.


Our best chance is to use historical data to interpolate the future.

My back-of-the-envolope calculation was 256 Go-packages, with each 
having 1 MiB. An analysis of the three on 2022-02-16, at the commit 
right before Minikube and k3s were cleaned, showed that only five 
packages out of 120 had larger package-directory sizes than one MiB.


256 Go-packages is roughly the number of Go-packages we have right now. 
Assuming they all have a package-directory size of 1.6 MiB, the most 
extensive EGO_SUM package the analysis yielded so far, we end up with 
410 MiB.


The point you criticize was that a system able to handle the current 
size of ::gentoo would also be able to manage an additional 256 MiB. The 
point still stands if we exchange the 256 MiB with 410 MiB.


Furthermore, both numbers, 256 MiB and 410 MiB, are based on the 
over-approximation that every EGO_SUM package uses 1.6 MiB, which is 
almost certainly not the case. The mean package-directory size of a 
EGO_SUM using package at 2022-02-16 was 280 KiB.


- Flow




Re: [gentoo-dev] Re: EGO_SUM

2023-05-08 Thread Florian Schmaus

On 02.05.23 22:04, Matt Turner wrote:

On Tue, May 2, 2023 at 3:33 PM Florian Schmaus  wrote:

I performed a tree-wide analysis regarding EGO_SUM and IIRC published
the results in my previous post about EGO_SUM last year.
https://dev.gentoo.org/~flow/ego_sum-2022-01-01.txt shows the analysis
results for ::gentoo as of 2022-01-01 (I've recently updated the file to
contain the Manifest-size too).

Minikube (#833478) and k3s (#833477) appear there, too, with
package-directory sizes over one MiB. However, those packages are under
the top five of packages using EGO_SUM by package-directory size.

They do not represent the average Go package.

The mean size of a Manifest of a package using EGO_SUM was 186 KiB, and
the median was even lower at 84 KiB. Only a tiny percentage of packages,
below 5%, had a Manifest-size above one MiB.


It sounds like you've identified a compelling rationale for a Manifest
size limit.


Please feel free and encouraged to elaborate on your thoughts about 
Manifest size limitation.


- Flow



Re: [gentoo-dev] Re: EGO_SUM

2023-05-02 Thread Matt Turner
On Tue, May 2, 2023 at 3:33 PM Florian Schmaus  wrote:
> I performed a tree-wide analysis regarding EGO_SUM and IIRC published
> the results in my previous post about EGO_SUM last year.
> https://dev.gentoo.org/~flow/ego_sum-2022-01-01.txt shows the analysis
> results for ::gentoo as of 2022-01-01 (I've recently updated the file to
> contain the Manifest-size too).
>
> Minikube (#833478) and k3s (#833477) appear there, too, with
> package-directory sizes over one MiB. However, those packages are under
> the top five of packages using EGO_SUM by package-directory size.
>
> They do not represent the average Go package.
>
> The mean size of a Manifest of a package using EGO_SUM was 186 KiB, and
> the median was even lower at 84 KiB. Only a tiny percentage of packages,
> below 5%, had a Manifest-size above one MiB.

It sounds like you've identified a compelling rationale for a Manifest
size limit.



Re: [gentoo-dev] Re: EGO_SUM

2023-05-02 Thread Sam James

Florian Schmaus  writes:

> [[PGP Signed Part:Undecided]]
> On 27/04/2023 23.16, Sam James wrote:
>> Florian Schmaus  writes:
>> 
>>> [[PGP Signed Part:Undecided]]
>>> On 26/04/2023 18.12, Matt Turner wrote:
 On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:
> The discussion would be more productive if someone who is supporting the
> EGO_SUM deprecation could rationally summarize the main arguments why we
> deprecated EGO_SUM.
 You're requesting the changes. It's on you to read the previous
 threads and try to understand. It's not others' responsibilities to
 justify the status quo to you, but tl;dr is Manifest files grew to
 insane sizes for golang packages with many dependencies, and the
 Manifest size is a cost all Gentoo users pay regardless of whether
 they use the package.
>>>
>>> I am sorry. I did try to understand the reasoning in the previous
>>> threads. However, I do not conclude that the "cost" users must pay for
>>> EGO_SUM justifies EGO_SUM's deprecation. It is the other way around:
>>> EGO_SUM's advantages do not explain its deprecation, even if users
>>> have to pay a cost.
>>>
>>> You write that the "Manifest sizes grew to insane sizes"?
>>>
>>> At which boundary does a package size, the total size of the package's
>>> directory, become insane?
>>>
>>> Disk space is cheap. Currently, ::gentoo, without metadata, is around
>>> 470 MiB. If you add 10 Go packages with a whopping 200 KiB each, then
>>> this adds 2 MiB to that. I need someone to explain how this
>>> constitutes an issue with disk space. Even if we add 100 Go packages,
>>> probably roughly the number of Go packages we have in ::gentoo, then
>>> those 20 MiB are not significant. Needless to say that the average
>>> size of a Go package is less than the 200 KiB uses in this
>>> calculation.
>> The numbers you've used here suggest you've missed some of the
>> big problematic cases from the past:
>> - https://bugs.gentoo.org/833478 (1.1MB manifest)
>> - https://bugs.gentoo.org/833477 (1.6MB manifest)
>
> Thanks for pointing those bugs out.
>
> But please allow me to clarify that I did not miss those "problematic"
> cases from the past.

This kind of phrasing is the sort of thing which makes it seem like you
don't appreciate/acknowledge others' concerns.

I said problematic because it was clearly beyond what your worst-case
estimates were, i.e. far more than what you were saying would be a
large amount for the purposes of calculations.



signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-02 Thread Sam James

Florian Schmaus  writes:

> [[PGP Signed Part:Undecided]]
> On 28/04/2023 16.34, Michał Górny wrote:
>> On Fri, 2023-04-28 at 08:59 +0200, Florian Schmaus wrote:
>>> And I never said that I believe in representing the majority's opinion.
>>> That said, I prefer to have this voted on by an all-developer vote than
>>> a council vote. Then we would know what the majority voted for. Is that
>>> possible?
>> There's the General Resolution but it's supposed to be used only to
>> override Council decisions, so you should go with a Council vote first.
>
> Could we temporarily re-purpose Gentoo's election infrastructure to
> hold an all-developer opinion poll?
>
> I imagine a poll asking for opinions, nothing binding. Furthermore,
> since Gentoo's voting infrastructure uses the Condorcet method, we
> could have multiple options.
>

You still haven't addressed all concerns on this ML, including my
last email, so I'd say this is a bit premature.



signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-02 Thread Florian Schmaus

On 28/04/2023 16.34, Michał Górny wrote:

On Fri, 2023-04-28 at 08:59 +0200, Florian Schmaus wrote:

And I never said that I believe in representing the majority's opinion.
That said, I prefer to have this voted on by an all-developer vote than
a council vote. Then we would know what the majority voted for. Is that
possible?


There's the General Resolution but it's supposed to be used only to
override Council decisions, so you should go with a Council vote first.


Could we temporarily re-purpose Gentoo's election infrastructure to hold 
an all-developer opinion poll?


I imagine a poll asking for opinions, nothing binding. Furthermore, 
since Gentoo's voting infrastructure uses the Condorcet method, we could 
have multiple options.


A poll-preceding phase where voters can submit options for the poll 
would help to take everyone's position into account.


And then, performing a poll where everyone can rank the available 
options should allow us to get a pretty good idea about what the 
community of Gentoo developers thinks about this topic.


@gentoo-elections: would you be willing to assist in such a venture?

- Flow


OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-05-02 Thread Florian Schmaus

On 27/04/2023 23.16, Sam James wrote:

Florian Schmaus  writes:


[[PGP Signed Part:Undecided]]
On 26/04/2023 18.12, Matt Turner wrote:

On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:

The discussion would be more productive if someone who is supporting the
EGO_SUM deprecation could rationally summarize the main arguments why we
deprecated EGO_SUM.

You're requesting the changes. It's on you to read the previous
threads and try to understand. It's not others' responsibilities to
justify the status quo to you, but tl;dr is Manifest files grew to
insane sizes for golang packages with many dependencies, and the
Manifest size is a cost all Gentoo users pay regardless of whether
they use the package.


I am sorry. I did try to understand the reasoning in the previous
threads. However, I do not conclude that the "cost" users must pay for
EGO_SUM justifies EGO_SUM's deprecation. It is the other way around:
EGO_SUM's advantages do not explain its deprecation, even if users
have to pay a cost.

You write that the "Manifest sizes grew to insane sizes"?

At which boundary does a package size, the total size of the package's
directory, become insane?

Disk space is cheap. Currently, ::gentoo, without metadata, is around
470 MiB. If you add 10 Go packages with a whopping 200 KiB each, then
this adds 2 MiB to that. I need someone to explain how this
constitutes an issue with disk space. Even if we add 100 Go packages,
probably roughly the number of Go packages we have in ::gentoo, then
those 20 MiB are not significant. Needless to say that the average
size of a Go package is less than the 200 KiB uses in this
calculation.


The numbers you've used here suggest you've missed some of the
big problematic cases from the past:
- https://bugs.gentoo.org/833478 (1.1MB manifest)
- https://bugs.gentoo.org/833477 (1.6MB manifest)


Thanks for pointing those bugs out.

But please allow me to clarify that I did not miss those "problematic" 
cases from the past.


I performed a tree-wide analysis regarding EGO_SUM and IIRC published 
the results in my previous post about EGO_SUM last year.
https://dev.gentoo.org/~flow/ego_sum-2022-01-01.txt shows the analysis 
results for ::gentoo as of 2022-01-01 (I've recently updated the file to 
contain the Manifest-size too).


Minikube (#833478) and k3s (#833477) appear there, too, with 
package-directory sizes over one MiB. However, those packages are under 
the top five of packages using EGO_SUM by package-directory size.


They do not represent the average Go package.

The mean size of a Manifest of a package using EGO_SUM was 186 KiB, and 
the median was even lower at 84 KiB. Only a tiny percentage of packages, 
below 5%, had a Manifest-size above one MiB.


It appears that some feel like the EGO_SUM size consumption is wasteful.

I am always sympathetic toward optimization efforts that save resources. 
Be it bytes-at-rest, transferred bytes, or CPU cycles. Often those can 
make a difference, or at least, they are evidence of engineering skills.


But even if all Go-packages using EGO_SUM had one-MiB-sized Manifests, 
it is unclear what the actual issue is.


Both bugs ask for action without describing the negative impact of those 
larger than 1 MiB Manifests. For example, there is no mention of someone 
being negatively affected by those bugs nor any observed reduction in 
functionality.


- Flow


OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-29 Thread Robin H. Johnson
On Fri, Apr 28, 2023 at 08:59:29AM +0200, Florian Schmaus wrote:
> On 27/04/2023 14.54, Michał Górny wrote:
> > On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
> >> Disk space is cheap.
> > 
> > No, it's not.  Gentoo supports more hardware than your average PC with
> > beefy hard drive and/or possibility of installing one.  Let's not forget
> > that you need a ::gentoo checkout even on a system running purely
> > on binary packages.
> 
> You are right. Gentoo supports a broad range of hardware in many 
> dimensions, e.g., architecture, release date, and composition.
> 
> You seem to suggest that are Gentoo systems that can not handle the 
> additional disk space consumption of EGO_SUM Go-packages?
> 
> I can not imagine systems that are able to deal with the ~500 MiB 
> ::gentoo repository, but would break if the same repository would 
> contain 100 additional Go-packages with 200 KiB each.
> 
> Even under a "worst-case" assumption, where we would have 256 
> Go-packages with each having a 1 MiB package-directory size, any system 
> that can handle the current state of ::gentoo should be able to take the 
> additional 256 MiB (+ metadata).
This email ended up more rambling than I intended, but I wanted to get the data
out there, and enable us to look deeper at the problems and potential impacts
of the solutions.

Before the ideas and data I wanted to note the semi-conceptual ways to package
new things that have many dependency artifacts (package or distfile).

Distfile-heavy packages:

A package declares many distfile dependencies, but very few package
dependencies. The Manifest files in this case suffer a lot of
duplication - but the growth is mostly limited to ::gentoo (or
overlays).

Any change of a package that leads to slightly different Manifest file,
and while delta compression will reduce the growth factor, it's still
large (dropping a version, adding a version, adding a remotely-fetched patch.

Dependency-heavy packages:
--
A package declares many package dependencies, with the distfile growth
distributed over MANY packages. Major downside here is that
build-depends consume a lot more space & inodes to install all the
depends that are used for the ebuild, esp. when a given distfile might
be used for only one package. Want to build a complex Go-based package?
Debian/Ubuntu use this approach, and it shows might have to explicitly
package 70+ dependencies to get something you want packaged.
https://salsa.debian.org/go-team/packages/consul/-/blob/debian/sid/debian/control#L10-89
a quick back-of-napkin set of math show the Debian golang dep packages,
as of 22.04 LTS: ~30% are a dep for only one package; a further 30% are
a dep for only 2 packages.


With the above in mind, we see that it's not just the size of the Manifest, but
the combinatorial problem of Manifest revisions, with the saving roll of Git's
delta compression.

I pulled a Git listing of every Manifest blob that was larger than 64KiB
in Git history (excluding the historical conversion), and then go based
on those: 2718 blobs in total, taking up ~516MiB, 1600056 DIST entries,
for 166726 distinct distfiles.

I tried to break those distfiles down, based on filename patterns, or where
they occurred (sorted by number of distfiles here):
  76075 dist-tex (all in the tex category)
  33949 dist-mozilla (firefox*, thunderbird*)
  19314 dist-office 
  17802 dist-golang (*%2F@v%2F* files; 10160 .mod, 7642 .zip)
  10478 dist-rust (*.crate files)
   3630 dist-other
   1325 dist-jar-pom (*.jar, *.pom)
   1020 dist-tablebase-syzygy (distfiles for a specific package)
981 dist-kde (kde manifests that met the threshold)
980 dist-kernel-and-genpatches
749 dist-tessdata (again specific packages)
424 dist-bash (specific packages)
 166727 == total

The Rust & Golang counts *are* lower bounds, because it's not trivial to
take into account changes in packaging. However, the upper bound 
E.g. this distfile isn't immediately classifiable as Rust:
d3d12-rs-a990c93ec64eeab78f2292763d0715da9dba1d59.gh.tar.gz
To assume a worst case, assign the dist-other to the category of  your choice.

Ecosystems that are distfile-heavy, in order of Manifest sizes: TeX, Golang, 
Rust
Packages that are distfile-heavy: LibreOffice/OpenOffice, Firefox, Thunderbird

TeX has only a few packages, but the MOST distfiles.
dev-texlive/texlive-latexextra/Manifest peaked over 6MB with 15480 entries. For
all of Gentoo git history however, there have only been 19 revisions of that
Manifest. For all TeX packages, 286 revisions of Manifests over 37 packages.
Those 286 Manifest revisions clock in at ~94MB together before compression.

The Mozilla packages have the next most distfiles:
4 packages, 768 manifest revisions, but the largest single Manifest was only 
285519 bytes.
~88MB for all the manifest revision bytes together.

The office packages (app-office/libreoffice-l10n & app-office/openoffice-bin)
are similar to 

Re: [gentoo-dev] Re: EGO_SUM

2023-04-28 Thread Michał Górny
On Fri, 2023-04-28 at 08:59 +0200, Florian Schmaus wrote:
> On 27/04/2023 14.54, Michał Górny wrote:
> > On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
> > > Disk space is cheap.
> > 
> > No, it's not.  Gentoo supports more hardware than your average PC with
> > beefy hard drive and/or possibility of installing one.  Let's not forget
> > that you need a ::gentoo checkout even on a system running purely
> > on binary packages.
> 
> You are right. Gentoo supports a broad range of hardware in many 
> dimensions, e.g., architecture, release date, and composition.
> 
> You seem to suggest that are Gentoo systems that can not handle the 
> additional disk space consumption of EGO_SUM Go-packages?
> 
> I can not imagine systems that are able to deal with the ~500 MiB 
> ::gentoo repository, but would break if the same repository would 
> contain 100 additional Go-packages with 200 KiB each.
> 
> Even under a "worst-case" assumption, where we would have 256 
> Go-packages with each having a 1 MiB package-directory size, any system 
> that can handle the current state of ::gentoo should be able to take the 
> additional 256 MiB (+ metadata).

That's the slippery slope of exponential growth.  If every developer
thought "oh, worst case it'll grow only 10%"...

There's roughly 19k packages in Gentoo.  Go packages constitute only
a small number of them, yet maintainers of these packages seem to assume
it's fine if they take up a significant portion of disk space.  That's
not fair at all.

In fact, I'm pretty sure I ground some numbers in the previous thread.

> > 
> I am only pursuing the modest request to legitimize any decision 
> regarding EGO_SUM by a democratic vote.
> 
> As far as I can tell, there was never a democratic vote regarding 
> EGO_SUM. But please correct me if I am wrong.

Since when are eclass design issues "legitimized" by "a democratic
vote"?  In the best case, they are handled via rough consensus.
In the worst, a single person can't stand a decision and bothers
everyone until they let them have their way.

Open source is not a democracy, it's volunteer effort.  People dedicate
their free time and do their best.  If you want something done, you have
to either do it yourself (and do it right!) or convince someone to do
it.  You don't overturn maintainers by "democratic votes", that's
actually how you shatter open source community and make volunteers stop
contributing.

Believe me, I've made enough bad decisions to know that now.

> And I never said that I believe in representing the majority's opinion. 
> That said, I prefer to have this voted on by an all-developer vote than 
> a council vote. Then we would know what the majority voted for. Is that 
> possible?

There's the General Resolution but it's supposed to be used only to
override Council decisions, so you should go with a Council vote first.

I don't believe this is a hill worth dying on but if you insist...
*shrug*.  I just wish you'd actually listen to people and put some real
effort to reach a compromise/consensus rather than pushing your narrow
solution through with no regard for consequences.

-- 
Best regards,
Michał Górny




Re: [gentoo-dev] Re: EGO_SUM

2023-04-28 Thread Florian Schmaus

On 27/04/2023 14.54, Michał Górny wrote:

On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:

Disk space is cheap.


No, it's not.  Gentoo supports more hardware than your average PC with
beefy hard drive and/or possibility of installing one.  Let's not forget
that you need a ::gentoo checkout even on a system running purely
on binary packages.


You are right. Gentoo supports a broad range of hardware in many 
dimensions, e.g., architecture, release date, and composition.


You seem to suggest that are Gentoo systems that can not handle the 
additional disk space consumption of EGO_SUM Go-packages?


I can not imagine systems that are able to deal with the ~500 MiB 
::gentoo repository, but would break if the same repository would 
contain 100 additional Go-packages with 200 KiB each.


Even under a "worst-case" assumption, where we would have 256 
Go-packages with each having a 1 MiB package-directory size, any system 
that can handle the current state of ::gentoo should be able to take the 
additional 256 MiB (+ metadata).




Network traffic, while also being cheap, may be more of an issue.


Again, you're making assumption based on living in a well-developed area
and discriminating against users who have shoddy Internet connectivity.

That said, this all was discussed in the past.  I really wish you would
humble down and try to find a solution that would work for everyone
instead of showing arrogance and lack of concern for users outside your
"majority" view of Gentoo.


I am sorry. I will work on my humbleness.

I am only pursuing the modest request to legitimize any decision 
regarding EGO_SUM by a democratic vote.


As far as I can tell, there was never a democratic vote regarding 
EGO_SUM. But please correct me if I am wrong.


And I never said that I believe in representing the majority's opinion. 
That said, I prefer to have this voted on by an all-developer vote than 
a council vote. Then we would know what the majority voted for. Is that 
possible?


- Flow


OpenPGP_0x8CAC2A9678548E35.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Sam James

Michał Górny  writes:

> On Fri, 2023-04-28 at 01:38 +0100, Sam James wrote:
>> Pascal Jäger  writes:
>> 
>> > Maybe I’m getting this wrong, but didn’t  we switch to shallow
>> > checkouts for the systems repository? I remember it was a major
>> > outcry on the mailing list. So at least for end users git keeps no
>> > history and our repository history should not impact clone size of a
>> > shallow copy, should it? 
>> > 
>> 
>> (Try to avoid top-posting if you can, reply after the message you're
>> replying to.)
>> 
>> rsync copies of the tree aren't affected by this, nor are full
>> git clones for development.
>> 
>
> Err, but full gentoo.git clones are definitely affected!  After all,
> that's where huge ebuilds and their Manifests land first.

I meant they're not affected by any changes to Portage's new default
of shallow clones, i.e. it doesn't help the problem for them.



signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Michał Górny
On Fri, 2023-04-28 at 01:38 +0100, Sam James wrote:
> Pascal Jäger  writes:
> 
> > Maybe I’m getting this wrong, but didn’t  we switch to shallow
> > checkouts for the systems repository? I remember it was a major
> > outcry on the mailing list. So at least for end users git keeps no
> > history and our repository history should not impact clone size of a
> > shallow copy, should it? 
> > 
> 
> (Try to avoid top-posting if you can, reply after the message you're
> replying to.)
> 
> rsync copies of the tree aren't affected by this, nor are full
> git clones for development.
> 

Err, but full gentoo.git clones are definitely affected!  After all,
that's where huge ebuilds and their Manifests land first.

-- 
Best regards,
Michał Górny




Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Sam James

Pascal Jäger  writes:

> Maybe I’m getting this wrong, but didn’t  we switch to shallow
> checkouts for the systems repository? I remember it was a major
> outcry on the mailing list. So at least for end users git keeps no
> history and our repository history should not impact clone size of a
> shallow copy, should it? 
>

(Try to avoid top-posting if you can, reply after the message you're
replying to.)

rsync copies of the tree aren't affected by this, nor are full
git clones for development.

>
>
> On Donnerstag, Apr. 27, 2023 at 14:54, Michał Górny <
> mgo...@gentoo.org> wrote:
> On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
>
> Disk space is cheap.
>
>
> No, it's not. Gentoo supports more hardware than your average PC
> with
> beefy hard drive and/or possibility of installing one. Let's not
> forget
> that you need a ::gentoo checkout even on a system running purely
> on binary packages.
>
> Let's not forget that git keeps all history, so every bump of a
> Go
> package with large Manifest has a permanent negative impact on
> clone
> size. A few version bumps of Go packages can easily outweigh
> complete
> history of hundreds of other packages. 
>
>
> Network traffic, while also being cheap, may be more of an
> issue.
>
>
> Again, you're making assumption based on living in a
> well-developed area
> and discriminating against users who have shoddy Internet
> connectivity.
>
> That said, this all was discussed in the past. I really wish you
> would
> humble down and try to find a solution that would work for
> everyone
> instead of showing arrogance and lack of concern for users
> outside your
> "majority" view of Gentoo.
>
> --
> Best regards,
> Michał Górny
>
>



signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Pascal Jäger
Maybe I’m getting this wrong, but didn’t we switch to shallow checkouts for the 
systems repository? I remember it was a major outcry on the mailing list. So at 
least for end users git keeps no history and our repository history should not 
impact clone size of a shallow copy, should it?

> On Donnerstag, Apr. 27, 2023 at 14:54, Michał Górny  (mailto:mgo...@gentoo.org)> wrote:
> On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
> > Disk space is cheap.
>
> No, it's not. Gentoo supports more hardware than your average PC with
> beefy hard drive and/or possibility of installing one. Let's not forget
> that you need a ::gentoo checkout even on a system running purely
> on binary packages.
>
> Let's not forget that git keeps all history, so every bump of a Go
> package with large Manifest has a permanent negative impact on clone
> size. A few version bumps of Go packages can easily outweigh complete
> history of hundreds of other packages.
>
> > Network traffic, while also being cheap, may be more of an issue.
>
> Again, you're making assumption based on living in a well-developed area
> and discriminating against users who have shoddy Internet connectivity.
>
> That said, this all was discussed in the past. I really wish you would
> humble down and try to find a solution that would work for everyone
> instead of showing arrogance and lack of concern for users outside your
> "majority" view of Gentoo.
>
> --
> Best regards,
> Michał Górny
>
>


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Sam James

Florian Schmaus  writes:

> [[PGP Signed Part:Undecided]]
> On 26/04/2023 18.12, Matt Turner wrote:
>> On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:
>>> The discussion would be more productive if someone who is supporting the
>>> EGO_SUM deprecation could rationally summarize the main arguments why we
>>> deprecated EGO_SUM.
>> You're requesting the changes. It's on you to read the previous
>> threads and try to understand. It's not others' responsibilities to
>> justify the status quo to you, but tl;dr is Manifest files grew to
>> insane sizes for golang packages with many dependencies, and the
>> Manifest size is a cost all Gentoo users pay regardless of whether
>> they use the package.
>
> I am sorry. I did try to understand the reasoning in the previous
> threads. However, I do not conclude that the "cost" users must pay for
> EGO_SUM justifies EGO_SUM's deprecation. It is the other way around:
> EGO_SUM's advantages do not explain its deprecation, even if users
> have to pay a cost.
>
> You write that the "Manifest sizes grew to insane sizes"?
>
> At which boundary does a package size, the total size of the package's
> directory, become insane?
>
> Disk space is cheap. Currently, ::gentoo, without metadata, is around
> 470 MiB. If you add 10 Go packages with a whopping 200 KiB each, then
> this adds 2 MiB to that. I need someone to explain how this
> constitutes an issue with disk space. Even if we add 100 Go packages,
> probably roughly the number of Go packages we have in ::gentoo, then
> those 20 MiB are not significant. Needless to say that the average
> size of a Go package is less than the 200 KiB uses in this
> calculation.

The numbers you've used here suggest you've missed some of the
big problematic cases from the past:
- https://bugs.gentoo.org/833478 (1.1MB manifest)
- https://bugs.gentoo.org/833477 (1.6MB manifest)

sam


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread David Seifert
On Thu, 2023-04-27 at 13:00 -0500, William Hubbs wrote:
>  That, however, doesn't remove the concern about big ebuilds and
>  manifests. I will look at the remainder of the thread to figure out
>  what is going on with that.

You do know that the main reason it was deprecated in ::gentoo was the
ballooning of manifests, not some SRC_URI-generating implementation
details of the eclass itself?



Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread William Hubbs
On Mon, Apr 17, 2023 at 02:28:22PM +0500, Anna (cybertailor) Vyalkova wrote:
> On 2023-04-17 09:37, Florian Schmaus wrote:
> > The EGO_SUM alternatives
> > - do not have the same level of trust and therefore have a negative 
> > impact on security (a dubious tarball someone put somewhere, especially 
> > when proxy-maint)

I haven't read all of this thread yet, but I did speak with Sam last
night, and I have another idea about this.

- I still want to deprecate EGO_SUM, but I'm working in the background
  on reworking get-ego-vendor to generate the data that goes into
  src_uri directly. This would eliminate most of the processing in the
  eclass.

 
 That, however, doesn't remove the concern about big ebuilds and
 manifests. I will look at the remainder of the thread to figure out
 what is going on with that.

 William


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-27 Thread Michał Górny
On Thu, 2023-04-27 at 09:58 +0200, Florian Schmaus wrote:
> Disk space is cheap.

No, it's not.  Gentoo supports more hardware than your average PC with
beefy hard drive and/or possibility of installing one.  Let's not forget
that you need a ::gentoo checkout even on a system running purely
on binary packages.

Let's not forget that git keeps all history, so every bump of a Go
package with large Manifest has a permanent negative impact on clone
size.  A few version bumps of Go packages can easily outweigh complete
history of hundreds of other packages.

> Network traffic, while also being cheap, may be more of an issue. 

Again, you're making assumption based on living in a well-developed area
and discriminating against users who have shoddy Internet connectivity.

That said, this all was discussed in the past.  I really wish you would
humble down and try to find a solution that would work for everyone
instead of showing arrogance and lack of concern for users outside your
"majority" view of Gentoo.

-- 
Best regards,
Michał Górny




Re: [gentoo-dev] Re: EGO_SUM

2023-04-26 Thread Sam James

Florian Schmaus  writes:

> Hi Sam,
>
> thanks for your feedback. I am glad for everyone who engages in this
> discussion and shares their views and new information.
>
> On 24/04/2023 22.28, Sam James wrote:
>> Florian Schmaus  writes:
>> [CCing williamh@ as go-module.eclass & dev-lang/go maintainer.]
>> 
>>> I like to ask the Gentoo council to vote on whether EGO_SUM should be
>>> reinstated ("un-deprecated") or not > In the various previous discussions, 
>>> the need
>> for _some_ limit to be implemented (derived from EGO_SUM) was clear from
>> the QA team and others.
>
> Asking to impose an artificial limit is based on the same unfounded
> belief under which EGO_SUM was deprecated in the first place. I am
> worried that if we follow this, then a potential next step is to argue
> about adding packages to ::gentoo.
>
>
>> Voting on the matter now would be reopening the issue which led EGO_SUM
>> to be deprecated in the first place, with only a partial mitigation
>> (the Portage warning).
>
> I am sorry, but I do not follow. I think this is partly because it is
> not clear "what" (else) to mitigate.
>
> The discussion would be more productive if someone who is supporting
> the EGO_SUM deprecation could rationally summarize the main arguments
> why we deprecated EGO_SUM.

I think Matt handled this in his reply.

>
>
>> Any such limit should be supported by pkgcheck, allow using EGO_SUM
>> for most packages, but exclude the pathological cases which we're
>> unlikely to want in ::gentoo.
>> (Limit-per-ebuild rather than per-package is one option of many,
>> too.)
>
> As you probably noticed, I am not aware why we should impose such a
> limit. Especially a per-package limit confines the ability to provide
> the user with multiple versions of a package, which sometimes comes in
> handy [1].

You added a check to Portage (thank you!) to warn when the environment
size is too big. This is a runtime/dynamic check which we can't
determine purely from the repository, so pkgcheck can't notice it.

I would like pkgcheck to have an approximation of a too-large A
in an ebuild (can use Manifest as a proxy if required) derived from
the maximum environment size.

I thought I'd communicated that need for the counterpart before.

thanks,
sam


signature.asc
Description: PGP signature


Re: [gentoo-dev] Re: EGO_SUM

2023-04-26 Thread Matt Turner
On Wed, Apr 26, 2023 at 3:31 PM Andrew Ammerlaan
 wrote:
>
> On 26/04/2023 18:12, Matt Turner wrote:
> > On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:
> >> The discussion would be more productive if someone who is supporting the
> >> EGO_SUM deprecation could rationally summarize the main arguments why we
> >> deprecated EGO_SUM.
> >
> > You're requesting the changes. It's on you to read the previous
> > threads and try to understand. It's not others' responsibilities to
> > justify the status quo to you, but tl;dr is Manifest files grew to
> > insane sizes for golang packages with many dependencies, and the
> > Manifest size is a cost all Gentoo users pay regardless of whether
> > they use the package.
> >
>
> This is a valid point and I think it is clear. What is not clear however
> is why the EGO_SUM method should be dropped entirely instead of keeping
> it as an option for overlays (with an appropriate warning). As I
> remember this is where the discussion got 'stuck' last time.
>
> There are other cases where things are possible but prohibited in
> ::gentoo by policy. E.g. the acct-user eclass allows setting
> ACCT_USER_ID to -1 for dynamic assignment, but we do not allow this in
> ::gentoo. I don't see why we could not do the same for EGO_SUM, keep it
> as an option, while disallowing it in ::gentoo.

I suspect allowing it unrestricted in overlays is fine—which seems to
be the major practical issue that spurred this thread.

Sam suggested a requirement for a maximum Manifest size (presumably
thinking about ::gentoo), and Florian replied:

> Asking to impose an artificial limit is based on the same unfounded
> belief under which EGO_SUM was deprecated in the first place. I am
> worried that if we follow this, then a potential next step is to argue
> about adding packages to ::gentoo.

So I think that's where the disagreement is.



RE: [gentoo-dev] Re: EGO_SUM

2023-04-26 Thread Chris Pritchard
> This way ridiculously large manifests are gone out of ::gentoo. But overlays 
> can
> still use the EGO_SUM method for their go packages if a tarball is too much of
> a hassle. And everyone is happy. It is then the responsibility of the overlay
> maintainers to ensure that their manifests don't grow out of hand. A warning
> from the eclass and/or pkgcheck should ensure that they are aware of the
> potential problem.
> 
> What am I missing? I truly do not understand why this matter is not resolved
> already and why we continue to have this discussion again and again. The
> solution just seems so simple.

I agree with this as a viable solution, hosting vendor tarballs with the gentoo 
infrastructure is possible, though there would need to be a way to support 
proxy maintainers in uploading and hosting them, but to deprecate it and move 
on to removing it as an option for overlays is, in my view, a poor move. It 
adds a significant burden to overlay maintainers, who may have to move to 
paying for hosting of the vendor tarballs, forking repositories, or even not 
contributing at all.

Chris


Re: [gentoo-dev] Re: EGO_SUM

2023-04-26 Thread Andrew Ammerlaan

On 26/04/2023 18:12, Matt Turner wrote:

On Wed, Apr 26, 2023 at 11:31 AM Florian Schmaus  wrote:

The discussion would be more productive if someone who is supporting the
EGO_SUM deprecation could rationally summarize the main arguments why we
deprecated EGO_SUM.


You're requesting the changes. It's on you to read the previous
threads and try to understand. It's not others' responsibilities to
justify the status quo to you, but tl;dr is Manifest files grew to
insane sizes for golang packages with many dependencies, and the
Manifest size is a cost all Gentoo users pay regardless of whether
they use the package.



This is a valid point and I think it is clear. What is not clear however 
is why the EGO_SUM method should be dropped entirely instead of keeping 
it as an option for overlays (with an appropriate warning). As I 
remember this is where the discussion got 'stuck' last time.


There are other cases where things are possible but prohibited in 
::gentoo by policy. E.g. the acct-user eclass allows setting 
ACCT_USER_ID to -1 for dynamic assignment, but we do not allow this in 
::gentoo. I don't see why we could not do the same for EGO_SUM, keep it 
as an option, while disallowing it in ::gentoo.


This way ridiculously large manifests are gone out of ::gentoo. But 
overlays can still use the EGO_SUM method for their go packages if a 
tarball is too much of a hassle. And everyone is happy. It is then the 
responsibility of the overlay maintainers to ensure that their manifests 
don't grow out of hand. A warning from the eclass and/or pkgcheck should 
ensure that they are aware of the potential problem.


What am I missing? I truly do not understand why this matter is not 
resolved already and why we continue to have this discussion again and 
again. The solution just seems so simple.


Best regards,
Andrew



Re: [gentoo-dev] Re: EGO_SUM

2023-04-24 Thread Alexey Zapparov
My 2 cents. As somebody who contributes to ::guru, I would like to
second that having a burden of hosting dependencies tarballs feels
like an obstacle. Pursuing upstream projects to adopt dependencies
bundling is often difficult (it's hard to convince developers to
change their workflows to make the life of ebuild packagers easier).
Latter is leading to forking the project on GitHub/Gitlab with the
only goal to cut release of dependencies tarball.

On Mon, Apr 24, 2023 at 10:33 PM Sam James  wrote:
>
>
> Florian Schmaus  writes:
>
> [CCing williamh@ as go-module.eclass & dev-lang/go maintainer.]
>
> > I like to ask the Gentoo council to vote on whether EGO_SUM should be
> > reinstated ("un-deprecated") or not.
> >
> > EGO_SUM is a project-comprehensive matter, as it affects not only
> > Go-lang packaging but also the proxy-maint and GURU
> > projects. Furthermore, as I have mentioned in my previous emails, the
> > deprecation of EGO_SUM has a significant negative impact on our users
> > and is, therefore, a global Gentoo issue.
> >
> > Asking for council involvement should be a last resort and only be
> > done in essential conflicts. But, unfortunately, I was unable to
> > convince the relevant maintainer with arguments that the deprecation
> > of EGO_SUM is harmful. And this matter is significant enough to
> > proceed with this.
>
> My feeling on this is that this proposal isn't yet complete enough
> for the council to assess. In the various previous discussions, the need
> for _some_ limit to be implemented (derived from EGO_SUM) was clear from
> the QA team and others.
>
> Voting on the matter now would be reopening the issue which led EGO_SUM
> to be deprecated in the first place, with only a partial mitigation
> (the Portage warning).
>
> Any such limit should be supported by pkgcheck, allow using EGO_SUM
> for most packages, but exclude the pathological cases which we're
> unlikely to want in ::gentoo.
>
> (Limit-per-ebuild rather than per-package is one option of many,
> too.)
>
> >
> > Most voices on the related mailing-list threads expressed support for
> > reinstating EGO_SUM. At least, that is my impression. While the
> > arguments used to deprecate EGO_SUM were mostly of esthetic nature.
> >
> > I want to state what should be common sense. Namely, asking for a
> > democratic vote is not a personal attack against any involved
> > person.
> > [...]
>
> I agree this is an important issue that affects the practicality
> of using Gentoo for some, and for contributing to Gentoo to others.
>
> >
> > On 17/04/2023 09.37, Florian Schmaus wrote:
> >> [original msg snipped]