Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-24 Thread Robin H. Johnson
On Mon, May 04, 2020 at 11:57:03PM +0100, Andrey Utkin wrote:
> Since it is going to be opt-in and optional anyway, we seem to be fine with
> having just partial data.
> 
> I assume we have logs of distfiles downloads from Gentoo infrastructure, and
> can negotiate access to relevant logs of our mirrors. That constitutes partial
> data correlated with users' installation activity, as good as it gets.
This assumption is wrong at the root.

> If we do have some such data, are we using it in any way for the discussed
> purposes?
> 
> If we don't, but could get it, would we be able to use that data for these
> purposes? If no, why?
> 
> If we can't get the data, why?
Simply put: Gentoo does not run the last-mile edge of distfile distribution.

$ dig @ns1.gentoo.org +noall +answer distfiles.gentoo.org IN A
distfiles.gentoo.org.   7200IN  A   64.50.233.100
distfiles.gentoo.org.   7200IN  A   140.211.166.134
distfiles.gentoo.org.   7200IN  A   64.50.236.52
$ echo 140.211.166.134 64.50.233.100 64.50.236.52 |fmt -1 |xargs -n1 dig +short 
-x 
ftp-osl.osuosl.org.
ftp-nyc.osuosl.org.
ftp-chi.osuosl.org.

And historically also TDS & another provider.

Plus all of the regional mirrors that don't even have .gentoo.org
hostnames.

I would like to replace the legacy http://distfiles.gentoo.org/
functionality with a redirection service, at which point you could have
partial data, but it answers a very different question than Goose.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robb...@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Toralf Förster
On 5/5/20 10:26 PM, Daniel Pielmeier wrote:
> Actually the maintainer decided to continue the project.
> The code is now hosted at Github [1].
> The site moved to a new server and the upload is working again.
> 
> [1] https://github.com/portagefilelist
> 
> -- 
> Best regards
> Daniel

Indeed - I'm reactivating the pfl logic in the tinderbox script.

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Daniel Pielmeier
Am May 5, 2020 7:31:34 PM UTC schrieb "Toralf Förster" :
>On 4/26/20 10:08 AM, Michał Górny wrote:
>> I don't think we really want to try to investigate
>> which files are actually used but focus on what's installed.
>Hi,
>
>I do wonder if the http://www.portagefilelist.de/site/start (package
>app-portage/pfl) would be part of that or not?
>The maintainer of the pfl stopped the import of new data last year due
>to lack fo time to maintain that project and is looking for a
>usccessor.

Actually the maintainer decided to continue the project.
The code is now hosted at Github [1].
The site moved to a new server and the upload is working again.

[1] https://github.com/portagefilelist

-- 
Best regards
Daniel

Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Kent Fredric
On Tue, 5 May 2020 02:47:48 +0200
Thomas Deutschmann  wrote:

> Yes it would be a signal but a useless signal, not?

"There are no users reported using this dist, so we can nuke it" is
still far far superior to "there are no reverse dependencies, so we can
nuke it"

*Even* when the former is false information.

As presently, the "no reverse dependencies, therefore nuke" essentially
asserts there *are* no users to consider.

So the *worst* case scenario for decisions made with these statistics
is our *current* case.

Even if *nobody* uses the service and *all* results indicates "nobody
uses anything", then we'll just be reverting to what we currently do:
Remove things entirely on conjecture that they're not useful.



pgpocHsjWcN5y.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Toralf Förster
On 4/26/20 10:08 AM, Michał Górny wrote:
> I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
Hi,

I do wonder if the http://www.portagefilelist.de/site/start (package 
app-portage/pfl) would be part of that or not?
The maintainer of the pfl stopped the import of new data last year due to lack 
fo time to maintain that project and is looking for a usccessor.

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Jaco Kroon
Hi Michał, and the rest of the Gentoo devs,

I've been patiently sitting and watching this discussion.

I raised some ideas with another developer (Not Michał) just days before
he raised this thread to the ML.

I believe all points raised to this point is valid, I'll try to summarise:

1.  This must be completely *opt in*.
2.  Anonymity was discussed by various parties (privacy).
3.  "spam" protection (ie, preventing bogus data from entering).
4.  Trustworthiness of data.
5.  Acceptance of some form of privacy policy.

In my opinion, points 2 and 3 works against each other, in that if
registration is compulsory if you would like to submit stats, then we
can control the spam more easily (not foolproof), but requiring
registration also raises the entry barrier.  I'd be completely willing
to provide at least an email address as part of a submission.

All of the replies seems to have focused purely on yes/no, do it or
don't.  Not many have addressed the benefits to end users/system
administrators.  It seems to focus is on what we as developers can get
out of this.

Regarding the above points:

1.  I fully agree.  This should not be forced on anyone.
2.  Happy to concede that some people may wish to submit anonymously. 
Let them.
3.  I'll address this below.
4.  A lot of the discussion has been around the usefulness of the data,
and I concede to Thomas that this may (or may not) generate "decision
blind spots" or as per "artificially increase decision certainty".  I
don't see how this is worse than what we've got now.
5.  We have the infrastructure for this already by way of licenses.  So
we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first
take explicit action to accept GentooPrivacy.

I have some other ideas around this, which will tread even further on
privacy, but again, all of this should be a kind of opt-in, and building
on the ideas by Kent where he suggested a form of submission proxy
(STATS_SERVER), we could potentially give the full benefit of the code
to such entities, but then still allow them to submit "upstream" in a
more filtered manner.

Bottom line, in my opinion:  Any data is better than no data!

Whilst we can't say "no one is using xyz", we will at least be able to
say "hey, some people are using xyz", and whilst this may generate some
blinds it at least enables us to test known use cases during
test-builds, eg, we know for a fact a thousand users are using package X
with USE flags "-* a b c", so we should definitely run that as a compile
test.  Your build breaks frequently?  Would you mind submitting stats? 
Great thank you.  You not willing to do that, then my stance becomes one
of "ok, I'll help where I can, but really, please consider us to help
you, if you submit stats we can pre-emptively at least include build
tests for your specific USE flags." - and again, this means we can
actually have our tooling use these stats to generate build tests for
the "known popular" configs.

I point you to RHEL - why are people willing to pay for for RHEL?  What
do they get for that buck?  Because I promise you, the support I get
from fellow Gentoo'ers FAR outweigh the support I have ever gotten from
(paid for) RHEL.  Most of the time.

I myself used to run 500+ Gentoo hosts more than 15 years back.  It was
fun.  I was also a student back then so had much more time on my hands
than I do now.  It was challenging, and fun to try and get things to
work exactly the way we envisioned it should.  I promise you, if what
Michał proposes was available for me back then to firstly keep track of
my own internal assets, and to submit stats upstream to help improve
Gentoo I would not have hesitated for 10 seconds.

And there I touch on a point I'm trying to make - this should be
something that not only helps devs, but brings benefit to users.  I'll
say more on this at the end of the email (possibly force users to run
some of their own infra for this at least, but these stats form the
framework for a multi-system management system too, potentially).  First
I'd like to pay more attention to the individual points raised by Michał.

On 2020/04/26 10:08, Michał Górny wrote:

> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.

My time is also limited, but I would love to be involved in some way or
another.

> The big questions
> =
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's 

Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Nils Freydank
Hi all,

I find the idea of having data great, but agree that it can lead to a false
sense of having a correct data base. Therefor two thoughts:

First, therefore I'd like to propose that you introduce gentoostats as a
*strictly timed experiment* and evaluate if it actually changed anything within
your decisions and drop it or let run permanently afterwards.

I have no proper solution for the parameters though, maybe something like
"I choose to keep X use flags based on g.s.", but this would ask every dev to
log plenty of decisions manually (read: I don't think this will happen).

Second, I'm a bit frightened of Whissi's thought of dropping anything
security related based on non-input via g.s. -- I'd like to ask you to use the
information based on g.s. *not* for security related decisions, more for
"harmless" ones like the Matt mentioned: Should I really support feature X while
literally everyone of 200 users uses feature Y instead and I have no real
testing ground for feature X (Matt, yell at me if I got you wrong!).

Kind regards,
Nils (holgersson on Freenode)


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Michał Górny
On Tue, 2020-05-05 at 02:47 +0200, Thomas Deutschmann wrote:
> Yes it would be a signal but a useless signal, not?
> 

You seem to aim for arbitrarily blocking developers from making
decisions by preventing them from having data.  This won't work. 
Firstly, because *we have* to make decisions, and the worse data we
have, the more arbitrary decisions will be.  Secondly, because we always
will have some data, it will probably be worse than what's being
proposed here.

Generally, having more data means making better informed decisions.
Of course, there's always the potential of having too much data (though
I honestly don't think we're anywhere near that).  There's also
the potential of being lazy and just taking the easiest available data. 
There's no way around that but then, you can also be lazy and make
decisions ignoring any data.


For example, one kind of data we have right now are bugs.  So a package
fails for me in an obvious way yet there's no bug open.  Does that mean
that the package has zero users?  Otherwise someone would have reported
the problem, right?  So here go last rites.

Gentoostats could tell me 'hey, this package has bunch of users still'. 
This questions my first assessment -- 'oh, they probably haven't had to
rebuild it since ...'


If I have no data, we have to rely on 'gut feelings'.  I have a gut
feeling that this package looks useless, why bother.  Is that more
worthwhile than having *some* number to look at?  Even if the data is
biased towards specific kind of users, it would probably work better
than guessing.  And if it looks unreasonable, nobody stops you from
guessing.  I guess that an informed guess is better than a random guess.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-05 Thread Alec Warner
On Mon, May 4, 2020 at 10:14 PM Matt Turner  wrote:

> On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann 
> wrote:
> >
> > On 2020-04-26 15:46, Kent Fredric wrote:
> > > On Sun, 26 Apr 2020 14:38:54 +0200
> > > Thomas Deutschmann  wrote:
> > >
> > >> Let's assume we will get reports that app-misc/foo is only installed
> 20
> > >> times. If you are going to judge based on this data, "Obviously,
> nobody
> > >> is using that package, it's stuck on ... safe to remove"
> your
> > >> view is biased:
> > >
> > > I see this as more like what bloom filters get you, but in reverse:
> > >
> > > [...]
> > >
> > > - But now, instead of having "we don't know if anybody uses this", you
> > >   *can* have a "we know for sure somebody uses this".
> >
> > But how does that information really help us to decide anything in the
> end?
> >
> > Case A, stats are showing 0 users:
> >
> > Like said, we can't know if this is true or if this package is only used
> > in setups where people don't report stats.
> >
> >
> > Case B, stats are showing x users:
> >
> > Now what? Package from case A could have similar users -- we just don't
> > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
> > doesn't show up in stats. How does that help us? Would this allow us to
> > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
> > using vivaldi? Does it allow Python project to go forward pushing a mask
> > for removal in case vivaldi would depend on Python version, Python
> > project want to get rid of? Would this allow Gentoo PR to make a public
> > statement like "Firefox is the most popular browser in Gentoo, twice as
> > users as chromium"?
>
> I hate the saying "the perfect is the enemy of the good" but I think
> it applies here.
>
> You're of course correct that we would not have perfect information.
> But the thing about statistics is that you can still know some things
> based on a sampling of that perfect information.
>
> I would personally like to have data on whether users of my packages
> have certain USE flags enabled. Knowing that would allow me to decide
> whether its worth the maintenance burden of supporting features that I
> *think* are very rarely used. If instead the data showed me that 50%
> of users had IUSE=xyz enabled, I probably wouldn't consider removing
> it.
>
> I think your example of potential misuse of data is a bit over dramatic.
>

Let me present the same point another way.

Today we have no data, so we make an arbitrary decision. It might be right
or wrong; and we may not know until after we decide.
This is traditionally things like "break them and they will come" type of
process. "Mask it, if they complain, I'll unmask it."

In the future, we could have this package data. It may influence decision
making. However I'm not sure from a decision-making standpoint that it is
strictly worse than no data.
The danger (which is what I think Whissi's concern is) is that it could
artificially increase decision certainty.

For example, if I have to decide whether to keep a package, or a flag, or
whatever. I might make an arbitrary decision. I'm aware it's arbitrary, it
might be wrong, and so I'm not super attached to such a decision. I'm not
*certain* about it; but I have to decide one way or the other[0]. Then I
move to a world with package data. Now I'm no longer making an arbitrary
decision; I'm making a decision based on *data*. The *data* tells me my
decision is correct, resulting in a more *certain* decision outcome. I
think this is the fallacy we want to avoid. The data can be informative but
there are significant biases in it that should result in very *little*
certainty added to decision making.

Making decisions based on incomplete data is just life though, so I'm
fairly skeptical of a "we shouldn't collect any data" type of mindset. I'd
be curious to see if we can instill a *culture* component around the use of
data in our development workflows.

-A

[0] There are a bunch of other cultural components here, like different
decision types (1 vs 2) and the ability to make a mistake in public and not
feel bad about it; so I'm aware reality does not reflect this trivial
example. But those are hallmarks of cultural markets I'd like to aim for in
Gentoo, so I would prefer to discuss a world where they exist ;)


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-04 Thread Matt Turner
On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann  wrote:
>
> On 2020-04-26 15:46, Kent Fredric wrote:
> > On Sun, 26 Apr 2020 14:38:54 +0200
> > Thomas Deutschmann  wrote:
> >
> >> Let's assume we will get reports that app-misc/foo is only installed 20
> >> times. If you are going to judge based on this data, "Obviously, nobody
> >> is using that package, it's stuck on ... safe to remove" your
> >> view is biased:
> >
> > I see this as more like what bloom filters get you, but in reverse:
> >
> > [...]
> >
> > - But now, instead of having "we don't know if anybody uses this", you
> >   *can* have a "we know for sure somebody uses this".
>
> But how does that information really help us to decide anything in the end?
>
> Case A, stats are showing 0 users:
>
> Like said, we can't know if this is true or if this package is only used
> in setups where people don't report stats.
>
>
> Case B, stats are showing x users:
>
> Now what? Package from case A could have similar users -- we just don't
> know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
> doesn't show up in stats. How does that help us? Would this allow us to
> skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
> using vivaldi? Does it allow Python project to go forward pushing a mask
> for removal in case vivaldi would depend on Python version, Python
> project want to get rid of? Would this allow Gentoo PR to make a public
> statement like "Firefox is the most popular browser in Gentoo, twice as
> users as chromium"?

I hate the saying "the perfect is the enemy of the good" but I think
it applies here.

You're of course correct that we would not have perfect information.
But the thing about statistics is that you can still know some things
based on a sampling of that perfect information.

I would personally like to have data on whether users of my packages
have certain USE flags enabled. Knowing that would allow me to decide
whether its worth the maintenance burden of supporting features that I
*think* are very rarely used. If instead the data showed me that 50%
of users had IUSE=xyz enabled, I probably wouldn't consider removing
it.

I think your example of potential misuse of data is a bit over dramatic.



Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-04 Thread Thomas Deutschmann
On 2020-04-26 15:46, Kent Fredric wrote:
> On Sun, 26 Apr 2020 14:38:54 +0200
> Thomas Deutschmann  wrote:
> 
>> Let's assume we will get reports that app-misc/foo is only installed 20
>> times. If you are going to judge based on this data, "Obviously, nobody
>> is using that package, it's stuck on ... safe to remove" your
>> view is biased:
> 
> I see this as more like what bloom filters get you, but in reverse:
> 
> [...]
>
> - But now, instead of having "we don't know if anybody uses this", you
>   *can* have a "we know for sure somebody uses this".

But how does that information really help us to decide anything in the end?

Case A, stats are showing 0 users:

Like said, we can't know if this is true or if this package is only used
in setups where people don't report stats.


Case B, stats are showing x users:

Now what? Package from case A could have similar users -- we just don't
know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi
doesn't show up in stats. How does that help us? Would this allow us to
skip publishing GLSAs for vivalid because we assume nobody in Gentoo is
using vivaldi? Does it allow Python project to go forward pushing a mask
for removal in case vivaldi would depend on Python version, Python
project want to get rid of? Would this allow Gentoo PR to make a public
statement like "Firefox is the most popular browser in Gentoo, twice as
users as chromium"?

Yes it would be a signal but a useless signal, not?


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-04 Thread Thomas Deutschmann
On 2020-05-05 00:57, Andrey Utkin wrote:
> I assume we have logs of distfiles downloads from Gentoo infrastructure, and
> can negotiate access to relevant logs of our mirrors. That constitutes partial
> data correlated with users' installation activity, as good as it gets.

Even if we would have data for distfiles.gentoo.org this won't help us.
See how Gentoo works: If you follow handbook you will pick a
local/regional mirror. Now all these users are suddenly 'disconnected'
from the download stats...


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-05-04 Thread Andrey Utkin
Since it is going to be opt-in and optional anyway, we seem to be fine with
having just partial data.

I assume we have logs of distfiles downloads from Gentoo infrastructure, and
can negotiate access to relevant logs of our mirrors. That constitutes partial
data correlated with users' installation activity, as good as it gets.

If we do have some such data, are we using it in any way for the discussed
purposes?

If we don't, but could get it, would we be able to use that data for these
purposes? If no, why?

If we can't get the data, why?


As an aside, I think the best known way to ensure the availability of important
things, from user perspective, is to pay for these important things. Of course
I see how this won't fit culturally very well here and that we're not going to
switch to commercial model just for this reason.


signature.asc
Description: Digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Andreas K . Hüttel
Am Sonntag, 26. April 2020, 12:09:59 EEST schrieb Ulrich Mueller:
> > On Sun, 26 Apr 2020, Michał Górny wrote:
> > The other major problem is spam protection.  The best semi-anonymous way
> > I see is to use submitter's IPv4 addresses (can we support IPv6 then?).
> > We could set a limit of, say, 10 submissions per IPv4 address per week.
> > If some address would exceed that limit, we could require CAPTCHA
> > authorization.
> 
> Instead of using the IP address, you could generate a UUID when
> installing the tool. This would also take care of clusters with machines
> that are clones of each other.
> 

TBH, for clusters I would insert a sentence like
"If you are administering a cluster of many identical Gentoo machines, please 
see $WIKIPAGE before enabling submission"

and there then have a few more instructions (like how to enable only for one 
machine, and additionally provide us with the cluster size). I guess in this 
case we can add this further step, since whoever is doing that will be both 
invested in Gentoo and able to read docs.

-- 
Andreas K. Hüttel
dilfri...@gentoo.org
Gentoo Linux developer 
(council, qa, toolchain, base-system, perl, libreoffice)

signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Samuel Bernardo
Hi everyone,

gentoostats is a novelty for me and I'm not aware of previous
discussions or implementations. But for what I could understand from the
comments and Michał Górny explanation, I would start to ask your
attention to octoverse[1] initiative.

Maybe collected statistics could be a possible from a platform to get
the additional metadata for the stats from user contribution. What I
mean is a way to have a broker to collect all statistics from an
organization internally and then to publish that in the end. With such
solution would allow to add value for enterprise statistics and also to
contribute in the end to Gentoo.

Each broker cloud use in the end git authentication to publish the
results with a merge request that would run the necessary hooks from
Gentoo side. We only need here a document specification for data parsing
in the end.

Sorry if my comment is completely out of context, but such an octoverse
for Gentoo would be very interesting in my perspective.

Best,

Samuel

[1] https://octoverse.github.com/

On 4/26/20 9:08 AM, Michał Górny wrote:
> Hi,
>
> The topic of rebooting gentoostats comes here from time to time.  Unless
> I'm mistaken, all the efforts so far were superficial, lacking a clear
> plan and unwilling to research the problems.  I'd like to start
> a serious discussion focused on the issues we need to solve, and propose
> some ideas how we could solve them.
>
> I can't promise I'll find time to implement it.  However, I'd like to
> get a clear plan on how it should be done if someone actually does it.
>
>
> The big questions
> =
> The way I see it, the primary goal of the project would be to gather
> statistics on popularity of packages, in order to help us prioritize our
> attention and make decisions on what to keep and what to remove.  Unlike
> Debian's popcon, I don't think we really want to try to investigate
> which files are actually used but focus on what's installed.
>
> There are a few important questions that need to be answered first:
>
> 1. Which data do we need to collect?
>
>a. list of installed packages?
>b. versions (or slots?) of installed packages?
>c. USE flags on installed packages?
>d. world and world_sets files
>e. system profile?
>f. enabled repositories? (possibly filtered to official list)
>g. distribution?
>
> I think d. is most important as it gives us information on what users
> really want.  a. alone is kinda redundant is we have d.  c. might have
> some value when deciding whether to mask a particular flag (and implies
> a.).
>
> e. would be valuable if we wanted to determine the future of particular
> profiles, as well as e.g. estimate the transition to new versions.
>
> f. would be valuable to determine which repositories are used but we
> need to filter private repos from the output for privacy reasons.
>
> g. could be valuable in correlation with other data but not sure if
> there's much direct value alone.
>
>
> 2. How to handle Gentoo derivatives?  Some of them could provide
> meaningful data but some could provide false data (e.g. when derivatives
> override Gentoo packages).  One possible option would be to filter a.-e. 
> to stuff coming from ::gentoo.
>
>
> 3. How to keep the data up-to-date?  After all, if we just stack a lot
> of old data, we will soon stop getting meaningful results.  I suppose
> we'll need to timestamp all data and remove old entries.
>
>
> 4. How to avoid duplication?  If some users submit their results more
> often than others, they would bias the results.  3. might be related.
>
>
> 5. How to handle clusters?  Things are simple if we can assume that
> people will submit data for a few distinct systems.  But what about
> companies that run 50 Gentoo machines with the same or similar setup? 
> What about clusters of 1000 almost identical containers?  Big entities
> could easily bias the results but we should also make it possible for
> them to participate somehow.
>
>
> 6. Security.  We don't want to expose information that could be
> correlated to specific systems, as it could disclose their
> vulnerabilities.
>
>
> 7. Privacy.  Besides the above, our sysadmins would appreciate if
> the data they submitted couldn't be easily correlated to them.  If we
> don't respect privacy of our users, we won't get them to submit data.
>
>
> 8. Spam protection.  Finally, the service needs to be resilient to being
> spammed with fake data.  Both to users who want to make their packages
> look more important, and to script kiddies that want to prove a point.
>
>
> My (partial) implementation idea
> 
> I think our approach should be oriented on privacy/security first,
> and attempt to make the best of the data we can get while respecting
> this principle.  This means no correlation and no tracking.
>
> Once the tool is installed, the user needs to opt-in to using it.  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would 

Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Kent Fredric
On Sun, 26 Apr 2020 10:52:27 +0200
Michał Górny  wrote:

> Do you have any other idea for spam protection then?

What is the realistic risk here for spamming?

If the record is well formed, and pertains to known packages, the worst
I currently imagine is astroturfing: A single individual attempting to
make a package seem more popular than it is.

Just generally IME, spamming aims to make a buck somehow, but if
there's no fields in the data set that can be used for this, and abuse
of existing fields to fill with spam prose get filtered by not
correlating to any known possible values, then the entire record is
simply invalid, and can be removed on that basis.

Conceptually, you could have a report with
"dev-foo/plz-sir-halp-me-I-have-money-and-an-a-nigerian-prince::nigeria-prince",
but for anybody to see that they'd have to be querying data about the
::nigeria-prince overlay, and that's assuming we even show data about
overlays we can't locate.

Trolling ::gentoo with packages that don't exist seems easy to eliminate.

I don't like that astroturfing could be a thing ... but like, I also
don't really care about that happening.

For instance, crates.io has per-crate and per-crate-version download
statistics.

That's super easy to rig, you get lots of spiky noise in infrequently
used packages simply due to various automated services fetching things.

But at scale, the data still turns out to be quasi-useful, as it allows
you to chart adoption and migration... because as soon as a new version
gets shipped, if people are using it, then you'll start to see an
uptick in reports from the new version.

The "change" and "change response" information is very useful, and a
very odd target for astroturfing.

I for one would be greatly interested in "new perl version shipped,
explosion of results due to people upgrading", because then I can gauge
roughly how many people managed to upgrade perl without having to join
#gentoo and cry about it being broken.

(We could also designate a certain UUID flag for use by Gentoo infra,
possibly even a UUID-per-host, the results of which were invisible in
the public data, but still visible to people with approved perms,
because we really do value the ability to know which packages we have
to be careful about causing problems in, and where infra is at with
upgrading various things before we remove the versions infra is using,
whereas currently, working out what infra are currently running
requires lots of direct communication)


pgpd35R8sKJD6.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Kent Fredric
On Sun, 26 Apr 2020 03:39:24 -0700
Brian Dolbec  wrote:

> We would need that
> person/team to only enable their test system for gentoostats/disabled
> for deployments. Repeated failure to do that could result in that uuid
> being blacklisted.   Part of the initial profile details for that
> vm/image would be some details about approx numbers of deployments
> (yes, subject to change. But useful to know whether it is 10-15 or
> 100-500.  type of deployment  ie: vm/docker/kubernetes/desktop/server...

If the UUID generation was how I proposed in my other reply: On a
voluntary basis, with ability for UUID's to have metadata about what
the information associated with them may be used for, one could also
have a metadata field indicating what /kind/ of user the UUID was
associated with.

Then people simply installing things for testing, and reporting results
from their test rig could have a "tester" flag associated with a UUID
used only for testing, and then we can exclude that data from the main
reports, while still using it as evidence that a thing may work for
some audience.

The submission rate for UUID's with the "tester" flag could be allowed
to be higher, because it no longer contributes to the overall
statistics.



pgpNs50CPKG9p.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Kent Fredric
On Sun, 26 Apr 2020 14:38:54 +0200
Thomas Deutschmann  wrote:

> Let's assume we will get reports that app-misc/foo is only installed 20
> times. If you are going to judge based on this data, "Obviously, nobody
> is using that package, it's stuck on ... safe to remove" your
> view is biased:

I see this as more like what bloom filters get you, but in reverse:

- You still have to factor for "what you don't know"

- But now, instead of having "we don't know if anybody uses this", you
  *can* have a "we know for sure somebody uses this".

The anonymization and uncorrelatable aspects are of course very useful
to encourage people who would otherwise be averse to participate to
participate, but its for sure not a sure thing.

It would certainly be an improvement over what currently happens "No
reverse dependencies, thus, nobody is using it".

Bad things will still happen, but the absence of this tool won't stop
the bad things happening, because presently, the existence of users is
entirely conjecture.





pgpgPoUjIyApz.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Kent Fredric
On Sun, 26 Apr 2020 10:08:32 +0200
Michał Górny  wrote:

> A proper solution to cluster problem would probably involve some way to
> internally collect and combine data data before submission.  If you have
> large clusters of similar systems, I think you'd want to have all
> packages used on different systems reported as one entry.

For this, I'd suggest the ability to have an overrideable
"STATS_SERVER" (or something) ENV var URI that tells the submission
clients where to send their reports to.

Then have some server shipped in gentoo people can deploy, and submit
aggregated as a cron job, or potentially hand review the aggregated
submission data before submission, and potentially have tools to
whittle data out you don't want to share at the org level.

Such a tool is potentially useful to an organisation even without its
"submit to gentoo" capacity, as being able to internally analyse what
your organisation is using seems to be useful.

(eg: provide an admin a single point of information showing what
packages they need to audit, if all the nodes in the org are not
entirely controlled at the top level)

Though I think the overall design of anonymity by design is useful, I
can see usecases, especially in the organisation model, where being
able to voluntarily self-identify a node could be useful without
inherently being a privacy concern.

And you'd configure your relay to suppress these node identities in the
submitted data, or map them to a different org-wide identity. 

Example:
  I need to find somebody who is using  so I can ask them if 
  works, or if  is important to this package.

Example:
  Data indicates somebody within my org is using , and I need to ask
  them not to use , as its licensing terms are not compatible with
  our org.

Though for cases of voluntary identification, you'd need an interface
on the server node somewhere that allows you to generate unique ident
tokens, and associate data with them, possibly with a list of flags
dictating what records associated with this identity may be used for
(eg: Contact [y/n] )


pgpZ0SD8p685S.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Thomas Deutschmann
On 2020-04-26 10:08, Michał Górny wrote:
> What do you think?  Do you foresee other problems?  Do you have
> other needs?  Can you think of better solutions?

While I would really like to have data, I think it's impossible to get
correct data and therefore we shouldn't collect any data at all because
the invalid data we would collect would be misused/misinterpreted.

Let's start with your first example already,

> the primary goal of the project would be to gather statistics on
> popularity of packages, in order to help us prioritize our attention
> and make decisions on what to keep and what to remove

Let's assume we will get reports that app-misc/foo is only installed 20
times. If you are going to judge based on this data, "Obviously, nobody
is using that package, it's stuck on ... safe to remove" your
view is biased:

Because reporting will never be mandatory, we don't know if app-misc/foo
is just unlucky because most of its user haven't opt-in into reporting,
too (you can assume something like this for people with tor-related
programs for example).

Now think about large installations which are probably not allowed to
"phone home", using their private local mirror and are even using build
hosts. I am aware of *multiple* large Gentoo deployments -- for servers.
You will never get data from these installations. Instead, stats will be
drowned by several home users which are more likely to submit data.
Not to mention the new containerized world...

It's the problem you all should know from Mozilla, Google, Microsoft
*duck*: They all do 'data-driven development'. The problem: *We* are
power users. We are using several features most normal users don't even
know. However, most of us are also aware about privacy and are disabling
stats. The result: These companies are killing popular power user
features just because their data indicates that nobody is using that
feature.

Please don't create pressure on users to opt-in to gentoostats to
prevent something like this for Gentoo.

My point is: I'll strongly object against *any* decision based on this
project because the data will be *always wrong*. Therefore the data is
useless and I wouldn't even consider collecting them in first place.
Where there is a trough the pigs gather... and at one point people will
start to ignore that the data is useless just to underline *their* point
in their current situation. :/


-- 
Regards,
Thomas Deutschmann / Gentoo Linux Developer
C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Toralf Förster
On 4/26/20 12:25 PM, Michał Górny wrote:
> On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote:
>> On 4/26/20 10:52 AM, Michał Górny wrote:
>>> Do you have any other idea for spam protection then?
>>
>> IMO there're 2 types of spam:
>>
>> 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
>> 2. made intentionlly
>>
>> The 1st can be handled by UUID - just drop any old related dataset from 
>> inbox when a new one arrived
>> For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning 
>> where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in 
>> the last few weeks/months ?
>>
> 
> I'm sorry but could you rephrase that in more sentences?  I don't
> understand what you mean.
> 

Well, inspired by what Tor people do with Tor bridge stats:

- Create an UUID (never published, known only at the client and at the gentoo 
stats server)
- Calculate a hash of it. The hash is allowed to be published. The hash may be 
related with contact informations. The contact data may or may not be 
published. The hash is used for contacting people in case of questions.

The stats sent by the client contains the UUID.
Stats are send to a stats server in an area where they do live fore a while 
(days).
If a new stats file was got then the stats server deletes all older stats file 
of thet UUID in the stats area.

Stats are be trusted if they meet conditions already mentioned by Brian Dolbec.

IMO do not care about detecting spam, just try to detect valid UUIDs.

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Brian Dolbec
On Sun, 26 Apr 2020 11:32:06 +0200
Toralf Förster  wrote:

> On 4/26/20 11:09 AM, Ulrich Mueller wrote:
> > Instead of using the IP address, you could generate a UUID when
> > installing the tool.   
> 
> like the pfl tool did ?
> 

Like the last gentoostats gsoc project did.

As for enterprise/school/multiple clone deployments.  Those are
generated by one person/team, then deployed.  We would need that
person/team to only enable their test system for gentoostats/disabled
for deployments. Repeated failure to do that could result in that uuid
being blacklisted.   Part of the initial profile details for that
vm/image would be some details about approx numbers of deployments
(yes, subject to change. But useful to know whether it is 10-15 or
100-500.  type of deployment  ie: vm/docker/kubernetes/desktop/server...



Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Michał Górny
On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote:
> On 4/26/20 10:52 AM, Michał Górny wrote:
> > Do you have any other idea for spam protection then?
> 
> IMO there're 2 types of spam:
> 
> 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
> 2. made intentionlly
> 
> The 1st can be handled by UUID - just drop any old related dataset from inbox 
> when a new one arrived
> For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning 
> where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in 
> the last few weeks/months ?
> 

I'm sorry but could you rephrase that in more sentences?  I don't
understand what you mean.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Toralf Förster
On 4/26/20 10:52 AM, Michał Górny wrote:
> Do you have any other idea for spam protection then?

IMO there're 2 types of spam:

1. made by accident (eg. "* * * * *" instead "@weekly" in crontab)
2. made intentionlly

The 1st can be handled by UUID - just drop any old related dataset from inbox 
when a new one arrived
For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning 
where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in the 
last few weeks/months ?

Well, other than that maybe spamassassin or Tor peolple have more theory and 
generic approaches?
:-)

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Michał Górny
On Sun, 2020-04-26 at 11:09 +0200, Ulrich Mueller wrote:
> > > > > > On Sun, 26 Apr 2020, Michał Górny wrote:
> > The other major problem is spam protection.  The best semi-anonymous way
> > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> > We could set a limit of, say, 10 submissions per IPv4 address per week. 
> > If some address would exceed that limit, we could require CAPTCHA
> > authorization.
> 
> Instead of using the IP address, you could generate a UUID when
> installing the tool. This would also take care of clusters with machines
> that are clones of each other.
> 

That wouldn't help with abuse at all.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Toralf Förster
On 4/26/20 11:09 AM, Ulrich Mueller wrote:
> Instead of using the IP address, you could generate a UUID when
> installing the tool. 

like the pfl tool did ?

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Ulrich Mueller
> On Sun, 26 Apr 2020, Michał Górny wrote:

> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.

Instead of using the IP address, you could generate a UUID when
installing the tool. This would also take care of clusters with machines
that are clones of each other.

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Michał Górny
On Sun, 2020-04-26 at 10:43 +0200, Toralf Förster wrote:
> On 4/26/20 10:08 AM, Michał Górny wrote:
> > .  This
> > involves accepting a privacy policy and setting up a cronjob.  The tool
> > would suggest a (random?) time for submission to take place periodically
> > (say, every week).
> 
> Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c 
> the later might be too late for desktop users.
> 
> 
> > We could set a limit of, say, 10 submissions per IPv4 address per week.
> 
> If the output do not differ (too much) then the limit isn't needed, or?

Do you have any other idea for spam protection then?

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Toralf Förster
On 4/26/20 10:08 AM, Michał Górny wrote:
> .  This
> involves accepting a privacy policy and setting up a cronjob.  The tool
> would suggest a (random?) time for submission to take place periodically
> (say, every week).

Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c 
the later might be too late for desktop users.


> We could set a limit of, say, 10 submissions per IPv4 address per week.

If the output do not differ (too much) then the limit isn't needed, or?

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Alarig Le Lay
Hi,

On Sun 26 Apr 2020 10:08:32 GMT, Michał Górny wrote:
> The other major problem is spam protection.  The best semi-anonymous way
> I see is to use submitter's IPv4 addresses (can we support IPv6 then?). 
> We could set a limit of, say, 10 submissions per IPv4 address per week. 
> If some address would exceed that limit, we could require CAPTCHA
> authorization.
> 
> I think this would make spamming a bit harder while keeping submissions
> easy for the most, and a little harder but possible for those of us
> behind ISP NATs.

I think that the IPv6 support shouldn’t be a question. I have several
points for it:

1. All the Gentoo infrastructure is IPv6-able (at least the public
   faced as I’m aware), so it could create a specific case
2. As you mention NAT ISPs, most of those are providing IPv6 as well
   (because NAT isn’t cost-less). Also putting the IPv4 rate-limit to a
   /64 IPv6 will reduce the need for a CAPTCHA.
3. Users don’t necessary have an IPv4 access
4. About a third of the Internet traffic is IPv6, so it’s not an
   option in my humble opinion.

Regards,
-- 
Alarig



[gentoo-dev] [RFC] Ideas for gentoostats implementation

2020-04-26 Thread Michał Górny
Hi,

The topic of rebooting gentoostats comes here from time to time.  Unless
I'm mistaken, all the efforts so far were superficial, lacking a clear
plan and unwilling to research the problems.  I'd like to start
a serious discussion focused on the issues we need to solve, and propose
some ideas how we could solve them.

I can't promise I'll find time to implement it.  However, I'd like to
get a clear plan on how it should be done if someone actually does it.


The big questions
=
The way I see it, the primary goal of the project would be to gather
statistics on popularity of packages, in order to help us prioritize our
attention and make decisions on what to keep and what to remove.  Unlike
Debian's popcon, I don't think we really want to try to investigate
which files are actually used but focus on what's installed.

There are a few important questions that need to be answered first:

1. Which data do we need to collect?

   a. list of installed packages?
   b. versions (or slots?) of installed packages?
   c. USE flags on installed packages?
   d. world and world_sets files
   e. system profile?
   f. enabled repositories? (possibly filtered to official list)
   g. distribution?

I think d. is most important as it gives us information on what users
really want.  a. alone is kinda redundant is we have d.  c. might have
some value when deciding whether to mask a particular flag (and implies
a.).

e. would be valuable if we wanted to determine the future of particular
profiles, as well as e.g. estimate the transition to new versions.

f. would be valuable to determine which repositories are used but we
need to filter private repos from the output for privacy reasons.

g. could be valuable in correlation with other data but not sure if
there's much direct value alone.


2. How to handle Gentoo derivatives?  Some of them could provide
meaningful data but some could provide false data (e.g. when derivatives
override Gentoo packages).  One possible option would be to filter a.-e. 
to stuff coming from ::gentoo.


3. How to keep the data up-to-date?  After all, if we just stack a lot
of old data, we will soon stop getting meaningful results.  I suppose
we'll need to timestamp all data and remove old entries.


4. How to avoid duplication?  If some users submit their results more
often than others, they would bias the results.  3. might be related.


5. How to handle clusters?  Things are simple if we can assume that
people will submit data for a few distinct systems.  But what about
companies that run 50 Gentoo machines with the same or similar setup? 
What about clusters of 1000 almost identical containers?  Big entities
could easily bias the results but we should also make it possible for
them to participate somehow.


6. Security.  We don't want to expose information that could be
correlated to specific systems, as it could disclose their
vulnerabilities.


7. Privacy.  Besides the above, our sysadmins would appreciate if
the data they submitted couldn't be easily correlated to them.  If we
don't respect privacy of our users, we won't get them to submit data.


8. Spam protection.  Finally, the service needs to be resilient to being
spammed with fake data.  Both to users who want to make their packages
look more important, and to script kiddies that want to prove a point.


My (partial) implementation idea

I think our approach should be oriented on privacy/security first,
and attempt to make the best of the data we can get while respecting
this principle.  This means no correlation and no tracking.

Once the tool is installed, the user needs to opt-in to using it.  This
involves accepting a privacy policy and setting up a cronjob.  The tool
would suggest a (random?) time for submission to take place periodically
(say, every week).

The submission would contain only raw data, without any identification
information.  It would be encrypted using our public key.  Once
uploaded, it would be put into our input queue as-is.

Periodically the input queue would be processed in bulk.  The individual
statistics would be updated and the input would be discarded.  This
should prevent people trying to correlate changes in statistics with
individual uploads.

Each counted item would have a timestamp associated, and we'd discard
old items per resubmission period.  This should ensure that we keep
fresh data and people can update their earlier submissions without
storing identification data.

For example, N users submit their data containing a list of packages
every week.  This data is used in bulk to update counts of individual
packages (technically, to append timestamps to list corresponding to
these packages).  Data older than one week is discarded, so we have
rough counts of package use during the last week.

I think this addresses problems 3./6./7.


The other major problem is spam protection.  The best semi-anonymous way
I see is to use submitter's IPv4 addresses (can we