Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Mon, May 04, 2020 at 11:57:03PM +0100, Andrey Utkin wrote: > Since it is going to be opt-in and optional anyway, we seem to be fine with > having just partial data. > > I assume we have logs of distfiles downloads from Gentoo infrastructure, and > can negotiate access to relevant logs of our mirrors. That constitutes partial > data correlated with users' installation activity, as good as it gets. This assumption is wrong at the root. > If we do have some such data, are we using it in any way for the discussed > purposes? > > If we don't, but could get it, would we be able to use that data for these > purposes? If no, why? > > If we can't get the data, why? Simply put: Gentoo does not run the last-mile edge of distfile distribution. $ dig @ns1.gentoo.org +noall +answer distfiles.gentoo.org IN A distfiles.gentoo.org. 7200IN A 64.50.233.100 distfiles.gentoo.org. 7200IN A 140.211.166.134 distfiles.gentoo.org. 7200IN A 64.50.236.52 $ echo 140.211.166.134 64.50.233.100 64.50.236.52 |fmt -1 |xargs -n1 dig +short -x ftp-osl.osuosl.org. ftp-nyc.osuosl.org. ftp-chi.osuosl.org. And historically also TDS & another provider. Plus all of the regional mirrors that don't even have .gentoo.org hostnames. I would like to replace the legacy http://distfiles.gentoo.org/ functionality with a redirection service, at which point you could have partial data, but it answers a very different question than Goose. -- Robin Hugh Johnson Gentoo Linux: Dev, Infra Lead, Foundation Treasurer E-Mail : robb...@gentoo.org GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85 GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136 signature.asc Description: PGP signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 5/5/20 10:26 PM, Daniel Pielmeier wrote: > Actually the maintainer decided to continue the project. > The code is now hosted at Github [1]. > The site moved to a new server and the upload is working again. > > [1] https://github.com/portagefilelist > > -- > Best regards > Daniel Indeed - I'm reactivating the pfl logic in the tinderbox script. -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Am May 5, 2020 7:31:34 PM UTC schrieb "Toralf Förster" : >On 4/26/20 10:08 AM, Michał Górny wrote: >> I don't think we really want to try to investigate >> which files are actually used but focus on what's installed. >Hi, > >I do wonder if the http://www.portagefilelist.de/site/start (package >app-portage/pfl) would be part of that or not? >The maintainer of the pfl stopped the import of new data last year due >to lack fo time to maintain that project and is looking for a >usccessor. Actually the maintainer decided to continue the project. The code is now hosted at Github [1]. The site moved to a new server and the upload is working again. [1] https://github.com/portagefilelist -- Best regards Daniel
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Tue, 5 May 2020 02:47:48 +0200 Thomas Deutschmann wrote: > Yes it would be a signal but a useless signal, not? "There are no users reported using this dist, so we can nuke it" is still far far superior to "there are no reverse dependencies, so we can nuke it" *Even* when the former is false information. As presently, the "no reverse dependencies, therefore nuke" essentially asserts there *are* no users to consider. So the *worst* case scenario for decisions made with these statistics is our *current* case. Even if *nobody* uses the service and *all* results indicates "nobody uses anything", then we'll just be reverting to what we currently do: Remove things entirely on conjecture that they're not useful. pgpocHsjWcN5y.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 4/26/20 10:08 AM, Michał Górny wrote: > I don't think we really want to try to investigate > which files are actually used but focus on what's installed. Hi, I do wonder if the http://www.portagefilelist.de/site/start (package app-portage/pfl) would be part of that or not? The maintainer of the pfl stopped the import of new data last year due to lack fo time to maintain that project and is looking for a usccessor. -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Hi Michał, and the rest of the Gentoo devs, I've been patiently sitting and watching this discussion. I raised some ideas with another developer (Not Michał) just days before he raised this thread to the ML. I believe all points raised to this point is valid, I'll try to summarise: 1. This must be completely *opt in*. 2. Anonymity was discussed by various parties (privacy). 3. "spam" protection (ie, preventing bogus data from entering). 4. Trustworthiness of data. 5. Acceptance of some form of privacy policy. In my opinion, points 2 and 3 works against each other, in that if registration is compulsory if you would like to submit stats, then we can control the spam more easily (not foolproof), but requiring registration also raises the entry barrier. I'd be completely willing to provide at least an email address as part of a submission. All of the replies seems to have focused purely on yes/no, do it or don't. Not many have addressed the benefits to end users/system administrators. It seems to focus is on what we as developers can get out of this. Regarding the above points: 1. I fully agree. This should not be forced on anyone. 2. Happy to concede that some people may wish to submit anonymously. Let them. 3. I'll address this below. 4. A lot of the discussion has been around the usefulness of the data, and I concede to Thomas that this may (or may not) generate "decision blind spots" or as per "artificially increase decision certainty". I don't see how this is worse than what we've got now. 5. We have the infrastructure for this already by way of licenses. So we ship with "GPLv2/3/whatever + GentooPrivacy", and users have to first take explicit action to accept GentooPrivacy. I have some other ideas around this, which will tread even further on privacy, but again, all of this should be a kind of opt-in, and building on the ideas by Kent where he suggested a form of submission proxy (STATS_SERVER), we could potentially give the full benefit of the code to such entities, but then still allow them to submit "upstream" in a more filtered manner. Bottom line, in my opinion: Any data is better than no data! Whilst we can't say "no one is using xyz", we will at least be able to say "hey, some people are using xyz", and whilst this may generate some blinds it at least enables us to test known use cases during test-builds, eg, we know for a fact a thousand users are using package X with USE flags "-* a b c", so we should definitely run that as a compile test. Your build breaks frequently? Would you mind submitting stats? Great thank you. You not willing to do that, then my stance becomes one of "ok, I'll help where I can, but really, please consider us to help you, if you submit stats we can pre-emptively at least include build tests for your specific USE flags." - and again, this means we can actually have our tooling use these stats to generate build tests for the "known popular" configs. I point you to RHEL - why are people willing to pay for for RHEL? What do they get for that buck? Because I promise you, the support I get from fellow Gentoo'ers FAR outweigh the support I have ever gotten from (paid for) RHEL. Most of the time. I myself used to run 500+ Gentoo hosts more than 15 years back. It was fun. I was also a student back then so had much more time on my hands than I do now. It was challenging, and fun to try and get things to work exactly the way we envisioned it should. I promise you, if what Michał proposes was available for me back then to firstly keep track of my own internal assets, and to submit stats upstream to help improve Gentoo I would not have hesitated for 10 seconds. And there I touch on a point I'm trying to make - this should be something that not only helps devs, but brings benefit to users. I'll say more on this at the end of the email (possibly force users to run some of their own infra for this at least, but these stats form the framework for a multi-system management system too, potentially). First I'd like to pay more attention to the individual points raised by Michał. On 2020/04/26 10:08, Michał Górny wrote: > Hi, > > The topic of rebooting gentoostats comes here from time to time. Unless > I'm mistaken, all the efforts so far were superficial, lacking a clear > plan and unwilling to research the problems. I'd like to start > a serious discussion focused on the issues we need to solve, and propose > some ideas how we could solve them. > > I can't promise I'll find time to implement it. However, I'd like to > get a clear plan on how it should be done if someone actually does it. My time is also limited, but I would love to be involved in some way or another. > The big questions > = > The way I see it, the primary goal of the project would be to gather > statistics on popularity of packages, in order to help us prioritize our > attention and make decisions on what to keep and what to remove. Unlike > Debian's
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Hi all, I find the idea of having data great, but agree that it can lead to a false sense of having a correct data base. Therefor two thoughts: First, therefore I'd like to propose that you introduce gentoostats as a *strictly timed experiment* and evaluate if it actually changed anything within your decisions and drop it or let run permanently afterwards. I have no proper solution for the parameters though, maybe something like "I choose to keep X use flags based on g.s.", but this would ask every dev to log plenty of decisions manually (read: I don't think this will happen). Second, I'm a bit frightened of Whissi's thought of dropping anything security related based on non-input via g.s. -- I'd like to ask you to use the information based on g.s. *not* for security related decisions, more for "harmless" ones like the Matt mentioned: Should I really support feature X while literally everyone of 200 users uses feature Y instead and I have no real testing ground for feature X (Matt, yell at me if I got you wrong!). Kind regards, Nils (holgersson on Freenode) signature.asc Description: PGP signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Tue, 2020-05-05 at 02:47 +0200, Thomas Deutschmann wrote: > Yes it would be a signal but a useless signal, not? > You seem to aim for arbitrarily blocking developers from making decisions by preventing them from having data. This won't work. Firstly, because *we have* to make decisions, and the worse data we have, the more arbitrary decisions will be. Secondly, because we always will have some data, it will probably be worse than what's being proposed here. Generally, having more data means making better informed decisions. Of course, there's always the potential of having too much data (though I honestly don't think we're anywhere near that). There's also the potential of being lazy and just taking the easiest available data. There's no way around that but then, you can also be lazy and make decisions ignoring any data. For example, one kind of data we have right now are bugs. So a package fails for me in an obvious way yet there's no bug open. Does that mean that the package has zero users? Otherwise someone would have reported the problem, right? So here go last rites. Gentoostats could tell me 'hey, this package has bunch of users still'. This questions my first assessment -- 'oh, they probably haven't had to rebuild it since ...' If I have no data, we have to rely on 'gut feelings'. I have a gut feeling that this package looks useless, why bother. Is that more worthwhile than having *some* number to look at? Even if the data is biased towards specific kind of users, it would probably work better than guessing. And if it looks unreasonable, nobody stops you from guessing. I guess that an informed guess is better than a random guess. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Mon, May 4, 2020 at 10:14 PM Matt Turner wrote: > On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann > wrote: > > > > On 2020-04-26 15:46, Kent Fredric wrote: > > > On Sun, 26 Apr 2020 14:38:54 +0200 > > > Thomas Deutschmann wrote: > > > > > >> Let's assume we will get reports that app-misc/foo is only installed > 20 > > >> times. If you are going to judge based on this data, "Obviously, > nobody > > >> is using that package, it's stuck on ... safe to remove" > your > > >> view is biased: > > > > > > I see this as more like what bloom filters get you, but in reverse: > > > > > > [...] > > > > > > - But now, instead of having "we don't know if anybody uses this", you > > > *can* have a "we know for sure somebody uses this". > > > > But how does that information really help us to decide anything in the > end? > > > > Case A, stats are showing 0 users: > > > > Like said, we can't know if this is true or if this package is only used > > in setups where people don't report stats. > > > > > > Case B, stats are showing x users: > > > > Now what? Package from case A could have similar users -- we just don't > > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi > > doesn't show up in stats. How does that help us? Would this allow us to > > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is > > using vivaldi? Does it allow Python project to go forward pushing a mask > > for removal in case vivaldi would depend on Python version, Python > > project want to get rid of? Would this allow Gentoo PR to make a public > > statement like "Firefox is the most popular browser in Gentoo, twice as > > users as chromium"? > > I hate the saying "the perfect is the enemy of the good" but I think > it applies here. > > You're of course correct that we would not have perfect information. > But the thing about statistics is that you can still know some things > based on a sampling of that perfect information. > > I would personally like to have data on whether users of my packages > have certain USE flags enabled. Knowing that would allow me to decide > whether its worth the maintenance burden of supporting features that I > *think* are very rarely used. If instead the data showed me that 50% > of users had IUSE=xyz enabled, I probably wouldn't consider removing > it. > > I think your example of potential misuse of data is a bit over dramatic. > Let me present the same point another way. Today we have no data, so we make an arbitrary decision. It might be right or wrong; and we may not know until after we decide. This is traditionally things like "break them and they will come" type of process. "Mask it, if they complain, I'll unmask it." In the future, we could have this package data. It may influence decision making. However I'm not sure from a decision-making standpoint that it is strictly worse than no data. The danger (which is what I think Whissi's concern is) is that it could artificially increase decision certainty. For example, if I have to decide whether to keep a package, or a flag, or whatever. I might make an arbitrary decision. I'm aware it's arbitrary, it might be wrong, and so I'm not super attached to such a decision. I'm not *certain* about it; but I have to decide one way or the other[0]. Then I move to a world with package data. Now I'm no longer making an arbitrary decision; I'm making a decision based on *data*. The *data* tells me my decision is correct, resulting in a more *certain* decision outcome. I think this is the fallacy we want to avoid. The data can be informative but there are significant biases in it that should result in very *little* certainty added to decision making. Making decisions based on incomplete data is just life though, so I'm fairly skeptical of a "we shouldn't collect any data" type of mindset. I'd be curious to see if we can instill a *culture* component around the use of data in our development workflows. -A [0] There are a bunch of other cultural components here, like different decision types (1 vs 2) and the ability to make a mistake in public and not feel bad about it; so I'm aware reality does not reflect this trivial example. But those are hallmarks of cultural markets I'd like to aim for in Gentoo, so I would prefer to discuss a world where they exist ;)
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Mon, May 4, 2020 at 5:48 PM Thomas Deutschmann wrote: > > On 2020-04-26 15:46, Kent Fredric wrote: > > On Sun, 26 Apr 2020 14:38:54 +0200 > > Thomas Deutschmann wrote: > > > >> Let's assume we will get reports that app-misc/foo is only installed 20 > >> times. If you are going to judge based on this data, "Obviously, nobody > >> is using that package, it's stuck on ... safe to remove" your > >> view is biased: > > > > I see this as more like what bloom filters get you, but in reverse: > > > > [...] > > > > - But now, instead of having "we don't know if anybody uses this", you > > *can* have a "we know for sure somebody uses this". > > But how does that information really help us to decide anything in the end? > > Case A, stats are showing 0 users: > > Like said, we can't know if this is true or if this package is only used > in setups where people don't report stats. > > > Case B, stats are showing x users: > > Now what? Package from case A could have similar users -- we just don't > know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi > doesn't show up in stats. How does that help us? Would this allow us to > skip publishing GLSAs for vivalid because we assume nobody in Gentoo is > using vivaldi? Does it allow Python project to go forward pushing a mask > for removal in case vivaldi would depend on Python version, Python > project want to get rid of? Would this allow Gentoo PR to make a public > statement like "Firefox is the most popular browser in Gentoo, twice as > users as chromium"? I hate the saying "the perfect is the enemy of the good" but I think it applies here. You're of course correct that we would not have perfect information. But the thing about statistics is that you can still know some things based on a sampling of that perfect information. I would personally like to have data on whether users of my packages have certain USE flags enabled. Knowing that would allow me to decide whether its worth the maintenance burden of supporting features that I *think* are very rarely used. If instead the data showed me that 50% of users had IUSE=xyz enabled, I probably wouldn't consider removing it. I think your example of potential misuse of data is a bit over dramatic.
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 2020-04-26 15:46, Kent Fredric wrote: > On Sun, 26 Apr 2020 14:38:54 +0200 > Thomas Deutschmann wrote: > >> Let's assume we will get reports that app-misc/foo is only installed 20 >> times. If you are going to judge based on this data, "Obviously, nobody >> is using that package, it's stuck on ... safe to remove" your >> view is biased: > > I see this as more like what bloom filters get you, but in reverse: > > [...] > > - But now, instead of having "we don't know if anybody uses this", you > *can* have a "we know for sure somebody uses this". But how does that information really help us to decide anything in the end? Case A, stats are showing 0 users: Like said, we can't know if this is true or if this package is only used in setups where people don't report stats. Case B, stats are showing x users: Now what? Package from case A could have similar users -- we just don't know. Assume firefox has 1.000 users, chromium has 500 users and vivaldi doesn't show up in stats. How does that help us? Would this allow us to skip publishing GLSAs for vivalid because we assume nobody in Gentoo is using vivaldi? Does it allow Python project to go forward pushing a mask for removal in case vivaldi would depend on Python version, Python project want to get rid of? Would this allow Gentoo PR to make a public statement like "Firefox is the most popular browser in Gentoo, twice as users as chromium"? Yes it would be a signal but a useless signal, not? -- Regards, Thomas Deutschmann / Gentoo Linux Developer fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 2020-05-05 00:57, Andrey Utkin wrote: > I assume we have logs of distfiles downloads from Gentoo infrastructure, and > can negotiate access to relevant logs of our mirrors. That constitutes partial > data correlated with users' installation activity, as good as it gets. Even if we would have data for distfiles.gentoo.org this won't help us. See how Gentoo works: If you follow handbook you will pick a local/regional mirror. Now all these users are suddenly 'disconnected' from the download stats... -- Regards, Thomas Deutschmann / Gentoo Linux Developer fpr: C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Since it is going to be opt-in and optional anyway, we seem to be fine with having just partial data. I assume we have logs of distfiles downloads from Gentoo infrastructure, and can negotiate access to relevant logs of our mirrors. That constitutes partial data correlated with users' installation activity, as good as it gets. If we do have some such data, are we using it in any way for the discussed purposes? If we don't, but could get it, would we be able to use that data for these purposes? If no, why? If we can't get the data, why? As an aside, I think the best known way to ensure the availability of important things, from user perspective, is to pay for these important things. Of course I see how this won't fit culturally very well here and that we're not going to switch to commercial model just for this reason. signature.asc Description: Digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Am Sonntag, 26. April 2020, 12:09:59 EEST schrieb Ulrich Mueller: > > On Sun, 26 Apr 2020, Michał Górny wrote: > > The other major problem is spam protection. The best semi-anonymous way > > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). > > We could set a limit of, say, 10 submissions per IPv4 address per week. > > If some address would exceed that limit, we could require CAPTCHA > > authorization. > > Instead of using the IP address, you could generate a UUID when > installing the tool. This would also take care of clusters with machines > that are clones of each other. > TBH, for clusters I would insert a sentence like "If you are administering a cluster of many identical Gentoo machines, please see $WIKIPAGE before enabling submission" and there then have a few more instructions (like how to enable only for one machine, and additionally provide us with the cluster size). I guess in this case we can add this further step, since whoever is doing that will be both invested in Gentoo and able to read docs. -- Andreas K. Hüttel dilfri...@gentoo.org Gentoo Linux developer (council, qa, toolchain, base-system, perl, libreoffice) signature.asc Description: This is a digitally signed message part.
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Hi everyone, gentoostats is a novelty for me and I'm not aware of previous discussions or implementations. But for what I could understand from the comments and Michał Górny explanation, I would start to ask your attention to octoverse[1] initiative. Maybe collected statistics could be a possible from a platform to get the additional metadata for the stats from user contribution. What I mean is a way to have a broker to collect all statistics from an organization internally and then to publish that in the end. With such solution would allow to add value for enterprise statistics and also to contribute in the end to Gentoo. Each broker cloud use in the end git authentication to publish the results with a merge request that would run the necessary hooks from Gentoo side. We only need here a document specification for data parsing in the end. Sorry if my comment is completely out of context, but such an octoverse for Gentoo would be very interesting in my perspective. Best, Samuel [1] https://octoverse.github.com/ On 4/26/20 9:08 AM, Michał Górny wrote: > Hi, > > The topic of rebooting gentoostats comes here from time to time. Unless > I'm mistaken, all the efforts so far were superficial, lacking a clear > plan and unwilling to research the problems. I'd like to start > a serious discussion focused on the issues we need to solve, and propose > some ideas how we could solve them. > > I can't promise I'll find time to implement it. However, I'd like to > get a clear plan on how it should be done if someone actually does it. > > > The big questions > = > The way I see it, the primary goal of the project would be to gather > statistics on popularity of packages, in order to help us prioritize our > attention and make decisions on what to keep and what to remove. Unlike > Debian's popcon, I don't think we really want to try to investigate > which files are actually used but focus on what's installed. > > There are a few important questions that need to be answered first: > > 1. Which data do we need to collect? > >a. list of installed packages? >b. versions (or slots?) of installed packages? >c. USE flags on installed packages? >d. world and world_sets files >e. system profile? >f. enabled repositories? (possibly filtered to official list) >g. distribution? > > I think d. is most important as it gives us information on what users > really want. a. alone is kinda redundant is we have d. c. might have > some value when deciding whether to mask a particular flag (and implies > a.). > > e. would be valuable if we wanted to determine the future of particular > profiles, as well as e.g. estimate the transition to new versions. > > f. would be valuable to determine which repositories are used but we > need to filter private repos from the output for privacy reasons. > > g. could be valuable in correlation with other data but not sure if > there's much direct value alone. > > > 2. How to handle Gentoo derivatives? Some of them could provide > meaningful data but some could provide false data (e.g. when derivatives > override Gentoo packages). One possible option would be to filter a.-e. > to stuff coming from ::gentoo. > > > 3. How to keep the data up-to-date? After all, if we just stack a lot > of old data, we will soon stop getting meaningful results. I suppose > we'll need to timestamp all data and remove old entries. > > > 4. How to avoid duplication? If some users submit their results more > often than others, they would bias the results. 3. might be related. > > > 5. How to handle clusters? Things are simple if we can assume that > people will submit data for a few distinct systems. But what about > companies that run 50 Gentoo machines with the same or similar setup? > What about clusters of 1000 almost identical containers? Big entities > could easily bias the results but we should also make it possible for > them to participate somehow. > > > 6. Security. We don't want to expose information that could be > correlated to specific systems, as it could disclose their > vulnerabilities. > > > 7. Privacy. Besides the above, our sysadmins would appreciate if > the data they submitted couldn't be easily correlated to them. If we > don't respect privacy of our users, we won't get them to submit data. > > > 8. Spam protection. Finally, the service needs to be resilient to being > spammed with fake data. Both to users who want to make their packages > look more important, and to script kiddies that want to prove a point. > > > My (partial) implementation idea > > I think our approach should be oriented on privacy/security first, > and attempt to make the best of the data we can get while respecting > this principle. This means no correlation and no tracking. > > Once the tool is installed, the user needs to opt-in to using it. This > involves accepting a privacy policy and setting up a cronjob. The tool > would
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 26 Apr 2020 10:52:27 +0200 Michał Górny wrote: > Do you have any other idea for spam protection then? What is the realistic risk here for spamming? If the record is well formed, and pertains to known packages, the worst I currently imagine is astroturfing: A single individual attempting to make a package seem more popular than it is. Just generally IME, spamming aims to make a buck somehow, but if there's no fields in the data set that can be used for this, and abuse of existing fields to fill with spam prose get filtered by not correlating to any known possible values, then the entire record is simply invalid, and can be removed on that basis. Conceptually, you could have a report with "dev-foo/plz-sir-halp-me-I-have-money-and-an-a-nigerian-prince::nigeria-prince", but for anybody to see that they'd have to be querying data about the ::nigeria-prince overlay, and that's assuming we even show data about overlays we can't locate. Trolling ::gentoo with packages that don't exist seems easy to eliminate. I don't like that astroturfing could be a thing ... but like, I also don't really care about that happening. For instance, crates.io has per-crate and per-crate-version download statistics. That's super easy to rig, you get lots of spiky noise in infrequently used packages simply due to various automated services fetching things. But at scale, the data still turns out to be quasi-useful, as it allows you to chart adoption and migration... because as soon as a new version gets shipped, if people are using it, then you'll start to see an uptick in reports from the new version. The "change" and "change response" information is very useful, and a very odd target for astroturfing. I for one would be greatly interested in "new perl version shipped, explosion of results due to people upgrading", because then I can gauge roughly how many people managed to upgrade perl without having to join #gentoo and cry about it being broken. (We could also designate a certain UUID flag for use by Gentoo infra, possibly even a UUID-per-host, the results of which were invisible in the public data, but still visible to people with approved perms, because we really do value the ability to know which packages we have to be careful about causing problems in, and where infra is at with upgrading various things before we remove the versions infra is using, whereas currently, working out what infra are currently running requires lots of direct communication) pgpd35R8sKJD6.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 26 Apr 2020 03:39:24 -0700 Brian Dolbec wrote: > We would need that > person/team to only enable their test system for gentoostats/disabled > for deployments. Repeated failure to do that could result in that uuid > being blacklisted. Part of the initial profile details for that > vm/image would be some details about approx numbers of deployments > (yes, subject to change. But useful to know whether it is 10-15 or > 100-500. type of deployment ie: vm/docker/kubernetes/desktop/server... If the UUID generation was how I proposed in my other reply: On a voluntary basis, with ability for UUID's to have metadata about what the information associated with them may be used for, one could also have a metadata field indicating what /kind/ of user the UUID was associated with. Then people simply installing things for testing, and reporting results from their test rig could have a "tester" flag associated with a UUID used only for testing, and then we can exclude that data from the main reports, while still using it as evidence that a thing may work for some audience. The submission rate for UUID's with the "tester" flag could be allowed to be higher, because it no longer contributes to the overall statistics. pgpNs50CPKG9p.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 26 Apr 2020 14:38:54 +0200 Thomas Deutschmann wrote: > Let's assume we will get reports that app-misc/foo is only installed 20 > times. If you are going to judge based on this data, "Obviously, nobody > is using that package, it's stuck on ... safe to remove" your > view is biased: I see this as more like what bloom filters get you, but in reverse: - You still have to factor for "what you don't know" - But now, instead of having "we don't know if anybody uses this", you *can* have a "we know for sure somebody uses this". The anonymization and uncorrelatable aspects are of course very useful to encourage people who would otherwise be averse to participate to participate, but its for sure not a sure thing. It would certainly be an improvement over what currently happens "No reverse dependencies, thus, nobody is using it". Bad things will still happen, but the absence of this tool won't stop the bad things happening, because presently, the existence of users is entirely conjecture. pgpgPoUjIyApz.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 26 Apr 2020 10:08:32 +0200 Michał Górny wrote: > A proper solution to cluster problem would probably involve some way to > internally collect and combine data data before submission. If you have > large clusters of similar systems, I think you'd want to have all > packages used on different systems reported as one entry. For this, I'd suggest the ability to have an overrideable "STATS_SERVER" (or something) ENV var URI that tells the submission clients where to send their reports to. Then have some server shipped in gentoo people can deploy, and submit aggregated as a cron job, or potentially hand review the aggregated submission data before submission, and potentially have tools to whittle data out you don't want to share at the org level. Such a tool is potentially useful to an organisation even without its "submit to gentoo" capacity, as being able to internally analyse what your organisation is using seems to be useful. (eg: provide an admin a single point of information showing what packages they need to audit, if all the nodes in the org are not entirely controlled at the top level) Though I think the overall design of anonymity by design is useful, I can see usecases, especially in the organisation model, where being able to voluntarily self-identify a node could be useful without inherently being a privacy concern. And you'd configure your relay to suppress these node identities in the submitted data, or map them to a different org-wide identity. Example: I need to find somebody who is using so I can ask them if works, or if is important to this package. Example: Data indicates somebody within my org is using , and I need to ask them not to use , as its licensing terms are not compatible with our org. Though for cases of voluntary identification, you'd need an interface on the server node somewhere that allows you to generate unique ident tokens, and associate data with them, possibly with a list of flags dictating what records associated with this identity may be used for (eg: Contact [y/n] ) pgpZ0SD8p685S.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 2020-04-26 10:08, Michał Górny wrote: > What do you think? Do you foresee other problems? Do you have > other needs? Can you think of better solutions? While I would really like to have data, I think it's impossible to get correct data and therefore we shouldn't collect any data at all because the invalid data we would collect would be misused/misinterpreted. Let's start with your first example already, > the primary goal of the project would be to gather statistics on > popularity of packages, in order to help us prioritize our attention > and make decisions on what to keep and what to remove Let's assume we will get reports that app-misc/foo is only installed 20 times. If you are going to judge based on this data, "Obviously, nobody is using that package, it's stuck on ... safe to remove" your view is biased: Because reporting will never be mandatory, we don't know if app-misc/foo is just unlucky because most of its user haven't opt-in into reporting, too (you can assume something like this for people with tor-related programs for example). Now think about large installations which are probably not allowed to "phone home", using their private local mirror and are even using build hosts. I am aware of *multiple* large Gentoo deployments -- for servers. You will never get data from these installations. Instead, stats will be drowned by several home users which are more likely to submit data. Not to mention the new containerized world... It's the problem you all should know from Mozilla, Google, Microsoft *duck*: They all do 'data-driven development'. The problem: *We* are power users. We are using several features most normal users don't even know. However, most of us are also aware about privacy and are disabling stats. The result: These companies are killing popular power user features just because their data indicates that nobody is using that feature. Please don't create pressure on users to opt-in to gentoostats to prevent something like this for Gentoo. My point is: I'll strongly object against *any* decision based on this project because the data will be *always wrong*. Therefore the data is useless and I wouldn't even consider collecting them in first place. Where there is a trough the pigs gather... and at one point people will start to ignore that the data is useless just to underline *their* point in their current situation. :/ -- Regards, Thomas Deutschmann / Gentoo Linux Developer C4DD 695F A713 8F24 2AA1 5638 5849 7EE5 1D5D 74A5 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 4/26/20 12:25 PM, Michał Górny wrote: > On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote: >> On 4/26/20 10:52 AM, Michał Górny wrote: >>> Do you have any other idea for spam protection then? >> >> IMO there're 2 types of spam: >> >> 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab) >> 2. made intentionlly >> >> The 1st can be handled by UUID - just drop any old related dataset from >> inbox when a new one arrived >> For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning >> where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in >> the last few weeks/months ? >> > > I'm sorry but could you rephrase that in more sentences? I don't > understand what you mean. > Well, inspired by what Tor people do with Tor bridge stats: - Create an UUID (never published, known only at the client and at the gentoo stats server) - Calculate a hash of it. The hash is allowed to be published. The hash may be related with contact informations. The contact data may or may not be published. The hash is used for contacting people in case of questions. The stats sent by the client contains the UUID. Stats are send to a stats server in an area where they do live fore a while (days). If a new stats file was got then the stats server deletes all older stats file of thet UUID in the stats area. Stats are be trusted if they meet conditions already mentioned by Brian Dolbec. IMO do not care about detecting spam, just try to detect valid UUIDs. -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 26 Apr 2020 11:32:06 +0200 Toralf Förster wrote: > On 4/26/20 11:09 AM, Ulrich Mueller wrote: > > Instead of using the IP address, you could generate a UUID when > > installing the tool. > > like the pfl tool did ? > Like the last gentoostats gsoc project did. As for enterprise/school/multiple clone deployments. Those are generated by one person/team, then deployed. We would need that person/team to only enable their test system for gentoostats/disabled for deployments. Repeated failure to do that could result in that uuid being blacklisted. Part of the initial profile details for that vm/image would be some details about approx numbers of deployments (yes, subject to change. But useful to know whether it is 10-15 or 100-500. type of deployment ie: vm/docker/kubernetes/desktop/server...
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 2020-04-26 at 12:15 +0200, Toralf Förster wrote: > On 4/26/20 10:52 AM, Michał Górny wrote: > > Do you have any other idea for spam protection then? > > IMO there're 2 types of spam: > > 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab) > 2. made intentionlly > > The 1st can be handled by UUID - just drop any old related dataset from inbox > when a new one arrived > For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning > where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in > the last few weeks/months ? > I'm sorry but could you rephrase that in more sentences? I don't understand what you mean. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 4/26/20 10:52 AM, Michał Górny wrote: > Do you have any other idea for spam protection then? IMO there're 2 types of spam: 1. made by accident (eg. "* * * * *" instead "@weekly" in crontab) 2. made intentionlly The 1st can be handled by UUID - just drop any old related dataset from inbox when a new one arrived For the 2nd: what about accepting only datasets from "valid" UUIDs, meaning where just 1 dataset/week/IPv4 (maybe /16 block) in the mean did arrived in the last few weeks/months ? Well, other than that maybe spamassassin or Tor peolple have more theory and generic approaches? :-) -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 2020-04-26 at 11:09 +0200, Ulrich Mueller wrote: > > > > > > On Sun, 26 Apr 2020, Michał Górny wrote: > > The other major problem is spam protection. The best semi-anonymous way > > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). > > We could set a limit of, say, 10 submissions per IPv4 address per week. > > If some address would exceed that limit, we could require CAPTCHA > > authorization. > > Instead of using the IP address, you could generate a UUID when > installing the tool. This would also take care of clusters with machines > that are clones of each other. > That wouldn't help with abuse at all. -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 4/26/20 11:09 AM, Ulrich Mueller wrote: > Instead of using the IP address, you could generate a UUID when > installing the tool. like the pfl tool did ? -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
> On Sun, 26 Apr 2020, Michał Górny wrote: > The other major problem is spam protection. The best semi-anonymous way > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). > We could set a limit of, say, 10 submissions per IPv4 address per week. > If some address would exceed that limit, we could require CAPTCHA > authorization. Instead of using the IP address, you could generate a UUID when installing the tool. This would also take care of clusters with machines that are clones of each other. Ulrich signature.asc Description: PGP signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On Sun, 2020-04-26 at 10:43 +0200, Toralf Förster wrote: > On 4/26/20 10:08 AM, Michał Górny wrote: > > . This > > involves accepting a privacy policy and setting up a cronjob. The tool > > would suggest a (random?) time for submission to take place periodically > > (say, every week). > > Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c > the later might be too late for desktop users. > > > > We could set a limit of, say, 10 submissions per IPv4 address per week. > > If the output do not differ (too much) then the limit isn't needed, or? Do you have any other idea for spam protection then? -- Best regards, Michał Górny signature.asc Description: This is a digitally signed message part
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
On 4/26/20 10:08 AM, Michał Górny wrote: > . This > involves accepting a privacy policy and setting up a cronjob. The tool > would suggest a (random?) time for submission to take place periodically > (say, every week). Well, something like "@weekly" should be preferred over eg "42 23 * * *" b/c the later might be too late for desktop users. > We could set a limit of, say, 10 submissions per IPv4 address per week. If the output do not differ (too much) then the limit isn't needed, or? -- Toralf PGP 23217DA7 9B888F45 signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] [RFC] Ideas for gentoostats implementation
Hi, On Sun 26 Apr 2020 10:08:32 GMT, Michał Górny wrote: > The other major problem is spam protection. The best semi-anonymous way > I see is to use submitter's IPv4 addresses (can we support IPv6 then?). > We could set a limit of, say, 10 submissions per IPv4 address per week. > If some address would exceed that limit, we could require CAPTCHA > authorization. > > I think this would make spamming a bit harder while keeping submissions > easy for the most, and a little harder but possible for those of us > behind ISP NATs. I think that the IPv6 support shouldn’t be a question. I have several points for it: 1. All the Gentoo infrastructure is IPv6-able (at least the public faced as I’m aware), so it could create a specific case 2. As you mention NAT ISPs, most of those are providing IPv6 as well (because NAT isn’t cost-less). Also putting the IPv4 rate-limit to a /64 IPv6 will reduce the need for a CAPTCHA. 3. Users don’t necessary have an IPv4 access 4. About a third of the Internet traffic is IPv6, so it’s not an option in my humble opinion. Regards, -- Alarig