Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-24 Thread Kent Fredric
On Sun, 24 May 2020 13:05:35 +
Peter Stuge  wrote:

> The bar only needs to be raised high enough.

Sure. A lot of this is just "think about what could happen in the worst
case imaginable".

Its very unlikely our worst cases will happen.

But we should at least have the ability to easily add mitigations in
future if things do get worse.


pgpL037xJyqxw.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-24 Thread Peter Stuge
Kent Fredric wrote:
> > While services such as reCAPTCHA are (as said) massively intrusive, there
> > are other, much less intrusive and even terminal-compatible ways to 
> > construct
> > a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
> > for a human above the response input line - that's not so bad.
> 
> Well, they kinda have to be,

I disagree with that, especially for this service, that was the point I
wanted to make. :)


> the state of AI is increasing so much that current captcha systems
> undoubtedly also develop their own adversarial AI to try beat their
> own captcha.
> 
> I don't think we have the sort of power to develop this.

In any case I don't think that's required.


> And the inherently low entropy of only having 80x23 with so few
> (compared to full RGB) bits per pixel,

A character doesn't compare too bad to RGB. See aalib, or if you
will risk exclusion of color-vision-impaired humans libcaca.


> this gives any would-be AI a substantial leg up.
> 
> Using text distortion is amateur hour these days.
> 
> (and there's always mechanical-turk anyway)

Except this isn't for some web-scale disruptive startup, it's a
statistics/reputation system for an advanced, super-nerdy Linux distribution.

Please think more about the threat model, and remember the rate limit knob.

The bar only needs to be raised high enough.


//Peter



Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-23 Thread Kent Fredric
On Fri, 22 May 2020 21:58:54 +0200
Michał Górny  wrote:

> Let's put it like this.  This thing starts working.  Package X is
> broken, and we see that almost nobody is using it.  We remove that
> package.  Now somebody is angry.  He submits a lot of fake data to
> render the service useless so that we don't make any future decisions
> based on it.

Sure, and I agree that's a risk. But its not the "random users from the
internet fill your inbox with shallow promises of free money" sort of
risk, that's typically implied by "spam" ;).

The set of potential attackers seems much smaller in our case, and are
expressly likely to be actual consumers of Gentoo.

This attacker type seems to be the sort that mitigates well with:

- Make it so that end users can't forge custom IDs and can only be
  handed out by the server (but the ID doesn't actually add any
  tracking, its just a chunk of randomness with a signature that
  verifies its legitimacy)

- Make ID generation expensive.

- Limit submissions per ID the same way we do now.

That way it doesn't harm typical users beyond their --setup, but hurts
would be attackers.

If we get under attack, we can just suspend ID generation services, or
rate limit ID generation.

(And we can encode data in the ID about when it was generated, and the
strength of the challenge of the generation, and then block submissions
based on criteria when problems occur)

This means we don't need to keep track of what ID's are "valid", server
side, crypto bits do all the leg work.

Even if our private key doing the signing gets compromised, we can
change it, which triggers all users to need to re-id, and flush old
data.



pgp2bPAtbKLWn.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-23 Thread Kent Fredric
On Fri, 22 May 2020 22:13:11 +
Peter Stuge  wrote:

> While services such as reCAPTCHA are (as said) massively intrusive, there
> are other, much less intrusive and even terminal-compatible ways to construct
> a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
> for a human above the response input line - that's not so bad.

Well, they kinda have to be, the state of AI is increasing so much that
current captcha systems undoubtedly also develop their own adversarial
AI to try beat their own captcha.

I don't think we have the sort of power to develop this.

And the inherently low entropy of only having 80x23 with so few
(compared to full RGB) bits per pixel, this gives any would-be AI a
substantial leg up.

Using text distortion is amateur hour these days.

(and there's always mechanical-turk anyway)


pgpAEpbsuoP1P.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-23 Thread Kent Fredric
On Fri, 22 May 2020 12:53:03 -0700
Brian Dolbec  wrote:

> We cannot exclude overlays which will have cat/pkg not in the main
> gentoo repo.  So, we should not excludea submission that includes a few
> of these.  They would just become irrelevant outliers to our
> processesing of the data.  In fact some of these outlier pkgs could be
> relevant to our including that pkg into the main repo.

We *can* still validate them against entries in known overlays.

And even if we *cant* validate everything, we can de-weight and hide
from *default* reports items that can't be found in known overlays.

This would move the difficulty goal from "submit a spam record" to:

- write an overlay
- get it published somewhere
- get it included in the database of known overlays
- then publish a spam record relating to it

Which sounds like a slow and painful process if the risk of being
blacklisted burns down that whole stack.


pgp3S6X2G8lWc.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-23 Thread Michał Górny
On Sat, 2020-05-23 at 09:54 +0200, Fabian Groffen wrote:
> On 22-05-2020 21:58:54 +0200, Michał Górny wrote:
> > Let's put it like this.  This thing starts working.  Package X is
> > broken, and we see that almost nobody is using it.  We remove that
> > package.  Now somebody is angry.  He submits a lot of fake data to
> > render the service useless so that we don't make any future decisions
> > based on it.
> 
> I'm affraid that has a heroic flair to me.  The service should never be
> used for decisions like that, because it's a biased sample at most.
> Doing stuff like this simply destroys the soul of the distribution.
> 
> I hope this isn't one of your genuine objectives with the service.  If
> it is, I can see why you fear spam so much.
> 

What it is is one thing, what an angry user perceives it to be is
another.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-23 Thread Fabian Groffen
On 22-05-2020 21:58:54 +0200, Michał Górny wrote:
> Let's put it like this.  This thing starts working.  Package X is
> broken, and we see that almost nobody is using it.  We remove that
> package.  Now somebody is angry.  He submits a lot of fake data to
> render the service useless so that we don't make any future decisions
> based on it.

I'm affraid that has a heroic flair to me.  The service should never be
used for decisions like that, because it's a biased sample at most.
Doing stuff like this simply destroys the soul of the distribution.

I hope this isn't one of your genuine objectives with the service.  If
it is, I can see why you fear spam so much.

Fabian

-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Peter Stuge
Stop motivated attackers or keep low barrier to entry; pick any one. :)

Michał Górny wrote:
> CAPTCHA
> ==
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
> 
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.

While services such as reCAPTCHA are (as said) massively intrusive, there
are other, much less intrusive and even terminal-compatible ways to construct
a CAPTCHA. Hello game developers, you have 80x23 "pixels" to render a puzzle
for a human above the response input line - that's not so bad.

Attacking something like a server-generated maths challenge rendered in a
randomly chosen and maybe distorted font would require OCR and/or ML, which
is fairly annoying. The only real problem then would be with OCR packages. ;)

Combine with a rate limit that is increased manually as the service grows
more popular. It can be a soft limit which doesn't report failure but results
in queueing+maybe vetting of reports, to allow some elasticity for peaks.


//Peter



Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread John Helmert III
On Fri, May 22, 2020 at 12:53:03PM -0700, Brian Dolbec wrote:
> We cannot exclude overlays which will have cat/pkg not in the main
> gentoo repo.  So, we should not excludea submission that includes a few
> of these.

To avoid this problem, even if imperfectly, it should be possible to
track what repository a given package is installed from and then check
its validity based on a list of valid packages for a given overlay.


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Michał Górny
On Sat, 2020-05-23 at 07:20 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 10:47:07 +0200
> Michał Górny  wrote:
> 
> > Other ideas
> > ===
> > Do you have any other ideas on how we could resolve this?
> 
> And a question I'd like to revisit, because nobody responded to it:
> 
> - What are the incentives a would-be spammer has to spam this service.
> 
> Services that see spam *typically* have a definable objective.
> 
> *Typically* it revolves around the ability to submit /arbitrary text/,
> which allows them to hawk something, and this becomes a profit motive.
> 
> If we implement data validation so that there's no way for them to
> profit off what they spam, seems likely they'll be less motivated to
> develop the necessary circumvention tools. ( as in, we shouldn't accept
> arbitrary CAT/PN pairs as being valid until something can confirm those
> pairs exist in reality )
> 
> There may be people trying to jack the data up, but ... it seems a less
> worthy target.
> 
> So it seems the largest risk isn't so much "spam", but "denial of
> service", or "data pollution".

I've meant 'spam' as 'undesired submissions'.  You seem to have used
a very narrow definition of 'spam' to argue into reaching the same
problem under different name.

Let's put it like this.  This thing starts working.  Package X is
broken, and we see that almost nobody is using it.  We remove that
package.  Now somebody is angry.  He submits a lot of fake data to
render the service useless so that we don't make any future decisions
based on it.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Brian Dolbec
On Sat, 23 May 2020 07:20:22 +1200
Kent Fredric  wrote:

> On Thu, 21 May 2020 10:47:07 +0200
> Michał Górny  wrote:
> 
> > Other ideas
> > ===
> > Do you have any other ideas on how we could resolve this?  
> 
> And a question I'd like to revisit, because nobody responded to it:
> 
> - What are the incentives a would-be spammer has to spam this service.
> 
> Services that see spam *typically* have a definable objective.
> 
> *Typically* it revolves around the ability to submit /arbitrary text/,
> which allows them to hawk something, and this becomes a profit motive.
> 
> If we implement data validation so that there's no way for them to
> profit off what they spam, seems likely they'll be less motivated to
> develop the necessary circumvention tools. ( as in, we shouldn't
> accept arbitrary CAT/PN pairs as being valid until something can
> confirm those pairs exist in reality )
> 
> There may be people trying to jack the data up, but ... it seems a
> less worthy target.
> 
> So it seems the largest risk isn't so much "spam", but "denial of
> service", or "data pollution".
> 
> Of course, we should still mitigate, but /how/ we mitigate seems to
> pivot around this somewhat.

We cannot exclude overlays which will have cat/pkg not in the main
gentoo repo.  So, we should not excludea submission that includes a few
of these.  They would just become irrelevant outliers to our
processesing of the data.  In fact some of these outlier pkgs could be
relevant to our including that pkg into the main repo.

But, like you I agree that purely spam submissions would be few, if any.



Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Kent Fredric
On Thu, 21 May 2020 10:47:07 +0200
Michał Górny  wrote:

> Other ideas
> ===
> Do you have any other ideas on how we could resolve this?

And a question I'd like to revisit, because nobody responded to it:

- What are the incentives a would-be spammer has to spam this service.

Services that see spam *typically* have a definable objective.

*Typically* it revolves around the ability to submit /arbitrary text/,
which allows them to hawk something, and this becomes a profit motive.

If we implement data validation so that there's no way for them to
profit off what they spam, seems likely they'll be less motivated to
develop the necessary circumvention tools. ( as in, we shouldn't accept
arbitrary CAT/PN pairs as being valid until something can confirm those
pairs exist in reality )

There may be people trying to jack the data up, but ... it seems a less
worthy target.

So it seems the largest risk isn't so much "spam", but "denial of
service", or "data pollution".

Of course, we should still mitigate, but /how/ we mitigate seems to
pivot around this somewhat.


pgpsx5km9Qpj3.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread waebbl
Am Fr., 22. Mai 2020 um 15:40 Uhr schrieb Gordon Pettey <
petteyg...@gmail.com>:

> On Fri, May 22, 2020 at 1:18 AM waebbl  wrote:
>
>> Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
>> xp.vit@gmail.com>:
>>
>>> I believe that we are all have forgotten about Donald Knuth: Premature
>>> optimisation is the root of all evill.
>>>
>> I won't consider spam protection to be a optimisation. Instead, the
>> occurence of spam is IMO a proper use-case from a developers PoV. Therefore
>> thinking about how to handle it, is a necessary task.
>>
> Abusing Knuth's words as an excuse to avoid any and all good practice is
> the root of all evil.
>

Would you consider not even thinking about it a good practice?


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Gordon Pettey
On Fri, May 22, 2020 at 1:18 AM waebbl  wrote:

> Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
> xp.vit@gmail.com>:
>
>> I believe that we are all have forgotten about Donald Knuth: Premature
>> optimisation is the root of all evill.
>>
> I won't consider spam protection to be a optimisation. Instead, the
> occurence of spam is IMO a proper use-case from a developers PoV. Therefore
> thinking about how to handle it, is a necessary task.
>
Abusing Knuth's words as an excuse to avoid any and all good practice is
the root of all evil.


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread waebbl
Am Do., 21. Mai 2020 um 22:14 Uhr schrieb Viktar Patotski <
xp.vit@gmail.com>:

I believe that we are all have forgotten about Donald Knuth: Premature
> optimisation is the root of all evill.
>

I won't consider spam protection to be a optimisation. Instead, the
occurence of spam is IMO a proper use-case from a developers PoV. Therefore
thinking about how to handle it, is a necessary task.

--
With Regards
Bernd 


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-22 Thread Michał Górny
On Fri, 2020-05-22 at 06:42 +0200, Michał Górny wrote:
> On Thu, 2020-05-21 at 22:13 +0200, Viktar Patotski wrote:
> > We don't have "spam" yet, but we are already trying to protect. There might
> > be cases when some systems will be posting stats more often than we want,
> > but probably that will not harm us. Or this will be done by our main users
> > who runs 1kk of gentoo installations and this "spam"  will be actually
> > valuable. Moreover, nobody forces us to treat info from 'goose' as first
> > priority, so we are still able to select on which packages to work. In
> > short: this topic is not so important yet, I think.
> > 
> 
> Tell that to SKS keyserver admins.  Well, on the plus side if it
> happens, it probably won't affect user systems in the process.

Well, I didn't make my point very clear, so please let me explain.

Right now the project is in experimental phase.  If we do major changes
right now, the harm is minimal.

If spamming happens one year from now, two years from now... we'd have
many users submitting data.  Suddenly, we would have to invent something
new, and it will probably be impossible within the framework used right
now.  This would most likely mean we'd have to literally kick all users
from the system and start over.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 22:13 +0200, Viktar Patotski wrote:
> We don't have "spam" yet, but we are already trying to protect. There might
> be cases when some systems will be posting stats more often than we want,
> but probably that will not harm us. Or this will be done by our main users
> who runs 1kk of gentoo installations and this "spam"  will be actually
> valuable. Moreover, nobody forces us to treat info from 'goose' as first
> priority, so we are still able to select on which packages to work. In
> short: this topic is not so important yet, I think.
> 

Tell that to SKS keyserver admins.  Well, on the plus side if it
happens, it probably won't affect user systems in the process.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 22:07 +0200, Toralf Förster wrote:
> On 5/21/20 11:43 AM, Michał Górny wrote:
> > On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
> > > On 5/21/20 10:47 AM, Michał Górny wrote:
> > > > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > > > i.e. mass fake submissions.
> > > > 
> > > 
> > > I'd combine IP-limits with proof-of-work.
> > > CAPTCHA should be the very last option IMO.
> > > 
> > 
> > To be honest, I don't see the point for proof-of-work if we have IP
> > limits.
> > 
> 
> The POW has to be made for every submission and should (somehow) include the 
> IP-address.
> So you have 2 barriers. None of both is perfect but their combination is 
> expensive.

No, one of them is expensive while the other is completely covered by
it.  I can't imagine requiring PoW that expensive that it would limit
requests more than a reasonable IP limiting.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Alec Warner
On Thu, May 21, 2020 at 1:13 PM Viktar Patotski 
wrote:

> Hi all,
>
> I believe that we are all have forgotten about Donald Knuth: Premature
> optimisation is the root of all evill.
>
> We don't have "spam" yet, but we are already trying to protect. There
> might be cases when some systems will be posting stats more often than we
> want, but probably that will not harm us. Or this will be done by our main
> users who runs 1kk of gentoo installations and this "spam"  will be
> actually valuable. Moreover, nobody forces us to treat info from 'goose' as
> first priority, so we are still able to select on which packages to work.
> In short: this topic is not so important yet, I think.
>

I raised a similar question on irc and the conclusion was that 'it is good
to have ideas' and I don't necessarily disagree there[0]. We cannot build a
foolproof system but some are feasible in some scenarios[1].

[0] Gentoo offers numerous no-login-required services; most of these are
read-only but they typically don't suffer from attacks; or at least, not
attacks that we need to respond to. The most obvious one of these is our
gentoo.org mail service which accepts unauthenticated email to gentoo.org.
Our anti-email-spam countermeasures are what I would call complex, but we
still employ broad measures when needed and the tradeoffs are similar to
the options for goose; e.g. if we are too broad we can block email from
large swaths of the internet.
[1] Bugzilla *has* recently been the target of spam attacks, it *has*
logins required (e.g. to create / modify bugs) and it has not stopped the
spammers from creating accounts. We have discussed different protections
for bugzilla, as it has different parameters. A basic bugzilla account
can't do all that much (you can't modify the bugs of others easily) and
spam posts are easily identified. This is to differentiate from goose where
the powers of each token are the same (submit report) and it may be
difficult to tell an abusive report from a real report.


> Viktar
>
>
> On Thu, May 21, 2020, 16:28 Jaco Kroon  wrote:
>
>> Hi Michał,
>>
>> On 2020/05/21 13:02, Michał Górny wrote:
>> > On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
>> >> Even for v4, as an attacker ... well, as I'm sitting here right now
>> I've
>> >> got direct access to almost a /20 (4096 addresses).  I know a number of
>> >> people with larger scopes than that.  Use bot-nets and the scope goes
>> up
>> >> even more.
>> > See how unfair the world is!  You are filling your bathtub with IP
>> > addresses, and my ISP has taken mine only recently.
>> I must admit, I work for an ISP :$
>> >>> Option 3: explicit CAPTCHA
>> >>> ==
>> >>> A traditional way of dealing with spam -- require every new system
>> >>> identifier to be confirmed by solving a CAPTCHA (or a few
>> identifiers
>> >>> for one CAPTCHA).
>> >>>
>> >>> The advantage of this method is that it requires a real human work
>> >>> to be
>> >>> performed, effectively limiting the ability to submit spam.
>> >>>
>> >> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
>> >> opinion more effort than they're worth.
>> >>
>> >> This may be beneficial to *generate* a token.  In other words - when
>> >> generating a token, that token needs to be registered by way of
>> capthca.
>> >>
>> >>> Other ideas
>> >>> ===
>> >>> Do you have any other ideas on how we could resolve this?
>> >>>
>> >> Generated token + hardware based hash.
>> > How are you going to verify that the hardware-based hash is real,
>> > and not just a random value created to circumvent the protection?
>>
>> So the generation of the hash is more to validate that it's still on the
>> same installation (ie, not a cloned token).  Sorry if that wasn't clear,
>> so trying to solve two possible problems in one go.
>>
>> >
>> >>   Rate limit the combination to 1/day.
>> >>
>> >> Don't use included results until it's been kept up to date for a
>> minimum
>> >> period.  Say updated at least 20 times 30 days.
>> > For privacy reasons, we don't correlate the results.  So this is
>> > impossible to implement.
>>
>> Ok, but a token cannot (unless we issue it based on an email based
>> account) be linked back to a specific user, so does it matter if we
>> associate uploads with a token?
>>
>> >> The downside here is that many machines are not powered up at least
>> once
>> >> a day to be able to perform that initial submission sequence.  So
>> >> perhaps it's a bit stringent.
>> > Exactly.  Even once a week is a bit risky but once a day is too narrow
>> > a period.
>> >
>> > To some degree, we could decide we don't care about exact numbers
>> > as much as some degree of weighed proportions.  This would mean that,
>> > say, people who submit daily get the count of 7, at the loss of people
>> > who don't run their machines that much.  It would effectively put more
>> > emphasis on more active users.  It's debatable whether 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Viktar Patotski
Hi all,

I believe that we are all have forgotten about Donald Knuth: Premature
optimisation is the root of all evill.

We don't have "spam" yet, but we are already trying to protect. There might
be cases when some systems will be posting stats more often than we want,
but probably that will not harm us. Or this will be done by our main users
who runs 1kk of gentoo installations and this "spam"  will be actually
valuable. Moreover, nobody forces us to treat info from 'goose' as first
priority, so we are still able to select on which packages to work. In
short: this topic is not so important yet, I think.

Viktar


On Thu, May 21, 2020, 16:28 Jaco Kroon  wrote:

> Hi Michał,
>
> On 2020/05/21 13:02, Michał Górny wrote:
> > On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
> >> Even for v4, as an attacker ... well, as I'm sitting here right now I've
> >> got direct access to almost a /20 (4096 addresses).  I know a number of
> >> people with larger scopes than that.  Use bot-nets and the scope goes up
> >> even more.
> > See how unfair the world is!  You are filling your bathtub with IP
> > addresses, and my ISP has taken mine only recently.
> I must admit, I work for an ISP :$
> >>> Option 3: explicit CAPTCHA
> >>> ==
> >>> A traditional way of dealing with spam -- require every new system
> >>> identifier to be confirmed by solving a CAPTCHA (or a few
> identifiers
> >>> for one CAPTCHA).
> >>>
> >>> The advantage of this method is that it requires a real human work
> >>> to be
> >>> performed, effectively limiting the ability to submit spam.
> >>>
> >> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
> >> opinion more effort than they're worth.
> >>
> >> This may be beneficial to *generate* a token.  In other words - when
> >> generating a token, that token needs to be registered by way of capthca.
> >>
> >>> Other ideas
> >>> ===
> >>> Do you have any other ideas on how we could resolve this?
> >>>
> >> Generated token + hardware based hash.
> > How are you going to verify that the hardware-based hash is real,
> > and not just a random value created to circumvent the protection?
>
> So the generation of the hash is more to validate that it's still on the
> same installation (ie, not a cloned token).  Sorry if that wasn't clear,
> so trying to solve two possible problems in one go.
>
> >
> >>   Rate limit the combination to 1/day.
> >>
> >> Don't use included results until it's been kept up to date for a minimum
> >> period.  Say updated at least 20 times 30 days.
> > For privacy reasons, we don't correlate the results.  So this is
> > impossible to implement.
>
> Ok, but a token cannot (unless we issue it based on an email based
> account) be linked back to a specific user, so does it matter if we
> associate uploads with a token?
>
> >> The downside here is that many machines are not powered up at least once
> >> a day to be able to perform that initial submission sequence.  So
> >> perhaps it's a bit stringent.
> > Exactly.  Even once a week is a bit risky but once a day is too narrow
> > a period.
> >
> > To some degree, we could decide we don't care about exact numbers
> > as much as some degree of weighed proportions.  This would mean that,
> > say, people who submit daily get the count of 7, at the loss of people
> > who don't run their machines that much.  It would effectively put more
> > emphasis on more active users.  It's debatable whether this is desirable
> > or not.
> Decaying averages.  Simple to implement, don't need all historic data.
> >
> > Both the token and hardware hash can of course be tainted and is under
> >> "attacker control".
> > Exactly.  So it really looks like exercise for the sake of exercise.
>
> Unless tokens are *issued* as per the rest of my email you snipped
> away.  Wherein I proposed an issuing of both anonymous and non-anonymous
> tokens.
>
> Kind Regards,
> Jaco
>
>
>


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Toralf Förster
On 5/21/20 11:43 AM, Michał Górny wrote:
> On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
>> On 5/21/20 10:47 AM, Michał Górny wrote:
>>> TL;DR: I'm looking for opinions on how to protect goose from spam,
>>> i.e. mass fake submissions.
>>>
>>
>> I'd combine IP-limits with proof-of-work.
>> CAPTCHA should be the very last option IMO.
>>
> 
> To be honest, I don't see the point for proof-of-work if we have IP
> limits.
> 

The POW has to be made for every submission and should (somehow) include the 
IP-address.
So you have 2 barriers. None of both is perfect but their combination is 
expensive.

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Jaco Kroon
Hi Michał,

On 2020/05/21 13:02, Michał Górny wrote:
> On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
>> Even for v4, as an attacker ... well, as I'm sitting here right now I've
>> got direct access to almost a /20 (4096 addresses).  I know a number of
>> people with larger scopes than that.  Use bot-nets and the scope goes up
>> even more.
> See how unfair the world is!  You are filling your bathtub with IP
> addresses, and my ISP has taken mine only recently.
I must admit, I work for an ISP :$
>>> Option 3: explicit CAPTCHA
>>> ==
>>> A traditional way of dealing with spam -- require every new system
>>> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
>>> for one CAPTCHA).
>>>
>>> The advantage of this method is that it requires a real human work
>>> to be
>>> performed, effectively limiting the ability to submit spam.
>>>
>> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
>> opinion more effort than they're worth.
>>
>> This may be beneficial to *generate* a token.  In other words - when
>> generating a token, that token needs to be registered by way of capthca.
>>
>>> Other ideas
>>> ===
>>> Do you have any other ideas on how we could resolve this?
>>>
>> Generated token + hardware based hash.
> How are you going to verify that the hardware-based hash is real,
> and not just a random value created to circumvent the protection?

So the generation of the hash is more to validate that it's still on the
same installation (ie, not a cloned token).  Sorry if that wasn't clear,
so trying to solve two possible problems in one go.

>
>>   Rate limit the combination to 1/day.
>>
>> Don't use included results until it's been kept up to date for a minimum
>> period.  Say updated at least 20 times 30 days.
> For privacy reasons, we don't correlate the results.  So this is
> impossible to implement.

Ok, but a token cannot (unless we issue it based on an email based
account) be linked back to a specific user, so does it matter if we
associate uploads with a token?

>> The downside here is that many machines are not powered up at least once
>> a day to be able to perform that initial submission sequence.  So
>> perhaps it's a bit stringent.
> Exactly.  Even once a week is a bit risky but once a day is too narrow
> a period.
>
> To some degree, we could decide we don't care about exact numbers
> as much as some degree of weighed proportions.  This would mean that,
> say, people who submit daily get the count of 7, at the loss of people
> who don't run their machines that much.  It would effectively put more
> emphasis on more active users.  It's debatable whether this is desirable
> or not.
Decaying averages.  Simple to implement, don't need all historic data.
>
> Both the token and hardware hash can of course be tainted and is under
>> "attacker control".
> Exactly.  So it really looks like exercise for the sake of exercise.

Unless tokens are *issued* as per the rest of my email you snipped
away.  Wherein I proposed an issuing of both anonymous and non-anonymous
tokens.

Kind Regards,
Jaco




Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Fri, 2020-05-22 at 01:09 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 14:25:00 +0200
> Ulrich Mueller  wrote:
> 
> > That's why I said salted hash.
> 
> Even a salted hash becomes a trivial joke when the input data you're
> hashing has a *total* entropy of only 32bits.
> 

If anyone cares about the numbers, I've been able to crack my own IP
address (85.*) in 10 minutes using john with trivial IP address wordlist
generator and plain SHA-512 hash.  I suppose you could assume that
having salted hashes would mean up to 30 minutes per IP address but
that's still not much.  I suppose you could use Argon2 or some other
crazy hash but... where is this going, really?

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Kent Fredric
On Fri, 22 May 2020 01:38:02 +1200
Kent Fredric  wrote:

> So instead of the ID being generated locally, you'd send a request
> asking for an ID, it would send you the challenge math, you'd send the
> answer, and then you'd get your ID.

Additionally, you could even allow the client to pass a number, that
stipulates a desired level of trust, in exchange for a more expensive
computation.

If there was an ID generation option that allowed me to, once, request
a challenge that takes an hour to complete, in exchange for getting a
higher "trust" vector, I'd do that.

Then you could present reports and whittle the results down by minimum
trust level.

( And then after the fact, one can adjust the minimum trust level of a
UID key to submit, so if UID keys below a certian trust level become
problematic, you can easily start rejecting them, and demand they
re-key with a higher trust level )


pgpVo9Hg2tlvb.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Kent Fredric
On Thu, 21 May 2020 15:16:12 +0200
Michał Górny  wrote:

> Isn't the whole point of salted hash to use unique salts?

You'd thinik so, but I've seen too many piece of code where the salt
was a hardcoded string right there in the hash generation.

md5sum( "SeKrIt\0" + pass  )

So I've learned to never assume that salts were unique per entry.



pgpVY__tDhm5i.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Kent Fredric
On Thu, 21 May 2020 10:47:07 +0200
Michał Górny  wrote:

> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.

If the proof of work mechanism was restricted to ID generation, then
the amoritized cost would be acceptable.

So instead of the ID being generated locally, you'd send a request
asking for an ID, it would send you the challenge math, you'd send the
answer, and then you'd get your ID.

And their ID would be an encoded copy of their input vectors and
responses, a random chunk, and chunk representing the signature of
IV/RESPONSE/RAND.

Or something like that.

But the gist is it would be impossible to use ID's not generated by the
server.

Then the spam factor to monitor wouldn't be submission rates, it would
be "New ID request" rates, as these should never be needed to be
generated in large volumes.

_And_ taking 5 minutes for ID generation wouldn't be a terrible thing.

( We could possibly collect anonymous stats on ID generation rates, and
average times to generate a response to a challenge, and use that to
determine what our challenge complexity should be )



pgpRtdZuRbfZX.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Gordon Pettey
Require browser-based interaction to use the service. Do something funky
with AJAX so the page can't be properly used with curl or anything so that
manual effort is required to get the UUID to submit as. Only allow
registered UUIDs, and only allow one submission per day per UUID.
Sure, somebody can go to Mechanical Turk and pay a few cents to generate
fake submission IDs, but at least you have that tiny deterrent of "I've got
to pay 3 cents per spam account :(".

Maybe also add some minor tracking to the database if it isn't already
there to count submissions over time per UUID, and make the default cron
script weekly. If you see some UUID that is submitting at the maximum rate
of daily, you may lean towards accusations of spam.

On Thu, May 21, 2020 at 3:47 AM Michał Górny  wrote:

> Hi,
>
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
>
>
> Problem
> ===
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter.
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
>
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:
>
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
>
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
>
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
>
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
>
>
> Option 1: IP-based limiting
> ===
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
>
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
>
> - IP address range is naturally limited
>
> - IP addresses have non-zero cost
>
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
>
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
>
> - it actively favors people with access to large number of IP addresses
>
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
>
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
>
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
>
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
>
>
> Option 2: proof-of-work
> ===
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
>
> On the plus side, it would rely more on actual physical hardware than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
>
> On the minus side, it would penalize people with weak hardware.
>
> For example, 'time hashcash -m -b 28 -r test' gives:
>
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
> At the same time, it would still permit a lot of fake submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode.  This
> would still allow me to use 7 threads.  If we adjusted the algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
>
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
>
>
> Option 3: explicit CAPTCHA
> ==
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Fri, 2020-05-22 at 01:09 +1200, Kent Fredric wrote:
> On Thu, 21 May 2020 14:25:00 +0200
> Ulrich Mueller  wrote:
> 
> > That's why I said salted hash.
> 
> Even a salted hash becomes a trivial joke when the input data you're
> hashing has a *total* entropy of only 32bits.
> 
> You at very least need a unique salt per hash, or you only have to
> expose the salt to create a rainbow table for the whole dataset.

Isn't the whole point of salted hash to use unique salts?

Nevertheless, it's still near-trivial task.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Kent Fredric
On Thu, 21 May 2020 14:25:00 +0200
Ulrich Mueller  wrote:

> That's why I said salted hash.

Even a salted hash becomes a trivial joke when the input data you're
hashing has a *total* entropy of only 32bits.

You at very least need a unique salt per hash, or you only have to
expose the salt to create a rainbow table for the whole dataset.


pgp6cCemO_pkT.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Ulrich Mueller
> On Thu, 21 May 2020, Robert Bridge wrote:

> There are only 4 billion to reverse, not that hard really with a
> rainbow table...

That's why I said salted hash.


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Robert Bridge
There are only 4 billion to reverse, not that hard really with a rainbow
table...

On Thu, 21 May 2020 at 13:08, Michał Górny  wrote:

> On Thu, 2020-05-21 at 13:57 +0200, Ulrich Mueller wrote:
> > > > > > > On Thu, 21 May 2020, Robert Bridge wrote:
> > > On Thu, 21 May 2020 at 09:47, Michał Górny  wrote:
> > > > Option 1: IP-based limiting
> > > > ===
> > > >
> > > Preface this with IANAL, check with your own legal counsel...
> > > While IP address based methods might be attractive  technically, do
> > > remember that an IP address is considered Personally Identifiable in
> > > European Data Protection law.
> > > The fact submissions require an action by the user will probably be
> > > sufficient to be explicit consent, any system storing these details
> should
> > > allow for the use to revoke their consent: If you collect anything
> > > personally identifiable, you will need to provide a mechanism for
> users to
> > > request the removal of all their submissions.
> > > Tread carefully with this project. :)
> >
> > You don't have to store any IP addresses, you can store a cryptographic
> > hash like their b2sum (salted if necessary).
> >
>
> Yes, this is as great as storing hashes of phone numbers ;-).
>
> --
> Best regards,
> Michał Górny
>
>


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 13:57 +0200, Ulrich Mueller wrote:
> > > > > > On Thu, 21 May 2020, Robert Bridge wrote:
> > On Thu, 21 May 2020 at 09:47, Michał Górny  wrote:
> > > Option 1: IP-based limiting
> > > ===
> > > 
> > Preface this with IANAL, check with your own legal counsel...
> > While IP address based methods might be attractive  technically, do
> > remember that an IP address is considered Personally Identifiable in
> > European Data Protection law.
> > The fact submissions require an action by the user will probably be
> > sufficient to be explicit consent, any system storing these details should
> > allow for the use to revoke their consent: If you collect anything
> > personally identifiable, you will need to provide a mechanism for users to
> > request the removal of all their submissions.
> > Tread carefully with this project. :)
> 
> You don't have to store any IP addresses, you can store a cryptographic
> hash like their b2sum (salted if necessary).
> 

Yes, this is as great as storing hashes of phone numbers ;-).

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Ulrich Mueller
> On Thu, 21 May 2020, Robert Bridge wrote:

> On Thu, 21 May 2020 at 09:47, Michał Górny  wrote:
>> 
>> Option 1: IP-based limiting
>> ===
>> 

> Preface this with IANAL, check with your own legal counsel...

> While IP address based methods might be attractive  technically, do
> remember that an IP address is considered Personally Identifiable in
> European Data Protection law.

> The fact submissions require an action by the user will probably be
> sufficient to be explicit consent, any system storing these details should
> allow for the use to revoke their consent: If you collect anything
> personally identifiable, you will need to provide a mechanism for users to
> request the removal of all their submissions.

> Tread carefully with this project. :)

You don't have to store any IP addresses, you can store a cryptographic
hash like their b2sum (salted if necessary).

Ulrich


signature.asc
Description: PGP signature


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 12:33 +0100, Robert Bridge wrote:
> On Thu, 21 May 2020 at 09:47, Michał Górny  wrote:
> 
> > Option 1: IP-based limiting
> > ===
> > 
> 
> Preface this with IANAL, check with your own legal counsel...
> 
> While IP address based methods might be attractive  technically, do
> remember that an IP address is considered Personally Identifiable in
> European Data Protection law.
> 
> The fact submissions require an action by the user will probably be
> sufficient to be explicit consent, any system storing these details should
> allow for the use to revoke their consent: If you collect anything
> personally identifiable, you will need to provide a mechanism for users to
> request the removal of all their submissions.
> 
> Tread carefully with this project. :)

All the data collected is set to expire in 7 days.  The 'privacy-first'
statement in the project description is there for a reason ;-).

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Robert Bridge
On Thu, 21 May 2020 at 09:47, Michał Górny  wrote:

>
> Option 1: IP-based limiting
> ===
>

Preface this with IANAL, check with your own legal counsel...

While IP address based methods might be attractive  technically, do
remember that an IP address is considered Personally Identifiable in
European Data Protection law.

The fact submissions require an action by the user will probably be
sufficient to be explicit consent, any system storing these details should
allow for the use to revoke their consent: If you collect anything
personally identifiable, you will need to provide a mechanism for users to
request the removal of all their submissions.

Tread carefully with this project. :)


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Fabian Groffen
Hi,

On 21-05-2020 10:47:07 +0200, Michał Górny wrote:
> Hi,
> 
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
> 
> 
> Problem
> ===
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter. 
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
> 
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:

Perhaps you could consider something like a reputation system.  I'm
thinking of things like only publishing results after X hours when an id
is new (graylisting?), and gradually build up "trust" there.  In the X
hours you could then determine something is potentially fraud if you see
a new user id, loads of submissions from same IP, etc. what you describe
below I think.

The reputation logic could further build on if it appears to follow a
norm, e.g. compilation times which fall in the average given the
cpu/arch configuration.
Another way would be to see submissions for packages that are actually
bumped/stabilised in the tree, to score an id as more likely to be
genuine.

I think it will be a tad complicated, but static limiting might be as
easy to circumvent as you block it, as has been pointed out already.

Perhaps, it is fruitful to think of the reverse, when is something
obviously bad?  When a single (obscure?) package is suddenly reported
many times by new ids?  When a single id generates hundreds or thousands
of package submissions (is it a cluster being misconfigured, many
identical packages, or what seems to be an a to z scan).
Thing is, would a single "fake" submission (that IMO will unlikely be
ever noticed) screw up the overall state of things?  I think the
fuzzyness of the system as a whole should cover for these.  It is pure
poisioning that should be able to be mitigated, and I agree with you
preferably most of it blocked by default.  Fact probably is that it will
happen nevertheless.

That brings me to the thought: are there things that can be done to make
sure a fraudulous action can be easily undone or negated somehow?  E.g.
should a log be kept, or some action to rollback and replay.  Sorry to
have no concrete examples here.

Fabian

> 
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
> 
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
> 
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
> 
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
> 
> 
> Option 1: IP-based limiting
> ===
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
> 
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
> 
> - IP address range is naturally limited
> 
> - IP addresses have non-zero cost
> 
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
> 
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
> 
> - it actively favors people with access to large number of IP addresses
> 
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
> 
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
> 
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
> 
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
> 
> 
> Option 2: proof-of-work
> ===
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 12:45 +0200, Jaco Kroon wrote:
> Even for v4, as an attacker ... well, as I'm sitting here right now I've
> got direct access to almost a /20 (4096 addresses).  I know a number of
> people with larger scopes than that.  Use bot-nets and the scope goes up
> even more.

See how unfair the world is!  You are filling your bathtub with IP
addresses, and my ISP has taken mine only recently.

> > Option 3: explicit CAPTCHA
> > ==
> > A traditional way of dealing with spam -- require every new system
> > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > for one CAPTCHA).
> > 
> > The advantage of this method is that it requires a real human work
> > to be
> > performed, effectively limiting the ability to submit spam.
> > 
> Yea.  One would think.  CAPTCHAs are massively intrusive and in my
> opinion more effort than they're worth.
> 
> This may be beneficial to *generate* a token.  In other words - when
> generating a token, that token needs to be registered by way of capthca.
> 
> > 
> > Other ideas
> > ===
> > Do you have any other ideas on how we could resolve this?
> > 
> Generated token + hardware based hash.

How are you going to verify that the hardware-based hash is real,
and not just a random value created to circumvent the protection?

>   Rate limit the combination to 1/day.
> 
> Don't use included results until it's been kept up to date for a minimum
> period.  Say updated at least 20 times 30 days.

For privacy reasons, we don't correlate the results.  So this is
impossible to implement.

> The downside here is that many machines are not powered up at least once
> a day to be able to perform that initial submission sequence.  So
> perhaps it's a bit stringent.

Exactly.  Even once a week is a bit risky but once a day is too narrow
a period.

To some degree, we could decide we don't care about exact numbers
as much as some degree of weighed proportions.  This would mean that,
say, people who submit daily get the count of 7, at the loss of people
who don't run their machines that much.  It would effectively put more
emphasis on more active users.  It's debatable whether this is desirable
or not.

> 
Both the token and hardware hash can of course be tainted and is under
> "attacker control".

Exactly.  So it really looks like exercise for the sake of exercise.


-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Jaco Kroon
Hi,

On 2020/05/21 11:48, Tomas Mozes wrote:
>
>
> On Thu, May 21, 2020 at 10:47 AM Michał Górny  > wrote:
>
> Hi,
>
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
> Option 1: IP-based limiting
> ===
> The original idea was to set a hard limit of submissions per week
> based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
>
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
>
> - IP address range is naturally limited
>
> - IP addresses have non-zero cost
>
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
>
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
>
> - it actively favors people with access to large number of IP
> addresses
>
> - it doesn't map cleanly to IPv6 (where some people may have just
> one IP
> address, and others may have whole /64 or /48 ranges)
>
So this gets tricky.  A single host could as you say either have a /128
or possibly a whole /64.  ISPs are "encouraged" to use a single /64 per
connecting user on the access layer (can be link-local technically, but
it seems to be frowned upon).  Generally then you're encourages to
delegate a /56 to the router, but at the very least a /60.  Some
recommendations even state to delegate a /48 at this point.  That's
outright crazy seeing that a /48 essentially boils down to 65536
individual LANs behind the router, /56 is 256 LANs which frankly I
reckon is adequate.  The only advantage of /48 is cleaner boudary
mapping onto : separators.  This is OPINION.  I also use "encouraged"
since these are

Short version:  If you're willing to rate limit on larger blocks it
could work.  /64s are probably OK, but most hosts will typically have a
/128, so you'll be limiting LANs, and switching IPs is trivial as you'd
have access to at least a /64 (or ~18.45 * 10^18).

You could have multiple layers ... ie:

each /128 gets 1 or 2 submissions per day
each /64 gets 200/day
each /56 gets 400/day
each /48 gets 600/day

But now you need to keep bucket loads of data ... so DOS on the rate
limiting mechanism itself becomes possible unless you're happy to limit
the size of the tables and discard "low risk of exceeding entries" somehow.

Even for v4, as an attacker ... well, as I'm sitting here right now I've
got direct access to almost a /20 (4096 addresses).  I know a number of
people with larger scopes than that.  Use bot-nets and the scope goes up
even more.

>
>
> Option 2: proof-of-work
> ===
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be
> accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
>
> On the plus side, it would rely more on actual physical hardware
> than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
>
> On the minus side, it would penalize people with weak hardware.
>
> For example, 'time hashcash -m -b 28 -r test' gives:
>
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
> At the same time, it would still permit a lot of fake
> submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode. 
> This
> would still allow me to use 7 threads.  If we adjusted the
> algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
>
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
>
Indeed.  This was considered for email SPAM protection as well about two
decades back.  Amongst other proposals.

Perhaps some crazy proof-of-work for registration of a token, but given
how cheap it is to lease CPU cycles you'd need to balance the effects. 
And given bot nets ... using other people's hardware for proof-of-work
doesn't seem inconceivable (bitcoin miners embedded on web pages being
an example of the stuff that people pull).

>
>
> Option 3: explicit CAPTCHA
> ==
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
>
> The advantage of this method is that it requires a real human work
> to be
> performed, effectively limiting the ability to submit spam.
>
Yea.  One would think.  CAPTCHAs are massively intrusive and in my
opinion more effort 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Tomas Mozes
On Thu, May 21, 2020 at 12:10 PM Michał Górny  wrote:

> On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> > On Thu, May 21, 2020 at 10:47 AM Michał Górny  wrote:
> >
> > > Hi,
> > >
> > > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > > i.e. mass fake submissions.
> > >
> > >
> > > Problem
> > > ===
> > > Goose currently lacks proper limiting of submitted data.  The only
> > > limiter currently in place is based on unique submitter id that is
> > > randomly generated at setup time and in full control of the submitter.
> > > This only protects against accidental duplicates but it can't protect
> > > against deliberate action.
> > >
> > > An attacker could easily submit thousands (millions?) of fake entries
> by
> > > issuing a lot of requests with different ids.  Creating them is
> > > as trivial as using successive numbers.  The potential damage includes:
> > >
> > > - distorting the metrics to the point of it being useless (even though
> > > some people consider it useless by design).
> > >
> > > - submitting lots of arbitrary data to cause DoS via growing
> > > the database until no disk space is left.
> > >
> > > - blocking large range of valid user ids, causing collisions with
> > > legitimate users more likely.
> > >
> > > I don't think it worthwhile to discuss the motivation for doing so:
> > > whether it would be someone wishing harm to Gentoo, disagreeing with
> > > the project or merely wanting to try and see if it would work.  The
> case
> > > of SKS keyservers teaches us a lesson that you can't leave holes like
> > > this open a long time because someone eventually will abuse them.
> > >
> > >
> > > Option 1: IP-based limiting
> > > ===
> > > The original idea was to set a hard limit of submissions per week based
> > > on IP address of the submitter.  This has (at least as far as IPv4 is
> > > concerned) the advantages that:
> > >
> > > - submitted has limited control of his IP address (i.e. he can't just
> > > submit stuff using arbitrary data)
> > >
> > > - IP address range is naturally limited
> > >
> > > - IP addresses have non-zero cost
> > >
> > > This method could strongly reduce the number of fake submissions one
> > > attacker could devise.  However, it has a few problems too:
> > >
> > > - a low limit would harm legitimate submitters sharing IP address
> > > (i.e. behind NAT)
> > >
> > > - it actively favors people with access to large number of IP addresses
> > >
> > > - it doesn't map cleanly to IPv6 (where some people may have just one
> IP
> > > address, and others may have whole /64 or /48 ranges)
> > >
> > > - it may cause problems for anonymizing network users (and we want to
> > > encourage Tor usage for privacy)
> > >
> > > All this considered, IP address limiting can't be used the primary
> > > method of preventing fake submissions.  However, I suppose it could
> work
> > > as an additional DoS prevention, limiting the number of submissions
> from
> > > a single address over short periods of time.
> > >
> > > Example: if we limit to 10 requests an hour, then a single IP can be
> > > used ot manufacture at most 240 submissions a day.  This might be
> > > sufficient to render them unusable but should keep the database
> > > reasonably safe.
> > >
> > >
> > > Option 2: proof-of-work
> > > ===
> > > An alternative of using a proof-of-work algorithm was suggested to me
> > > yesterday.  The idea is that every submission has to be accompanied
> with
> > > the result of some cumbersome calculation that can't be trivially run
> > > in parallel or optimized out to dedicated hardware.
> > >
> > > On the plus side, it would rely more on actual physical hardware than
> IP
> > > addresses provided by ISPs.  While it would be a waste of CPU time
> > > and memory, doing it just once a week wouldn't be that much harm.
> > >
> > > On the minus side, it would penalize people with weak hardware.
> > >
> > > For example, 'time hashcash -m -b 28 -r test' gives:
> > >
> > > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > >
> > > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > >
> > > At the same time, it would still permit a lot of fake submissions.  For
> > > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > > submissions a day.
> > >
> > > So in the end, while this is interesting, it doesn't seem like
> > > a workable anti-spam measure.
> > >
> > >
> > > Option 3: explicit CAPTCHA
> > > ==
> > > A traditional way of dealing with spam -- require every new system
> > > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > > for one CAPTCHA).
> > >
> > > The advantage of this method is that it requires a real human work to
> be
> > > performed, effectively limiting the ability to submit spam.

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 11:48 +0200, Tomas Mozes wrote:
> On Thu, May 21, 2020 at 10:47 AM Michał Górny  wrote:
> 
> > Hi,
> > 
> > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > i.e. mass fake submissions.
> > 
> > 
> > Problem
> > ===
> > Goose currently lacks proper limiting of submitted data.  The only
> > limiter currently in place is based on unique submitter id that is
> > randomly generated at setup time and in full control of the submitter.
> > This only protects against accidental duplicates but it can't protect
> > against deliberate action.
> > 
> > An attacker could easily submit thousands (millions?) of fake entries by
> > issuing a lot of requests with different ids.  Creating them is
> > as trivial as using successive numbers.  The potential damage includes:
> > 
> > - distorting the metrics to the point of it being useless (even though
> > some people consider it useless by design).
> > 
> > - submitting lots of arbitrary data to cause DoS via growing
> > the database until no disk space is left.
> > 
> > - blocking large range of valid user ids, causing collisions with
> > legitimate users more likely.
> > 
> > I don't think it worthwhile to discuss the motivation for doing so:
> > whether it would be someone wishing harm to Gentoo, disagreeing with
> > the project or merely wanting to try and see if it would work.  The case
> > of SKS keyservers teaches us a lesson that you can't leave holes like
> > this open a long time because someone eventually will abuse them.
> > 
> > 
> > Option 1: IP-based limiting
> > ===
> > The original idea was to set a hard limit of submissions per week based
> > on IP address of the submitter.  This has (at least as far as IPv4 is
> > concerned) the advantages that:
> > 
> > - submitted has limited control of his IP address (i.e. he can't just
> > submit stuff using arbitrary data)
> > 
> > - IP address range is naturally limited
> > 
> > - IP addresses have non-zero cost
> > 
> > This method could strongly reduce the number of fake submissions one
> > attacker could devise.  However, it has a few problems too:
> > 
> > - a low limit would harm legitimate submitters sharing IP address
> > (i.e. behind NAT)
> > 
> > - it actively favors people with access to large number of IP addresses
> > 
> > - it doesn't map cleanly to IPv6 (where some people may have just one IP
> > address, and others may have whole /64 or /48 ranges)
> > 
> > - it may cause problems for anonymizing network users (and we want to
> > encourage Tor usage for privacy)
> > 
> > All this considered, IP address limiting can't be used the primary
> > method of preventing fake submissions.  However, I suppose it could work
> > as an additional DoS prevention, limiting the number of submissions from
> > a single address over short periods of time.
> > 
> > Example: if we limit to 10 requests an hour, then a single IP can be
> > used ot manufacture at most 240 submissions a day.  This might be
> > sufficient to render them unusable but should keep the database
> > reasonably safe.
> > 
> > 
> > Option 2: proof-of-work
> > ===
> > An alternative of using a proof-of-work algorithm was suggested to me
> > yesterday.  The idea is that every submission has to be accompanied with
> > the result of some cumbersome calculation that can't be trivially run
> > in parallel or optimized out to dedicated hardware.
> > 
> > On the plus side, it would rely more on actual physical hardware than IP
> > addresses provided by ISPs.  While it would be a waste of CPU time
> > and memory, doing it just once a week wouldn't be that much harm.
> > 
> > On the minus side, it would penalize people with weak hardware.
> > 
> > For example, 'time hashcash -m -b 28 -r test' gives:
> > 
> > - 34 s (-s estimated 38 s) on Ryzen 5 3600
> > 
> > - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
> > 
> > At the same time, it would still permit a lot of fake submissions.  For
> > example, randomx [1] claims to require 2G of memory in fast mode.  This
> > would still allow me to use 7 threads.  If we adjusted the algorithm to
> > take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> > submissions a day.
> > 
> > So in the end, while this is interesting, it doesn't seem like
> > a workable anti-spam measure.
> > 
> > 
> > Option 3: explicit CAPTCHA
> > ==
> > A traditional way of dealing with spam -- require every new system
> > identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> > for one CAPTCHA).
> > 
> > The advantage of this method is that it requires a real human work to be
> > performed, effectively limiting the ability to submit spam.
> > The disadvantage is that it is cumbersome to users, so many of them will
> > just resign from participating.
> > 
> > 
> > Other ideas
> > ===
> > Do you have any other ideas on how we could resolve this?
> > 
> > 
> > [1] 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Tomas Mozes
On Thu, May 21, 2020 at 10:47 AM Michał Górny  wrote:

> Hi,
>
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
>
>
> Problem
> ===
> Goose currently lacks proper limiting of submitted data.  The only
> limiter currently in place is based on unique submitter id that is
> randomly generated at setup time and in full control of the submitter.
> This only protects against accidental duplicates but it can't protect
> against deliberate action.
>
> An attacker could easily submit thousands (millions?) of fake entries by
> issuing a lot of requests with different ids.  Creating them is
> as trivial as using successive numbers.  The potential damage includes:
>
> - distorting the metrics to the point of it being useless (even though
> some people consider it useless by design).
>
> - submitting lots of arbitrary data to cause DoS via growing
> the database until no disk space is left.
>
> - blocking large range of valid user ids, causing collisions with
> legitimate users more likely.
>
> I don't think it worthwhile to discuss the motivation for doing so:
> whether it would be someone wishing harm to Gentoo, disagreeing with
> the project or merely wanting to try and see if it would work.  The case
> of SKS keyservers teaches us a lesson that you can't leave holes like
> this open a long time because someone eventually will abuse them.
>
>
> Option 1: IP-based limiting
> ===
> The original idea was to set a hard limit of submissions per week based
> on IP address of the submitter.  This has (at least as far as IPv4 is
> concerned) the advantages that:
>
> - submitted has limited control of his IP address (i.e. he can't just
> submit stuff using arbitrary data)
>
> - IP address range is naturally limited
>
> - IP addresses have non-zero cost
>
> This method could strongly reduce the number of fake submissions one
> attacker could devise.  However, it has a few problems too:
>
> - a low limit would harm legitimate submitters sharing IP address
> (i.e. behind NAT)
>
> - it actively favors people with access to large number of IP addresses
>
> - it doesn't map cleanly to IPv6 (where some people may have just one IP
> address, and others may have whole /64 or /48 ranges)
>
> - it may cause problems for anonymizing network users (and we want to
> encourage Tor usage for privacy)
>
> All this considered, IP address limiting can't be used the primary
> method of preventing fake submissions.  However, I suppose it could work
> as an additional DoS prevention, limiting the number of submissions from
> a single address over short periods of time.
>
> Example: if we limit to 10 requests an hour, then a single IP can be
> used ot manufacture at most 240 submissions a day.  This might be
> sufficient to render them unusable but should keep the database
> reasonably safe.
>
>
> Option 2: proof-of-work
> ===
> An alternative of using a proof-of-work algorithm was suggested to me
> yesterday.  The idea is that every submission has to be accompanied with
> the result of some cumbersome calculation that can't be trivially run
> in parallel or optimized out to dedicated hardware.
>
> On the plus side, it would rely more on actual physical hardware than IP
> addresses provided by ISPs.  While it would be a waste of CPU time
> and memory, doing it just once a week wouldn't be that much harm.
>
> On the minus side, it would penalize people with weak hardware.
>
> For example, 'time hashcash -m -b 28 -r test' gives:
>
> - 34 s (-s estimated 38 s) on Ryzen 5 3600
>
> - 3 minutes (estimated 92 s) on some old 32-bit Celeron M
>
> At the same time, it would still permit a lot of fake submissions.  For
> example, randomx [1] claims to require 2G of memory in fast mode.  This
> would still allow me to use 7 threads.  If we adjusted the algorithm to
> take ~30 seconds, that means 7 submissions every 30 s, i.e. 20k
> submissions a day.
>
> So in the end, while this is interesting, it doesn't seem like
> a workable anti-spam measure.
>
>
> Option 3: explicit CAPTCHA
> ==
> A traditional way of dealing with spam -- require every new system
> identifier to be confirmed by solving a CAPTCHA (or a few identifiers
> for one CAPTCHA).
>
> The advantage of this method is that it requires a real human work to be
> performed, effectively limiting the ability to submit spam.
> The disadvantage is that it is cumbersome to users, so many of them will
> just resign from participating.
>
>
> Other ideas
> ===
> Do you have any other ideas on how we could resolve this?
>
>
> [1] https://github.com/tevador/RandomX
>
>
> --
> Best regards,
> Michał Górny
>



Sadly, the problem with IP addresses is (in this case), that there are
anonymous. One can easily start an attack with thousands of IPs (all around
the world).

One solution would be to introduce user accounts:
- one needs to register with an email
- you can rate limit based on 

Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Michał Górny
On Thu, 2020-05-21 at 11:17 +0200, Toralf Förster wrote:
> On 5/21/20 10:47 AM, Michał Górny wrote:
> > TL;DR: I'm looking for opinions on how to protect goose from spam,
> > i.e. mass fake submissions.
> > 
> 
> I'd combine IP-limits with proof-of-work.
> CAPTCHA should be the very last option IMO.
> 

To be honest, I don't see the point for proof-of-work if we have IP
limits.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] [RFC] Anti-spam for goose

2020-05-21 Thread Toralf Förster
On 5/21/20 10:47 AM, Michał Górny wrote:
> TL;DR: I'm looking for opinions on how to protect goose from spam,
> i.e. mass fake submissions.
> 

I'd combine IP-limits with proof-of-work.
CAPTCHA should be the very last option IMO.

-- 
Toralf
PGP 23217DA7 9B888F45



signature.asc
Description: OpenPGP digital signature