I agree with rate-limiting those without some sort of ID (login or API key).

As Oliver said, big (ab)users can massively skew our stats, often by
themselves. But hordes of upper middle volume bots (way too high for a
human, nowhere near the max for a superstar bot) can have a large
cumulative effect, too. We can't track them down individually, or even
detect that they are there because they are "only" involved in a fraction
of a percent of traffic—but a hundred such bots add up to a significant
skew, and reasonable rate limits could knock them down to manageable levels.

While enforcing UA requirements is inherently reasonable, anyone who
doesn't know to set up a valid UA string may not know to not just copy one
from a browser to make things worse. (I've done that myself in the past
when using curl with an uncooperative site. The shame.) Maybe rate limiting
will be the 80 in the 80/20 solution, and enforcing UA reqs won't be
necessary to control traffic, leaving them as a silly but effective way of
identifying certain kinds of traffic. The flip-side case would be
bajillions of very low volume bots—mimicking roughly human levels of
traffic and so sailing under rate limits—all with blank UAs. But we could
note that after rate limiting slows down the ridiculously heavy hitters and
take action as needed.


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation

On Tue, Sep 1, 2015 at 1:44 PM, Oliver Keyes <oke...@wikimedia.org> wrote:

> If people aren't capable of following UA guidelines I doubt they're
> going to follow voluntary login.
>
> For what it's worth I absolutely support both rate-limiting and login
> to get around this. In fact, I would argue that from an analytics
> point of view rate limiting is probably the most high-profile problem
> we have with incoming data at the moment. It's far, far too common for
> random pieces of automata to set themselves up and massively skew our
> datasets; identifying this in advance is impossible (we don't always
> have IP data) and identifying them post-hoc on an individual basis is
> massively time consuming.
>
> Why don't we have rate limiting + login? Who would work on this? Why
> /should/ we not have rate limiting?
>
> On 1 September 2015 at 13:37, Brion Vibber <bvib...@wikimedia.org> wrote:
> > I'm not 100% convinced that the UA requirement is helpful, for two
> reasons:
> >
> > 1) Lots of requests will have default like "PHP" or "Python/urllib" or
> > whatever from the tool they used to build their bot. These aren't helpful
> > either as they contain no of how to get in touch.
> >
> > 2) It's trivial to work around the requirement for a non-blank UA by
> > setting one of the above, or worse -- cut-n-pasting the UA string from a
> > browser. If someone hacks this up real quick while testing, they may
> never
> > bother putting in contact information when their bot moves from a handful
> > of requests to gazillions.
> >
> > Auto-throttling super-high-rate API clients (by IP/IP group) and giving
> > them an explicit "You really should contact us and, better yet, make it
> > possible for us to contact you" message might be nice.
> >
> >
> > We may want to seriously think about some sort of API key system... not
> > necessarily as mandatory for access (we love freedom and convenience!)
> but
> > perhaps as the way you get around being throttled for too many accesses.
> > This would give us a structured way of storing their contact information,
> > which might be better than unstructured names or addresses in the UA.
> >
> > Does it make sense to tell people "log in to your bot's account with
> OAuth"
> > or is that too much of a pain in the ass versus "add this one parameter
> to
> > your requests with your key"? :)
> >
> > -- brion
> >
> >
> > On Tue, Sep 1, 2015 at 10:23 AM, Oliver Keyes <oke...@wikimedia.org>
> wrote:
> >
> >> Awesome; thanks for the analysis, Krinkle.
> >>
> >> Do we want to change this behaviour? From my point of view the answer
> >> is 'yes, not setting any kind of user agent is a violation of our API
> >> etiquette and we should be taking steps to alert people that it is'
> >> but if other people have different perspectives on this I'd love to
> >> hear them.
> >>
> >> On 1 September 2015 at 13:18, Krinkle <krinklem...@gmail.com> wrote:
> >> > I've confirmed just now that whatever requirement there was, it
> doesn't
> >> seem to be in effect.
> >> >
> >> > Both omitting the header entirely, sending it with empty string, and
> >> sending
> >> > with "-"; – all three result in a response from the MediaWiki API.
> >> >
> >> > $ curl -A '' --include -v '
> >> https://en.wikipedia.org/w/api.php?action=query&format=json' <
> >> https://en.wikipedia.org/w/api.php?action=query&format=json'>
> >> >> GET /w/api.php?action=query&format=json HTTP/1.1
> >> >> Host: en.wikipedia.org
> >> >> Accept: */*
> >> > < HTTP/1.1 200 OK
> >> > ..
> >> > {"batchcomplete":""}
> >> >
> >> >
> >> > $ curl -A '-' --include -v '
> >> https://en.wikipedia.org/w/api.php?action=query&format=json' <
> >> https://en.wikipedia.org/w/api.php?action=query&format=json'>
> >> >> GET /w/api.php?action=query&format=json HTTP/1.1
> >> >> User-Agent: -
> >> >> Host: en.wikipedia.org <http://en.wikipedia.org/>
> >> >> Accept: */*
> >> > < HTTP/1.1 200 OK
> >> > ..
> >> > {"batchcomplete":""}
> >> >
> >> > In the past (2012?) these were definitely being blocked. (Ran into it
> >> from time to time on Toolserver)
> >> > It seems php file_get_contents('http://...api..' <http://...api..'>)
> is
> >> also working fine now,
> >> > without having to init_set a user_agent value first.
> >> >
> >> > -- Krinkle
> >> > _______________________________________________
> >> > Wikitech-l mailing list
> >> > Wikitech-l@lists.wikimedia.org
> >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> >>
> >>
> >> --
> >> Oliver Keyes
> >> Count Logula
> >> Wikimedia Foundation
> >>
> >> _______________________________________________
> >> Wikitech-l mailing list
> >> Wikitech-l@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> >>
> > _______________________________________________
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
>
>
> --
> Oliver Keyes
> Count Logula
> Wikimedia Foundation
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to