Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

Thad Guidry Tue, 23 Jul 2019 09:50:10 -0700

>
> Adding authentication to the service, and allowing higher quotas to bots
> that authenticate.



Awesome and expected.

Creating an asynchronous queue, which could allow running more expensive
> queries, but with longer deadlines.


Even more awesome!
Will this be approachable:   My 2 hour query will actually finally return
results into my 1gig csv.zip file?

Thad
https://www.linkedin.com/in/thadguidry/


On Tue, Jul 23, 2019 at 5:47 AM Amir Sarabadani <
amir.sarabad...@wikimedia.de> wrote:

> Hey,
> Forgive my ignorance. I don't know much about infrastructure of WDQS and
> how it works. I just want to mention how application servers do it. In
> appservers, there are dedicated nodes both for apache and the replica
> database. So if a bot overdo things in Wikipedia (which happens quite a
> lot), users won't feel anything but the other bots take the hit. Routing
> based on UA seems hard though while it's easy in mediawiki (if you hit
> api.php, we assume it's a bot).
>
> Did you consider this in a more long-term solution?
> Best
>
> On Tue, 23 Jul 2019 at 09:43, Stas Malyshev <smalys...@wikimedia.org>
> wrote:
>
>> Hello all!
>>
>> Here is (at last!) an update on what we are doing to protect the
>> stability of Wikidata Query Service.
>>
>> For 4 years we have been offering to Wikidata users the Query Service, a
>> powerful tool that allows anyone to query the content of Wikidata,
>> without any identification needed. This means that anyone can use the
>> service using a script and make heavy or very frequent requests.
>> However, this freedom has led to the service being overloaded by a too
>> big amount of queries, causing the issues or lag that you may have
>> noticed.
>>
>> A reminder about the context:
>>
>> We have had a number of incidents where the public WDQS endpoint was
>> overloaded by bot traffic. We don't think that any of that activity was
>> intentionally malicious, but rather that the bot authors most probably
>> don't understand the cost of their queries and the impact they have on
>> our infrastructure. We've recently seen more distributed bots, coming
>> from multiple IPs from cloud providers. This kind of pattern makes it
>> harder and harder to filter or throttle an individual bot. The impact
>> has ranged from increased update lag to full service interruption.
>>
>> What we have been doing:
>>
>> While we would love to allow anyone to run any query they want at any
>> time, we're not able to sustain that load, and we need to be more
>> aggressive in how we throttle clients. We want to be fair to our users
>> and allow everyone to use the service productively. We also want the
>> service to be available to the casual user and provide up-to-date access
>> to the live Wikidata data. And while we would love to throttle only
>> abusive bots, to be able to do that we need to be able to identify them.
>>
>> We have two main means of identifying bots:
>>
>> 1) their user agent and IP address
>> 2) the pattern of their queries
>>
>> Identifying patterns in queries is done manually, by a person inspecting
>> the logs. It takes time and can only be done after the fact. We can only
>> start our identification process once the service is already overloaded.
>> This is not going to scale.
>>
>> IP addresses are starting to be problematic. We see bots running on
>> cloud providers and running their workloads on multiple instances, with
>> multiple IP addresses.
>>
>> We are left with user agents. But here, we have a problem again. To
>> block only abusive bots, we would need those bots to use a clearly
>> identifiable user agent, so that we can throttle or block them and
>> contact the author to work together on a solution. It is unlikely that
>> an intentionally abusive bot will voluntarily provide a way to be
>> blocked. So we need to be more aggressive about bots which are using a
>> generic user agent. We are not blocking those, but we are limiting the
>> number of requests coming from generic user agents. This is a large
>> bucket, with a lot of bots that are in this same category of "generic
>> user agent". Sadly, this is also the bucket that contains many small
>> bots that generate only a very reasonable load. And so we are also
>> impacting the bots that play fair.
>>
>> At the moment, if your bot is affected by our restrictions, configure a
>> custom user agent that identifies you; this should be sufficient to give
>> you enough bandwidth. If you are still running into issues, please
>> contact us; we'll find a solution together.
>>
>> What's coming next:
>>
>> First, it is unlikely that we will be able to remove the current
>> restrictions in the short term. We're sorry for that, but the
>> alternative - service being unresponsive or severely lagged for everyone
>> - is worse.
>>
>> We are exploring a number of alternatives. Adding authentication to the
>> service, and allowing higher quotas to bots that authenticate. Creating
>> an asynchronous queue, which could allow running more expensive queries,
>> but with longer deadlines. And we are in the process of hiring another
>> engineer to work on these ideas.
>>
>> Thanks for your patience!
>>
>> WDQS Team
>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> Amir Sarabadani (he/him)
> Software engineer
>
> Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
> Tel. (030) 219 158 26-0
> https://wikimedia.de
>
> Unsere Vision ist eine Welt, in der alle Menschen am Wissen der Menschheit
> teilhaben, es nutzen und mehren können. Helfen Sie uns dabei!
> https://spenden.wikimedia.de
>
> Wikimedia Deutschland — Gesellschaft zur Förderung Freien Wissens e. V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/029/42207.
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Query Service User-Agent requirements for script users

Reply via email to