Hello!

Thanks for trying to not overload the service!

There is some minimal documentation on the throttling done by Wikidata
Query Service [1], but it clearly needs to be improved.

High level overview:

Throttling is done by "client". Where client in this case is identified by
user-agent and IP address (yes, it is a flawed definition of client, but it
mostly works for throttling purpose). Limits are set on the query execution
time and on the number of errors raised by the client. When the limits are
reached, an HTTP 429 response is sent to the client, with a "Retry-After"
HTTP header. This header contains an estimate of how long a client should
wait before retrying a request (in seconds). If we see a client that seems
to ignore HTTP 429 for long enough, that client is going to be banned for
24 hours.

What you can do:

* don't execute more than one request in parallel
* set a user-agent specific to your application (see [2] for some
documentation on the user-agent policy)
* when receiving an HTTP 429 response, stop for the duration of the
Retry-After header or for 1 minute

If you follow all that, you should be good. If you still see throttling /
ban, let us know. If you give me the User-Agent of your script and the time
at which you received the throttling / ban response, and I can have a look
into the logs.

Note that we might have some degenerated behaviour when the service is
already overloaded (I don't think so, but who knows).

Good luck!

   Guillaume

[1]
https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual#Query_limits
[2] https://meta.wikimedia.org/wiki/User-Agent_policy


On Sat, Nov 2, 2019 at 11:37 AM Andra Waagmeester <an...@micel.io> wrote:

> Hi,
>
>     I hope this is the right mailing list to discuss this issue.
> Some time ago I ran into a series of temporary bans, I thought I managed
> to tackle this basically by doing a full stop once it gets any response
> header code other than 200.
>
> However, this seems not to have fixed it, since I received the following
> message:
>
> "requests.exceptions.HTTPError: 403 Client Error: You have been banned
> until 2019-10-18T10:21:36.495Z, please respect throttling and retry-after
> headers. for url: https://query.wikidata.org/sparql";
>
> I am looking into this from scratch and see if I can implement a better
> solution and certainly one that really respects the retry-after time
> instead of going full stop.
>
> Whatever I try now, I keep getting 200 headers and I don't want to start
> an excessive bot run to get into a ban state to see the exact header that
> the bot needs to respect.
>
> Is there an example of such a header which I can use to make my own test
> script?
>
> Or is there example python could that successfully deals with a
> retry-after header?
>
> Regards,
>
> Andra
>
>
> _______________________________________________
> Wikidata-tech mailing list
> Wikidata-tech@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata-tech
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+2 / CEST
_______________________________________________
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech

Reply via email to