Re: [Wikidata] Wikidata Query Service partial outage

2017-10-24 Thread Guillaume Lederrey
Hello all!

Following on this previous communication, the change to our throttling
policy has been deployed yesterday (2017-10-23 17:00 UTC). Reviewing
the logs so far, I don't see any change of pattern in the number of
throttled requests. This means that mostly no one should be affected.
Or at least not affected more then you already were.

Feel free to reach out to me if that's not the case.

  Have fun!

Guillaume

On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey
 wrote:
> Hello all!
>
> As you might have seen / endured, we've had a Wikdiata Query Service
> partial outage yesterday morning (central european time). The full
> incident report is available [1] if you are interested in the details.
> The short version:
>
> * a single client started to run an unusually high number of queries on WDQS
> * the overload was not prevented by our current throttling
> * the failure was not detected and isolated automatically
>
> To prevent this from happening again, we will review our throttling
> rules. Those rules were previously tuned to prevent a single client
> from overloading the service with a small number of expensive
> requests: we started to log a client activity only when the duration
> of a request exceeded 10 seconds. Which means that a client sending
> tons of short requests would never be throttled.
>
> We will correct that by lowering the threshold to probably 25ms. The
> throttling rules are still the same:
>
> * 60 seconds of processing time per minute (peaking at 120 seconds)
> * 30 errors per minute (peaking at 60)
>
> If you are using WDQS to make lots of small requests, and you are over
> the throttling rates above, there is a chance that you will start
> seeing throttling errors. We are not doing this to bother you, we're
> just trying to keep another crash from happening...
>
> If you are throttled, you will receive an HTTP 429 error code. This
> response include the "Retry-After" HTTP header which specify a number
> of seconds you should wait before retrying.
>
> Thanks for your patience!
>
> And contact me if you want any clarification.
>
>   Guillaume
>
> [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs
> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST



-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Yaroslav Blanter
Thanks Gillaume for clarification.

Cheers
Yaroslav

On Thu, Oct 19, 2017 at 3:06 PM, Guillaume Lederrey  wrote:

> Hello!
>
> As far as I understand, the dispatch lag is an issue between Wikidata
> and the different Wikipedias. There is no involvement of Wikidata
> Query Service in this. Sjoerd probably understand that much better
> than I do...
>
> Note that this issue also caused some replication lag on one of the
> Wikidata Query Service servers [1]. In that case, this was mitigated
> by taking that specific server out of rotation and wait for it to
> recover before sending traffic to it again. And also note that the
> Wikidata Query Service replication lag is a very different kind of lag
> than the dispatch lag you were talking about. (yes, all this is
> complicated).
>
> Thanks for your interest!
>
> [1] https://grafana.wikimedia.org/dashboard/db/wikidata-query-
> service?refresh=1m&orgId=1&from=now-7d&to=now
>
> On Thu, Oct 19, 2017 at 2:29 PM, Yaroslav Blanter 
> wrote:
> > Thanks Sjoerd. Some en-wiki users consider the delay as a (one more)
> > argument that Wikidata is junk and should be thrown down the toilet, so I
> > was curious whether the delay was handled as a part of the problem.
> >
> > Cheers
> > Yaroslav
> >
> > On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin 
> > wrote:
> >>
> >> Hi Yaoslav,
> >>
> >> No, but there has been some dispatch issues in the last few days. The
> >> current lag for enwiki is 3 hours, for example. You can see a graph of
> the
> >> dispatch lag here:
> >> https://grafana.wikimedia.org/dashboard/db/wikidata-
> dispatch?refresh=1m&orgId=1&from=now-7d&to=now
> >>
> >> Greetings,
> >>
> >> Sjoerd de Bruin
> >> sjoerddebr...@me.com
> >>
> >> Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter  het
> >> volgende geschreven:
> >>
> >> Thanks Guilaume,
> >>
> >> is this the same accident which caused an hour delay of Wikidata items
> on
> >> Wikipedia watchlists?
> >>
> >> Cheers
> >> Yaroslav
> >>
> >> On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey
> >>  wrote:
> >>>
> >>> Hello all!
> >>>
> >>> As you might have seen / endured, we've had a Wikdiata Query Service
> >>> partial outage yesterday morning (central european time). The full
> >>> incident report is available [1] if you are interested in the details.
> >>> The short version:
> >>>
> >>> * a single client started to run an unusually high number of queries on
> >>> WDQS
> >>> * the overload was not prevented by our current throttling
> >>> * the failure was not detected and isolated automatically
> >>>
> >>> To prevent this from happening again, we will review our throttling
> >>> rules. Those rules were previously tuned to prevent a single client
> >>> from overloading the service with a small number of expensive
> >>> requests: we started to log a client activity only when the duration
> >>> of a request exceeded 10 seconds. Which means that a client sending
> >>> tons of short requests would never be throttled.
> >>>
> >>> We will correct that by lowering the threshold to probably 25ms. The
> >>> throttling rules are still the same:
> >>>
> >>> * 60 seconds of processing time per minute (peaking at 120 seconds)
> >>> * 30 errors per minute (peaking at 60)
> >>>
> >>> If you are using WDQS to make lots of small requests, and you are over
> >>> the throttling rates above, there is a chance that you will start
> >>> seeing throttling errors. We are not doing this to bother you, we're
> >>> just trying to keep another crash from happening...
> >>>
> >>> If you are throttled, you will receive an HTTP 429 error code. This
> >>> response include the "Retry-After" HTTP header which specify a number
> >>> of seconds you should wait before retrying.
> >>>
> >>> Thanks for your patience!
> >>>
> >>> And contact me if you want any clarification.
> >>>
> >>>   Guillaume
> >>>
> >>> [1]
> >>> https://wikitech.wikimedia.org/wiki/Incident_
> documentation/20171018-wdqs
> >>> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
> >>>
> >>> --
> >>> Guillaume Lederrey
> >>> Operations Engineer, Discovery
> >>> Wikimedia Foundation
> >>> UTC+2 / CEST
> >>>
> >>> ___
> >>> Wikidata mailing list
> >>> Wikidata@lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>
> >>
> >> ___
> >> Wikidata mailing list
> >> Wikidata@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>
> >>
> >>
> >> ___
> >> Wikidata mailing list
> >> Wikidata@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >>
> >
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
>
>
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST
>
> __

Re: [Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Guillaume Lederrey
Hello!

As far as I understand, the dispatch lag is an issue between Wikidata
and the different Wikipedias. There is no involvement of Wikidata
Query Service in this. Sjoerd probably understand that much better
than I do...

Note that this issue also caused some replication lag on one of the
Wikidata Query Service servers [1]. In that case, this was mitigated
by taking that specific server out of rotation and wait for it to
recover before sending traffic to it again. And also note that the
Wikidata Query Service replication lag is a very different kind of lag
than the dispatch lag you were talking about. (yes, all this is
complicated).

Thanks for your interest!

[1] 
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?refresh=1m&orgId=1&from=now-7d&to=now

On Thu, Oct 19, 2017 at 2:29 PM, Yaroslav Blanter  wrote:
> Thanks Sjoerd. Some en-wiki users consider the delay as a (one more)
> argument that Wikidata is junk and should be thrown down the toilet, so I
> was curious whether the delay was handled as a part of the problem.
>
> Cheers
> Yaroslav
>
> On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin 
> wrote:
>>
>> Hi Yaoslav,
>>
>> No, but there has been some dispatch issues in the last few days. The
>> current lag for enwiki is 3 hours, for example. You can see a graph of the
>> dispatch lag here:
>> https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-7d&to=now
>>
>> Greetings,
>>
>> Sjoerd de Bruin
>> sjoerddebr...@me.com
>>
>> Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter  het
>> volgende geschreven:
>>
>> Thanks Guilaume,
>>
>> is this the same accident which caused an hour delay of Wikidata items on
>> Wikipedia watchlists?
>>
>> Cheers
>> Yaroslav
>>
>> On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey
>>  wrote:
>>>
>>> Hello all!
>>>
>>> As you might have seen / endured, we've had a Wikdiata Query Service
>>> partial outage yesterday morning (central european time). The full
>>> incident report is available [1] if you are interested in the details.
>>> The short version:
>>>
>>> * a single client started to run an unusually high number of queries on
>>> WDQS
>>> * the overload was not prevented by our current throttling
>>> * the failure was not detected and isolated automatically
>>>
>>> To prevent this from happening again, we will review our throttling
>>> rules. Those rules were previously tuned to prevent a single client
>>> from overloading the service with a small number of expensive
>>> requests: we started to log a client activity only when the duration
>>> of a request exceeded 10 seconds. Which means that a client sending
>>> tons of short requests would never be throttled.
>>>
>>> We will correct that by lowering the threshold to probably 25ms. The
>>> throttling rules are still the same:
>>>
>>> * 60 seconds of processing time per minute (peaking at 120 seconds)
>>> * 30 errors per minute (peaking at 60)
>>>
>>> If you are using WDQS to make lots of small requests, and you are over
>>> the throttling rates above, there is a chance that you will start
>>> seeing throttling errors. We are not doing this to bother you, we're
>>> just trying to keep another crash from happening...
>>>
>>> If you are throttled, you will receive an HTTP 429 error code. This
>>> response include the "Retry-After" HTTP header which specify a number
>>> of seconds you should wait before retrying.
>>>
>>> Thanks for your patience!
>>>
>>> And contact me if you want any clarification.
>>>
>>>   Guillaume
>>>
>>> [1]
>>> https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs
>>> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
>>>
>>> --
>>> Guillaume Lederrey
>>> Operations Engineer, Discovery
>>> Wikimedia Foundation
>>> UTC+2 / CEST
>>>
>>> ___
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Yaroslav Blanter
Thanks Sjoerd. Some en-wiki users consider the delay as a (one more)
argument that Wikidata is junk and should be thrown down the toilet, so I
was curious whether the delay was handled as a part of the problem.

Cheers
Yaroslav

On Thu, Oct 19, 2017 at 12:09 PM, Sjoerd de Bruin 
wrote:

> Hi Yaoslav,
>
> No, but there has been some dispatch issues in the last few days. The
> current lag for enwiki is 3 hours, for example. You can see a graph of the
> dispatch lag here: https://grafana.wikimedia.org/dashboard/db/
> wikidata-dispatch?refresh=1m&orgId=1&from=now-7d&to=now
>
> Greetings,
>
> Sjoerd de Bruin
> sjoerddebr...@me.com
>
> Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter  het
> volgende geschreven:
>
> Thanks Guilaume,
>
> is this the same accident which caused an hour delay of Wikidata items on
> Wikipedia watchlists?
>
> Cheers
> Yaroslav
>
> On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey <
> gleder...@wikimedia.org> wrote:
>
>> Hello all!
>>
>> As you might have seen / endured, we've had a Wikdiata Query Service
>> partial outage yesterday morning (central european time). The full
>> incident report is available [1] if you are interested in the details.
>> The short version:
>>
>> * a single client started to run an unusually high number of queries on
>> WDQS
>> * the overload was not prevented by our current throttling
>> * the failure was not detected and isolated automatically
>>
>> To prevent this from happening again, we will review our throttling
>> rules. Those rules were previously tuned to prevent a single client
>> from overloading the service with a small number of expensive
>> requests: we started to log a client activity only when the duration
>> of a request exceeded 10 seconds. Which means that a client sending
>> tons of short requests would never be throttled.
>>
>> We will correct that by lowering the threshold to probably 25ms. The
>> throttling rules are still the same:
>>
>> * 60 seconds of processing time per minute (peaking at 120 seconds)
>> * 30 errors per minute (peaking at 60)
>>
>> If you are using WDQS to make lots of small requests, and you are over
>> the throttling rates above, there is a chance that you will start
>> seeing throttling errors. We are not doing this to bother you, we're
>> just trying to keep another crash from happening...
>>
>> If you are throttled, you will receive an HTTP 429 error code. This
>> response include the "Retry-After" HTTP header which specify a number
>> of seconds you should wait before retrying.
>>
>> Thanks for your patience!
>>
>> And contact me if you want any clarification.
>>
>>   Guillaume
>>
>> [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/
>> 20171018-wdqs
>> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
>>
>> --
>> Guillaume Lederrey
>> Operations Engineer, Discovery
>> Wikimedia Foundation
>> UTC+2 / CEST
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Sjoerd de Bruin
Hi Yaoslav,

No, but there has been some dispatch issues in the last few days. The current 
lag for enwiki is 3 hours, for example. You can see a graph of the dispatch lag 
here: 
https://grafana.wikimedia.org/dashboard/db/wikidata-dispatch?refresh=1m&orgId=1&from=now-7d&to=now
 


Greetings,

Sjoerd de Bruin
sjoerddebr...@me.com

> Op 19 okt. 2017, om 11:30 heeft Yaroslav Blanter  het 
> volgende geschreven:
> 
> Thanks Guilaume,
> 
> is this the same accident which caused an hour delay of Wikidata items on 
> Wikipedia watchlists?
> 
> Cheers
> Yaroslav
> 
> On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey  > wrote:
> Hello all!
> 
> As you might have seen / endured, we've had a Wikdiata Query Service
> partial outage yesterday morning (central european time). The full
> incident report is available [1] if you are interested in the details.
> The short version:
> 
> * a single client started to run an unusually high number of queries on WDQS
> * the overload was not prevented by our current throttling
> * the failure was not detected and isolated automatically
> 
> To prevent this from happening again, we will review our throttling
> rules. Those rules were previously tuned to prevent a single client
> from overloading the service with a small number of expensive
> requests: we started to log a client activity only when the duration
> of a request exceeded 10 seconds. Which means that a client sending
> tons of short requests would never be throttled.
> 
> We will correct that by lowering the threshold to probably 25ms. The
> throttling rules are still the same:
> 
> * 60 seconds of processing time per minute (peaking at 120 seconds)
> * 30 errors per minute (peaking at 60)
> 
> If you are using WDQS to make lots of small requests, and you are over
> the throttling rates above, there is a chance that you will start
> seeing throttling errors. We are not doing this to bother you, we're
> just trying to keep another crash from happening...
> 
> If you are throttled, you will receive an HTTP 429 error code. This
> response include the "Retry-After" HTTP header which specify a number
> of seconds you should wait before retrying.
> 
> Thanks for your patience!
> 
> And contact me if you want any clarification.
> 
>   Guillaume
> 
> [1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs 
> 
> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429 
> 
> 
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org 
> https://lists.wikimedia.org/mailman/listinfo/wikidata 
> 
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Yaroslav Blanter
Thanks Guilaume,

is this the same accident which caused an hour delay of Wikidata items on
Wikipedia watchlists?

Cheers
Yaroslav

On Thu, Oct 19, 2017 at 10:14 AM, Guillaume Lederrey <
gleder...@wikimedia.org> wrote:

> Hello all!
>
> As you might have seen / endured, we've had a Wikdiata Query Service
> partial outage yesterday morning (central european time). The full
> incident report is available [1] if you are interested in the details.
> The short version:
>
> * a single client started to run an unusually high number of queries on
> WDQS
> * the overload was not prevented by our current throttling
> * the failure was not detected and isolated automatically
>
> To prevent this from happening again, we will review our throttling
> rules. Those rules were previously tuned to prevent a single client
> from overloading the service with a small number of expensive
> requests: we started to log a client activity only when the duration
> of a request exceeded 10 seconds. Which means that a client sending
> tons of short requests would never be throttled.
>
> We will correct that by lowering the threshold to probably 25ms. The
> throttling rules are still the same:
>
> * 60 seconds of processing time per minute (peaking at 120 seconds)
> * 30 errors per minute (peaking at 60)
>
> If you are using WDQS to make lots of small requests, and you are over
> the throttling rates above, there is a chance that you will start
> seeing throttling errors. We are not doing this to bother you, we're
> just trying to keep another crash from happening...
>
> If you are throttled, you will receive an HTTP 429 error code. This
> response include the "Retry-After" HTTP header which specify a number
> of seconds you should wait before retrying.
>
> Thanks for your patience!
>
> And contact me if you want any clarification.
>
>   Guillaume
>
> [1] https://wikitech.wikimedia.org/wiki/Incident_
> documentation/20171018-wdqs
> [2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Query Service partial outage

2017-10-19 Thread Guillaume Lederrey
Hello all!

As you might have seen / endured, we've had a Wikdiata Query Service
partial outage yesterday morning (central european time). The full
incident report is available [1] if you are interested in the details.
The short version:

* a single client started to run an unusually high number of queries on WDQS
* the overload was not prevented by our current throttling
* the failure was not detected and isolated automatically

To prevent this from happening again, we will review our throttling
rules. Those rules were previously tuned to prevent a single client
from overloading the service with a small number of expensive
requests: we started to log a client activity only when the duration
of a request exceeded 10 seconds. Which means that a client sending
tons of short requests would never be throttled.

We will correct that by lowering the threshold to probably 25ms. The
throttling rules are still the same:

* 60 seconds of processing time per minute (peaking at 120 seconds)
* 30 errors per minute (peaking at 60)

If you are using WDQS to make lots of small requests, and you are over
the throttling rates above, there is a chance that you will start
seeing throttling errors. We are not doing this to bother you, we're
just trying to keep another crash from happening...

If you are throttled, you will receive an HTTP 429 error code. This
response include the "Retry-After" HTTP header which specify a number
of seconds you should wait before retrying.

Thanks for your patience!

And contact me if you want any clarification.

  Guillaume

[1] https://wikitech.wikimedia.org/wiki/Incident_documentation/20171018-wdqs
[2] https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429

-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata