We noticed a user who was responsible for the most requests by far (albeit
still not a large percentage of total requests) and banned them, and that
immediately restored full service availability (following another quick
round of blazegraph restarts to get the deadlocked blazegraph processes
back up and running properly).

This problem is resolved (for now at least). I'll be sending an e-mail out
to the user we banned informing them of the user agent ban.


On Wed, Sep 8, 2021 at 8:03 PM Ryan Kemper <[email protected]> wrote:

> Our WDQS backend servers (in CODFW only) have incredibly patchy
> availability currently.
>
> As a result a sizeable portion of queries made to query.wikidata.org are
> failing or taking unusually long.
>
> We're doing our best to isolate a cause (basically a user or user(s)
> submitting particularly expensive or error-generating queries). Until we
> succeed in that service availability is likely to be quite poor.
>
> Note that we currently have a mitigation in place where we're restarting
> blazegraph across the affected hosts (codfw) hourly, but that mitigation is
> insufficient currently.
>
> You can see the current status of wdqs backend server availability here:
> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=7&from=now-1h&to=now&refresh=1m
>
> ^ This is a graph of our total triple count (i.e. not explicitly a graph
> of service availability), but servers affected by the blazegraph deadlock
> issue that we're experiencing fail to report metrics while they're
> affected. So the presence or absence of RDF triple counts for a given host
> corresponds to its uptime
>
_______________________________________________
Wikidata mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to