[Wikidata-l] Fwd: [Ops] site issues yesterday - jobqueue and wikidata

Greg Grossmeier Fri, 29 Mar 2013 11:45:31 -0700

Hello all,

This is Asher's writeup of the jobqueue disruption that happened
yesterday afternoon Pacific time.


He's not on this list, so please keep him in the cc: if you want him to
see your message.

Greg

----- Forwarded message from Asher Feldman <[email protected]> -----

> Date: Fri, 29 Mar 2013 11:27:13 -0700
> From: Asher Feldman <[email protected]>
> To: Operations Engineers <[email protected]>
> Subject: [Ops] site issues yesterday - jobqueue and wikidata
> 
> We had two brief site disruptions yesterday, one in the afternoon that was
> fairly major but brief (12:40-12:43pm PST) and another that was less severe
> around 11pm.  Both were jobqueue related; the first incident was suspected
> to be triggered by the wikidata change publisher and the second incident
> points more strongly in that direction.
> 
> As far as what happened - the current mysql jobqueue implementation is way
> too costly.  In the last 24 hours, 75% of all queries that take over 450ms
> to run on the enwiki master are related to the jobqueue and all major
> actions result in replicated writes.  It's 58% of all query execution time
> when not looking at over the slow threshold.  If 1 million refreshlinks
> jobs are queued as quickly as possible without paying attention to
> replication lag, say hello to replication lag.  Mediawiki depends on
> reading from slaves to scale and avoids lagged ones.  If all slaves are
> lagged, the master is used for everything, and if this happens to enwiki,
> the site falls over.
> 
> The wikidata change propagator inserts ChangeNotification jobs into local
> wiki queues in batches of 1000.  The execution of one change job can result
> in many additional refreshLinks jobs being enqueued.  Just prior the the
> meldown, the wikidata propagator inserted around 7000 jobs into enwiki.
> That resulted in around 200k refreshlinks jobs getting inserted in a single
> minute, and around 1.2 million over a slightly longer time.  It turns out
> that trying to reparse 1/4 of enwiki as quickly as possible is a problem :)
> 
> Aaron deployed a change last night (
> https://gerrit.wikimedia.org/r/#/c/56572/1) that should throttle the
> insertion of new refreshLinks jobs if the queue is large but not yet sure
> if that's enough.  We may also turn down the wikidata dispatcher batch
> size, shut down one of its two dispatchers, or again limit how many
> wikiadmin users can connect to the database to force a concurrency limit on
> all things job queue related.
> 
> The good thing is - the mysql jobqueue was identified as a scaling
> bottleneck a while ago, and will be switching to redis very soon.  It's
> currently targeted with the release of wmf13, but we may be able to
> backport to wmf12 and get this done sooner.
> 
> In the interim, please do not release anything that will place new demands
> on the jobqueue, such as echo, or any ramping up of wikidata.
> 
> -Asher

> _______________________________________________
> Ops mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/ops


----- End forwarded message -----

-- 
| Greg Grossmeier            GPG: B2FA 27B1 F7EB D327 6B8E |
| identi.ca: @greg                A18D 1138 8E47 FAC8 1C7D |

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

[Wikidata-l] Fwd: [Ops] site issues yesterday - jobqueue and wikidata

Reply via email to