Hello all, This is Asher's writeup of the jobqueue disruption that happened yesterday afternoon Pacific time.
He's not on this list, so please keep him in the cc: if you want him to see your message. Greg ----- Forwarded message from Asher Feldman <[email protected]> ----- > Date: Fri, 29 Mar 2013 11:27:13 -0700 > From: Asher Feldman <[email protected]> > To: Operations Engineers <[email protected]> > Subject: [Ops] site issues yesterday - jobqueue and wikidata > > We had two brief site disruptions yesterday, one in the afternoon that was > fairly major but brief (12:40-12:43pm PST) and another that was less severe > around 11pm. Both were jobqueue related; the first incident was suspected > to be triggered by the wikidata change publisher and the second incident > points more strongly in that direction. > > As far as what happened - the current mysql jobqueue implementation is way > too costly. In the last 24 hours, 75% of all queries that take over 450ms > to run on the enwiki master are related to the jobqueue and all major > actions result in replicated writes. It's 58% of all query execution time > when not looking at over the slow threshold. If 1 million refreshlinks > jobs are queued as quickly as possible without paying attention to > replication lag, say hello to replication lag. Mediawiki depends on > reading from slaves to scale and avoids lagged ones. If all slaves are > lagged, the master is used for everything, and if this happens to enwiki, > the site falls over. > > The wikidata change propagator inserts ChangeNotification jobs into local > wiki queues in batches of 1000. The execution of one change job can result > in many additional refreshLinks jobs being enqueued. Just prior the the > meldown, the wikidata propagator inserted around 7000 jobs into enwiki. > That resulted in around 200k refreshlinks jobs getting inserted in a single > minute, and around 1.2 million over a slightly longer time. It turns out > that trying to reparse 1/4 of enwiki as quickly as possible is a problem :) > > Aaron deployed a change last night ( > https://gerrit.wikimedia.org/r/#/c/56572/1) that should throttle the > insertion of new refreshLinks jobs if the queue is large but not yet sure > if that's enough. We may also turn down the wikidata dispatcher batch > size, shut down one of its two dispatchers, or again limit how many > wikiadmin users can connect to the database to force a concurrency limit on > all things job queue related. > > The good thing is - the mysql jobqueue was identified as a scaling > bottleneck a while ago, and will be switching to redis very soon. It's > currently targeted with the release of wmf13, but we may be able to > backport to wmf12 and get this done sooner. > > In the interim, please do not release anything that will place new demands > on the jobqueue, such as echo, or any ramping up of wikidata. > > -Asher > _______________________________________________ > Ops mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/ops ----- End forwarded message ----- -- | Greg Grossmeier GPG: B2FA 27B1 F7EB D327 6B8E | | identi.ca: @greg A18D 1138 8E47 FAC8 1C7D | _______________________________________________ Wikidata-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-l
