Hmm... what I'd expect is that if one ES save target database is in read-only, the system should cycle through to the next available one that is working -- the save should then succeed transparently.
Do we not have that sort of write failover logic, or are *all* ES clusters getting locked somehow? -- brion On Nov 18, 2011 9:51 AM, "Ben Hartshorne" <[email protected]> wrote: > Hi everyone, > > I just posted a > note< > http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/ > >on > the blog about our new external store but wanted to add a few details > here. The deploy went smoothly, and I'm very happy with how the project > progressed overall. There are plenty more details on the project itself on > the project wiki > page<http://wikitech.wikimedia.org/view/External_storage/Update_2011-08 > >and > hiding in RT. there were a few followup things to come out of it, and > I want to talk through those in hopes that someone either picks them up or > has suggestions on what to do. > > The project originally included recompressing all of the object types in > the external store databases, continuing the work that was started in > 2010. I spent some time doing verification that things were behaving as > expected and it turns out they weren't. Upon examining the count of > different data types in the external store content, I found that some types > that are no longer supposed to be used were still getting created. I've > filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the > investigation and resolution of those differences. > > During the deploy there was a brief (about 10 minute) period during which > article saves failed due to the external store databases being in read-only > mode. As expected, some folks showed up in IRC telling us of the > 'problem'. After the migration was complete we brainstormed a bit in IRC > about good ways of informing editors of planned maintenance such as this > migration. The regular databases (s3, etc.) have a read-only mode flag so > that the affected wikis show a reasonable error, but the external store > databases are a little different. Because of the way they're spread out, > the outage of a specific database cluster does not affect specific language > projects, but instead affects a specific time range for all wikis. > Additionally, the currently writable external store database affects > article edits on all wikis. > > There were a few suggestions thrown around: > 1) use central notice. This would certainly have the effect of alerting > all wikis that there was some maintenance, but it has the disadvantage of > telling all *readers* about the outage, rather than only the people that > would actually be interested (those editing pages). > 2) make mediawiki cache the change to conceal the outage from editors. The > idea here is that mediawiki would notice that the backend database is > currently in read-only mode and would cache the change and write it to the > DB when it returns to read-write mode. There are a number of technical > challenges here, as well as the introduction of another system (the change > cache), but it's an interesting way around the problem, since rather than > addressing how to inform editors of impending maintenance it simply > eliminates the necessity for that communication. > 3) throw up a banner on the edit page itself. The time when we want to > inform someone that there is going to be maintenance that will impede > editing is when the user begins an edit. (at the moment we inform them > when they try to save the edit in the form of an error message.) If there > was a banner on all edit pages that informed the user not to save their > document during a specific time period, they could choose to postpone the > edit or finish quickly. The text would be something like "There will be > planned maintenance starting in 23 minutes and lasting for 30 minutes. You > will be unable to save edits during the maintenance period. Please save > your work before maintenance begins." During the maintenance, we could > change the message to be more visible, or we could take more drastic action > such as disabling the edit or save buttons. > 4) don't make any change from what we do now. The external store databases > rarely fail or undergo maintenance. Increasing the complexity of the > system to protect against their outage will be more likely cause harm than > the outages themselves. Instead, just announce it on the blog before and > apologize to anybody affected afterwards. > > I'm sure there are some more ideas on what we should do, as well as > opinions about these various options out there. Discuss! :) I haven't > filed a bug yet, but will do so if this conversation comes to some > consensus about a specific thing that should be done. > > Thanks, > > -ben > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
