Hmm... what I'd expect is that if one ES save target database is in
read-only, the system should cycle through to the next available one that
is working -- the save should then succeed transparently.

Do we not have that sort of write failover logic, or are *all* ES clusters
getting locked somehow?

-- brion
On Nov 18, 2011 9:51 AM, "Ben Hartshorne" <[email protected]> wrote:

> Hi everyone,
>
> I just posted a
> note<
> http://blog.wikimedia.org/2011/11/18/nobody-notices-when-its-not-broken-new-database-servers-deployed/
> >on
> the blog about our new external store but wanted to add a few details
> here.  The deploy went smoothly, and I'm very happy with how the project
> progressed overall.  There are plenty more details on the project itself on
> the project wiki
> page<http://wikitech.wikimedia.org/view/External_storage/Update_2011-08
> >and
> hiding in RT.  there were a few followup things to come out of it, and
> I want to talk through those in hopes that someone either picks them up or
> has suggestions on what to do.
>
> The project originally included recompressing all of the object types in
> the external store databases, continuing the work that was started in
> 2010.  I spent some time doing verification that things were behaving as
> expected and it turns out they weren't.  Upon examining the count of
> different data types in the external store content, I found that some types
> that are no longer supposed to be used were still getting created.  I've
> filed https://bugzilla.wikimedia.org/show_bug.cgi?id=32478 to track the
> investigation and resolution of those differences.
>
> During the deploy there was a brief (about 10 minute) period during which
> article saves failed due to the external store databases being in read-only
> mode.  As expected, some folks showed up in IRC telling us of the
> 'problem'.  After the migration was complete we brainstormed a bit in IRC
> about good ways of informing editors of planned maintenance such as this
> migration.  The regular databases (s3, etc.) have a read-only mode flag so
> that the affected wikis show a reasonable error, but the external store
> databases are a little different.  Because of the way they're spread out,
> the outage of a specific database cluster does not affect specific language
> projects, but instead affects a specific time range for all wikis.
> Additionally, the currently writable external store database affects
> article edits on all wikis.
>
> There were a few suggestions thrown around:
> 1) use central notice.  This would certainly have the effect of alerting
> all wikis that there was some maintenance, but it has the disadvantage of
> telling all *readers* about the outage, rather than only the people that
> would actually be interested (those editing pages).
> 2) make mediawiki cache the change to conceal the outage from editors.  The
> idea here is that mediawiki would notice that the backend database is
> currently in read-only mode and would cache the change and write it to the
> DB when it returns to read-write mode.  There are a number of technical
> challenges here, as well as the introduction of another system (the change
> cache), but it's an interesting way around the problem, since rather than
> addressing how to inform editors of impending maintenance it simply
> eliminates the necessity for that communication.
> 3) throw up a banner on the edit page itself.  The time when we want to
> inform someone that there is going to be maintenance that will impede
> editing is when the user begins an edit.  (at the moment we inform them
> when they try to save the edit in the form of an error message.)  If there
> was a banner on all edit pages that informed the user not to save their
> document during a specific time period, they could choose to postpone the
> edit or finish quickly.  The text would be something like "There will be
> planned maintenance starting in 23 minutes and lasting for 30 minutes.  You
> will be unable to save edits during the maintenance period.  Please save
> your work before maintenance begins."  During the maintenance, we could
> change the message to be more visible, or we could take more drastic action
> such as disabling the edit or save buttons.
> 4) don't make any change from what we do now.  The external store databases
> rarely fail or undergo maintenance.  Increasing the complexity of the
> system to protect against their outage will be more likely cause harm than
> the outages themselves.  Instead, just announce it on the blog before and
> apologize to anybody affected afterwards.
>
> I'm sure there are some more ideas on what we should do, as well as
> opinions about these various options out there.  Discuss!  :)  I haven't
> filed a bug yet, but will do so if this conversation comes to some
> consensus about a specific thing that should be done.
>
> Thanks,
>
> -ben
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to