[libreoffice-website] Postmortem March 26 incident report

Guilhem Moulin Wed, 17 Apr 2019 16:52:03 -0700

This is a follow-up for https://status.documentfoundation.org/incident/236 .


On March 26 at 10:15 UTC one of our hypervisor crashed, and brought some
VMs down with it.  When the hypervisor rebooted, a few VMs — most
notably the ones serving Bugzilla, AskBot, and the download archive —
refused to boot due to File System corruption issues.  After some time
spent trying to repair these, we discovered that GlusterFS' (the
distributed File System we're currently using) self-healing daemon
didn't trigger and some disk images were stuck in split-brain state,
despite not being reported as such, let alone auto-repaired.

AskBot was brought up again at 18:45 UTC, to the best of our knowledge
without data data loss.

Unfortunately for Bugzilla we weren't able to get the FS back into a
consistent enough stage, and had to restore a snapshot from backups.
Our backups are not continuous (they have a ~24h granularity), and
changes since March 25 ~23:00 UTC (about 80 changes) were not included
when the service was brought up again at 22:00 UTC.  The missing changes
were later replayed from the notification mails sent to the
libreoffice-bugs mailing list.

We lowered the priority of restoring the list archives due to the low
number of requests to that service, and also because verifying file
integrity of the ~500k files is a slow process.  (Like for the other
services we want to make sure the data we serve isn't corrupted, but the
list archive is much larger than our other data store hence take much
longer.)  A partial archive with ≥5.4 releases was restored on April 03;
then we moved on to older releases, and the entire archive was available
again on April 04 evening.

We apology for the inconvenience.  There are a few things we can do to
ensure this won't happen again, and we'll discuss some of these during
our next infra call:

 * Storage: rebalance gluster volumes; evaluate alternative backend
   solutions, incl. on- and off-site (aka “cloud”).

 * Backup: for performance reasons writes are not reflected to physical
   disks immediately.  We could reduce the interval between journal
   commits but that won't fully eliminate the uncertainty window unless
   we also install a battery backed cache.  Similarly we can't achieve
   fully continuous backups, but we can improve granularity there.  A
   recurring topic in our infra calls is to replace dump-based database
   backups with continuous archiving and Point-in-Time Recovery (PITR).
   Unfortunately this solution has not been implemented yet; it would
   have solved the data loss in the Bugzilla database (or at least
   reduced the 24h granularity to a sub-minute one), while at the same
   time providing referential integrity guaranties.

 * Communication (notifying the community): while infra team members are
   busy trying to put the pieces back, we're not always in a position to
   respond to questions from users & community.  Sophie, Italo, Mike and
   Florian were discussing how to best support infra with communicating
   the status quo on the different channels (IRC, Telegram, email,
   Planet, Twitter, etc.), so progress and resolution becomes more
   visible to all.

   Note that our status page https://status.documentfoundation.org has
   RSS and Atom feeds, as well as email subscription for status change
   and incidents.

-- 
Guilhem.

-- 
To unsubscribe e-mail to: [email protected]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/website/
Privacy Policy: https://www.documentfoundation.org/privacy

[libreoffice-website] Postmortem March 26 incident report

Reply via email to