This is a follow-up for https://status.documentfoundation.org/incident/236 .
On March 26 at 10:15 UTC one of our hypervisor crashed, and brought some VMs down with it. When the hypervisor rebooted, a few VMs — most notably the ones serving Bugzilla, AskBot, and the download archive — refused to boot due to File System corruption issues. After some time spent trying to repair these, we discovered that GlusterFS' (the distributed File System we're currently using) self-healing daemon didn't trigger and some disk images were stuck in split-brain state, despite not being reported as such, let alone auto-repaired. AskBot was brought up again at 18:45 UTC, to the best of our knowledge without data data loss. Unfortunately for Bugzilla we weren't able to get the FS back into a consistent enough stage, and had to restore a snapshot from backups. Our backups are not continuous (they have a ~24h granularity), and changes since March 25 ~23:00 UTC (about 80 changes) were not included when the service was brought up again at 22:00 UTC. The missing changes were later replayed from the notification mails sent to the libreoffice-bugs mailing list. We lowered the priority of restoring the list archives due to the low number of requests to that service, and also because verifying file integrity of the ~500k files is a slow process. (Like for the other services we want to make sure the data we serve isn't corrupted, but the list archive is much larger than our other data store hence take much longer.) A partial archive with ≥5.4 releases was restored on April 03; then we moved on to older releases, and the entire archive was available again on April 04 evening. We apology for the inconvenience. There are a few things we can do to ensure this won't happen again, and we'll discuss some of these during our next infra call: * Storage: rebalance gluster volumes; evaluate alternative backend solutions, incl. on- and off-site (aka “cloud”). * Backup: for performance reasons writes are not reflected to physical disks immediately. We could reduce the interval between journal commits but that won't fully eliminate the uncertainty window unless we also install a battery backed cache. Similarly we can't achieve fully continuous backups, but we can improve granularity there. A recurring topic in our infra calls is to replace dump-based database backups with continuous archiving and Point-in-Time Recovery (PITR). Unfortunately this solution has not been implemented yet; it would have solved the data loss in the Bugzilla database (or at least reduced the 24h granularity to a sub-minute one), while at the same time providing referential integrity guaranties. * Communication (notifying the community): while infra team members are busy trying to put the pieces back, we're not always in a position to respond to questions from users & community. Sophie, Italo, Mike and Florian were discussing how to best support infra with communicating the status quo on the different channels (IRC, Telegram, email, Planet, Twitter, etc.), so progress and resolution becomes more visible to all. Note that our status page https://status.documentfoundation.org has RSS and Atom feeds, as well as email subscription for status change and incidents. -- Guilhem. -- To unsubscribe e-mail to: [email protected] Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette List archive: https://listarchives.libreoffice.org/global/website/ Privacy Policy: https://www.documentfoundation.org/privacy
