[This is a follow-up for https://status.documentfoundation.org/incidents/265 .]
Starting from the early hours of April 25 some services, notably the wiki, blog, pad and extension sites experienced slower response time than usual. Unfortunately the situation got worse and by 10AM most requests timeouted, thereby making the aforementioned sites mostly unreachable. In addition outgoing emails, as well as emails to our public mailing lists, were not delivered resp. accepted in a timely manner thereby causing delays. We identified that some volumes in our distributed file system had lost consistency. That happens from time to time and discrepancies are normally transparently solved by the self-heal services. Occasionally something gets stuck and manual intervention is required, which was apparently the case here: we therefore triggered a manual heal and asked for patience while it was underway. A manual heal is typically an I/O intensive operation so we didn't think much about high loads or processes racing for I/O on the backend. But it typically completes under 30min, while this time it seemed to be much slower… We paused some non mission critical VMs to free I/O and give the healing process some slack, but that didn't seem to improve things significantly. Then it dawned on us that the crux of the problem was perhaps elsewhere after all, even though no hardware alert had gone off. Inspecting per-device I/O statistics we noticed a specific disk in a RAID array a lot more queued reads than its peers. S.M.A.R.T. attributes were hinting at a healthy disk, but it obviously wasn't: once marked as faulty the load almost immediately stabilized to acceptable levels. (Theoretically the kernel could grab data from one of the redundant peers instead of insisting in using the slow disk, but it apparently didn't.) It was shortly before 2PM and from that point it didn't take long for the heal to finally complete — it would probably have lasted much shorter if we had kicked the faulty disk before. That's likely what triggered the issue (consistency loss) in the first place: I/O-needy processes racing against other isn't a good thing when I/O is scarce… Unfortunately while we had detailed I/O metrics in the monitoring system, no alert threshold was defined, and S.M.A.R.T. failed to properly identify the faulty device. Once the issue was mitigated, the faulty drive was replaced later that afternoon. Then later during the week the array rebuilt and VMs moved back to better balance the load. Appologies for the inconvenience. -- Guilhem. -- To unsubscribe e-mail to: website+unsubscr...@global.libreoffice.org Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette List archive: https://listarchives.libreoffice.org/global/website/ Privacy Policy: https://www.documentfoundation.org/privacy