JMeybohm added a comment.
In T301147#7689837 <https://phabricator.wikimedia.org/T301147#7689837>, @dcausse wrote: > @JMeybohm we're still investigating why the application did not properly recover while kubernetes1014 went down but if you have ideas on the two questions in the ticket description this would be very helpful, thanks! Unfortunately I'm not exactly sure what happened to the node. What I know is that the system load surged (potentially due to high iowait) on the system, leaving running processes practically starving but the system was still responding to ICMP and kubernetes status heartbeats still (mostly) worked. Leaving the node flipping between Ready/NotReady state. That means the node was not actually down from k8s POV, which is why no new Pods where created until I drained the node respectively before I powercycled it (as evicting pods was actually hanging as well, as k8s tries to be nice and the node still was in it's overloaded state). TASK DETAIL https://phabricator.wikimedia.org/T301147 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JMeybohm Cc: Addshore, JMeybohm, Michael, Aklapper, dcausse, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org