Hi, Had an incident today where 17 KVM VM's on a hyp running a custom SmartOS build 20130702T112237Z ( just scripts no source changes) became unresponsive. One additional OS KVM seemed largely unaffected and continued to respond. (VM running OpenVPN which managed to maintain connections throughout the incident)
For a period we were unable to log into the hyp via console, and SSH connections timed out or were closed. Eventually SSH connections succeeded and we were able to log in, we found load avg was around 33 but apparently dropping. Can't see a great deal in the hypervisor logs to suggest the cause of the issue, however on all of the VM's ntpd stopped running because the clock had too much skew: ntpd[18019]: time correction of 2019 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time. Also spotted in dmesg: hrtimer: interrupt too slow, forcing clock min delta to 540258 ns right before the kernel reported a bunch of hung tasks. It looks like time instability or a clock jump may have caused a few of the VM's to generate higher than average load. While looking at the issue we spotted this commit https://github.com/joyent/illumos-kvm/commit/35adb214e6fd51779967667be6e02e08791e40ad This sounds like it *could* be the cause of the issue. Is there any background to this issue, anything we can look for that would confirm or rule out this is the bug we hit ? As we are on Sandy Bridge. Regards Steve. ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
