Hi,

Had an incident today where 17 KVM VM's on a hyp running a custom SmartOS
build 20130702T112237Z ( just scripts no source changes) became
unresponsive. One additional OS KVM seemed largely unaffected and continued
to respond. (VM running OpenVPN which managed to maintain connections
throughout the incident)

For a period we were unable to log into the hyp via console, and SSH
connections timed out or were closed. Eventually SSH connections succeeded
and we were able to log in, we found load avg was around 33 but apparently
dropping.


Can't see a great deal in the hypervisor logs to suggest the cause of the
issue, however on all of the VM's ntpd stopped running because the clock
had too much skew:

ntpd[18019]: time correction of 2019 seconds exceeds sanity limit (1000);
set clock manually to the correct UTC time.

Also spotted in dmesg:

hrtimer: interrupt too slow, forcing clock min delta to 540258 ns

right before the kernel reported a bunch of hung tasks.

It looks like time instability or a clock jump may have caused a few of the
VM's to generate higher than average load.
While looking at the issue we spotted this commit
https://github.com/joyent/illumos-kvm/commit/35adb214e6fd51779967667be6e02e08791e40ad

This sounds like it *could* be the cause of the issue. Is there any
background to this issue, anything we can look for that would confirm or
rule out this is the bug we hit ?
As we are on Sandy Bridge.


Regards
Steve.



-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to