Hi,
We are having some strange host ‘responsiveness’ issues when executing commands whilst logged into a SmartOS host via SSH. We have some fairly large compute nodes in our environment; one of these nodes has the following resources: 4x 4850 v4 processors 1.5TB RAM RAIDZ2 zones pool made up of 3x vdevs – each consisting of 1.92TB SAMSUNG SM860 SSDs This specific host has just under 200 VM instances running on it – and a mix of KVM, LX, OS zones. It is running the following image: joyent_20170803T064301Z Currently less than 700GB RAM has been allocated to VM instances on the host, and ARC – at any one time – is consuming about 500GB RAM. The host seems to be running fine (cpu utilisation is consistently <50%), except every so often e.g. when running ‘w’ or ‘prstat -Z’, we can get a significantly delayed response when we are logged into the host via SSH. So entering ‘prstat -Z’ on the command line can result in us getting a ‘Please wait…’ response for a variable amount of time – up to 60 seconds on occasion. Similarly: ‘ [root@c10 ~]# w 15:55:41 up 12 day(s), 17:57, 1 user, load average: 45.33, 45.15, 48.73 User tty login@ idle JCPU PCPU ‘ .. can also result in extremely delayed feedback on the command line (like now 😊 ) As I said, this happens intermittently, and does not appear to be related to host load average. In situations when this occurs, it also seems that our Zabbix agent on the host is impacted and only partially responsive for a period. We recently did an image update and reboot (12 days ago) – as this was happening with an older image as well; after the reboot, this host did not show ANY of these symptoms for 4-5 days, but after that, it has started to exhibit exactly the same symptoms as were happening with the previous image. ALSO, what has happened in the last 2 days is that on at least one occasion that we are aware of, the host became COMPLETELY unresponsive when logged in via SSH – for 3-4 minutes. This has happened once since we have rebooted this host into a newer image, but was in fact the main reason why we decided to upgrade the image 12 days ago (because exactly the same thing happened that day). After being unresponsive on the command line for 3-4 minutes, the host became responsive again with seemingly no negative impact. Logs show nothing, and there’s nothing in dmesg to indicate anything untoward. It seems to us that this may either be some sort of network issue (although when the host has been completely unresponsive to commands, it has been responding to ICMPs), or (possibly) some sort of memory allocation issue related to the global zone. Any ideas? Kind regards, Angelo. ------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
