This sounds like OS-5123. https://smartos.org/bugview/OS-5213
When ps hangs, the process that prints immediately after the hang, does it have a very large memory allocation? (10s of GBs?) -- Brian Bennett Systems Engineer, Cloud Operations Joyent, Inc. | www.joyent.com > On Oct 12, 2017, at 9:10 AM, Dr. Angelo Roussos <[email protected]> > wrote: > > Hi, > > We are having some strange host ‘responsiveness’ issues when executing > commands whilst logged into a SmartOS host via SSH. > > We have some fairly large compute nodes in our environment; one of these > nodes has the following resources: > > 4x 4850 v4 processors > 1.5TB RAM > RAIDZ2 zones pool made up of 3x vdevs – each consisting of 1.92TB SAMSUNG > SM860 SSDs > > This specific host has just under 200 VM instances running on it – and a mix > of KVM, LX, OS zones. > > It is running the following image: joyent_20170803T064301Z > > Currently less than 700GB RAM has been allocated to VM instances on the host, > and ARC – at any one time – is consuming about 500GB RAM. > > The host seems to be running fine (cpu utilisation is consistently <50%), > except every so often e.g. when running ‘w’ or ‘prstat -Z’, we can get a > significantly delayed response when we are logged into the host via SSH. > > So entering ‘prstat -Z’ on the command line can result in us getting a > ‘Please wait…’ response for a variable amount of time – up to 60 seconds on > occasion. > > Similarly: > > ‘ > [root@c10 ~]# w > 15:55:41 up 12 day(s), 17:57, 1 user, load average: 45.33, 45.15, 48.73 > User tty login@ idle JCPU PCPU > > ‘ > > .. can also result in extremely delayed feedback on the command line (like > now 😊 ) > > As I said, this happens intermittently, and does not appear to be related to > host load average. > > In situations when this occurs, it also seems that our Zabbix agent on the > host is impacted and only partially responsive for a period. > > We recently did an image update and reboot (12 days ago) – as this was > happening with an older image as well; after the reboot, this host did not > show ANY of these symptoms for 4-5 days, but after that, it has started to > exhibit exactly the same symptoms as were happening with the previous image. > > ALSO, what has happened in the last 2 days is that on at least one occasion > that we are aware of, the host became COMPLETELY unresponsive when logged in > via SSH – for 3-4 minutes. This has happened once since we have rebooted this > host into a newer image, but was in fact the main reason why we decided to > upgrade the image 12 days ago (because exactly the same thing happened that > day). After being unresponsive on the command line for 3-4 minutes, the host > became responsive again with seemingly no negative impact. > > Logs show nothing, and there’s nothing in dmesg to indicate anything untoward. > > It seems to us that this may either be some sort of network issue (although > when the host has been completely unresponsive to commands, it has been > responding to ICMPs), or (possibly) some sort of memory allocation issue > related to the global zone. > > Any ideas? > > Kind regards, > > Angelo. > > smartos-discuss | Archives | Modify Your Subscription
smime.p7s
Description: S/MIME cryptographic signature
------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com
