This sounds like OS-5123.

https://smartos.org/bugview/OS-5213

When ps hangs, the process that prints immediately after the hang, does it have 
a very large memory allocation? (10s of GBs?)

-- 
Brian Bennett
Systems Engineer, Cloud Operations
Joyent, Inc. | www.joyent.com

> On Oct 12, 2017, at 9:10 AM, Dr. Angelo Roussos <[email protected]> 
> wrote:
> 
> Hi,
>  
> We are having some strange host ‘responsiveness’ issues when executing 
> commands whilst logged into a SmartOS host via SSH.
>  
> We have some fairly large compute nodes in our environment; one of these 
> nodes has the following resources:
>  
> 4x 4850 v4 processors
> 1.5TB RAM
> RAIDZ2 zones pool made up of 3x vdevs – each consisting of 1.92TB SAMSUNG 
> SM860 SSDs
>  
> This specific host has just under 200 VM instances running on it – and a mix 
> of KVM, LX, OS zones.
>  
> It is running the following image: joyent_20170803T064301Z
>  
> Currently less than 700GB RAM has been allocated to VM instances on the host, 
> and ARC – at any one time – is consuming about 500GB RAM.
>  
> The host seems to be running fine (cpu utilisation is consistently <50%), 
> except every so often e.g. when running ‘w’ or ‘prstat -Z’, we can get a 
> significantly delayed response when we are logged into the host via SSH.
>  
> So entering ‘prstat -Z’ on the command line can result in us getting a 
> ‘Please wait…’ response for a variable amount of time – up to 60 seconds on 
> occasion.
>  
> Similarly:
>  
> ‘
> [root@c10 ~]# w
> 15:55:41    up 12 day(s), 17:57,  1 user,  load average: 45.33, 45.15, 48.73
> User     tty      login@         idle    JCPU    PCPU
>  
> ‘
>  
> .. can also result in extremely delayed feedback on the command line (like 
> now 😊 )
>  
> As I said, this happens intermittently, and does not appear to be related to 
> host load average.
>  
> In situations when this occurs, it also seems that our Zabbix agent on the 
> host is impacted and only partially responsive for a period.
>  
> We recently did an image update and reboot (12 days ago) – as this was 
> happening with an older image as well; after the reboot, this host did not 
> show ANY of these symptoms for 4-5 days, but after that, it has started to 
> exhibit exactly the same symptoms as were happening with the previous image.
>  
> ALSO, what has happened in the last 2 days is that on at least one occasion 
> that we are aware of, the host became COMPLETELY unresponsive when logged in 
> via SSH – for 3-4 minutes. This has happened once since we have rebooted this 
> host into a newer image, but was in fact the main reason why we decided to 
> upgrade the image 12 days ago (because exactly the same thing happened that 
> day). After being unresponsive on the command line for 3-4 minutes, the host 
> became responsive again with seemingly no negative impact.
>  
> Logs show nothing, and there’s nothing in dmesg to indicate anything untoward.
>  
> It seems to us that this may either be some sort of network issue (although 
> when the host has been completely unresponsive to commands, it has been 
> responding to ICMPs), or (possibly) some sort of memory allocation issue 
> related to the global zone.
>  
> Any ideas?
>  
> Kind regards,
>  
> Angelo.
>  
> smartos-discuss | Archives | Modify Your Subscription  

Attachment: smime.p7s
Description: S/MIME cryptographic signature




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to