[smartos-discuss] Possible SmartOS Compute Node Global Zone Memory issue?

Dr. Angelo Roussos Thu, 12 Oct 2017 09:11:34 -0700

Hi,


We are having some strange host ‘responsiveness’ issues when executing commands 
whilst logged into a SmartOS host via SSH.

 

We have some fairly large compute nodes in our environment; one of these nodes 
has the following resources:

 

4x 4850 v4 processors

1.5TB RAM

RAIDZ2 zones pool made up of 3x vdevs – each consisting of 1.92TB SAMSUNG SM860 
SSDs

 

This specific host has just under 200 VM instances running on it – and a mix of 
KVM, LX, OS zones.

 

It is running the following image: joyent_20170803T064301Z

 

Currently less than 700GB RAM has been allocated to VM instances on the host, 
and ARC – at any one time – is consuming about 500GB RAM.

 

The host seems to be running fine (cpu utilisation is consistently <50%), 
except every so often e.g. when running ‘w’ or ‘prstat -Z’, we can get a 
significantly delayed response when we are logged into the host via SSH.

 

So entering ‘prstat -Z’ on the command line can result in us getting a ‘Please 
wait…’ response for a variable amount of time – up to 60 seconds on occasion.

 

Similarly:

 

‘

[root@c10 ~]# w

15:55:41    up 12 day(s), 17:57,  1 user,  load average: 45.33, 45.15, 48.73

User     tty      login@         idle    JCPU    PCPU

 

‘

 

.. can also result in extremely delayed feedback on the command line (like now 
😊 )

 

As I said, this happens intermittently, and does not appear to be related to 
host load average.

 

In situations when this occurs, it also seems that our Zabbix agent on the host 
is impacted and only partially responsive for a period.

 

We recently did an image update and reboot (12 days ago) – as this was 
happening with an older image as well; after the reboot, this host did not show 
ANY of these symptoms for 4-5 days, but after that, it has started to exhibit 
exactly the same symptoms as were happening with the previous image.

 

ALSO, what has happened in the last 2 days is that on at least one occasion 
that we are aware of, the host became COMPLETELY unresponsive when logged in 
via SSH – for 3-4 minutes. This has happened once since we have rebooted this 
host into a newer image, but was in fact the main reason why we decided to 
upgrade the image 12 days ago (because exactly the same thing happened that 
day). After being unresponsive on the command line for 3-4 minutes, the host 
became responsive again with seemingly no negative impact.

 

Logs show nothing, and there’s nothing in dmesg to indicate anything untoward.

 

It seems to us that this may either be some sort of network issue (although 
when the host has been completely unresponsive to commands, it has been 
responding to ICMPs), or (possibly) some sort of memory allocation issue 
related to the global zone.

 

Any ideas?

 

Kind regards,

 

Angelo.

 




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

[smartos-discuss] Possible SmartOS Compute Node Global Zone Memory issue?

Reply via email to