I don’t know if this helps or not, but a better way to think of it is that for 
every page of memory, the OS wants to make sure there’s somewhere it can go on 
disk.  For application code, this is easy — it can just re-read the code from 
the binary (assuming it hasn’t changed) if necessary.  For any memory allocated 
by a program however, the OS must set aside a enough swap to be able to save 
those pages if necessary (it doesn’t allocate any specific area, but just 
reserves an amount so it knows there’s enough room if it needs it).   As long 
as it is able to make that reservation, an allocation will succeed.  Once that 
is no longer the case, allocations will fail.  Also remember that the kernel 
itself consumes a certain amount of memory, and will reduce the amount of 
memory available for programs.

However, long before the system runs out of swap, if the amount of physical 
memory available starts getting low, it will start to write out pages to swap ( 
least recently used pages will be written out first) to try to keep a small 
amount of free ram on hand.  The faster the rate of memory demand is, the 
faster it will attempt to write pages to swap.  The system will continue to 
make progress, though it is probably best measured in astronomical timescales.  
Since even SSDs are so much slower than ram, the system will become rather 
unresponsive once this starts happening.  There is no OOM killer, so it will 
stay that way as long as the memory pressure remains (or the system is 
restarted) — though even after the pressure is removed, it might a bit before 
everything gets back on track.

For zone resource caps, I believe it’s similar in concept, but applied to the 
ram being used by stuff inside a particular zone.  It looks like in your 
examples you’re trying to allocate almost all the ram in the system to zones 
(which is then trying to use it all), leaving almost nothing for the kernel.  
That would almost definitely cause it to start paging (in fact you can see that 
with the huge value in the ‘po' column in the vmstat output).

If the processes consuming all the memory are killed (obviously easier if you 
already have an open shell and can issue the kill command), the system should 
recover.  If it is not, that could be an indication of a problem with the 
resource management code and it would be useful to force a crashdump to allow 
examination of the system state at the time.  It might help to have a bit more 
information on the system — is there a remote management device (DRAC, iLO, 
etc), are you using the serial port as the console, or have a monitor and 
keyboard installed?  If there isn’t a remote management device, it might be 
possible to boot with kmdb and force a dump using that.



On June 24, 2017 at 1:08:13 AM, David Preece (da...@polymath.tech) wrote:

Hi,
On 23 June 2017 at 12:54:36 AM, Jerry Jelinek (jerry.jeli...@joyent.com) wrote:

1) In your zone you are trying to use a lot more physical memory than the limit 
you have set for the zone. The overall thrashing behavior you have described 
sounds like what would be expected in this case.
So, there's a lot I don't understand about SmartOS memory. If I set up 
4-physical/8-swap and allocate 4GB inside the zone, it shows (under 
zonememstat) as being 50% full. I take this to mean that the maximum that can 
be allocated inside the zone is 8GB, and that the paging mechanism is 
responsible for deciding which bit is physical and which bit is disk.

Here, example on a 32GB machine (with 64GB nVME swap):

[root@tiny ~]# zonememstat -t
                                 ZONE  RSS(MB)  CAP(MB)  NOVER  POUT(MB) SWAP%
                               global        0        -      -         -     -
           ctr-vKUWfLdjACa6aUt3fRTt5P        2     8192      0         0 87.57
           ctr-quLfQMoeND3yrnZj5aWgeL        2     8192      0         0 87.57
           ctr-5NtmqXi98tsFTqYBRNH3ZU        2     8192      0         0 87.57
           ctr-X33PVA7jEXVsVx3ZhT2sJ5        2     8192      0         0 87.57
                                total        8    32768      0         0     -

Each zone is running debian (lx) and has 7GB allocated in it (by python3). The 
edited highlights of ps from the global zone:

root      4564  0.0 22.0 7369636 7347664 ?        S 05:15:02  0:03 python3 
/incremental_alloc.py 7
root      4579  0.0 22.0 7369636 7347668 ?        S 05:15:09  0:03 python3 
/incremental_alloc.py 7
root      4594  0.0 22.0 7369636 7347660 ?        S 05:15:15  0:04 python3 
/incremental_alloc.py 7
root      4609  0.0 22.0 7369636 7347664 ?        S 05:15:19  0:04 python3 
/incremental_alloc.py 7

...disagrees on the RSS of each zone (7347664) but at least we can see the 
allocation. Swap agrees with ps:

[root@tiny ~]# swap -lh
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/zones/swap 90,1        4K      64G      64G

As does vmstat:

[root@tiny ~]# vmstat -S 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  si  so pi po fr de sr bk bk lf rm   in   sy   cs us sy id
 0 0 0 78725148 11829532 0 0 182 0 0  0 2373 -0 20 37 -11 3039 5894 2034 0 1 98
 0 0 0 68523768 1342112 0 0  0  0  0  0  0  0  0  0  0 2735 1364  471  0  1 99
 0 0 0 68523688 1342032 0 0  0  0  0  0  0  0  0  0  0 2544  649  370  0  1 99
 0 0 0 68523688 1342032 0 0  0  0  0  0  0  0  6  0  0 2875 2008 1501  0  1 98

If I now allocate 1GB more in one of the zones...

[root@tiny ~]# vmstat -S 1
 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  si  so pi po fr de sr bk bk lf rm   in   sy   cs us sy id
 0 0 0 77583364 10655732 0 0 161 0 0  0 2107 -0 19 37 -11 2988 5306 1861 0 1 98
 0 0 0 68523744 1342084 0 0  0  0  0  0  0  0  0  0  0 2670  691  457  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  8  0  0 2720  660 1230  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2633 2000  541  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2584  646  410  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2504  652  361  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2544  646  406  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2608 1311  498  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2661  646  431  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2482  653  339  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2556  646  409  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2660 1300  530  0  1 99
 0 0 0 68523664 1342004 0 0  0  0  0  0  0  0  0  0  0 2898 1211  593  2  4 94
 0 0 0 67857516 675848 0  0  0  0  0 522324 1219476 0 0 0 0 6026 2471 1707 1 12 
86
 0 0 0 67472216 290548 0  0 80  0  0 470092 1026886 0 0 0 15 652865 650 2424 0 
26 74
 0 0 0 67471628 289852 0  0  0  0 2652 423084 1109202 0 0 0 0 684690 649 919 0 
29 71
 0 0 0 67471576 292452 0  0  0 74256 75940 380776 1476922 0 98 0 0 533368 646 
431081 0 29 71
 0 0 0 67147676 42084 0   0  0  0 2712 342700 1429250 0 0 0 0 2906 659 1940 0 
11 89
 0 0 0 67147440 44564 0   0  0  0 45422 308432 1426855 0 0 0 0 2615 630 23296 0 
11 89

All hell breaks loose and 25 seconds later the box locks entirely (full trace 
at https://gist.github.com/RantyDave/218f8f3bab74fa623677b450f372286e).

 kthr      memory            page            disk          faults      cpu
 r b w   swap  free  si  so pi po fr de sr bk bk lf rm   in   sy   cs us sy id
 0 0 6 67131172 119868 0  0  0  0  0 37512 804536 0 0 0 0 2488 596 722 0  9 91
 0 0 6 67131172 119868 0  0  0  0  0 33764 732708 0 0 0 0 2483 607 1307 0 10 90
 0 0 6 67131172 119868 0  0  0  0  0 30388 717851 0 0 0 0 2469 596 902 0  9 91

The machine will ping, and you can type into the console but not log in. This 
actually went a lot less well than I thought it would.

Power cycling the machine and there's nothing in /cores or /var/cores. Edited 
highlights of the zone conf (from zonecfg info):

zonename: ctr-quLfQMoeND3yrnZj5aWgeL
zonepath: /zones/ctr-quLfQMoeND3yrnZj5aWgeL
brand: lx
limitpriv: default
scheduling-class: 
ip-type: exclusive
hostid: 
fs-allowed: 
uuid: ac9b7109-536f-c632-dd37-8dddaab0cc4b
[max-lwps: 2000]
[max-shm-memory: 8G]
[max-shm-ids: 4096]
[max-msg-ids: 4096]
[max-sem-ids: 4096]
[cpu-shares: 100]
net:
(snipped)

capped-memory:
[physical: 8G]
[swap: 8G]
[locked: 8G]

(more snip)
attr:
name: docker
type: string
value: true
attr:
name: init-name
type: string
value: /native/usr/vm/sbin/dockerinit
attr:
name: kernel-version
type: string
value: 3.16
rctl:
name: zone.max-lwps
value: (priv=privileged,limit=2000,action=deny)
rctl:
name: zone.max-shm-memory
value: (priv=privileged,limit=8589934592,action=deny)
rctl:
name: zone.max-shm-ids
value: (priv=privileged,limit=4096,action=deny)
rctl:
name: zone.max-sem-ids
value: (priv=privileged,limit=4096,action=deny)
rctl:
name: zone.max-msg-ids
value: (priv=privileged,limit=4096,action=deny)
rctl:
name: zone.cpu-shares
value: (priv=privileged,limit=100,action=none)
rctl:
name: zone.max-physical-memory
value: (priv=privileged,limit=8589934592,action=deny)
rctl:
name: zone.max-swap
value: (priv=privileged,limit=8589934592,action=deny)
rctl:
name: zone.max-locked-memory
value: (priv=privileged,limit=8589934592,action=deny)

Setting physical, swap and locked the same is what happens if you make a 
textbook lx using vmadm.

Oh, and I set the arc to be tiny...

[root@tiny ~]# arcstat
    time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c  
05:51:43     0     0      0     0    0     0    0     0    0    67M  512M  

And rcapd wasn't running. 

Running the test again with rcapd we get a better looking zonememstat:

[root@tiny ~]# zonememstat
                                 ZONE  RSS(MB)  CAP(MB)  NOVER  POUT(MB) SWAP%
                               global      185        -      -         -     -
           ctr-wBHaBqWZvWFAp9huch5ioX     7185     8192      0         0 87.57
           ctr-FHTL4BQ9Wh63kFShD6M8L6     7181     8192      0         0 87.57
           ctr-AsFSSvYzomUFXJgTNSxkVH     7181     8192      0         0 87.57
           ctr-RZXUVpsgAzKKmakdPD3PZN     7181     8192      0         0 87.57

The output from ps is the same. The effect of allocating the last gig is the 
same and because I left vfsstat running we get another smoking gun:

  r/s   w/s  kr/s  kw/s ractv wactv read_t writ_t  %r  %w   d/s  del_t zone
 34.6   1.3  15.3   0.3   0.0   0.8    2.0 643107.4   0  81   0.0    0.0 global 
(0)
  0.0 161.8   0.0 18877.8   0.0   0.1    0.0  581.8   0   9   0.0    0.0 
ctr-wBHa (1)
  0.0   0.0   0.0   0.0   0.0   0.0    0.0    0.0   0   0   0.0    0.0 ctr-FHTL 
(2)
  0.0   0.0   0.0   0.0   0.0   0.0    0.0    0.0   0   0   0.0    0.0 ctr-AsFS 
(3)
  0.0   0.0   0.0   0.0   0.0   0.0    0.0    0.0   0   0   0.0    0.0 ctr-RZXU 
(4)

2) The process eventually terminates with a SIGBUS.
The first time through I used alpine linux, this one used debian.

What I was *expecting* was that either the allocation would return NULL 
(possibly causing the SIGBUS), a signal would be sent to the application (same 
result again), or a Linux-esque OOM killer would shoot it through the head. 
Either which way I would've thought that preventing an ngz from taking down 
global would be job #1 :(

 I don't know if this is an issue with your application code or with our 
platform.
The application code is a test harness *just* for incrementally allocating 
memory: https://gist.github.com/RantyDave/c2322891f86f26f4696b3a8b3a478b62

3) The box eventually locks up. That is clearly our issue and is something we 
would want to investigate. Can you force a system dump and provide that to us? 
If you can't NMI your box when it is in this state, then you might be able to 
force a dump using DTrace.
Sorry, I really can't help you there.

-Dave



smartos-discuss | Archives  | Modify Your Subscription  

Attachment: signature.asc
Description: Message signed with OpenPGP using AMPGpg




-------------------------------------------
smartos-discuss
Archives: https://www.listbox.com/member/archive/184463/=now
RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb
Powered by Listbox: http://www.listbox.com

Reply via email to