I don’t know if this helps or not, but a better way to think of it is that for every page of memory, the OS wants to make sure there’s somewhere it can go on disk. For application code, this is easy — it can just re-read the code from the binary (assuming it hasn’t changed) if necessary. For any memory allocated by a program however, the OS must set aside a enough swap to be able to save those pages if necessary (it doesn’t allocate any specific area, but just reserves an amount so it knows there’s enough room if it needs it). As long as it is able to make that reservation, an allocation will succeed. Once that is no longer the case, allocations will fail. Also remember that the kernel itself consumes a certain amount of memory, and will reduce the amount of memory available for programs.
However, long before the system runs out of swap, if the amount of physical memory available starts getting low, it will start to write out pages to swap ( least recently used pages will be written out first) to try to keep a small amount of free ram on hand. The faster the rate of memory demand is, the faster it will attempt to write pages to swap. The system will continue to make progress, though it is probably best measured in astronomical timescales. Since even SSDs are so much slower than ram, the system will become rather unresponsive once this starts happening. There is no OOM killer, so it will stay that way as long as the memory pressure remains (or the system is restarted) — though even after the pressure is removed, it might a bit before everything gets back on track. For zone resource caps, I believe it’s similar in concept, but applied to the ram being used by stuff inside a particular zone. It looks like in your examples you’re trying to allocate almost all the ram in the system to zones (which is then trying to use it all), leaving almost nothing for the kernel. That would almost definitely cause it to start paging (in fact you can see that with the huge value in the ‘po' column in the vmstat output). If the processes consuming all the memory are killed (obviously easier if you already have an open shell and can issue the kill command), the system should recover. If it is not, that could be an indication of a problem with the resource management code and it would be useful to force a crashdump to allow examination of the system state at the time. It might help to have a bit more information on the system — is there a remote management device (DRAC, iLO, etc), are you using the serial port as the console, or have a monitor and keyboard installed? If there isn’t a remote management device, it might be possible to boot with kmdb and force a dump using that. On June 24, 2017 at 1:08:13 AM, David Preece (da...@polymath.tech) wrote: Hi, On 23 June 2017 at 12:54:36 AM, Jerry Jelinek (jerry.jeli...@joyent.com) wrote: 1) In your zone you are trying to use a lot more physical memory than the limit you have set for the zone. The overall thrashing behavior you have described sounds like what would be expected in this case. So, there's a lot I don't understand about SmartOS memory. If I set up 4-physical/8-swap and allocate 4GB inside the zone, it shows (under zonememstat) as being 50% full. I take this to mean that the maximum that can be allocated inside the zone is 8GB, and that the paging mechanism is responsible for deciding which bit is physical and which bit is disk. Here, example on a 32GB machine (with 64GB nVME swap): [root@tiny ~]# zonememstat -t ZONE RSS(MB) CAP(MB) NOVER POUT(MB) SWAP% global 0 - - - - ctr-vKUWfLdjACa6aUt3fRTt5P 2 8192 0 0 87.57 ctr-quLfQMoeND3yrnZj5aWgeL 2 8192 0 0 87.57 ctr-5NtmqXi98tsFTqYBRNH3ZU 2 8192 0 0 87.57 ctr-X33PVA7jEXVsVx3ZhT2sJ5 2 8192 0 0 87.57 total 8 32768 0 0 - Each zone is running debian (lx) and has 7GB allocated in it (by python3). The edited highlights of ps from the global zone: root 4564 0.0 22.0 7369636 7347664 ? S 05:15:02 0:03 python3 /incremental_alloc.py 7 root 4579 0.0 22.0 7369636 7347668 ? S 05:15:09 0:03 python3 /incremental_alloc.py 7 root 4594 0.0 22.0 7369636 7347660 ? S 05:15:15 0:04 python3 /incremental_alloc.py 7 root 4609 0.0 22.0 7369636 7347664 ? S 05:15:19 0:04 python3 /incremental_alloc.py 7 ...disagrees on the RSS of each zone (7347664) but at least we can see the allocation. Swap agrees with ps: [root@tiny ~]# swap -lh swapfile dev swaplo blocks free /dev/zvol/dsk/zones/swap 90,1 4K 64G 64G As does vmstat: [root@tiny ~]# vmstat -S 1 kthr memory page disk faults cpu r b w swap free si so pi po fr de sr bk bk lf rm in sy cs us sy id 0 0 0 78725148 11829532 0 0 182 0 0 0 2373 -0 20 37 -11 3039 5894 2034 0 1 98 0 0 0 68523768 1342112 0 0 0 0 0 0 0 0 0 0 0 2735 1364 471 0 1 99 0 0 0 68523688 1342032 0 0 0 0 0 0 0 0 0 0 0 2544 649 370 0 1 99 0 0 0 68523688 1342032 0 0 0 0 0 0 0 0 6 0 0 2875 2008 1501 0 1 98 If I now allocate 1GB more in one of the zones... [root@tiny ~]# vmstat -S 1 kthr memory page disk faults cpu r b w swap free si so pi po fr de sr bk bk lf rm in sy cs us sy id 0 0 0 77583364 10655732 0 0 161 0 0 0 2107 -0 19 37 -11 2988 5306 1861 0 1 98 0 0 0 68523744 1342084 0 0 0 0 0 0 0 0 0 0 0 2670 691 457 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 8 0 0 2720 660 1230 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2633 2000 541 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2584 646 410 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2504 652 361 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2544 646 406 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2608 1311 498 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2661 646 431 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2482 653 339 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2556 646 409 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2660 1300 530 0 1 99 0 0 0 68523664 1342004 0 0 0 0 0 0 0 0 0 0 0 2898 1211 593 2 4 94 0 0 0 67857516 675848 0 0 0 0 0 522324 1219476 0 0 0 0 6026 2471 1707 1 12 86 0 0 0 67472216 290548 0 0 80 0 0 470092 1026886 0 0 0 15 652865 650 2424 0 26 74 0 0 0 67471628 289852 0 0 0 0 2652 423084 1109202 0 0 0 0 684690 649 919 0 29 71 0 0 0 67471576 292452 0 0 0 74256 75940 380776 1476922 0 98 0 0 533368 646 431081 0 29 71 0 0 0 67147676 42084 0 0 0 0 2712 342700 1429250 0 0 0 0 2906 659 1940 0 11 89 0 0 0 67147440 44564 0 0 0 0 45422 308432 1426855 0 0 0 0 2615 630 23296 0 11 89 All hell breaks loose and 25 seconds later the box locks entirely (full trace at https://gist.github.com/RantyDave/218f8f3bab74fa623677b450f372286e). kthr memory page disk faults cpu r b w swap free si so pi po fr de sr bk bk lf rm in sy cs us sy id 0 0 6 67131172 119868 0 0 0 0 0 37512 804536 0 0 0 0 2488 596 722 0 9 91 0 0 6 67131172 119868 0 0 0 0 0 33764 732708 0 0 0 0 2483 607 1307 0 10 90 0 0 6 67131172 119868 0 0 0 0 0 30388 717851 0 0 0 0 2469 596 902 0 9 91 The machine will ping, and you can type into the console but not log in. This actually went a lot less well than I thought it would. Power cycling the machine and there's nothing in /cores or /var/cores. Edited highlights of the zone conf (from zonecfg info): zonename: ctr-quLfQMoeND3yrnZj5aWgeL zonepath: /zones/ctr-quLfQMoeND3yrnZj5aWgeL brand: lx limitpriv: default scheduling-class: ip-type: exclusive hostid: fs-allowed: uuid: ac9b7109-536f-c632-dd37-8dddaab0cc4b [max-lwps: 2000] [max-shm-memory: 8G] [max-shm-ids: 4096] [max-msg-ids: 4096] [max-sem-ids: 4096] [cpu-shares: 100] net: (snipped) capped-memory: [physical: 8G] [swap: 8G] [locked: 8G] (more snip) attr: name: docker type: string value: true attr: name: init-name type: string value: /native/usr/vm/sbin/dockerinit attr: name: kernel-version type: string value: 3.16 rctl: name: zone.max-lwps value: (priv=privileged,limit=2000,action=deny) rctl: name: zone.max-shm-memory value: (priv=privileged,limit=8589934592,action=deny) rctl: name: zone.max-shm-ids value: (priv=privileged,limit=4096,action=deny) rctl: name: zone.max-sem-ids value: (priv=privileged,limit=4096,action=deny) rctl: name: zone.max-msg-ids value: (priv=privileged,limit=4096,action=deny) rctl: name: zone.cpu-shares value: (priv=privileged,limit=100,action=none) rctl: name: zone.max-physical-memory value: (priv=privileged,limit=8589934592,action=deny) rctl: name: zone.max-swap value: (priv=privileged,limit=8589934592,action=deny) rctl: name: zone.max-locked-memory value: (priv=privileged,limit=8589934592,action=deny) Setting physical, swap and locked the same is what happens if you make a textbook lx using vmadm. Oh, and I set the arc to be tiny... [root@tiny ~]# arcstat time read miss miss% dmis dm% pmis pm% mmis mm% arcsz c 05:51:43 0 0 0 0 0 0 0 0 0 67M 512M And rcapd wasn't running. Running the test again with rcapd we get a better looking zonememstat: [root@tiny ~]# zonememstat ZONE RSS(MB) CAP(MB) NOVER POUT(MB) SWAP% global 185 - - - - ctr-wBHaBqWZvWFAp9huch5ioX 7185 8192 0 0 87.57 ctr-FHTL4BQ9Wh63kFShD6M8L6 7181 8192 0 0 87.57 ctr-AsFSSvYzomUFXJgTNSxkVH 7181 8192 0 0 87.57 ctr-RZXUVpsgAzKKmakdPD3PZN 7181 8192 0 0 87.57 The output from ps is the same. The effect of allocating the last gig is the same and because I left vfsstat running we get another smoking gun: r/s w/s kr/s kw/s ractv wactv read_t writ_t %r %w d/s del_t zone 34.6 1.3 15.3 0.3 0.0 0.8 2.0 643107.4 0 81 0.0 0.0 global (0) 0.0 161.8 0.0 18877.8 0.0 0.1 0.0 581.8 0 9 0.0 0.0 ctr-wBHa (1) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 ctr-FHTL (2) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 ctr-AsFS (3) 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.0 0.0 ctr-RZXU (4) 2) The process eventually terminates with a SIGBUS. The first time through I used alpine linux, this one used debian. What I was *expecting* was that either the allocation would return NULL (possibly causing the SIGBUS), a signal would be sent to the application (same result again), or a Linux-esque OOM killer would shoot it through the head. Either which way I would've thought that preventing an ngz from taking down global would be job #1 :( I don't know if this is an issue with your application code or with our platform. The application code is a test harness *just* for incrementally allocating memory: https://gist.github.com/RantyDave/c2322891f86f26f4696b3a8b3a478b62 3) The box eventually locks up. That is clearly our issue and is something we would want to investigate. Can you force a system dump and provide that to us? If you can't NMI your box when it is in this state, then you might be able to force a dump using DTrace. Sorry, I really can't help you there. -Dave smartos-discuss | Archives | Modify Your Subscription
signature.asc
Description: Message signed with OpenPGP using AMPGpg
------------------------------------------- smartos-discuss Archives: https://www.listbox.com/member/archive/184463/=now RSS Feed: https://www.listbox.com/member/archive/rss/184463/25769125-55cfbc00 Modify Your Subscription: https://www.listbox.com/member/?member_id=25769125&id_secret=25769125-7688e9fb Powered by Listbox: http://www.listbox.com