[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm .. increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) Hi, Hugh, I do have fixes for the problem above, but I've run into something strange. I see that when I create a new cgroup and set 500M as it's limit and run kernbench under it, I see a strange problem 1. memrlimit determines that limit is exceeded and fails the fork of the new process 2. The process that failed to fork, encounters a page fault and faults in find_vma I tried chasing the problem, but I am lost wondering how a page fault (do_page_fault) can occur in a process that has not yet been created and is going to fail with -ENOMEM. The interesting thing is that the OOPS occurs in find_vma My trace so far limit exceeded Pid: 3695, comm: sh Not tainted 2.6.27-rc1-mm1 #12 Call Trace: [802b0473] memrlimit_cgroup_charge_as+0x3a/0x3c [8023a82f] dup_mm+0xea/0x410 [8023b648] copy_process+0xabe/0x12ef [8023c0df] do_fork+0x114/0x2d2 [8025b42c] ? trace_hardirqs_on_caller+0xf9/0x124 [8025b464] ? trace_hardirqs_on+0xd/0xf [805bda1f] ? _spin_unlock_irq+0x2b/0x30 [805bd24e] ? trace_hardirqs_on_thunk+0x3a/0x3f [8020bf4b] ? system_call_fastpath+0x16/0x1b [8020a44a] sys_clone+0x23/0x25 [8020c2c7] ptregscall_common+0x67/0xb0 putting mm 88003d931400 3695 sh copy_mm, retval -12 copy_process returning -12 copy_process returned fff4 -12 fork failed -12 general protection fault: [1] copy_process returned 880037a11600 -13194 0462029312 SMP last sysfs file: /sys/block/sda/size CPU 2 Modules linked in: coretemp hwmon kvm_intel kvm rtc_cmos rtc_core rtc_lib mptsas mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd Pid: 3695, comm: sh Not tainted 2.6.27-rc1-mm1 #12 RIP: 0010:[802954f8] [802954f8] find_vma+0x2f/0x62 RSP: :88003544bee8 EFLAGS: 00010202 RAX: 6b6b6b6b6b6b6b6b RBX: RCX: 8800399e34d8 RDX: 8800399e34d8 RSI: 003a2729ad22 RDI: 88003e5c8500 RBP: 88003544bee8 R08: R09: R10: 88003e5c8568 R11: 0246 R12: 003a2729ad22 R13: 0014 R14: 88003544bf58 R15: 88003e8bac00 FS: 2b3b978f3f50() GS:8800bfd954b0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 003a2729ad22 CR3: 3549f000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process sh (pid: 3695, threadinfo 88003544a000, task 88003e8bac00) Stack: 88003544bf48 805bfec0 008cae50 88003e5c8560 88003e5c8500 00030001 7fff131e72c0 008cae50 Call Trace: [805bfec0] do_page_fault+0x36f/0x7ad [805bdd4d] error_exit+0x0/0xa9 Code: 85 ff 48 89 e5 74 55 eb 05 48 89 ca eb 47 48 8b 47 10 48 85 c0 74 0c 48 39 70 10 76 06 48 39 70 08 76 39 48 8b 47 08 31 d2 eb 1d 48 39 70 e0 48 8d 48 d0 76 0f 48 39 70 d8 76 ce 48 8b 40 10 48 RIP [802954f8] find_vma+0x2f/0x62 RSP 88003544bee8 ---[ end trace 89156336afdfaec3 ]--- I hope that I'll be able to think more clearly on Monday, but it's hard to say :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hi, Hugh, I am unable to reproduce the problem, but I do have an initial hypothesis CPU0CPU1 try_to_unuse task 1 stars exitinglook at mm = task1-mm .. increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails I do have a potential solution in mind, but I want to make sure my hypothesis is correct. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Tue, 5 Aug 2008, Balbir Singh wrote: Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 I am unable to reproduce the problem, Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then back to 2.6.26-rc8-mm1. But I've been SO stupid: saw it originally on one machine with SLAB_DEBUG=y, have been trying since mostly on another with SLUB_DEBUG=y, but never thought to boot with slub_debug=P,task_struct until now. but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm ..increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) I do have a potential solution in mind, but I want to make sure my hypothesis is correct. It seems wrong that memrlimit_cgroup_uncharge_as should be called after mm-owner may have been changed, even if it's to something safe. But I forget the mm/task exit details, surely they're tricky. By the way, is the ordering in mm_update_next_owner the best? Would there be less movement if it searched amongst siblings before it searched amongst children? Ought it to make a first pass trying to stay within the same cgroup? Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Tue, 5 Aug 2008, Balbir Singh wrote: Hugh Dickins wrote: [snip] BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 I am unable to reproduce the problem, Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then back to 2.6.26-rc8-mm1. But I've been SO stupid: saw it originally on one machine with SLAB_DEBUG=y, have been trying since mostly on another with SLUB_DEBUG=y, but never thought to boot with slub_debug=P,task_struct until now. Unfortunately, I've not tried on 32 bit and not at all with SLAB_DEBUG=y. I'll give the latter a trial run and see what I get. but I do have an initial hypothesis CPU0 CPU1 try_to_unuse task 1 stars exiting look at mm = task1-mm .. increment mm_users task 1 exits mm-owner needs to be updated, but no new owner is found (mm_users 1, but no other task has task-mm = task1-mm) mm_update_next_owner() leaves grace period user count drops, call mmput(mm) task 1 freed dereferencing mm-owner fails Yes, that looks right to me: seems obvious now. I don't think your careful alternation of CPU0/1 events at the end matters: the swapoff CPU simply dereferences mm-owner after that task has gone. (That's a shame, I'd always hoped that mm-owner-comm was going to be good for use in mm messages, even when tearing down the mm.) The problem we have is that tasks are independent of mm_struct's (in some ways) and are associated almost like a database associates two entities through keys. I do have a potential solution in mind, but I want to make sure my hypothesis is correct. It seems wrong that memrlimit_cgroup_uncharge_as should be called after mm-owner may have been changed, even if it's to something safe. But I forget the mm/task exit details, surely they're tricky. The fix would be to uncharge when a new owner can no longer be found (I am yet to code/test it though). By the way, is the ordering in mm_update_next_owner the best? Would there be less movement if it searched amongst siblings before it searched amongst children? Ought it to make a first pass trying to stay within the same cgroup? Yes, we need to make a first pass at keeping it in the same cgroup. You might be right about the sibling optimization. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
2008/7/25 Balbir Singh [EMAIL PROTECTED]: There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit I need to get caught up on this thread, but I did promise Balbir at the mini-summit that I would appear soon-ish with actual use-cases on this from some of the CGL folks. Specifically the case I was thinking of, other than the CGL requirement for VM Strict Overcommit, was finer grained rlimit accounting. It started out in the Collaboration Summit meeting in Austin as a discussion about the SCOPE gaps document and CGOS-4.5 (curiously called Coarse Resource Enforcement, when it's really trying to address per-thread limits). The full document is here in PDF form: http://www.scope-alliance.org/pr/SCOPE_CGOS_GAPS_PROFILE_v2.pdf I'm suspecting now, though, that after re-reading the requirement from SCOPE and the memrlimit discussion, they may in fact be disjoint sets of functionality. -J. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. From memory controller's point of view, treating mem+swap by the same controller makes sense. Because memory controller can check wheter we can use more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage. (By split-lru, I think we can do this avoidance.) Another-Topic? In recent servers, memory is big, swap is (relatively) small. And under memory resource controller, the whole swap is easily occupied by a group. I want to avoid it. For users, swap is not precious because it's not fast. But for memory reclaiming, swap is precious resource to page out anonymous/shmem/tmpfs memory. I think usual system-admin considers swap as some emergency spare of memory. I'd like to allow this emergency spare to each cgroup. (For example, swap is used even if vm.swappiness==0. This is for avoiding OOM-Killer under some situation, this behavior is added by Rik.) == following is another use case I explained to Rik at 23/May/08 == IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following. Consider following system. (and there is no swap controller.) Memory 4G. Swap 1G. with 2 cgroups A, B. state 1) swap is not used. Amemory limit to be 1G no swap usage memory_usage=0M Bmemory limit to be 1G no swap usage memory_usage=0M state 2) Run a big program on A. Amemory limit to be 1G and try to use 1.7G. uses 700MBytes of swap. memory_usage=1G swap_usage=700M Bmemory_usage=0M state 3) A some of programs ends in 'A' Amemory_usage=500M swap_usage=700M Bmemory_usage=0M. state 4) Run a big program on B. A...memory_usage=500M swap_usage=700M. B...memory_usage=1G swap_usage=300M Group B can only use 1.3G because of unfair swap use of group A. But users think why A uses 700M of swap with 500M of free memory == Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. (Unfortunately I don't have a good name for this mem+swap.) I happen to believe that the mem+swap controller would actually be a lot more useful than the current mem controller, and would expect many to run with mem+swap controller enabled but mem controller disabled or unlimited. How much is mem and how much is swap being left to global reclaim to decide, not imposed by any cgroup policy. What I don't like the sound of at all is a swap controller. Do you think that a mem controller (limit 1G) and a mem+swap controller (limit 2G) is equivalent to a mem controller (limit 1G) and a swap controller (limit 1G)? No: imagine memory pressure from outside the cgroup - with the mem+swap controller it can push as much as suits of the 2G out to swap; whereas with the swap controller, once 1G is out, it has to stop pushing any more of that cgroup out. I think that's absurd - but perhaps I just haven't looked, and I've totally misinterpreted the talk of a swap controller. From memory controller's point of view, treating mem+swap by the same controller makes sense. Because memory controller can check wheter we can use more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage. (By split-lru, I think we can do this avoidance.) That's a detail I'm not concerned with on this level. Another-Topic? In recent servers, memory is big, swap is (relatively) small. You'll know much more about those common proportions than I do. I'd wonder why such big memory servers have any swap at all: to cope with VM management defects we should be fixing? And under memory resource controller, the whole swap is easily occupied by a group. I want to avoid it. Why? I presume because you're thinking it a precious resource. I don't think its relative smallness makes it more precious. For users, swap is not precious because it's not fast. Yes, and that's my view. But for memory reclaiming, swap is precious resource to page out anonymous/shmem/tmpfs memory. I see that makes swap a useful resource, I don't see that it makes it a precious resource. We page out to it precisely because it's less precious than the memory; both users and kernel would much prefer to keep all the data in memory, but sometimes there isn't enough memory so we go to swap. There is just one way in which I see swap as precious, and that is to get around some VM management stupidity. If, for example, on i386 there's a shortage of lowmem and lots of anonymous in lowmem that we should shift to highmem, then I think it's still the case that we have to do that balancing via writing out to and reading in from swap, because nobody has actually hooked up page migration to do that when appropriate? But that's an argument for extending page migration, not for needing a swap controller. I think usual system-admin considers swap as some emergency spare of memory. Yes, I do too. I'd like to allow this emergency spare to each cgroup. We do allow that emergency spare to each cgroup. Perhaps you're saying you want to divide it up in advance between the cgroups? But why? Sounds like a nice idea (reminds me of what Paul said about using temporary files), but a solution to what problem? (For example, swap is used even if vm.swappiness==0. This is for avoiding OOM-Killer under some situation, this behavior is added by Rik.) Sorry, I don't know what you're referring to there, but again, suspect it's a detail we don't need to be concerned with here. == following is another use case I explained to Rik at 23/May/08 == IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following. Consider following system. (and there is no swap controller.) Memory 4G. Swap 1G. with 2 cgroups A, B. state 1) swap is not used. Amemory limit to be 1G no swap usage memory_usage=0M Bmemory limit to be 1G no swap usage memory_usage=0M state 2) Run a
[Devel] Re: memrlimit controller merge to mainline
On Tue, Jul 29, 2008 at 5:31 PM, Hugh Dickins [EMAIL PROTECTED] wrote: I don't see that I'm denying you a way to guarantee that (though I've been thinking more of the limits than the guarantees): I'm not saying that you cannot have a mem controller, I'm saying that you can also have a mem+swap controller; but that a swap-by-itself controller makes no sense to me. OK, fair enough. I think that works until you get to fork: shared files and private/anonymous/swap behave differently from then on. Good point. It works as long as you never do a plain fork() without immediate execve() though. Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote: No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? Scheduling jobs on to machines is much more straightforward when they request xGB of memory and yGB of swap rather than just (x+y)GB of (memory+swap). We want to be able to guarantee to jobs that they will be able to use xGB of real memory. I don't see that I'm denying you a way to guarantee that (though I've been thinking more of the limits than the guarantees): I'm not saying that you cannot have a mem controller, I'm saying that you can also have a mem+swap controller; but that a swap-by-itself controller makes no sense to me. Actually my preferred approach to swap controlling would be something like: - allow malloc to support mmaping pages from a temporary file rather than mmapping anonymous memory I think that works until you get to fork: shared files and private/anonymous/swap behave differently from then on. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I see what your saying. When you look at Linux right now, we control swap independent of memory, so I am not totally opposed to setting swap, instead of swap+mem. I might not want to swap from a particular cgroup, in which case, I set swap to 0 and risk OOMing, which might be an acceptable trade-off depending on my setup. I could easily change this policy on demand and add swap if OOMing was no longer OK. It's taken me a while to understand your point. I think you're saying that with a swap controller, you can set the swap limit to 0 on a cgroup if you want to keep it entirely in memory, without setting any mem limit upon it; whereas with my mem+swap controller, you'd have to set a mem limit then an equal mem+swap limit to achieve the same never go to swap effect, and maybe you don't want to set a mem limit. Hmm, but an unreachably high mem limit, and equal mem+swap limit, would achieve that effect. Sorry, I don't think I have understood (and even if the unreachably high limit didn't work, this seems more about setting a don't-swap flag than imposing a swap limit). Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 01:16:17 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ? It's a choice for me. From view of global LRU management, it's better. If we can avoid an accident that the swap is fully used by some silly program, anything is ok to me. How about you, Nishimura-san ? A story I talked is based on the assumption that there may be not enough swap space against memory. We can ask cutomers to equip tons of swap when memory is huge. BTW, what is the maximum swap size now ? Can we extend it if it's small ? snip state 4) Run a big program on B. A...memory_usage=500M swap_usage=700M. B...memory_usage=1G swap_usage=300M If you believe a swap controller would make that better, what limits do you suggest? If you assign A a swap limit of 700M or above, it changes nothing; if you assign A a swap limit below 700M, it cannot do all the work that it could do in the example. Of course, set A's swap_limit of 300M and get swap pages into memory and free swap entries and make A on memory. (before B starts.) But users think why A uses 700M of swap with 500M of free memory Because at this time A isn't actively using any of that 700M. That's a weakness of do all by automatic detection and ideal algoritm. It's just a result of LRU algorithm, which is not always the users think ideal. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 10:17:19 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 01:16:17 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote: On Fri, 25 Jul 2008 17:46:45 +0100 (BST) Hugh Dickins [EMAIL PROTECTED] wrote: IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Added Nishimura to CC. IMHO, from user point of view, both of - having 2 controls as mem controller + swap controller - mem + swap controller doesn't have much difference. The users will use as they like. I'm not suggesting either one of those alternatives. I'm suggesting we have a mem controller (the thing we already have) and a mem+swap controller (which we don't yet have: a controller for the total mem+swap of a cgroup); the mem+swap controller likely making use of much that is in the mem controller, as Paul has said. Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ? It's a choice for me. From view of global LRU management, it's better. If we can avoid an accident that the swap is fully used by some silly program, anything is ok to me. Hmm. mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Then, mem+swap controller finally means - under mem+swap controller, program works with no swap. Only global LRU may make pages swapped-out. - If swap-accounting-mode is off, swap can be used unlimitedly. Hmm, sounds a bit differenct from what I want. How about others ? Thanks, -Kame Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. I'm thinking mem+swap controller in a different way: an add-on to mem controller, just as current swap controller. I mean adding memory.(mem+swap)_limit. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Then, mem+swap controller finally means - under mem+swap controller, program works with no swap. Only global LRU may make pages swapped-out. - If swap-accounting-mode is off, swap can be used unlimitedly. Hmm, sounds a bit differenct from what I want. How about others ? Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 13:58:03 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. I'm thinking mem+swap controller in a different way: an add-on to mem controller, just as current swap controller. I mean adding memory.(mem+swap)_limit. Hmm ? adding a control file other than - memory.limit_in_bytes ? Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Sorry for many mails ;( I think I misunderstood something... Following is ? A brief summary about changes in memroy controller. - memory.limit_in_bytes works as it is now. - new parameter: memory.limit_in_bytes_includes_swap will be added. + memory.limit_in_bytes_includes_swap controlls the total amount of RAM + SWAP, + memory.limit_in_bytes = memory.limit_in_bytes_includes_swap As a result. - memory controller works as it is but doesn't use too much swap. - global-lru cannot be affected by controller's parameter. Hmm, seems reasonable. minor problem is how-to-handle 2 counts/limits ? BTW, does anyone have good names ? (example) memory.memory_limits_in_bytes. (for accounting memory) memory.total_limits_in_bytes. (for accountign memory+swap) Thanks, -Kame On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Thanks, -Kame ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Did I mention that conferences suck? ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. I see Andrew has already acted, so it's now moot. But I'd like to say that I do agree with you and the conclusion to hold off for now. I was a bit alarmed earlier to see those patches sailing on through; but realized that I'd done very little to substantiate my hatred of the whole thing, and decided that I didn't feel strongly enough to stand in the way now. But I am glad you've stepped in, thank you. (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Paul Menage wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. Although the memrlimit controller basically works as intended, my impression from the mini-summit on Tuesday is that our consensus is that this still doesn't have concrete practical use-cases yet: - avoiding swap over-use is better handled by the forthcoming swap controller - applications that can usefully handle mmap() returning NULL don't really exist yet (and since the system as a whole allows address space overcommit limits, if it was practical/useful to write such apps then presumably they would already exist) There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting and there is also the use case of the feature forming the infrastructure for other rlimit controllers. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. While we can argue about use cases, the feature needs more testing and I am OK holding off/reverting the merge to make it more stable and that would give us more time to argue on its usefulness. To say that overcommit handling is not useful is wrong. Meanwhile, I'll go back and look at the bug report that Hugh has posted and also look at building an mlock controller on top of memrlimits. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Andrew Morton wrote: On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Sorry for making your work more harder. Did I mention that conferences suck? Not yet, but we know now :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Andrew Morton wrote: On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote: Hi Balbir, Andrew included the memrlimit controller in his latest set of patches to Linus for mainline. I've asked Linus to drop all 238 patches. I'll be resending them minus the offending memrlimit patches. Sorry for making your work more harder. Did I mention that conferences suck? Not yet, but we know now :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Paul Menage wrote: So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. What do you (and others on the containers list) think? Should we ask Andrew/Linus to hold off on this for now? My preference would be to do that until we have someone who can stand up with a concrete scenario where they want to use this in a real environment. I see Andrew has already acted, so it's now moot. But I'd like to say that I do agree with you and the conclusion to hold off for now. I was a bit alarmed earlier to see those patches sailing on through; but realized that I'd done very little to substantiate my hatred of the whole thing, and decided that I didn't feel strongly enough to stand in the way now. But I am glad you've stepped in, thank you. (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) We will have a memory+swap controller working together. By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. I'll try and reproduce the problem right away. I've been running some kernbench on top of memrlimit (but not with a lot of stress or trying to swapoff the swap device). BUG: unable to handle kernel paging request at 6b6b6b8b IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 *pde = Oops: [#1] PREEMPT SMP last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device thermal ac battery button Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7) EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000) Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 7c033ec8 0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 Call Trace: [78161323] ? exit_mmap+0xaf/0x133 [781226b1] ? mmput+0x4c/0xba [78165ce3] ? try_to_unuse+0x20b/0x3f5 [78371534] ? _spin_unlock+0x22/0x3c [7816636a] ? sys_swapoff+0x17b/0x37c [78102d95] ? sysenter_past_esp+0x6a/0xa5 === Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c Hugh I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y but no use of it, when doing swapoff a week ago. Not investigated at all, I'm afraid, but at a guess it might come from memrlimit work placing too much faith in the mm_users count - swapoff is only one of several places which have to inc/dec mm_users for some reason. BUG: unable to handle kernel paging request at 6b6b6b8b Possibly the mm-owner tracking breaks in that case, if the last user exits while swapoff is occurring without relinquishing ownership? That looks as though mm-owner points to a task that had been poisoned after being freed. That could be awkward to fix :-( Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote: There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit Thanks, that'll be interesting to take a look at. So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting Yes, but (my impression of) the overall feeling in the meeting was that it wasn't yet the right time to push it to mainline. and there is also the use case of the feature forming the infrastructure for other rlimit controllers. Agreed, but that's something for the future. Paul ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Paul Menage wrote: On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote: There are applications that can/need to handle overcommit, just that we are not aware of them fully. Immediately after our meeting, I was pointed to http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit Thanks, that'll be interesting to take a look at. So I think we'd be complicating some of the vm paths in mainline with a feature that isn't likely to get a lot of real use. I did disagree in the meeting Yes, but (my impression of) the overall feeling in the meeting was that it wasn't yet the right time to push it to mainline. Yes! I need to test it more and I'll focus more on that front. and there is also the use case of the feature forming the infrastructure for other rlimit controllers. Agreed, but that's something for the future. I'll work on the mlock controller and post that as well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. I agree that a memory controller without a swap controller can make sense: I hope so, anyway, since that's what's in mainline. Even if swap is used, memory is a more precious resource than swap, and you were right to go about controlling memory first. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. It sounds as if you're interpreting my mem+swap controller as a mem controller and a swap controller and the swap controller makes use of some of the mem controller infrastructure. No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Fri, 25 Jul 2008, Balbir Singh wrote: I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. Good luck! I have only seen it once, on a dual-core laptop; though I don't remember to try swapoff while busy as often as I should (be sure to alternate between a couple or more of swapareas, so you can swap a new one on just before swapping an old one off, to be pretty sure of success). May be easier to find in the source: my suspicion is that a bad mm_users assumption will come into it. But I realize now that it could be entirely unrelated to memrlimit, just that uncharge_as was the one to get hit by bad refcounting elsewhere. Oh, that reminds me, I never reported back on my res_counter warnings at shutdown: never saw them again, once I added in the set of changes you came up with shortly after that - thanks. Hugh ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Balbir Singh wrote: I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as() created the problem, it's most likely related to mm-owner not being correct and we are dereferencing the wrong memory. Thanks for the bug report, I'll look further. Good luck! I have only seen it once, on a dual-core laptop; though I don't remember to try swapoff while busy as often as I should (be sure to alternate between a couple or more of swapareas, so you can swap a new one on just before swapping an old one off, to be pretty sure of success). Thanks, that's very useful information. I would have never tried juggling swap devices otherwise. May be easier to find in the source: my suspicion is that a bad mm_users assumption will come into it. But I realize now that it could be entirely unrelated to memrlimit, just that uncharge_as was the one to get hit by bad refcounting elsewhere. Oh, that reminds me, I never reported back on my res_counter warnings at shutdown: never saw them again, once I added in the set of changes you came up with shortly after that - thanks. I am glad those messages are gone, thanks for the bug report. I find bug fixing more exciting that kernel development on most occasions. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
Hugh Dickins wrote: On Fri, 25 Jul 2008, Paul Menage wrote: On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote: (Different topic, but one day I ought to get around to saying again how absurd I think a swap controller; whereas a mem+swap controller makes plenty of sense. I think Rik and others said the same.) Agreed that a swap controller without a memory controller doesn't make much sense, but a memory controller without a swap controller can make sense on machines that don't intend to use swap. I agree that a memory controller without a swap controller can make sense: I hope so, anyway, since that's what's in mainline. Even if swap is used, memory is a more precious resource than swap, and you were right to go about controlling memory first. Yes, I agree. So if they were separate controllers, we'd use the proposed cgroup dependency features to make the swap controller depend on the memory controller - in which case you'd only be able to mount the swap controller on a hierarchy that also had the memory controller, and the swap controller would be able to make use of the page ownership information. It's more of a modularity issue than a functionality issue, I think - the swap controller and memory controller are tracking fundamentally different things (space on disk versus pages in memory), and the only dependency between them is the memory controller tracking the ownership of a page and providing it to the swap controller. It sounds as if you're interpreting my mem+swap controller as a mem controller and a swap controller and the swap controller makes use of some of the mem controller infrastructure. No, I'm trying to say something stronger than that. I'm saying, as I've said before, that I cannot imagine why anyone would want to control swap itself - what they want to control is the total of mem+swap. Swap is a second-class citizen, nobody wants swap if they can have mem, so why control it separately? IIRC Rik expressed the same by pointing out that a cgroup at its swap limit would then be forced to grow in mem (until it hits its mem limit): so controlling the less precious resource would increase pressure on the more precious resource. (Actually, that probably bears little relation to what he said - sorry, Rik!) I don't recall what answer he got, perhaps I'd be persuaded if I heard it again. I see what your saying. When you look at Linux right now, we control swap independent of memory, so I am not totally opposed to setting swap, instead of swap+mem. I might not want to swap from a particular cgroup, in which case, I set swap to 0 and risk OOMing, which might be an acceptable trade-off depending on my setup. I could easily change this policy on demand and add swap if OOMing was no longer OK. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel