[Devel] Re: memrlimit controller merge to mainline

2008-08-10 Thread Balbir Singh
Hugh Dickins wrote:
 but I do have an initial hypothesis

 CPU0 CPU1
  try_to_unuse
 task 1 stars exiting look at mm = task1-mm
 ..   increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves

 grace period
  user count drops, call mmput(mm)
 task 1 freed
  dereferencing mm-owner fails
 
 Yes, that looks right to me: seems obvious now.  I don't think your
 careful alternation of CPU0/1 events at the end matters: the swapoff
 CPU simply dereferences mm-owner after that task has gone.
 
 (That's a shame, I'd always hoped that mm-owner-comm was going to
 be good for use in mm messages, even when tearing down the mm.)
 

Hi, Hugh,

I do have fixes for the problem above, but I've run into something strange. I
see that when I create a new cgroup and set 500M as it's limit and run kernbench
under it, I see a strange problem

1. memrlimit determines that limit is exceeded and fails the fork of the new 
process
2. The process that failed to fork, encounters a page fault and faults in 
find_vma

I tried chasing the problem, but I am lost wondering how a page fault
(do_page_fault) can occur in a process that has not yet been created and is
going to fail with -ENOMEM. The interesting thing is that the OOPS occurs in
find_vma

My trace so far


limit exceeded
Pid: 3695, comm: sh Not tainted 2.6.27-rc1-mm1 #12

Call Trace:
 [802b0473] memrlimit_cgroup_charge_as+0x3a/0x3c
 [8023a82f] dup_mm+0xea/0x410
 [8023b648] copy_process+0xabe/0x12ef
 [8023c0df] do_fork+0x114/0x2d2
 [8025b42c] ? trace_hardirqs_on_caller+0xf9/0x124
 [8025b464] ? trace_hardirqs_on+0xd/0xf
 [805bda1f] ? _spin_unlock_irq+0x2b/0x30
 [805bd24e] ? trace_hardirqs_on_thunk+0x3a/0x3f
 [8020bf4b] ? system_call_fastpath+0x16/0x1b
 [8020a44a] sys_clone+0x23/0x25
 [8020c2c7] ptregscall_common+0x67/0xb0

putting mm 88003d931400 3695 sh
copy_mm, retval -12
copy_process returning -12
copy_process returned fff4 -12
fork failed -12
general protection fault:  [1] copy_process returned 880037a11600 -13194
0462029312
SMP
last sysfs file: /sys/block/sda/size
CPU 2
Modules linked in: coretemp hwmon kvm_intel kvm rtc_cmos rtc_core rtc_lib mptsas
 mptscsih mptbase scsi_transport_sas uhci_hcd ohci_hcd ehci_hcd
Pid: 3695, comm: sh Not tainted 2.6.27-rc1-mm1 #12
RIP: 0010:[802954f8]  [802954f8] find_vma+0x2f/0x62
RSP: :88003544bee8  EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b6b RBX:  RCX: 8800399e34d8
RDX: 8800399e34d8 RSI: 003a2729ad22 RDI: 88003e5c8500
RBP: 88003544bee8 R08:  R09: 
R10: 88003e5c8568 R11: 0246 R12: 003a2729ad22
R13: 0014 R14: 88003544bf58 R15: 88003e8bac00
FS:  2b3b978f3f50() GS:8800bfd954b0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 003a2729ad22 CR3: 3549f000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process sh (pid: 3695, threadinfo 88003544a000, task 88003e8bac00)
Stack:  88003544bf48 805bfec0  008cae50
 88003e5c8560 88003e5c8500 00030001 
 7fff131e72c0  008cae50 
Call Trace:
 [805bfec0] do_page_fault+0x36f/0x7ad
 [805bdd4d] error_exit+0x0/0xa9


Code: 85 ff 48 89 e5 74 55 eb 05 48 89 ca eb 47 48 8b 47 10 48 85 c0 74 0c 48 39
 70 10 76 06 48 39 70 08 76 39 48 8b 47 08 31 d2 eb 1d 48 39 70 e0 48 8d 48 d0
 76 0f 48 39 70 d8 76 ce 48 8b 40 10 48
RIP  [802954f8] find_vma+0x2f/0x62
 RSP 88003544bee8

---[ end trace 89156336afdfaec3 ]---

I hope that I'll be able to think more clearly on Monday, but it's hard to say 
:)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Balbir Singh
Hugh Dickins wrote:
[snip]
 
 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 *pde =  
 Oops:  [#1] PREEMPT SMP 
 last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
 snd_seq_device thermal ac battery button
 
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
 EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
 Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 
  
 7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 
 a272f040 
7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 
 78165ce3 
 Call Trace:
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
  ===
 Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 
 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
 EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c

Hi, Hugh,

I am unable to reproduce the problem, but I do have an initial hypothesis

CPU0CPU1
try_to_unuse
task 1 stars exitinglook at mm = task1-mm
..  increment mm_users
task 1 exits
mm-owner needs to be updated, but
no new owner is found
(mm_users  1, but no other task
has task-mm = task1-mm)
mm_update_next_owner() leaves

grace period
user count drops, call mmput(mm)
task 1 freed
dereferencing mm-owner fails



I do have a potential solution in mind, but I want to make sure my hypothesis is
correct.



-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Hugh Dickins
On Tue, 5 Aug 2008, Balbir Singh wrote:
 Hugh Dickins wrote:
 [snip]
  
  BUG: unable to handle kernel paging request at 6b6b6b8b
  IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
  Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
   [78161323] ? exit_mmap+0xaf/0x133
   [781226b1] ? mmput+0x4c/0xba
   [78165ce3] ? try_to_unuse+0x20b/0x3f5
   [78371534] ? _spin_unlock+0x22/0x3c
   [7816636a] ? sys_swapoff+0x17b/0x37c
   [78102d95] ? sysenter_past_esp+0x6a/0xa5
 
 I am unable to reproduce the problem,

Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then
back to 2.6.26-rc8-mm1.  But I've been SO stupid: saw it originally
on one machine with SLAB_DEBUG=y, have been trying since mostly on
another with SLUB_DEBUG=y, but never thought to boot with
slub_debug=P,task_struct until now.

 but I do have an initial hypothesis
 
 CPU0  CPU1
   try_to_unuse
 task 1 stars exiting  look at mm = task1-mm
 ..increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves
 
 grace period
   user count drops, call mmput(mm)
 task 1 freed
   dereferencing mm-owner fails

Yes, that looks right to me: seems obvious now.  I don't think your
careful alternation of CPU0/1 events at the end matters: the swapoff
CPU simply dereferences mm-owner after that task has gone.

(That's a shame, I'd always hoped that mm-owner-comm was going to
be good for use in mm messages, even when tearing down the mm.)

 I do have a potential solution in mind, but I want to make sure my
 hypothesis is correct.

It seems wrong that memrlimit_cgroup_uncharge_as should be called
after mm-owner may have been changed, even if it's to something safe.
But I forget the mm/task exit details, surely they're tricky.

By the way, is the ordering in mm_update_next_owner the best?
Would there be less movement if it searched amongst siblings before
it searched amongst children?  Ought it to make a first pass trying
to stay within the same cgroup?

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-08-04 Thread Balbir Singh
Hugh Dickins wrote:
 On Tue, 5 Aug 2008, Balbir Singh wrote:
 Hugh Dickins wrote:
 [snip]
 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
 I am unable to reproduce the problem,
 
 Me neither, I've spent many hours trying 2.6.27-rc1-mm1 and then
 back to 2.6.26-rc8-mm1.  But I've been SO stupid: saw it originally
 on one machine with SLAB_DEBUG=y, have been trying since mostly on
 another with SLUB_DEBUG=y, but never thought to boot with
 slub_debug=P,task_struct until now.
 

Unfortunately, I've not tried on 32 bit and not at all with SLAB_DEBUG=y. I'll
give the latter a trial run and see what I get.

 but I do have an initial hypothesis

 CPU0 CPU1
  try_to_unuse
 task 1 stars exiting look at mm = task1-mm
 ..   increment mm_users
 task 1 exits
 mm-owner needs to be updated, but
 no new owner is found
 (mm_users  1, but no other task
 has task-mm = task1-mm)
 mm_update_next_owner() leaves

 grace period
  user count drops, call mmput(mm)
 task 1 freed
  dereferencing mm-owner fails
 
 Yes, that looks right to me: seems obvious now.  I don't think your
 careful alternation of CPU0/1 events at the end matters: the swapoff
 CPU simply dereferences mm-owner after that task has gone.
 
 (That's a shame, I'd always hoped that mm-owner-comm was going to
 be good for use in mm messages, even when tearing down the mm.)
 

The problem we have is that tasks are independent of mm_struct's (in some ways)
and are associated almost like a database associates two entities through keys.

 I do have a potential solution in mind, but I want to make sure my
 hypothesis is correct.
 
 It seems wrong that memrlimit_cgroup_uncharge_as should be called
 after mm-owner may have been changed, even if it's to something safe.
 But I forget the mm/task exit details, surely they're tricky.
 

The fix would be to uncharge when a new owner can no longer be found (I am yet
to code/test it though).

 By the way, is the ordering in mm_update_next_owner the best?
 Would there be less movement if it searched amongst siblings before
 it searched amongst children?  Ought it to make a first pass trying
 to stay within the same cgroup?

Yes, we need to make a first pass at keeping it in the same cgroup. You might be
right about the sibling optimization.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-31 Thread Joe MacDonald
2008/7/25 Balbir Singh [EMAIL PROTECTED]:

 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

I need to get caught up on this thread, but I did promise Balbir at
the mini-summit that I would appear soon-ish with actual use-cases on
this from some of the CGL folks.  Specifically the case I was thinking
of, other than the CGL requirement for VM Strict Overcommit, was finer
grained rlimit accounting.  It started out in the Collaboration Summit
meeting in Austin as a discussion about the SCOPE gaps document and
CGOS-4.5 (curiously called Coarse Resource Enforcement, when it's
really trying to address per-thread limits).

The full document is here in PDF form:

http://www.scope-alliance.org/pr/SCOPE_CGOS_GAPS_PROFILE_v2.pdf

I'm suspecting now, though, that after re-reading the requirement from
SCOPE and the memrlimit discussion, they may in fact be disjoint sets
of functionality.

-J.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
Hugh Dickins [EMAIL PROTECTED] wrote:

 IIRC Rik expressed the same by pointing out that a cgroup at its
 swap limit would then be forced to grow in mem (until it hits its
 mem limit): so controlling the less precious resource would increase
 pressure on the more precious resource.  (Actually, that probably
 bears little relation to what he said - sorry, Rik!)  I don't recall
 what answer he got, perhaps I'd be persuaded if I heard it again.
 
Added Nishimura to CC.

IMHO, from user point of view, both of
 - having 2 controls as mem controller + swap controller
 - mem + swap controller
doesn't have much difference. The users will use as they like.

From memory controller's point of view, treating mem+swap by the same
controller makes sense. Because memory controller can check wheter we can use
more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage.
(By split-lru, I think we can do this avoidance.)
 
Another-Topic?

In recent servers, memory is big, swap is (relatively) small.
And under memory resource controller, the whole swap is easily occupied
by a group. I want to avoid it.

For users, swap is not precious because it's not fast. 
But for memory reclaiming, swap is precious resource to page out
anonymous/shmem/tmpfs memory. I think usual system-admin considers swap as
some emergency spare of memory. I'd like to allow this emergency spare to each
cgroup.
(For example, swap is used even if vm.swappiness==0. This is for avoiding 
OOM-Killer
 under some situation, this behavior is added by Rik.)


== following is another use case I explained to Rik at 23/May/08 ==

IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following.
Consider following system. (and there is no swap controller.) 
Memory 4G. Swap 1G. with 2 cgroups A, B.

state 1) swap is not used.
  Amemory limit to be 1G  no swap usage memory_usage=0M
  Bmemory limit to be 1G  no swap usage memory_usage=0M

state 2) Run a big program on A.
  Amemory limit to be 1G and try to use 1.7G. uses 700MBytes of swap.
   memory_usage=1G swap_usage=700M
  Bmemory_usage=0M

state 3) A some of programs ends in 'A'
  Amemory_usage=500M swap_usage=700M
  Bmemory_usage=0M.

state 4) Run a big program on B.
  A...memory_usage=500M swap_usage=700M.
  B...memory_usage=1G   swap_usage=300M

Group B can only use 1.3G because of unfair swap use of group A.
But users think why A uses 700M of swap with 500M of free memory
==



Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
 On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  IIRC Rik expressed the same by pointing out that a cgroup at its
  swap limit would then be forced to grow in mem (until it hits its
  mem limit): so controlling the less precious resource would increase
  pressure on the more precious resource.  (Actually, that probably
  bears little relation to what he said - sorry, Rik!)  I don't recall
  what answer he got, perhaps I'd be persuaded if I heard it again.
  
 Added Nishimura to CC.
 
 IMHO, from user point of view, both of
  - having 2 controls as mem controller + swap controller
  - mem + swap controller
 doesn't have much difference. The users will use as they like.

I'm not suggesting either one of those alternatives.

I'm suggesting we have a mem controller (the thing we already have)
and a mem+swap controller (which we don't yet have: a controller
for the total mem+swap of a cgroup); the mem+swap controller likely
making use of much that is in the mem controller, as Paul has said.

(Unfortunately I don't have a good name for this mem+swap.)

I happen to believe that the mem+swap controller would actually be
a lot more useful than the current mem controller, and would expect
many to run with mem+swap controller enabled but mem controller
disabled or unlimited.  How much is mem and how much is swap being
left to global reclaim to decide, not imposed by any cgroup policy.

What I don't like the sound of at all is a swap controller.  Do you
think that a mem controller (limit 1G) and a mem+swap controller
(limit 2G) is equivalent to a mem controller (limit 1G) and a
swap controller (limit 1G)?  No: imagine memory pressure from
outside the cgroup - with the mem+swap controller it can push as
much as suits of the 2G out to swap; whereas with the swap controller,
once 1G is out, it has to stop pushing any more of that cgroup out.
I think that's absurd - but perhaps I just haven't looked, and
I've totally misinterpreted the talk of a swap controller.

 
 From memory controller's point of view, treating mem+swap by the same
 controller makes sense. Because memory controller can check wheter we can use
 more swap or not, we can avoid hopeless-scanning of Anon at swap-shortage.
 (By split-lru, I think we can do this avoidance.)

That's a detail I'm not concerned with on this level.

  
 Another-Topic?
 
 In recent servers, memory is big, swap is (relatively) small.

You'll know much more about those common proportions than I do.
I'd wonder why such big memory servers have any swap at all:
to cope with VM management defects we should be fixing?

 And under memory resource controller, the whole swap is easily occupied
 by a group. I want to avoid it.

Why?  I presume because you're thinking it a precious resource.
I don't think its relative smallness makes it more precious.

 
 For users, swap is not precious because it's not fast. 

Yes, and that's my view.

 But for memory reclaiming, swap is precious resource to page out
 anonymous/shmem/tmpfs memory.

I see that makes swap a useful resource, I don't see that it makes
it a precious resource.  We page out to it precisely because it's
less precious than the memory; both users and kernel would much
prefer to keep all the data in memory, but sometimes there isn't
enough memory so we go to swap.

There is just one way in which I see swap as precious, and that
is to get around some VM management stupidity.  If, for example,
on i386 there's a shortage of lowmem and lots of anonymous in lowmem
that we should shift to highmem, then I think it's still the case
that we have to do that balancing via writing out to and reading
in from swap, because nobody has actually hooked up page migration
to do that when appropriate?  But that's an argument for extending
page migration, not for needing a swap controller.

 I think usual system-admin considers swap as some emergency spare of memory.

Yes, I do too.

 I'd like to allow this emergency spare to each cgroup.

We do allow that emergency spare to each cgroup.  Perhaps you're
saying you want to divide it up in advance between the cgroups?
But why?  Sounds like a nice idea (reminds me of what Paul said
about using temporary files), but a solution to what problem?

 (For example, swap is used even if vm.swappiness==0. This is for avoiding
 OOM-Killer under some situation, this behavior is added by Rik.)

Sorry, I don't know what you're referring to there, but again,
suspect it's a detail we don't need to be concerned with here.

 
 == following is another use case I explained to Rik at 23/May/08 ==
 
 IIRC, a man shown his motivation to controll swap in OLS2007/BOF as following.
 Consider following system. (and there is no swap controller.) 
 Memory 4G. Swap 1G. with 2 cgroups A, B.
 
 state 1) swap is not used.
   Amemory limit to be 1G  no swap usage memory_usage=0M
   Bmemory limit to be 1G  no swap usage memory_usage=0M
 
 state 2) Run a 

[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Paul Menage
On Tue, Jul 29, 2008 at 5:31 PM, Hugh Dickins [EMAIL PROTECTED] wrote:

 I don't see that I'm denying you a way to guarantee that (though I've
 been thinking more of the limits than the guarantees): I'm not saying
 that you cannot have a mem controller, I'm saying that you can also
 have a mem+swap controller; but that a swap-by-itself controller
 makes no sense to me.

OK, fair enough.


 I think that works until you get to fork: shared files and
 private/anonymous/swap behave differently from then on.


Good point. It works as long as you never do a plain fork() without
immediate execve() though.

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 12:46 PM, Hugh Dickins [EMAIL PROTECTED] wrote:
  No, I'm trying to say something stronger than that.  I'm saying,
  as I've said before, that I cannot imagine why anyone would want
  to control swap itself - what they want to control is the total
  of mem+swap.  Swap is a second-class citizen, nobody wants swap
  if they can have mem, so why control it separately?
 
 Scheduling jobs on to machines is much more straightforward when they
 request xGB of memory and yGB of swap rather than just (x+y)GB of
 (memory+swap). We want to be able to guarantee to jobs that they will
 be able to use xGB of real memory.

I don't see that I'm denying you a way to guarantee that (though I've
been thinking more of the limits than the guarantees): I'm not saying
that you cannot have a mem controller, I'm saying that you can also
have a mem+swap controller; but that a swap-by-itself controller
makes no sense to me.

 Actually my preferred approach to swap controlling would be something like:
 
 - allow malloc to support mmaping pages from a temporary file rather
 than mmapping anonymous memory

I think that works until you get to fork: shared files and
private/anonymous/swap behave differently from then on.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Hugh Dickins
On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I see what your saying. When you look at Linux right now, we control swap
 independent of memory, so I am not totally opposed to setting swap, instead of
 swap+mem. I might not want to swap from a particular cgroup, in which case, I
 set swap to 0 and risk OOMing, which might be an acceptable trade-off 
 depending
 on my setup. I could easily change this policy on demand and add swap if 
 OOMing
 was no longer OK.

It's taken me a while to understand your point.  I think you're
saying that with a swap controller, you can set the swap limit to 0
on a cgroup if you want to keep it entirely in memory, without setting
any mem limit upon it; whereas with my mem+swap controller, you'd have
to set a mem limit then an equal mem+swap limit to achieve the same
never go to swap effect, and maybe you don't want to set a mem limit.

Hmm, but an unreachably high mem limit, and equal mem+swap limit,
would achieve that effect.  Sorry, I don't think I have understood
(and even if the unreachably high limit didn't work, this seems more
about setting a don't-swap flag than imposing a swap limit).

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Wed, 30 Jul 2008 01:16:17 +0100 (BST)
Hugh Dickins [EMAIL PROTECTED] wrote:

 On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
  On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
  Hugh Dickins [EMAIL PROTECTED] wrote:
  
   IIRC Rik expressed the same by pointing out that a cgroup at its
   swap limit would then be forced to grow in mem (until it hits its
   mem limit): so controlling the less precious resource would increase
   pressure on the more precious resource.  (Actually, that probably
   bears little relation to what he said - sorry, Rik!)  I don't recall
   what answer he got, perhaps I'd be persuaded if I heard it again.
   
  Added Nishimura to CC.
  
  IMHO, from user point of view, both of
   - having 2 controls as mem controller + swap controller
   - mem + swap controller
  doesn't have much difference. The users will use as they like.
 
 I'm not suggesting either one of those alternatives.
 
 I'm suggesting we have a mem controller (the thing we already have)
 and a mem+swap controller (which we don't yet have: a controller
 for the total mem+swap of a cgroup); the mem+swap controller likely
 making use of much that is in the mem controller, as Paul has said.
 
Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ?
It's a choice for me. From view of global LRU management, it's better.
If we can avoid an accident that the swap is fully used by some silly program,
anything is ok to me.

How about you, Nishimura-san ?

A story I talked is based on the assumption that there may be not enough swap
space against memory. We can ask cutomers to equip tons of swap when 
memory is huge. BTW, what is the maximum swap size now ?
Can we extend it if it's small ?


snip
  state 4) Run a big program on B.
A...memory_usage=500M swap_usage=700M.
B...memory_usage=1G   swap_usage=300M
 If you believe a swap controller would make that better, what limits
 do you suggest?  If you assign A a swap limit of 700M or above, it
 changes nothing; if you assign A a swap limit below 700M, it cannot
 do all the work that it could do in the example.

Of course, set A's swap_limit of 300M and get swap pages into memory and
free swap entries and make A on memory. (before B starts.)

  But users think why A uses 700M of swap with 500M of free memory
 
 Because at this time A isn't actively using any of that 700M.

That's a weakness of do all by automatic detection and ideal algoritm.
It's just a result of LRU algorithm, which is not always the users think
ideal.


Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Wed, 30 Jul 2008 10:17:19 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 01:16:17 +0100 (BST)
 Hugh Dickins [EMAIL PROTECTED] wrote:
 
  On Tue, 29 Jul 2008, KAMEZAWA Hiroyuki wrote:
   On Fri, 25 Jul 2008 17:46:45 +0100 (BST)
   Hugh Dickins [EMAIL PROTECTED] wrote:
   
IIRC Rik expressed the same by pointing out that a cgroup at its
swap limit would then be forced to grow in mem (until it hits its
mem limit): so controlling the less precious resource would increase
pressure on the more precious resource.  (Actually, that probably
bears little relation to what he said - sorry, Rik!)  I don't recall
what answer he got, perhaps I'd be persuaded if I heard it again.

   Added Nishimura to CC.
   
   IMHO, from user point of view, both of
- having 2 controls as mem controller + swap controller
- mem + swap controller
   doesn't have much difference. The users will use as they like.
  
  I'm not suggesting either one of those alternatives.
  
  I'm suggesting we have a mem controller (the thing we already have)
  and a mem+swap controller (which we don't yet have: a controller
  for the total mem+swap of a cgroup); the mem+swap controller likely
  making use of much that is in the mem controller, as Paul has said.
  
 Ah, what mem+swap controller means is limitiing mem+swap by 'a' limit ?
 It's a choice for me. From view of global LRU management, it's better.
 If we can avoid an accident that the swap is fully used by some silly program,
 anything is ok to me.
 
Hmm.

mem+swap controller means a shrink to memory resource controller 
(try_to_free_mem_cgroup_pages()) should drop only file caches.
(Because kick-out-to-swap will never changes the usage.)

right ? only global-lru can make a swap.
maybe I can add optimization to do this. Hmm. I should see how OOM works
under some situation.

Thanks,
-Kame





___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Wed, 30 Jul 2008 11:52:26 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 mem+swap controller means a shrink to memory resource controller 
 (try_to_free_mem_cgroup_pages()) should drop only file caches.
 (Because kick-out-to-swap will never changes the usage.)
 
 right ? only global-lru can make a swap.
 maybe I can add optimization to do this. Hmm. I should see how OOM works
 under some situation.
 
(I'm sorry that I'm not a good writer of e-mail.)

A brief summary about changes to mem controller.

 - mem+swap controller which limits the # sum of pages and swap_entries.
 - mem+swap controller just drops file caches when it reaches limit.
 - under mem+swap controller, recaliming Anon pages make no sense.
   Then,
  - LRU for Anon is not necessary.
  - LRU for tmpfs/shmem is not necessary.
  just showing account is better.
 - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
   Maybe Retries=5 is too small because we never do swap under us.
   a problem like struck-into-ext3-journal can easily make file-cache reclaim
   difficult.
 - need some changes to documentation.
 - Should we have on/off switch of taking swap into account ?
   or should we implement mem+swap contoller in different name than
   memory controller ?
   If swap is not accounted, we need to do swap-out in memory reclaiming path,
   again.
   

Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Wed, 30 Jul 2008 12:11:15 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 11:52:26 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
  mem+swap controller means a shrink to memory resource controller 
  (try_to_free_mem_cgroup_pages()) should drop only file caches.
  (Because kick-out-to-swap will never changes the usage.)
  
  right ? only global-lru can make a swap.
  maybe I can add optimization to do this. Hmm. I should see how OOM works
  under some situation.
  
 (I'm sorry that I'm not a good writer of e-mail.)
 
 A brief summary about changes to mem controller.
 
  - mem+swap controller which limits the # sum of pages and swap_entries.
  - mem+swap controller just drops file caches when it reaches limit.
  - under mem+swap controller, recaliming Anon pages make no sense.
Then,
   - LRU for Anon is not necessary.
   - LRU for tmpfs/shmem is not necessary.
   just showing account is better.
  - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
Maybe Retries=5 is too small because we never do swap under us.
a problem like struck-into-ext3-journal can easily make file-cache reclaim
difficult.
  - need some changes to documentation.
  - Should we have on/off switch of taking swap into account ?
or should we implement mem+swap contoller in different name than
memory controller ?
If swap is not accounted, we need to do swap-out in memory reclaiming path,
again.

Then, mem+swap controller finally means
 - under mem+swap controller, program works with no swap. Only global LRU
   may make pages swapped-out.
 - If swap-accounting-mode is off, swap can be used unlimitedly.

Hmm, sounds a bit differenct from what I want. How about others ?

Thanks,
-Kame



 
 Thanks,
 -Kame
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread Daisuke Nishimura
On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 On Wed, 30 Jul 2008 12:11:15 +0900
 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 
  On Wed, 30 Jul 2008 11:52:26 +0900
  KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
   mem+swap controller means a shrink to memory resource controller 
   (try_to_free_mem_cgroup_pages()) should drop only file caches.
   (Because kick-out-to-swap will never changes the usage.)
   
   right ? only global-lru can make a swap.
   maybe I can add optimization to do this. Hmm. I should see how OOM works
   under some situation.
   
I'm thinking mem+swap controller in a different way: an add-on to
mem controller, just as current swap controller.
I mean adding memory.(mem+swap)_limit.

  (I'm sorry that I'm not a good writer of e-mail.)
  
  A brief summary about changes to mem controller.
  
   - mem+swap controller which limits the # sum of pages and swap_entries.
   - mem+swap controller just drops file caches when it reaches limit.
   - under mem+swap controller, recaliming Anon pages make no sense.
 Then,
- LRU for Anon is not necessary.
- LRU for tmpfs/shmem is not necessary.
just showing account is better.
   - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
 Maybe Retries=5 is too small because we never do swap under us.
 a problem like struck-into-ext3-journal can easily make file-cache 
  reclaim
 difficult.
   - need some changes to documentation.
   - Should we have on/off switch of taking swap into account ?
 or should we implement mem+swap contoller in different name than
 memory controller ?
 If swap is not accounted, we need to do swap-out in memory reclaiming 
  path,
 again.
 
 Then, mem+swap controller finally means
  - under mem+swap controller, program works with no swap. Only global LRU
may make pages swapped-out.
  - If swap-accounting-mode is off, swap can be used unlimitedly.
 
 Hmm, sounds a bit differenct from what I want. How about others ?
 

Thanks,
Daisuke Nishimura.
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
On Wed, 30 Jul 2008 13:58:03 +0900
Daisuke Nishimura [EMAIL PROTECTED] wrote:

 On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] 
 wrote:
  On Wed, 30 Jul 2008 12:11:15 +0900
  KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
  
   On Wed, 30 Jul 2008 11:52:26 +0900
   KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
mem+swap controller means a shrink to memory resource controller 
(try_to_free_mem_cgroup_pages()) should drop only file caches.
(Because kick-out-to-swap will never changes the usage.)

right ? only global-lru can make a swap.
maybe I can add optimization to do this. Hmm. I should see how OOM works
under some situation.

 I'm thinking mem+swap controller in a different way: an add-on to
 mem controller, just as current swap controller.
 I mean adding memory.(mem+swap)_limit.
 
Hmm ? adding a control file other than
 - memory.limit_in_bytes
?

Thanks,
-Kame

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-29 Thread KAMEZAWA Hiroyuki
Sorry for many mails ;(

I think I misunderstood something...

Following is ?

A brief summary about changes in memroy controller.
 - memory.limit_in_bytes works as it is now.
 - new parameter: memory.limit_in_bytes_includes_swap will be added.
   + memory.limit_in_bytes_includes_swap controlls the total amount of
 RAM + SWAP,
   + memory.limit_in_bytes = memory.limit_in_bytes_includes_swap

As a result.
 - memory controller works as it is but doesn't use too much swap.
 - global-lru cannot be affected by controller's parameter.


Hmm, seems reasonable. minor problem is how-to-handle 2 counts/limits ?

BTW, does anyone have good names ?
  (example) memory.memory_limits_in_bytes.  (for accounting memory) 
memory.total_limits_in_bytes.   (for accountign memory+swap)

Thanks,
-Kame


On Wed, 30 Jul 2008 12:11:15 +0900
KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote:
 A brief summary about changes to mem controller.
 
  - mem+swap controller which limits the # sum of pages and swap_entries.
  - mem+swap controller just drops file caches when it reaches limit.
  - under mem+swap controller, recaliming Anon pages make no sense.
Then,
   - LRU for Anon is not necessary.
   - LRU for tmpfs/shmem is not necessary.
   just showing account is better.
  - we should see try_to_free_mem_cgroup() again to avoid too much OOM.
Maybe Retries=5 is too small because we never do swap under us.
a problem like struck-into-ext3-journal can easily make file-cache reclaim
difficult.
  - need some changes to documentation.
  - Should we have on/off switch of taking swap into account ?
or should we implement mem+swap contoller in different name than
memory controller ?
If swap is not accounted, we need to do swap-out in memory reclaiming path,
again.

 
 Thanks,
 -Kame
 
 

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Andrew Morton
On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:

 Hi Balbir,
 
 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.

I've asked Linus to drop all 238 patches.  I'll be resending them minus
the offending memrlimit patches.

Did I mention that conferences suck?

___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.
 
 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.

I see Andrew has already acted, so it's now moot.  But I'd like to
say that I do agree with you and the conclusion to hold off for now.

I was a bit alarmed earlier to see those patches sailing on through;
but realized that I'd done very little to substantiate my hatred of
the whole thing, and decided that I didn't feel strongly enough to
stand in the way now.  But I am glad you've stepped in, thank you.

(Different topic, but one day I ought to get around to saying again
how absurd I think a swap controller; whereas a mem+swap controller
makes plenty of sense.  I think Rik and others said the same.)

By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
but no use of it, when doing swapoff a week ago.  Not investigated
at all, I'm afraid, but at a guess it might come from memrlimit work
placing too much faith in the mm_users count - swapoff is only one
of several places which have to inc/dec mm_users for some reason.

BUG: unable to handle kernel paging request at 6b6b6b8b
IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
*pde =  
Oops:  [#1] PREEMPT SMP 
last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
snd_seq_device thermal ac battery button

Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518  
    7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 a272f040 
   7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 78165ce3 
Call Trace:
 [78161323] ? exit_mmap+0xaf/0x133
 [781226b1] ? mmput+0x4c/0xba
 [78165ce3] ? try_to_unuse+0x20b/0x3f5
 [78371534] ? _spin_unlock+0x22/0x3c
 [7816636a] ? sys_swapoff+0x17b/0x37c
 [78102d95] ? sysenter_past_esp+0x6a/0xa5
 ===
Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 0c 
50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Paul Menage wrote:
 Hi Balbir,
 
 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 Although the memrlimit controller basically works as intended, my
 impression from the mini-summit on Tuesday is that our consensus is
 that this still doesn't have concrete practical use-cases yet:
 
 - avoiding swap over-use is better handled by the forthcoming swap controller
 
 - applications that can usefully handle mmap() returning NULL don't
 really exist yet (and since the system as a whole allows address space
 overcommit limits, if it was practical/useful to write such apps then
 presumably they would already exist)
 

There are applications that can/need to handle overcommit, just that we are not
aware of them fully. Immediately after our meeting, I was pointed to
http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.
 

I did disagree in the meeting and there is also the use case of the feature
forming the infrastructure for other rlimit controllers.

 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.

While we can argue about use cases, the feature needs more testing and I am OK
holding off/reverting the merge to make it more stable and that would give us
more time to argue on its usefulness. To say that overcommit handling is not
useful is wrong. Meanwhile, I'll go back and look at the bug report that Hugh
has posted and also look at building an mlock controller on top of memrlimits.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Andrew Morton wrote:
 On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:
 
 Hi Balbir,

 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 I've asked Linus to drop all 238 patches.  I'll be resending them minus
 the offending memrlimit patches.
 

Sorry for making your work more harder.

 Did I mention that conferences suck?

Not yet, but we know now :)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Andrew Morton wrote:
 On Fri, 25 Jul 2008 04:14:55 -0400 Paul Menage [EMAIL PROTECTED] wrote:
 
 Hi Balbir,

 Andrew included the memrlimit controller in his latest set of patches
 to Linus for mainline.
 
 I've asked Linus to drop all 238 patches.  I'll be resending them minus
 the offending memrlimit patches.
 

Sorry for making your work more harder.

 Did I mention that conferences suck?

Not yet, but we know now :)

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Paul Menage wrote:
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.

 What do you (and others on the containers list) think? Should we ask
 Andrew/Linus to hold off on this for now? My preference would be to do
 that until we have someone who can stand up with a concrete scenario
 where they want to use this in a real environment.
 
 I see Andrew has already acted, so it's now moot.  But I'd like to
 say that I do agree with you and the conclusion to hold off for now.
 
 I was a bit alarmed earlier to see those patches sailing on through;
 but realized that I'd done very little to substantiate my hatred of
 the whole thing, and decided that I didn't feel strongly enough to
 stand in the way now.  But I am glad you've stepped in, thank you.
 
 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)
 

We will have a memory+swap controller working together.

 By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
 but no use of it, when doing swapoff a week ago.  Not investigated
 at all, I'm afraid, but at a guess it might come from memrlimit work
 placing too much faith in the mm_users count - swapoff is only one
 of several places which have to inc/dec mm_users for some reason.
 

I'll try and reproduce the problem right away. I've been running some kernbench
on top of memrlimit (but not with a lot of stress or trying to swapoff the swap
device).

 BUG: unable to handle kernel paging request at 6b6b6b8b
 IP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29
 *pde =  
 Oops:  [#1] PREEMPT SMP 
 last sysfs file: /sys/devices/system/cpu/cpu1/cache/index2/shared_cpu_map
 Modules linked in: acpi_cpufreq snd_pcm_oss snd_mixer_oss snd_seq 
 snd_seq_device thermal ac battery button
 
 Pid: 22500, comm: swapoff Not tainted (2.6.26-rc8-mm1 #7)
 EIP: 0060:[7817078f] EFLAGS: 00010206 CPU: 0
 EIP is at memrlimit_cgroup_uncharge_as+0x18/0x29
 EAX: 6b6b6b6b EBX: 7963215c ECX: 7c032000 EDX: 0025e000
 ESI: 96902518 EDI: 9fbb1aa0 EBP: 7c033e9c ESP: 7c033e9c
  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 Process swapoff (pid: 22500, ti=7c032000 task=907e2b70 task.ti=7c032000)
 Stack: 7c033edc 78161323 9fbb1aa0 025e ff77 7c033ecc 96902518 
  
 7c033ec8  0089 7963215c 9fbb1aa0 9fbb1b28 
 a272f040 
7c033ef4 781226b1 9fbb1aa0 9fbb1aa0 790fa884 a272f0c8 7c033f80 
 78165ce3 
 Call Trace:
  [78161323] ? exit_mmap+0xaf/0x133
  [781226b1] ? mmput+0x4c/0xba
  [78165ce3] ? try_to_unuse+0x20b/0x3f5
  [78371534] ? _spin_unlock+0x22/0x3c
  [7816636a] ? sys_swapoff+0x17b/0x37c
  [78102d95] ? sysenter_past_esp+0x6a/0xa5
  ===
 Code: 24 0c 00 00 8b 40 20 52 83 c0 0c 50 e8 ad a6 fd ff c9 c3 55 89 e5 8b 45 
 08 8b 55 0c 8b 80 30 02 00 00 c1 e2 0c 8b 80 24 0c 00 00 8b 40 20 52 83 c0 
 0c 50 e8 e6 a6 fd ff 58 5a c9 c3 55 89 e5 8b 
 EIP: [7817078f] memrlimit_cgroup_uncharge_as+0x18/0x29 SS:ESP 0068:7c033e9c
 
 Hugh

I'll try and recreate the problem and fix it. If memrlimit_cgroup_uncharge_as()
created the problem, it's most likely related to mm-owner not being correct and
we are dereferencing the wrong memory.

Thanks for the bug report, I'll look further.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Paul Menage
On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:

 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)

Agreed that a swap controller without a memory controller doesn't make
much sense, but a memory controller without a swap controller can make
sense on machines that don't intend to use swap.

So if they were separate controllers, we'd use the proposed cgroup
dependency features to make the swap controller depend on the memory
controller - in which case you'd only be able to mount the swap
controller on a hierarchy that also had the memory controller, and the
swap controller would be able to make use of the page ownership
information.

It's more of a modularity issue than a functionality issue, I think -
the swap controller and memory controller are tracking fundamentally
different things (space on disk versus pages in memory), and the only
dependency between them is the memory controller tracking the
ownership of a page and providing it to the swap controller.


 By the way, here's a BUG I got from CONFIG_CGROUP_MEMRLIMIT_CTLR=y
 but no use of it, when doing swapoff a week ago.  Not investigated
 at all, I'm afraid, but at a guess it might come from memrlimit work
 placing too much faith in the mm_users count - swapoff is only one
 of several places which have to inc/dec mm_users for some reason.

 BUG: unable to handle kernel paging request at 6b6b6b8b

Possibly the mm-owner tracking breaks in that case, if the last user
exits while swapoff is occurring without relinquishing ownership?

That looks as though mm-owner points to a task that had been poisoned
after being freed. That could be awkward to fix :-(

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Paul Menage
On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote:

 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit

Thanks, that'll be interesting to take a look at.


 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.


 I did disagree in the meeting

Yes, but (my impression of) the overall feeling in the meeting was
that it wasn't yet the right time to push it to mainline.

 and there is also the use case of the feature
 forming the infrastructure for other rlimit controllers.

Agreed, but that's something for the future.

Paul
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Paul Menage wrote:
 On Fri, Jul 25, 2008 at 8:30 AM, Balbir Singh [EMAIL PROTECTED] wrote:
 There are applications that can/need to handle overcommit, just that we are 
 not
 aware of them fully. Immediately after our meeting, I was pointed to
 http://www.linuxfoundation.org/en/Carrier_Grade_Linux/Requirements_Alpha1#AVL.4.1_VM_Strict_Over-Commit
 
 Thanks, that'll be interesting to take a look at.
 
 So I think we'd be complicating some of the vm paths in mainline with
 a feature that isn't likely to get a lot of real use.

 I did disagree in the meeting
 
 Yes, but (my impression of) the overall feeling in the meeting was
 that it wasn't yet the right time to push it to mainline.
 

Yes! I need to test it more and I'll focus more on that front.

 and there is also the use case of the feature
 forming the infrastructure for other rlimit controllers.
 
 Agreed, but that's something for the future.

I'll work on the mlock controller and post that as well.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:
 
  (Different topic, but one day I ought to get around to saying again
  how absurd I think a swap controller; whereas a mem+swap controller
  makes plenty of sense.  I think Rik and others said the same.)
 
 Agreed that a swap controller without a memory controller doesn't make
 much sense, but a memory controller without a swap controller can make
 sense on machines that don't intend to use swap.

I agree that a memory controller without a swap controller can
make sense: I hope so, anyway, since that's what's in mainline.
Even if swap is used, memory is a more precious resource than swap,
and you were right to go about controlling memory first.

 
 So if they were separate controllers, we'd use the proposed cgroup
 dependency features to make the swap controller depend on the memory
 controller - in which case you'd only be able to mount the swap
 controller on a hierarchy that also had the memory controller, and the
 swap controller would be able to make use of the page ownership
 information.
 
 It's more of a modularity issue than a functionality issue, I think -
 the swap controller and memory controller are tracking fundamentally
 different things (space on disk versus pages in memory), and the only
 dependency between them is the memory controller tracking the
 ownership of a page and providing it to the swap controller.

It sounds as if you're interpreting my mem+swap controller as a
mem controller and a swap controller and the swap controller makes
use of some of the mem controller infrastructure.

No, I'm trying to say something stronger than that.  I'm saying,
as I've said before, that I cannot imagine why anyone would want
to control swap itself - what they want to control is the total
of mem+swap.  Swap is a second-class citizen, nobody wants swap
if they can have mem, so why control it separately?

IIRC Rik expressed the same by pointing out that a cgroup at its
swap limit would then be forced to grow in mem (until it hits its
mem limit): so controlling the less precious resource would increase
pressure on the more precious resource.  (Actually, that probably
bears little relation to what he said - sorry, Rik!)  I don't recall
what answer he got, perhaps I'd be persuaded if I heard it again.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Hugh Dickins
On Fri, 25 Jul 2008, Balbir Singh wrote:
 
 I'll try and recreate the problem and fix it. If 
 memrlimit_cgroup_uncharge_as()
 created the problem, it's most likely related to mm-owner not being correct 
 and
 we are dereferencing the wrong memory.
 
 Thanks for the bug report, I'll look further.

Good luck!  I have only seen it once, on a dual-core laptop; though
I don't remember to try swapoff while busy as often as I should (be
sure to alternate between a couple or more of swapareas, so you can
swap a new one on just before swapping an old one off, to be pretty
sure of success).

May be easier to find in the source: my suspicion is that a bad
mm_users assumption will come into it.  But I realize now that it
could be entirely unrelated to memrlimit, just that uncharge_as
was the one to get hit by bad refcounting elsewhere.

Oh, that reminds me, I never reported back on my res_counter warnings
at shutdown: never saw them again, once I added in the set of changes
you came up with shortly after that - thanks.

Hugh
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Balbir Singh wrote:
 I'll try and recreate the problem and fix it. If 
 memrlimit_cgroup_uncharge_as()
 created the problem, it's most likely related to mm-owner not being correct 
 and
 we are dereferencing the wrong memory.

 Thanks for the bug report, I'll look further.
 
 Good luck!  I have only seen it once, on a dual-core laptop; though
 I don't remember to try swapoff while busy as often as I should (be
 sure to alternate between a couple or more of swapareas, so you can
 swap a new one on just before swapping an old one off, to be pretty
 sure of success).
 

Thanks, that's very useful information. I would have never tried juggling swap
devices otherwise.

 May be easier to find in the source: my suspicion is that a bad
 mm_users assumption will come into it.  But I realize now that it
 could be entirely unrelated to memrlimit, just that uncharge_as
 was the one to get hit by bad refcounting elsewhere.
 
 Oh, that reminds me, I never reported back on my res_counter warnings
 at shutdown: never saw them again, once I added in the set of changes
 you came up with shortly after that - thanks.
 

I am glad those messages are gone, thanks for the bug report. I find bug fixing
more exciting that kernel development on most occasions.


-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel


[Devel] Re: memrlimit controller merge to mainline

2008-07-25 Thread Balbir Singh
Hugh Dickins wrote:
 On Fri, 25 Jul 2008, Paul Menage wrote:
 On Fri, Jul 25, 2008 at 5:06 AM, Hugh Dickins [EMAIL PROTECTED] wrote:
 (Different topic, but one day I ought to get around to saying again
 how absurd I think a swap controller; whereas a mem+swap controller
 makes plenty of sense.  I think Rik and others said the same.)
 Agreed that a swap controller without a memory controller doesn't make
 much sense, but a memory controller without a swap controller can make
 sense on machines that don't intend to use swap.
 
 I agree that a memory controller without a swap controller can
 make sense: I hope so, anyway, since that's what's in mainline.
 Even if swap is used, memory is a more precious resource than swap,
 and you were right to go about controlling memory first.
 

Yes, I agree.

 So if they were separate controllers, we'd use the proposed cgroup
 dependency features to make the swap controller depend on the memory
 controller - in which case you'd only be able to mount the swap
 controller on a hierarchy that also had the memory controller, and the
 swap controller would be able to make use of the page ownership
 information.

 It's more of a modularity issue than a functionality issue, I think -
 the swap controller and memory controller are tracking fundamentally
 different things (space on disk versus pages in memory), and the only
 dependency between them is the memory controller tracking the
 ownership of a page and providing it to the swap controller.
 
 It sounds as if you're interpreting my mem+swap controller as a
 mem controller and a swap controller and the swap controller makes
 use of some of the mem controller infrastructure.
 
 No, I'm trying to say something stronger than that.  I'm saying,
 as I've said before, that I cannot imagine why anyone would want
 to control swap itself - what they want to control is the total
 of mem+swap.  Swap is a second-class citizen, nobody wants swap
 if they can have mem, so why control it separately?
 
 IIRC Rik expressed the same by pointing out that a cgroup at its
 swap limit would then be forced to grow in mem (until it hits its
 mem limit): so controlling the less precious resource would increase
 pressure on the more precious resource.  (Actually, that probably
 bears little relation to what he said - sorry, Rik!)  I don't recall
 what answer he got, perhaps I'd be persuaded if I heard it again.
 

I see what your saying. When you look at Linux right now, we control swap
independent of memory, so I am not totally opposed to setting swap, instead of
swap+mem. I might not want to swap from a particular cgroup, in which case, I
set swap to 0 and risk OOMing, which might be an acceptable trade-off depending
on my setup. I could easily change this policy on demand and add swap if OOMing
was no longer OK.

-- 
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
___
Containers mailing list
[EMAIL PROTECTED]
https://lists.linux-foundation.org/mailman/listinfo/containers

___
Devel mailing list
Devel@openvz.org
https://openvz.org/mailman/listinfo/devel