Re: the new VM
Rik van Riel <[EMAIL PROTECTED]> writes: > Hmmm, could you help me with drawing up a selection algorithm > on how to choose which SHM segment to destroy when we run OOM? > > The criteria would be about the same as with normal programs: > > 1) minimise the amount of work lost > 2) try to protect 'innocent' stuff > 3) try to kill only one thing > 4) don't surprise the user, but chose something that >the user will expect to be killed/destroyed First we only kill segments with no attachees. There are circumstances under normal load where you have these. (SAP R/3 will do this all the time on Linux 2.4) So perhaps we could signal shm that we killed a process and let it try to find a segment where this process was the last attachee. This would be a good candidate. If this does not help either we could do two different things: 1) kill the biggest nonattached segment 2) kill the segment which was longest detached Greetings Christoph -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Rik van Riel [EMAIL PROTECTED] writes: Hmmm, could you help me with drawing up a selection algorithm on how to choose which SHM segment to destroy when we run OOM? The criteria would be about the same as with normal programs: 1) minimise the amount of work lost 2) try to protect 'innocent' stuff 3) try to kill only one thing 4) don't surprise the user, but chose something that the user will expect to be killed/destroyed First we only kill segments with no attachees. There are circumstances under normal load where you have these. (SAP R/3 will do this all the time on Linux 2.4) So perhaps we could signal shm that we killed a process and let it try to find a segment where this process was the last attachee. This would be a good candidate. If this does not help either we could do two different things: 1) kill the biggest nonattached segment 2) kill the segment which was longest detached Greetings Christoph -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
[replying to a really old email now that I've started work on integrating the OOM handler] On 25 Sep 2000, Christoph Rohland wrote: > Rik van Riel <[EMAIL PROTECTED]> writes: > > > > Because as you said the machine can lockup when you run out of memory. > > > > The fix for this is to kill a user process when you're OOM > > (you need to do this anyway). > > > > The last few allocations of the "condemned" process can come > > frome the reserved pages and the process we killed will exit just > > fine. > > It's slightly offtopic, but you should think about detached shm > segments in yout OOM killer. As many of the high end > applications like databases and e.g. SAP have most of the memory > in shm segments you easily end up killing a lot of processes > without freeing a lot of memory. I see this often in my shm > tests. Hmmm, could you help me with drawing up a selection algorithm on how to choose which SHM segment to destroy when we run OOM? The criteria would be about the same as with normal programs: 1) minimise the amount of work lost 2) try to protect 'innocent' stuff 3) try to kill only one thing 4) don't surprise the user, but chose something that the user will expect to be killed/destroyed regards, regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
[replying to a really old email now that I've started work on integrating the OOM handler] On 25 Sep 2000, Christoph Rohland wrote: Rik van Riel [EMAIL PROTECTED] writes: Because as you said the machine can lockup when you run out of memory. The fix for this is to kill a user process when you're OOM (you need to do this anyway). The last few allocations of the "condemned" process can come frome the reserved pages and the process we killed will exit just fine. It's slightly offtopic, but you should think about detached shm segments in yout OOM killer. As many of the high end applications like databases and e.g. SAP have most of the memory in shm segments you easily end up killing a lot of processes without freeing a lot of memory. I see this often in my shm tests. Hmmm, could you help me with drawing up a selection algorithm on how to choose which SHM segment to destroy when we run OOM? The criteria would be about the same as with normal programs: 1) minimise the amount of work lost 2) try to protect 'innocent' stuff 3) try to kill only one thing 4) don't surprise the user, but chose something that the user will expect to be killed/destroyed regards, regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote: > such screwups by checking for NULL and trying to handle it. I suggest to > rather fix those screwups. How do you know which is the minimal amount of RAM that allows you not to be in the screwedup state? We for sure need a kind of counter for the special dynamic structures but I'm not sure if that should account the static stuff as well. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote: > > On Tue, 26 Sep 2000, Pavel Machek wrote: > of the VM allocation issues. Returning NULL in kmalloc() is just a way to > say: 'oops, we screwed up somewhere'. And i'd suggest to not work around That is not at all how it is currently used in the kernel. > such screwups by checking for NULL and trying to handle it. I suggest to > rather fix those screwups. Kmalloc returns null when there is not enough memory to satisfy the request. What's wrong with that? -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Tue, 26 Sep 2000, Pavel Machek wrote: > Okay, I'm user on small machine and I'm doing stupid thing: I've got > 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I > insert module_1mb.o. Repeat. How does it end? I think that > kmalloc(GFP_KERNEL) *has* to return NULL at some point. if a stupid root user keeps inserting bogus modules :-) then thats a problem, no matter what. I can DoS your system if given the right to insert arbitrary size modules, even if kmalloc returns NULL. For such things explicit highlevel protection is needed - completely independently of the VM allocation issues. Returning NULL in kmalloc() is just a way to say: 'oops, we screwed up somewhere'. And i'd suggest to not work around such screwups by checking for NULL and trying to handle it. I suggest to rather fix those screwups. the __GFP_SOFT suggestion handles these things nicely. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Tue, 26 Sep 2000, Pavel Machek wrote: Okay, I'm user on small machine and I'm doing stupid thing: I've got 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I insert module_1mb.o. Repeat. How does it end? I think that kmalloc(GFP_KERNEL) *has* to return NULL at some point. if a stupid root user keeps inserting bogus modules :-) then thats a problem, no matter what. I can DoS your system if given the right to insert arbitrary size modules, even if kmalloc returns NULL. For such things explicit highlevel protection is needed - completely independently of the VM allocation issues. Returning NULL in kmalloc() is just a way to say: 'oops, we screwed up somewhere'. And i'd suggest to not work around such screwups by checking for NULL and trying to handle it. I suggest to rather fix those screwups. the __GFP_SOFT suggestion handles these things nicely. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Wed, Sep 27, 2000 at 09:42:45AM +0200, Ingo Molnar wrote: such screwups by checking for NULL and trying to handle it. I suggest to rather fix those screwups. How do you know which is the minimal amount of RAM that allows you not to be in the screwedup state? We for sure need a kind of counter for the special dynamic structures but I'm not sure if that should account the static stuff as well. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Tue, Sep 26, 2000 at 09:10:16PM +0200, Pavel Machek wrote: > Hi! > > > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i > > > > My bad, you're right I was talking about GFP_USER indeed. > > > > But even GFP_KERNEL allocations like the init of a module or any other thing > > that is static sized during production just checking the retval > > looks be ok. > > Okay, I'm user on small machine and I'm doing stupid thing: I've got > 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I > insert module_1mb.o. Repeat. How does it end? I think that > kmalloc(GFP_KERNEL) *has* to return NULL at some point. I agree and that's what I said since the first place. GFP_KERNEL must return null when the system is truly out of memory or the kernel will deadlock at that time. In the sentence you quoted I meant that both GFP_USER and most GFP_KERNEL could only keep to check the retval even in the long term to be correct (checking for NULL, that in turn means GFP_KERNEL _will_ return NULL eventually). There's no need of special resource accounting for many static sized data structure in kernel (this accounting is necessary only for some of the dynamic things that grows and shrink during production and that can't be reclaimed synchronously when memory goes low by blocking in the allocator, like pagetables skbs on gbit ethernet and other things). Not sure if at the end we'll need to account also the static parts to get the dynamic part right. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Hi! > > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i > > My bad, you're right I was talking about GFP_USER indeed. > > But even GFP_KERNEL allocations like the init of a module or any other thing > that is static sized during production just checking the retval > looks be ok. Okay, I'm user on small machine and I'm doing stupid thing: I've got 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I insert module_1mb.o. Repeat. How does it end? I think that kmalloc(GFP_KERNEL) *has* to return NULL at some point. Killing apps is not a solution: If my insmoder is smaller than module I'm trying to insert, and it happens to be the only process, you just will not be able to kmalloc(GFP_KERNEL, sizeof(module)). Will you panic at the end? Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Hi Rik, Rik van Riel <[EMAIL PROTECTED]> writes: > > Because as you said the machine can lockup when you run out of memory. > > The fix for this is to kill a user process when you're OOM > (you need to do this anyway). > > The last few allocations of the "condemned" process can come > frome the reserved pages and the process we killed will exit just > fine. It's slightly offtopic, but you should think about detached shm segments in yout OOM killer. As many of the high end applications like databases and e.g. SAP have most of the memory in shm segments you easily end up killing a lot of processes without freeing a lot of memory. I see this often in my shm tests. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Hi Rik, Rik van Riel [EMAIL PROTECTED] writes: Because as you said the machine can lockup when you run out of memory. The fix for this is to kill a user process when you're OOM (you need to do this anyway). The last few allocations of the "condemned" process can come frome the reserved pages and the process we killed will exit just fine. It's slightly offtopic, but you should think about detached shm segments in yout OOM killer. As many of the high end applications like databases and e.g. SAP have most of the memory in shm segments you easily end up killing a lot of processes without freeing a lot of memory. I see this often in my shm tests. Greetings Christoph - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Hi! i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i My bad, you're right I was talking about GFP_USER indeed. But even GFP_KERNEL allocations like the init of a module or any other thing that is static sized during production just checking the retval looks be ok. Okay, I'm user on small machine and I'm doing stupid thing: I've got 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I insert module_1mb.o. Repeat. How does it end? I think that kmalloc(GFP_KERNEL) *has* to return NULL at some point. Killing apps is not a solution: If my insmoder is smaller than module I'm trying to insert, and it happens to be the only process, you just will not be able to kmalloc(GFP_KERNEL, sizeof(module)). Will you panic at the end? Pavel -- I'm [EMAIL PROTECTED] "In my country we have almost anarchy and I don't care." Panos Katsaloulis describing me w.r.t. patents at [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Tue, Sep 26, 2000 at 09:10:16PM +0200, Pavel Machek wrote: Hi! i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i My bad, you're right I was talking about GFP_USER indeed. But even GFP_KERNEL allocations like the init of a module or any other thing that is static sized during production just checking the retval looks be ok. Okay, I'm user on small machine and I'm doing stupid thing: I've got 6MB ram, and I keep inserting modules. I insert module_1mb.o. Then I insert module_1mb.o. Repeat. How does it end? I think that kmalloc(GFP_KERNEL) *has* to return NULL at some point. I agree and that's what I said since the first place. GFP_KERNEL must return null when the system is truly out of memory or the kernel will deadlock at that time. In the sentence you quoted I meant that both GFP_USER and most GFP_KERNEL could only keep to check the retval even in the long term to be correct (checking for NULL, that in turn means GFP_KERNEL _will_ return NULL eventually). There's no need of special resource accounting for many static sized data structure in kernel (this accounting is necessary only for some of the dynamic things that grows and shrink during production and that can't be reclaimed synchronously when memory goes low by blocking in the allocator, like pagetables skbs on gbit ethernet and other things). Not sure if at the end we'll need to account also the static parts to get the dynamic part right. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: > > i think an application should not fail due to other applications > > allocating too much RAM. OOM behavior should be a central thing and based > > At least Linus's point is that doing perfect accounting (at > least on the userspace allocation side) may cause you to waste > resources, failing even if you could still run and I tend to > agree with him. We're lazy on that side and that's global win in > most cases. OK, so do you guys want my OOM-killer selection code in 2.4? ;) (that will fix the OOM case in the rare situations where it occurs and do the expected thing most of the time) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:40:44PM +0100, Stephen C. Tweedie wrote: > Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the > wrong thing to do if we want to guarantee swapper progress under > extreme load. You're definitely right. We at least need the garantee of the memory to allocate the bhs on top of the swap cache while we atttempt to swapout one page (that path can't fail at the moment). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 05:16:06PM +0200, Ingo Molnar wrote: > situation is just 1% RAM away from the 'root cannot log in', situation. The root cannot log in is a little different. Just think that in the "root cannot log in" you only need to press SYSRQ+E (or as worse +I). If all tasks in the systems are hanging into the GFP loop SYSRQ+I won't solve the deadlock. Ok you can add a signal check in the memory balancing code but that looks an ugly hack that shows the difference between the two cases (the one Alan pointed out is real deadlock, the current one is kind of live lock that can go away any time, while the deadlock can reach the point where it can't be recovered without an hack from an irq somewhere). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 05:26:59PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > i think the GFP_USER case should do the oom logic within __alloc_pages(), > > > > What's the difference of implementing the logic outside alloc_pages? > > Putting the logic inside looks not clean design to me. > > it gives consistency and simplicity. The allocators themselves do not have > to care about oom. There are many cases where it is simple to do: if( alloc(r1) == fail) goto freeall if( alloc(r2) == fail) goto freeall if( alloc(r3) == fail) goto freeall And the alloc functions don't know how to "freeall". Perhaps it would be good to do an alloc_vec allocation in these cases. alloc_vec[0].size = n; .. alloc_vec[n].size = 0; if(kmalloc_all(alloc_vec) == FAIL) return -ENOMEM; else alloc_vec[i].ptr is the pointer. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Alan Cox wrote: > Unless Im missing something here think about this case > > 2 active processes, no swap > > #1#2 > kmalloc 32K kmalloc 16K > OKOK > kmalloc 16K kmalloc 32K > block block > > so GFP_KERNEL has to be able to fail - it can wait for I/O in some > cases with care, but when we have no pages left something has to give you are right, i agree that synchronous OOM for higher-order allocations must be preserved (just like ATOMIC allocations). But the overwhelming majority of allocations is done at page granularity. with multi-page allocations and the need for physically contiguous buffers, the problem cannot be solved. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > i think the GFP_USER case should do the oom logic within __alloc_pages(), > > What's the difference of implementing the logic outside alloc_pages? > Putting the logic inside looks not clean design to me. it gives consistency and simplicity. The allocators themselves do not have to care about oom. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Swap on RAID; was: Re: the new VM
On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote: > > this is fixed in 2.4. The 2.2 RAID code is frozen, and has known > > limitations (ie. due to the above RAID1 cannot be used as a swap-device). > as commonly patched in by RedHat? Should I instead use a swap file > for a machine that should be fault-tolerant against a drive failure? the answer is yes. RAID5 will not deadlock due to VM problems, but RAID5 might have other problems if the device is being reconstructed *and* used for swap. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Swap on RAID; was: Re: the new VM
Ingo Molnar wrote: > this is fixed in 2.4. The 2.2 RAID code is frozen, and has known > limitations (ie. due to the above RAID1 cannot be used as a swap-device). Eh, just to be clear about this: does this apply to the RAID 0.90 code as commonly patched in by RedHat? Should I instead use a swap file for a machine that should be fault-tolerant against a drive failure? regards, David -- David L. Parsley Network Administrator Roanoke College - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:43:44PM +0200, Ingo Molnar wrote: > i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i My bad, you're right I was talking about GFP_USER indeed. But even GFP_KERNEL allocations like the init of a module or any other thing that is static sized during production just checking the retval looks be ok. > believe the right place to oom is via a signal, not in the gfp() case. Signal can be trapped and ignored by malicious task. We had that security problem until 2.2.14 IIRC. > (because oom situation in the gfp() case is a completely random and > statistical event, which might have no connection at all to the behavior > of that given process.) I agree we should have more information about the behaviour of the system and I think a per-task page fault rate should work in practice. But my question isn't what you do when you're OOM, but is _how_ do you notice that you're OOM? In the GFP_USER case simply checking when GFP fails looks right to me. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
> > Because as you said the machine can lockup when you run out of memory. > > well, i think all kernel-space allocations have to be limited carefully, > denying succeeding allocations is not a solution against over-allocation, > especially in a multi-user environment. GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make space you die. The alternative approach where it cannot fail has to be at higher levels so you can release other resources that might need freeing for deadlock avoidance before you retry Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > > that is a showstopper bug. [...] > > > > why? > > Because as you said the machine can lockup when you run out of memory. The fix for this is to kill a user process when you're OOM (you need to do this anyway). The last few allocations of the "condemned" process can come frome the reserved pages and the process we killed will exit just fine. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 11:26:48AM -0300, Marcelo Tosatti wrote: > This thread keeps freeing pages from the inactive clean list when needed > (when zone->free_pages < zone->pages_low), making them available for > atomic allocations. This is flawed. It's the irq that have to shrink the memory itself. It can't certainly reschedule kreclaimd and wait it to do the work. Increasing the free_pages_min limit is the _only_ alternative to having irqs that are able to shrink clean cache (and hopefully that "feature" will be resurrected soon since it's the only way to go right now). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > At least Linus's point is that doing perfect accounting (at least on > the userspace allocation side) may cause you to waste resources, > failing even if you could still run and I tend to agree with him. > We're lazy on that side and that's global win in most cases. well, as i said, i agree that being lazy on the user-space side (which is by far the biggest RAM allocator in a typical system) makes sense - and we can handle it cleanly. Being lazy on the kernel-space side is the default behavior for us kernel hackers :-) but i dont think it's the right thing in the long term. > We are finegrined with page granularity, not with the mmap > granularity. The point is that not all the mmapped regions are going > to be pagedin. Think a program that only after 1 hour did all the > calculations that allocated all the memory it requested with malloc. > Before the hour passes the unused memory can still be used for other > things and that's what the user also expects when he runs `free`. i think you've completely missed the fact that i made exactly this point in my previous mail. 'user-space laziness': correct 'kernel-space laziness': dangerous i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i believe the right place to oom is via a signal, not in the gfp() case. (because oom situation in the gfp() case is a completely random and statistical event, which might have no connection at all to the behavior of that given process.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: > i think an application should not fail due to other applications > allocating too much RAM. OOM behavior should be a central thing and based At least Linus's point is that doing perfect accounting (at least on the userspace allocation side) may cause you to waste resources, failing even if you could still run and I tend to agree with him. We're lazy on that side and that's global win in most cases. We are finegrined with page granularity, not with the mmap granularity. The point is that not all the mmapped regions are going to be pagedin. Think a program that only after 1 hour did all the calculations that allocated all the memory it requested with malloc. Before the hour passes the unused memory can still be used for other things and that's what the user also expects when he runs `free`. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I talked with Alexey about this and it seems the best way is to have a > per-socket reservation of clean cache in function of the receive window. So we > don't need an huge atomic pool but we can have a special lru with an irq > spinlock that is able to shrink cache from irq as well. In the current 2.4 VM code, there is a kernel thread called "kreclaimd". This thread keeps freeing pages from the inactive clean list when needed (when zone->free_pages < zone->pages_low), making them available for atomic allocations. Do you consider pages_low pages as a "huge atomic pool" ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > I'm not sure if we should restrict the limiting only to the cases that > needs them. For example do_anonymous_page looks a place that could > rely on the GFP retval. i think an application should not fail due to other applications allocating too much RAM. OOM behavior should be a central thing and based on allocation patterns, not pure luck or unluck. I always found it rude to SIGBUS when some other application is abusing RAM but the oom detector has not yet killed it off. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:04:14PM +0200, Ingo Molnar wrote: > exactly, and this is why if a higher level lets through a GFP_KERNEL, then > it *must* succeed. Otherwise either the higher level code is buggy, or the > VM balance is buggy, but we want to have clear signs of it. I'm not sure if we should restrict the limiting only to the cases that needs them. For example do_anonymous_page looks a place that could rely on the GFP retval. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:39:51PM +0200, Ingo Molnar wrote: > Andrea, if you really mean this then you should not be let near the VM > balancing code :-) What I mean is that the VM balancing is in the lower layer that knows anything about the per-socket gigabit ethernet skbs limits, the limit should live at the higher layer. For most code just checking for NULL in GFP is fine (for example do_anonymous_page). It's the caller (not the VM balancing developer) that shouldn't be let near his code if it allows his code to fill all the physical ram with his stuff causing the machine to run OOM. > > Most dynamic big caches and kernel data can be shrinked dynamically > > during memory pressure (pheraps except skbs and I agree that for skbs > > on gigabit ethernet the thing is a little different). > > a big 'except'. You dont need gigabit for that, to the contrary, if the I talked with Alexey about this and it seems the best way is to have a per-socket reservation of clean cache in function of the receive window. So we don't need an huge atomic pool but we can have a special lru with an irq spinlock that is able to shrink cache from irq as well. > about how many D.O.S. attacks there are possible without implicit or > explicit bean counting. Again: the bean counting and all the limit happens at the higher layer. I shouldn't know anything about it when I play with the lower layer GFP memory balancing code. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Again: the bean counting and all the limit happens at the higher > layer. I shouldn't know anything about it when I play with the lower > layer GFP memory balancing code. exactly, and this is why if a higher level lets through a GFP_KERNEL, then it *must* succeed. Otherwise either the higher level code is buggy, or the VM balance is buggy, but we want to have clear signs of it. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO > > latencies) > > Very good! Many thanks Ingo. this was actually coded/fixed by Neil Brown - so the kudos go to him! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > And if the careful limit avoids the deadlock in the layer above > alloc_pages, then it will also avoid alloc_pages to return NULL and > you won't need an infinite loop in first place (unless the memory > balancing is buggy). yes i like this property very much because it unearths VM balancing bugs, which plagued us for so long and are so hard to detect. But statistically it's also possible that try_to_free_pages() frees a page and alloc_pages() done on another CPU (or in IRQ context) 'steals' the page. This can happen, because the VM right now guarantees no straight path from deallocator to allocator. (and it's not necessery to guarantee it, given the varying nature of allocation requests.) > GFP should return NULL only if the machine is out of memory. The > kernel can be written in a way that never deadlocks when the machine > is out of memory just checking the GFP retval. I don't think any > in-kernel resource limit is necessary to have things reliable and > fast. [...] Andrea, if you really mean this then you should not be let near the VM balancing code :-) > Most dynamic big caches and kernel data can be shrinked dynamically > during memory pressure (pheraps except skbs and I agree that for skbs > on gigabit ethernet the thing is a little different). a big 'except'. You dont need gigabit for that, to the contrary, if the network is slow it's easier to overallocate within the kernel. Ask Alan about how many D.O.S. attacks there are possible without implicit or explicit bean counting. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:21:01PM +0200, Ingo Molnar wrote: > yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO > latencies) Very good! Many thanks Ingo. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:12:58PM +0200, Ingo Molnar wrote: > well, i think all kernel-space allocations have to be limited carefully, When a machine without a gigabit ethernet runs oom it's userspace that allocated the memory via page faults not the kernel. And if the careful limit avoids the deadlock in the layer above alloc_pages, then it will also avoid alloc_pages to return NULL and you won't need an infinite loop in first place (unless the memory balancing is buggy). GFP should return NULL only if the machine is out of memory. The kernel can be written in a way that never deadlocks when the machine is out of memory just checking the GFP retval. I don't think any in-kernel resource limit is necessary to have things reliable and fast. Most dynamic big caches and kernel data can be shrinked dynamically during memory pressure (pheraps except skbs and I agree that for skbs on gigabit ethernet the thing is a little different). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Is it safe to sleep on the waitqueue in the kmalloc fail path in > raid1? yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO latencies) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > huh, what do you mean? > > I mean this: > > while (!( /* FIXME: now we are rather fault tolerant than nice */ this is fixed in 2.4. The 2.2 RAID code is frozen, and has known limitations (ie. due to the above RAID1 cannot be used as a swap-device). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:04:10PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Please fix raid1 instead of making things worse. > > huh, what do you mean? I mean this: while (!( /* FIXME: now we are rather fault tolerant than nice */ mirror_bh[i] = kmalloc (sizeof (struct buffer_head), GFP_KERNEL) ) ) I've seen in the 2.4.0-test9-pre6 raid1 code the above is gone (and this looks very promising :)), it is at least proof that some care about the deadlock is been taken) and you instead sleep on a waitqueue now. While it's not obvious at all that sleeping on the waitqueue is not deadlock prone (for example getblk sleeps on a waitqueue bit it's deadlock prone too), at least it's not an infinite loop anymore and that's still better. Is it safe to sleep on the waitqueue in the kmalloc fail path in raid1? Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > > that is a showstopper bug. [...] > > > > why? > > Because as you said the machine can lockup when you run out of memory. well, i think all kernel-space allocations have to be limited carefully, denying succeeding allocations is not a solution against over-allocation, especially in a multi-user environment. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: > > On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > > > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > > that is a showstopper bug. [...] > > why? Because as you said the machine can lockup when you run out of memory. > FYI, i havent put it there. Ok. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Please fix raid1 instead of making things worse. huh, what do you mean? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: > Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed > that is a showstopper bug. [...] why? > machine power for simulations runs out of memory all the time. If you > put this kind of obvious deadlock into the main kernel allocator FYI, i havent put it there. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 12:42:09PM +0200, Ingo Molnar wrote: > believe could simplify unrelated kernel code significantly. Eg. no need to > check for NULL pointers on most allocations, a GFP_KERNEL allocation > always succeeds, end of story. This behavior also has the 'nice' Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. We also have another showstopper bug in getblk that will be hard to fix because people was used to rely on it and they wrote dealdock prone code. You should know that people not running benchmarks and and using the machine power for simulations runs out of memory all the time. If you put this kind of obvious deadlock into the main kernel allocator you'll screwup the hard work to fix all the other deadlock problems during OOM that is been done so far. Please fix raid1 instead of making things worse. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 12:42:09PM +0200, Ingo Molnar wrote: believe could simplify unrelated kernel code significantly. Eg. no need to check for NULL pointers on most allocations, a GFP_KERNEL allocation always succeeds, end of story. This behavior also has the 'nice' Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. We also have another showstopper bug in getblk that will be hard to fix because people was used to rely on it and they wrote dealdock prone code. You should know that people not running benchmarks and and using the machine power for simulations runs out of memory all the time. If you put this kind of obvious deadlock into the main kernel allocator you'll screwup the hard work to fix all the other deadlock problems during OOM that is been done so far. Please fix raid1 instead of making things worse. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. [...] why? machine power for simulations runs out of memory all the time. If you put this kind of obvious deadlock into the main kernel allocator FYI, i havent put it there. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Please fix raid1 instead of making things worse. huh, what do you mean? Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. [...] why? Because as you said the machine can lockup when you run out of memory. FYI, i havent put it there. Ok. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. [...] why? Because as you said the machine can lockup when you run out of memory. well, i think all kernel-space allocations have to be limited carefully, denying succeeding allocations is not a solution against over-allocation, especially in a multi-user environment. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:04:10PM +0200, Ingo Molnar wrote: On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Please fix raid1 instead of making things worse. huh, what do you mean? I mean this: while (!( /* FIXME: now we are rather fault tolerant than nice */ mirror_bh[i] = kmalloc (sizeof (struct buffer_head), GFP_KERNEL) ) ) I've seen in the 2.4.0-test9-pre6 raid1 code the above is gone (and this looks very promising :)), it is at least proof that some care about the deadlock is been taken) and you instead sleep on a waitqueue now. While it's not obvious at all that sleeping on the waitqueue is not deadlock prone (for example getblk sleeps on a waitqueue bit it's deadlock prone too), at least it's not an infinite loop anymore and that's still better. Is it safe to sleep on the waitqueue in the kmalloc fail path in raid1? Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: huh, what do you mean? I mean this: while (!( /* FIXME: now we are rather fault tolerant than nice */ this is fixed in 2.4. The 2.2 RAID code is frozen, and has known limitations (ie. due to the above RAID1 cannot be used as a swap-device). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Is it safe to sleep on the waitqueue in the kmalloc fail path in raid1? yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO latencies) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:12:58PM +0200, Ingo Molnar wrote: well, i think all kernel-space allocations have to be limited carefully, When a machine without a gigabit ethernet runs oom it's userspace that allocated the memory via page faults not the kernel. And if the careful limit avoids the deadlock in the layer above alloc_pages, then it will also avoid alloc_pages to return NULL and you won't need an infinite loop in first place (unless the memory balancing is buggy). GFP should return NULL only if the machine is out of memory. The kernel can be written in a way that never deadlocks when the machine is out of memory just checking the GFP retval. I don't think any in-kernel resource limit is necessary to have things reliable and fast. Most dynamic big caches and kernel data can be shrinked dynamically during memory pressure (pheraps except skbs and I agree that for skbs on gigabit ethernet the thing is a little different). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 03:21:01PM +0200, Ingo Molnar wrote: yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO latencies) Very good! Many thanks Ingo. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: And if the careful limit avoids the deadlock in the layer above alloc_pages, then it will also avoid alloc_pages to return NULL and you won't need an infinite loop in first place (unless the memory balancing is buggy). yes i like this property very much because it unearths VM balancing bugs, which plagued us for so long and are so hard to detect. But statistically it's also possible that try_to_free_pages() frees a page and alloc_pages() done on another CPU (or in IRQ context) 'steals' the page. This can happen, because the VM right now guarantees no straight path from deallocator to allocator. (and it's not necessery to guarantee it, given the varying nature of allocation requests.) GFP should return NULL only if the machine is out of memory. The kernel can be written in a way that never deadlocks when the machine is out of memory just checking the GFP retval. I don't think any in-kernel resource limit is necessary to have things reliable and fast. [...] Andrea, if you really mean this then you should not be let near the VM balancing code :-) Most dynamic big caches and kernel data can be shrinked dynamically during memory pressure (pheraps except skbs and I agree that for skbs on gigabit ethernet the thing is a little different). a big 'except'. You dont need gigabit for that, to the contrary, if the network is slow it's easier to overallocate within the kernel. Ask Alan about how many D.O.S. attacks there are possible without implicit or explicit bean counting. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: yes. every RAID1-bh has a bound lifetime. (bound by worst-case IO latencies) Very good! Many thanks Ingo. this was actually coded/fixed by Neil Brown - so the kudos go to him! Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Again: the bean counting and all the limit happens at the higher layer. I shouldn't know anything about it when I play with the lower layer GFP memory balancing code. exactly, and this is why if a higher level lets through a GFP_KERNEL, then it *must* succeed. Otherwise either the higher level code is buggy, or the VM balance is buggy, but we want to have clear signs of it. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:04:14PM +0200, Ingo Molnar wrote: exactly, and this is why if a higher level lets through a GFP_KERNEL, then it *must* succeed. Otherwise either the higher level code is buggy, or the VM balance is buggy, but we want to have clear signs of it. I'm not sure if we should restrict the limiting only to the cases that needs them. For example do_anonymous_page looks a place that could rely on the GFP retval. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: I'm not sure if we should restrict the limiting only to the cases that needs them. For example do_anonymous_page looks a place that could rely on the GFP retval. i think an application should not fail due to other applications allocating too much RAM. OOM behavior should be a central thing and based on allocation patterns, not pure luck or unluck. I always found it rude to SIGBUS when some other application is abusing RAM but the oom detector has not yet killed it off. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: snip I talked with Alexey about this and it seems the best way is to have a per-socket reservation of clean cache in function of the receive window. So we don't need an huge atomic pool but we can have a special lru with an irq spinlock that is able to shrink cache from irq as well. In the current 2.4 VM code, there is a kernel thread called "kreclaimd". This thread keeps freeing pages from the inactive clean list when needed (when zone-free_pages zone-pages_low), making them available for atomic allocations. Do you consider pages_low pages as a "huge atomic pool" ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: i think an application should not fail due to other applications allocating too much RAM. OOM behavior should be a central thing and based At least Linus's point is that doing perfect accounting (at least on the userspace allocation side) may cause you to waste resources, failing even if you could still run and I tend to agree with him. We're lazy on that side and that's global win in most cases. We are finegrined with page granularity, not with the mmap granularity. The point is that not all the mmapped regions are going to be pagedin. Think a program that only after 1 hour did all the calculations that allocated all the memory it requested with malloc. Before the hour passes the unused memory can still be used for other things and that's what the user also expects when he runs `free`. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 11:26:48AM -0300, Marcelo Tosatti wrote: This thread keeps freeing pages from the inactive clean list when needed (when zone-free_pages zone-pages_low), making them available for atomic allocations. This is flawed. It's the irq that have to shrink the memory itself. It can't certainly reschedule kreclaimd and wait it to do the work. Increasing the free_pages_min limit is the _only_ alternative to having irqs that are able to shrink clean cache (and hopefully that "feature" will be resurrected soon since it's the only way to go right now). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: On Mon, Sep 25, 2000 at 03:02:58PM +0200, Ingo Molnar wrote: On Mon, 25 Sep 2000, Andrea Arcangeli wrote: Sorry I totally disagree. If GFP_KERNEL are garanteeded to succeed that is a showstopper bug. [...] why? Because as you said the machine can lockup when you run out of memory. The fix for this is to kill a user process when you're OOM (you need to do this anyway). The last few allocations of the "condemned" process can come frome the reserved pages and the process we killed will exit just fine. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
Because as you said the machine can lockup when you run out of memory. well, i think all kernel-space allocations have to be limited carefully, denying succeeding allocations is not a solution against over-allocation, especially in a multi-user environment. GFP_KERNEL has to be able to fail for 2.4. Otherwise you can get everything jammed in kernel space waiting on GFP_KERNEL and if the swapper cannot make space you die. The alternative approach where it cannot fail has to be at higher levels so you can release other resources that might need freeing for deadlock avoidance before you retry Alan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:43:44PM +0200, Ingo Molnar wrote: i talked about GFP_KERNEL, not GFP_USER. Even in the case of GFP_USER i My bad, you're right I was talking about GFP_USER indeed. But even GFP_KERNEL allocations like the init of a module or any other thing that is static sized during production just checking the retval looks be ok. believe the right place to oom is via a signal, not in the gfp() case. Signal can be trapped and ignored by malicious task. We had that security problem until 2.2.14 IIRC. (because oom situation in the gfp() case is a completely random and statistical event, which might have no connection at all to the behavior of that given process.) I agree we should have more information about the behaviour of the system and I think a per-task page fault rate should work in practice. But my question isn't what you do when you're OOM, but is _how_ do you notice that you're OOM? In the GFP_USER case simply checking when GFP fails looks right to me. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Swap on RAID; was: Re: the new VM
Ingo Molnar wrote: this is fixed in 2.4. The 2.2 RAID code is frozen, and has known limitations (ie. due to the above RAID1 cannot be used as a swap-device). Eh, just to be clear about this: does this apply to the RAID 0.90 code as commonly patched in by RedHat? Should I instead use a swap file for a machine that should be fault-tolerant against a drive failure? regards, David -- David L. Parsley Network Administrator Roanoke College - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Swap on RAID; was: Re: the new VM
On Mon, 25 Sep 2000 [EMAIL PROTECTED] wrote: this is fixed in 2.4. The 2.2 RAID code is frozen, and has known limitations (ie. due to the above RAID1 cannot be used as a swap-device). as commonly patched in by RedHat? Should I instead use a swap file for a machine that should be fault-tolerant against a drive failure? the answer is yes. RAID5 will not deadlock due to VM problems, but RAID5 might have other problems if the device is being reconstructed *and* used for swap. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: i think the GFP_USER case should do the oom logic within __alloc_pages(), What's the difference of implementing the logic outside alloc_pages? Putting the logic inside looks not clean design to me. it gives consistency and simplicity. The allocators themselves do not have to care about oom. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Alan Cox wrote: Unless Im missing something here think about this case 2 active processes, no swap #1#2 kmalloc 32K kmalloc 16K OKOK kmalloc 16K kmalloc 32K block block so GFP_KERNEL has to be able to fail - it can wait for I/O in some cases with care, but when we have no pages left something has to give you are right, i agree that synchronous OOM for higher-order allocations must be preserved (just like ATOMIC allocations). But the overwhelming majority of allocations is done at page granularity. with multi-page allocations and the need for physically contiguous buffers, the problem cannot be solved. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 05:16:06PM +0200, Ingo Molnar wrote: situation is just 1% RAM away from the 'root cannot log in', situation. The root cannot log in is a little different. Just think that in the "root cannot log in" you only need to press SYSRQ+E (or as worse +I). If all tasks in the systems are hanging into the GFP loop SYSRQ+I won't solve the deadlock. Ok you can add a signal check in the memory balancing code but that looks an ugly hack that shows the difference between the two cases (the one Alan pointed out is real deadlock, the current one is kind of live lock that can go away any time, while the deadlock can reach the point where it can't be recovered without an hack from an irq somewhere). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 04:40:44PM +0100, Stephen C. Tweedie wrote: Allowing GFP_ATOMIC to eat PF_MEMALLOC's last-chance pages is the wrong thing to do if we want to guarantee swapper progress under extreme load. You're definitely right. We at least need the garantee of the memory to allocate the bhs on top of the swap cache while we atttempt to swapout one page (that path can't fail at the moment). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, 25 Sep 2000, Andrea Arcangeli wrote: On Mon, Sep 25, 2000 at 04:27:24PM +0200, Ingo Molnar wrote: i think an application should not fail due to other applications allocating too much RAM. OOM behavior should be a central thing and based At least Linus's point is that doing perfect accounting (at least on the userspace allocation side) may cause you to waste resources, failing even if you could still run and I tend to agree with him. We're lazy on that side and that's global win in most cases. OK, so do you guys want my OOM-killer selection code in 2.4? ;) (that will fix the OOM case in the rare situations where it occurs and do the expected thing most of the time) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: the new VM
On Mon, Sep 25, 2000 at 05:26:59PM +0200, Ingo Molnar wrote: On Mon, 25 Sep 2000, Andrea Arcangeli wrote: i think the GFP_USER case should do the oom logic within __alloc_pages(), What's the difference of implementing the logic outside alloc_pages? Putting the logic inside looks not clean design to me. it gives consistency and simplicity. The allocators themselves do not have to care about oom. There are many cases where it is simple to do: if( alloc(r1) == fail) goto freeall if( alloc(r2) == fail) goto freeall if( alloc(r3) == fail) goto freeall And the alloc functions don't know how to "freeall". Perhaps it would be good to do an alloc_vec allocation in these cases. alloc_vec[0].size = n; .. alloc_vec[n].size = 0; if(kmalloc_all(alloc_vec) == FAIL) return -ENOMEM; else alloc_vec[i].ptr is the pointer. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH *] new VM patch for 2.4.0-test8
On Fri, 15 Sep 2000, James Lewis Nance wrote: > On Fri, Sep 15, 2000 at 10:09:57PM -0300, Rik van Riel wrote: > > Hi, > > > > today I released a new VM patch with 4 small improvements: > > Are these 4 improvements in the code test9-pre1 patch that Linus > just released? No. But I have a patch (that I will mail to the list once I've tested it a bit more). regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH *] new VM patch for 2.4.0-test8
On Fri, 15 Sep 2000, James Lewis Nance wrote: > On Fri, Sep 15, 2000 at 10:09:57PM -0300, Rik van Riel wrote: > > Hi, > > > > today I released a new VM patch with 4 small improvements: > > Are these 4 improvements in the code test9-pre1 patch that Linus > just released? Oh well, I may as well give it now ;) The patch below upgrades 2.4.0-test9-pre1 VM to a VM with the 4 changes... They /should/ be stable, but I'd really appreciate a bit more testing before I give the patch to Linus. (I know the VM patch included in 2.4.0-test9-pre1 is stable, that one got a heavier testing than any VM patch I ever made. I was testing the system so heavily that I had to upgrade my 8139too driver and other things to keep the system from crashing ;)) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ --- linux-2.4.8-test9-pre1/fs/buffer.c.orig Fri Sep 15 23:23:09 2000 +++ linux-2.4.8-test9-pre1/fs/buffer.c Fri Sep 15 23:26:24 2000 @@ -705,7 +705,6 @@ static void refill_freelist(int size) { if (!grow_buffers(size)) { - //wakeup_bdflush(1); balance_dirty(NODEV); wakeup_kswapd(1); } @@ -863,15 +862,14 @@ dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT; tot = nr_free_buffer_pages(); -// tot -= size_buffers_type[BUF_PROTECTED] >> PAGE_SHIFT; dirty *= 200; soft_dirty_limit = tot * bdf_prm.b_un.nfract; hard_dirty_limit = soft_dirty_limit * 2; /* First, check for the "real" dirty limit. */ - if (dirty > soft_dirty_limit || inactive_shortage()) { - if (dirty > hard_dirty_limit) + if (dirty > soft_dirty_limit) { + if (dirty > hard_dirty_limit || inactive_shortage()) return 1; return 0; } @@ -2279,7 +2277,9 @@ { struct buffer_head * tmp, * bh = page->buffers; int index = BUFSIZE_INDEX(bh->b_size); + int loop = 0; +cleaned_buffers_try_again: spin_lock(_list_lock); write_lock(_table_lock); spin_lock(_list[index].lock); @@ -2325,8 +2325,14 @@ spin_unlock(_list[index].lock); write_unlock(_table_lock); spin_unlock(_list_lock); - if (wait) + if (wait) { sync_page_buffers(bh, wait); + /* We waited synchronously, so we can free the buffers. */ + if (wait > 1 && !loop) { + loop = 1; + goto cleaned_buffers_try_again; + } + } return 0; } --- linux-2.4.8-test9-pre1/mm/swap.c.orig Fri Sep 15 23:23:11 2000 +++ linux-2.4.8-test9-pre1/mm/swap.cFri Sep 15 23:24:23 2000 @@ -161,14 +161,19 @@ * Don't touch it if it's not on the active list. * (some pages aren't on any list at all) */ - if (PageActive(page) && (page_count(page) == 1 || page->buffers) && + if (PageActive(page) && (page_count(page) <= 2 || page->buffers) && !page_ramdisk(page)) { /* * We can move the page to the inactive_dirty list * if we know there is backing store available. +* +* We also move pages here that we cannot free yet, +* but may be able to free later - because most likely +* we're holding an extra reference on the page which +* will be dropped right after deactivate_page(). */ - if (page->buffers) { + if (page->buffers || page_count(page) == 2) { del_page_from_active_list(page); add_page_to_inactive_dirty_list(page); /* @@ -181,8 +186,7 @@ add_page_to_inactive_clean_list(page); } /* -* ELSE: no backing store available, leave it on -* the active list. +* OK, we cannot free the page. Leave it alone. */ } } --- linux-2.4.8-test9-pre1/mm/vmscan.c.orig Fri Sep 15 23:23:11 2000 +++ linux-2.4.8-test9-pre1/mm/vmscan.c Fri Sep 15 23:32:10 2000 @@ -103,8 +103,8 @@ UnlockPage(page); vma->vm_mm->rss--; flush_tlb_page(vma, address); - page_cache_release(page); deactivate_page(page); + page_cache_release(page); goto out_failed; } @@ -681,19 +681,26 @@ if (freed_page && !free_shortage()) break; continue; + } else if (page->mapping && !PageDirty(page)) { + /* +* If a page had an extra reference in +* deactivate_page(), we will
Re: [PATCH *] new VM patch for 2.4.0-test8
On Fri, Sep 15, 2000 at 10:09:57PM -0300, Rik van Riel wrote: > Hi, > > today I released a new VM patch with 4 small improvements: Are these 4 improvements in the code test9-pre1 patch that Linus just released? Jim - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH *] new VM patch for 2.4.0-test8
On Fri, Sep 15, 2000 at 10:09:57PM -0300, Rik van Riel wrote: Hi, today I released a new VM patch with 4 small improvements: Are these 4 improvements in the code test9-pre1 patch that Linus just released? Jim - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH *] new VM patch for 2.4.0-test8
On Fri, 15 Sep 2000, James Lewis Nance wrote: On Fri, Sep 15, 2000 at 10:09:57PM -0300, Rik van Riel wrote: Hi, today I released a new VM patch with 4 small improvements: Are these 4 improvements in the code test9-pre1 patch that Linus just released? Oh well, I may as well give it now ;) The patch below upgrades 2.4.0-test9-pre1 VM to a VM with the 4 changes... They /should/ be stable, but I'd really appreciate a bit more testing before I give the patch to Linus. (I know the VM patch included in 2.4.0-test9-pre1 is stable, that one got a heavier testing than any VM patch I ever made. I was testing the system so heavily that I had to upgrade my 8139too driver and other things to keep the system from crashing ;)) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ --- linux-2.4.8-test9-pre1/fs/buffer.c.orig Fri Sep 15 23:23:09 2000 +++ linux-2.4.8-test9-pre1/fs/buffer.c Fri Sep 15 23:26:24 2000 @@ -705,7 +705,6 @@ static void refill_freelist(int size) { if (!grow_buffers(size)) { - //wakeup_bdflush(1); balance_dirty(NODEV); wakeup_kswapd(1); } @@ -863,15 +862,14 @@ dirty = size_buffers_type[BUF_DIRTY] PAGE_SHIFT; tot = nr_free_buffer_pages(); -// tot -= size_buffers_type[BUF_PROTECTED] PAGE_SHIFT; dirty *= 200; soft_dirty_limit = tot * bdf_prm.b_un.nfract; hard_dirty_limit = soft_dirty_limit * 2; /* First, check for the "real" dirty limit. */ - if (dirty soft_dirty_limit || inactive_shortage()) { - if (dirty hard_dirty_limit) + if (dirty soft_dirty_limit) { + if (dirty hard_dirty_limit || inactive_shortage()) return 1; return 0; } @@ -2279,7 +2277,9 @@ { struct buffer_head * tmp, * bh = page-buffers; int index = BUFSIZE_INDEX(bh-b_size); + int loop = 0; +cleaned_buffers_try_again: spin_lock(lru_list_lock); write_lock(hash_table_lock); spin_lock(free_list[index].lock); @@ -2325,8 +2325,14 @@ spin_unlock(free_list[index].lock); write_unlock(hash_table_lock); spin_unlock(lru_list_lock); - if (wait) + if (wait) { sync_page_buffers(bh, wait); + /* We waited synchronously, so we can free the buffers. */ + if (wait 1 !loop) { + loop = 1; + goto cleaned_buffers_try_again; + } + } return 0; } --- linux-2.4.8-test9-pre1/mm/swap.c.orig Fri Sep 15 23:23:11 2000 +++ linux-2.4.8-test9-pre1/mm/swap.cFri Sep 15 23:24:23 2000 @@ -161,14 +161,19 @@ * Don't touch it if it's not on the active list. * (some pages aren't on any list at all) */ - if (PageActive(page) (page_count(page) == 1 || page-buffers) + if (PageActive(page) (page_count(page) = 2 || page-buffers) !page_ramdisk(page)) { /* * We can move the page to the inactive_dirty list * if we know there is backing store available. +* +* We also move pages here that we cannot free yet, +* but may be able to free later - because most likely +* we're holding an extra reference on the page which +* will be dropped right after deactivate_page(). */ - if (page-buffers) { + if (page-buffers || page_count(page) == 2) { del_page_from_active_list(page); add_page_to_inactive_dirty_list(page); /* @@ -181,8 +186,7 @@ add_page_to_inactive_clean_list(page); } /* -* ELSE: no backing store available, leave it on -* the active list. +* OK, we cannot free the page. Leave it alone. */ } } --- linux-2.4.8-test9-pre1/mm/vmscan.c.orig Fri Sep 15 23:23:11 2000 +++ linux-2.4.8-test9-pre1/mm/vmscan.c Fri Sep 15 23:32:10 2000 @@ -103,8 +103,8 @@ UnlockPage(page); vma-vm_mm-rss--; flush_tlb_page(vma, address); - page_cache_release(page); deactivate_page(page); + page_cache_release(page); goto out_failed; } @@ -681,19 +681,26 @@ if (freed_page !free_shortage()) break; continue; + } else if (page-mapping !PageDirty(page)) { + /* +* If a page had an extra reference in +* deactivate_page(), we will find it here. +