Re: [RFC] memory reserve for userspace oom-killer
Hi Folks, On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin wrote: > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > > Proposal: Provide memory guarantees to userspace oom-killer. > > > > Background: > > > > Issues with kernel oom-killer: > > 1. Very conservative and prefer to reclaim. Applications can suffer > > for a long time. > > 2. Borrows the context of the allocator which can be resource limited > > (low sched priority or limited CPU quota). > > 3. Serialized by global lock. > > 4. Very simplistic oom victim selection policy. > > > > These issues are resolved through userspace oom-killer by: > > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > > early detect suffering. > > 2. Independent process context which can be given dedicated CPU quota > > and high scheduling priority. > > 3. Can be more aggressive as required. > > 4. Can implement sophisticated business logic/policies. > > > > Android's LMKD and Facebook's oomd are the prime examples of userspace > > oom-killers. One of the biggest challenges for userspace oom-killers > > is to potentially function under intense memory pressure and are prone > > to getting stuck in memory reclaim themselves. Current userspace > > oom-killers aim to avoid this situation by preallocating user memory > > and protecting themselves from global reclaim by either mlocking or > > memory.min. However a new allocation from userspace oom-killer can > > still get stuck in the reclaim and policy rich oom-killer do trigger > > new allocations through syscalls or even heap. > > > > Our attempt of userspace oom-killer faces similar challenges. > > Particularly at the tail on the very highly utilized machines we have > > observed userspace oom-killer spectacularly failing in many possible > > ways in the direct reclaim. We have seen oom-killer stuck in direct > > reclaim throttling, stuck in reclaim and allocations from interrupts > > keep stealing reclaimed memory. We have even observed systems where > > all the processes were stuck in throttle_direct_reclaim() and only > > kswapd was running and the interrupts kept stealing the memory > > reclaimed by kswapd. > > > > To reliably solve this problem, we need to give guaranteed memory to > > the userspace oom-killer. At the moment we are contemplating between > > the following options and I would like to get some feedback. > > > > 1. prctl(PF_MEMALLOC) > > > > The idea is to give userspace oom-killer (just one thread which is > > finding the appropriate victims and will be sending SIGKILLs) access > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > memory.min will be good enough but for rare occasions, when the > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > protect it from reclaim and let the allocation dip into the memory > > reserves. > > > > The misuse of this feature would be risky but it can be limited to > > privileged applications. Userspace oom-killer is the only appropriate > > user of this feature. This option is simple to implement. > > Hello Shakeel! > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > the system is already in a relatively bad shape. Arguably the userspace > OOM killer should kick in earlier, it's already a bit too late. I tend to agree here. This is how we are trying to avoid issues with such severe memory shortages - by tuning the killer a bit more aggressively. But a more reliable mechanism would definitely be an improvement. > Allowing to use reserves just pushes this even further, so we're risking > the kernel stability for no good reason. > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > I wonder if we can introduce a per-task flag which will exclude the task from > throttling, but instead all (large) allocations will just fail under a > significant memory pressure more easily. In this case if there is a > significant > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > for an attempt to read some stats, for example), but still will be able to > kill > some processes and make the forward progress. This sounds like a good idea to me. > But maybe it can be done in userspace too: by splitting the daemon into > a core- and extended part and avoid doing anything behind bare minimum > in the core part. > > > > > 2. Mempool > > > > The idea is to preallocate mempool with a given amount of memory for > > userspace oom-killer. Preferably this will be per-thread and > > oom-killer can preallocate mempool for its specific threads. The core > > page allocator can check before going to the reclaim path if the task > > has private access to the mempool and return page from it if yes. > > > > This option would be more complicated than the previous option as the > > lifecycle of the page from the mempool would be more sophisticated. > > Additionally the current mempool does not handle higher order pages > > and we might need to
Re: [RFC] memory reserve for userspace oom-killer
On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > Proposal: Provide memory guarantees to userspace oom-killer. > > Background: > > Issues with kernel oom-killer: > 1. Very conservative and prefer to reclaim. Applications can suffer > for a long time. > 2. Borrows the context of the allocator which can be resource limited > (low sched priority or limited CPU quota). > 3. Serialized by global lock. > 4. Very simplistic oom victim selection policy. > > These issues are resolved through userspace oom-killer by: > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > early detect suffering. > 2. Independent process context which can be given dedicated CPU quota > and high scheduling priority. > 3. Can be more aggressive as required. > 4. Can implement sophisticated business logic/policies. > > Android's LMKD and Facebook's oomd are the prime examples of userspace > oom-killers. One of the biggest challenges for userspace oom-killers > is to potentially function under intense memory pressure and are prone > to getting stuck in memory reclaim themselves. Current userspace > oom-killers aim to avoid this situation by preallocating user memory > and protecting themselves from global reclaim by either mlocking or > memory.min. However a new allocation from userspace oom-killer can > still get stuck in the reclaim and policy rich oom-killer do trigger > new allocations through syscalls or even heap. > > Our attempt of userspace oom-killer faces similar challenges. > Particularly at the tail on the very highly utilized machines we have > observed userspace oom-killer spectacularly failing in many possible > ways in the direct reclaim. We have seen oom-killer stuck in direct > reclaim throttling, stuck in reclaim and allocations from interrupts > keep stealing reclaimed memory. We have even observed systems where > all the processes were stuck in throttle_direct_reclaim() and only > kswapd was running and the interrupts kept stealing the memory > reclaimed by kswapd. > > To reliably solve this problem, we need to give guaranteed memory to > the userspace oom-killer. At the moment we are contemplating between > the following options and I would like to get some feedback. > > 1. prctl(PF_MEMALLOC) > > The idea is to give userspace oom-killer (just one thread which is > finding the appropriate victims and will be sending SIGKILLs) access > to MEMALLOC reserves. Most of the time the preallocation, mlock and > memory.min will be good enough but for rare occasions, when the > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > protect it from reclaim and let the allocation dip into the memory > reserves. > > The misuse of this feature would be risky but it can be limited to > privileged applications. Userspace oom-killer is the only appropriate > user of this feature. This option is simple to implement. Hello Shakeel! If ordinary PAGE_SIZE and smaller kernel allocations start to fail, the system is already in a relatively bad shape. Arguably the userspace OOM killer should kick in earlier, it's already a bit too late. Allowing to use reserves just pushes this even further, so we're risking the kernel stability for no good reason. But I agree that throttling the oom daemon in direct reclaim makes no sense. I wonder if we can introduce a per-task flag which will exclude the task from throttling, but instead all (large) allocations will just fail under a significant memory pressure more easily. In this case if there is a significant memory shortage the oom daemon will not be fully functional (will get -ENOMEM for an attempt to read some stats, for example), but still will be able to kill some processes and make the forward progress. But maybe it can be done in userspace too: by splitting the daemon into a core- and extended part and avoid doing anything behind bare minimum in the core part. > > 2. Mempool > > The idea is to preallocate mempool with a given amount of memory for > userspace oom-killer. Preferably this will be per-thread and > oom-killer can preallocate mempool for its specific threads. The core > page allocator can check before going to the reclaim path if the task > has private access to the mempool and return page from it if yes. > > This option would be more complicated than the previous option as the > lifecycle of the page from the mempool would be more sophisticated. > Additionally the current mempool does not handle higher order pages > and we might need to extend it to allow such allocations. Though this > feature might have more use-cases and it would be less risky than the > previous option. It looks like an over-kill for the oom daemon protection, but if there are other good use cases, maybe it's a good feature to have. > > Another idea I had was to use kthread based oom-killer and provide the > policies through eBPF program. Though I am not sure how to make it > monitor arbitrary metrics and if that can be done without any >
Re: [RFC] memory reserve for userspace oom-killer
On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko wrote: > > On Mon 19-04-21 18:44:02, Shakeel Butt wrote: [...] > > memory.min. However a new allocation from userspace oom-killer can > > still get stuck in the reclaim and policy rich oom-killer do trigger > > new allocations through syscalls or even heap. > > Can you be more specific please? > To decide when to kill, the oom-killer has to read a lot of metrics. It has to open a lot of files to read them and there will definitely be new allocations involved in those operations. For example reading memory.stat does a page size allocation. Similarly, to perform action the oom-killer may have to read cgroup.procs file which again has allocation inside it. Regarding sophisticated oom policy, I can give one example of our cluster level policy. For robustness, many user facing jobs run a lot of instances in a cluster to handle failures. Such jobs are tolerant to some amount of failures but they still have requirements to not let the number of running instances below some threshold. Normally killing such jobs is fine but we do want to make sure that we do not violate their cluster level agreement. So, the userspace oom-killer may dynamically need to confirm if such a job can be killed. [...] > > To reliably solve this problem, we need to give guaranteed memory to > > the userspace oom-killer. > > There is nothing like that. Even memory reserves are a finite resource > which can be consumed as it is sharing those reserves with other users > who are not necessarily coordinated. So before we start discussing > making this even more muddy by handing over memory reserves to the > userspace we should really examine whether pre-allocation is something > that will not work. > We actually explored if we can restrict the syscalls for the oom-killer which does not do memory allocations. We concluded that is not practical and not maintainable. Whatever the list we can come up with will be outdated soon. In addition, converting all the must-have syscalls to not do allocations is not possible/practical. > > At the moment we are contemplating between > > the following options and I would like to get some feedback. > > > > 1. prctl(PF_MEMALLOC) > > > > The idea is to give userspace oom-killer (just one thread which is > > finding the appropriate victims and will be sending SIGKILLs) access > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > memory.min will be good enough but for rare occasions, when the > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > protect it from reclaim and let the allocation dip into the memory > > reserves. > > I do not think that handing over an unlimited ticket to the memory > reserves to userspace is a good idea. Even the in kernel oom killer is > bound to a partial access to reserves. So if we really want this then > it should be in sync with and bound by the ALLOC_OOM. > Makes sense. > > The misuse of this feature would be risky but it can be limited to > > privileged applications. Userspace oom-killer is the only appropriate > > user of this feature. This option is simple to implement. > > > > 2. Mempool > > > > The idea is to preallocate mempool with a given amount of memory for > > userspace oom-killer. Preferably this will be per-thread and > > oom-killer can preallocate mempool for its specific threads. The core > > page allocator can check before going to the reclaim path if the task > > has private access to the mempool and return page from it if yes. > > Could you elaborate some more on how this would be controlled from the > userspace? A dedicated syscall? A driver? > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to free the mempool. > > This option would be more complicated than the previous option as the > > lifecycle of the page from the mempool would be more sophisticated. > > Additionally the current mempool does not handle higher order pages > > and we might need to extend it to allow such allocations. Though this > > feature might have more use-cases and it would be less risky than the > > previous option. > > I would tend to agree. > > > Another idea I had was to use kthread based oom-killer and provide the > > policies through eBPF program. Though I am not sure how to make it > > monitor arbitrary metrics and if that can be done without any > > allocations. > > A kernel module or eBPF to implement oom decisions has already been > discussed few years back. But I am afraid this would be hard to wire in > for anything except for the victim selection. I am not sure it is > maintainable to also control when the OOM handling should trigger. > I think you are referring to [1]. That patch was only looking at PSI and I think we are on the same page that we need more information to decide when to kill. Also I agree with you that it is hard to implement "when to kill" with eBPF but I wanted the idea
Re: [RFC] memory reserve for userspace oom-killer
On Mon 19-04-21 18:44:02, Shakeel Butt wrote: > Proposal: Provide memory guarantees to userspace oom-killer. > > Background: > > Issues with kernel oom-killer: > 1. Very conservative and prefer to reclaim. Applications can suffer > for a long time. > 2. Borrows the context of the allocator which can be resource limited > (low sched priority or limited CPU quota). > 3. Serialized by global lock. > 4. Very simplistic oom victim selection policy. > > These issues are resolved through userspace oom-killer by: > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to > early detect suffering. > 2. Independent process context which can be given dedicated CPU quota > and high scheduling priority. > 3. Can be more aggressive as required. > 4. Can implement sophisticated business logic/policies. > > Android's LMKD and Facebook's oomd are the prime examples of userspace > oom-killers. One of the biggest challenges for userspace oom-killers > is to potentially function under intense memory pressure and are prone > to getting stuck in memory reclaim themselves. Current userspace > oom-killers aim to avoid this situation by preallocating user memory > and protecting themselves from global reclaim by either mlocking or > memory.min. However a new allocation from userspace oom-killer can > still get stuck in the reclaim and policy rich oom-killer do trigger > new allocations through syscalls or even heap. Can you be more specific please? > Our attempt of userspace oom-killer faces similar challenges. > Particularly at the tail on the very highly utilized machines we have > observed userspace oom-killer spectacularly failing in many possible > ways in the direct reclaim. We have seen oom-killer stuck in direct > reclaim throttling, stuck in reclaim and allocations from interrupts > keep stealing reclaimed memory. We have even observed systems where > all the processes were stuck in throttle_direct_reclaim() and only > kswapd was running and the interrupts kept stealing the memory > reclaimed by kswapd. > > To reliably solve this problem, we need to give guaranteed memory to > the userspace oom-killer. There is nothing like that. Even memory reserves are a finite resource which can be consumed as it is sharing those reserves with other users who are not necessarily coordinated. So before we start discussing making this even more muddy by handing over memory reserves to the userspace we should really examine whether pre-allocation is something that will not work. > At the moment we are contemplating between > the following options and I would like to get some feedback. > > 1. prctl(PF_MEMALLOC) > > The idea is to give userspace oom-killer (just one thread which is > finding the appropriate victims and will be sending SIGKILLs) access > to MEMALLOC reserves. Most of the time the preallocation, mlock and > memory.min will be good enough but for rare occasions, when the > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > protect it from reclaim and let the allocation dip into the memory > reserves. I do not think that handing over an unlimited ticket to the memory reserves to userspace is a good idea. Even the in kernel oom killer is bound to a partial access to reserves. So if we really want this then it should be in sync with and bound by the ALLOC_OOM. > The misuse of this feature would be risky but it can be limited to > privileged applications. Userspace oom-killer is the only appropriate > user of this feature. This option is simple to implement. > > 2. Mempool > > The idea is to preallocate mempool with a given amount of memory for > userspace oom-killer. Preferably this will be per-thread and > oom-killer can preallocate mempool for its specific threads. The core > page allocator can check before going to the reclaim path if the task > has private access to the mempool and return page from it if yes. Could you elaborate some more on how this would be controlled from the userspace? A dedicated syscall? A driver? > This option would be more complicated than the previous option as the > lifecycle of the page from the mempool would be more sophisticated. > Additionally the current mempool does not handle higher order pages > and we might need to extend it to allow such allocations. Though this > feature might have more use-cases and it would be less risky than the > previous option. I would tend to agree. > Another idea I had was to use kthread based oom-killer and provide the > policies through eBPF program. Though I am not sure how to make it > monitor arbitrary metrics and if that can be done without any > allocations. A kernel module or eBPF to implement oom decisions has already been discussed few years back. But I am afraid this would be hard to wire in for anything except for the victim selection. I am not sure it is maintainable to also control when the OOM handling should trigger. -- Michal Hocko SUSE Labs
[RFC] memory reserve for userspace oom-killer
Proposal: Provide memory guarantees to userspace oom-killer. Background: Issues with kernel oom-killer: 1. Very conservative and prefer to reclaim. Applications can suffer for a long time. 2. Borrows the context of the allocator which can be resource limited (low sched priority or limited CPU quota). 3. Serialized by global lock. 4. Very simplistic oom victim selection policy. These issues are resolved through userspace oom-killer by: 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to early detect suffering. 2. Independent process context which can be given dedicated CPU quota and high scheduling priority. 3. Can be more aggressive as required. 4. Can implement sophisticated business logic/policies. Android's LMKD and Facebook's oomd are the prime examples of userspace oom-killers. One of the biggest challenges for userspace oom-killers is to potentially function under intense memory pressure and are prone to getting stuck in memory reclaim themselves. Current userspace oom-killers aim to avoid this situation by preallocating user memory and protecting themselves from global reclaim by either mlocking or memory.min. However a new allocation from userspace oom-killer can still get stuck in the reclaim and policy rich oom-killer do trigger new allocations through syscalls or even heap. Our attempt of userspace oom-killer faces similar challenges. Particularly at the tail on the very highly utilized machines we have observed userspace oom-killer spectacularly failing in many possible ways in the direct reclaim. We have seen oom-killer stuck in direct reclaim throttling, stuck in reclaim and allocations from interrupts keep stealing reclaimed memory. We have even observed systems where all the processes were stuck in throttle_direct_reclaim() and only kswapd was running and the interrupts kept stealing the memory reclaimed by kswapd. To reliably solve this problem, we need to give guaranteed memory to the userspace oom-killer. At the moment we are contemplating between the following options and I would like to get some feedback. 1. prctl(PF_MEMALLOC) The idea is to give userspace oom-killer (just one thread which is finding the appropriate victims and will be sending SIGKILLs) access to MEMALLOC reserves. Most of the time the preallocation, mlock and memory.min will be good enough but for rare occasions, when the userspace oom-killer needs to allocate, the PF_MEMALLOC flag will protect it from reclaim and let the allocation dip into the memory reserves. The misuse of this feature would be risky but it can be limited to privileged applications. Userspace oom-killer is the only appropriate user of this feature. This option is simple to implement. 2. Mempool The idea is to preallocate mempool with a given amount of memory for userspace oom-killer. Preferably this will be per-thread and oom-killer can preallocate mempool for its specific threads. The core page allocator can check before going to the reclaim path if the task has private access to the mempool and return page from it if yes. This option would be more complicated than the previous option as the lifecycle of the page from the mempool would be more sophisticated. Additionally the current mempool does not handle higher order pages and we might need to extend it to allow such allocations. Though this feature might have more use-cases and it would be less risky than the previous option. Another idea I had was to use kthread based oom-killer and provide the policies through eBPF program. Though I am not sure how to make it monitor arbitrary metrics and if that can be done without any allocations. Please do provide feedback on these approaches. thanks, Shakeel