Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 11:44 AM Qian Cai wrote: > > On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote: > > > > Feel like you are going in circles to "sell" without any new information. > > > If > > > you > > > need to deal with OOM that often, it might also worth working with FB on > > > oomd. > > > > > > https://github.com/facebookincubator/oomd > > > > > > It is well-known that kernel OOM could be slow and painful to deal with, > > > so > > > I > > > don't buy-in the argument that kernel OOM recover is better/faster than a > > > kdump > > > reboot. > > > > > > It is not unusual that when the system is triggering a kernel OOM, it is > > > almost > > > trashed/dead. Although developers are working hard to improve the recovery > > > after > > > OOM, there are still many error-paths that are not going to survive which > > > would > > > leak memories, introduce undefined behaviors, corrupt memory etc. > > > > But as you have pointed out many people are happy with current OOM > > processing > > which is the report and recovery so for those people a kdump reboot is > > overkill. > > Making the OOM report at least optionally a bit more informative has value. > > Also > > making sure it doesn't produce excessive output is desirable. > > > > I do agree for developers having to have all the system state a kdump > > provides that > > and as long as you can reproduce the OOM event that works well. But > > that is not the > > common case as has already been discussed. > > > > Also, OOM events that are due to kernel bugs could leak memory and over time > > and cause a crash, true. But that is not what we typically see. In > > fact we've had > > customers come back and report issues on systems that have been in > > continuous > > operation for years. No point in crashing their system. Linux if > > properly maintained > > is thankfully quite stable. But OOMs do happen and root causing them to > > prevent > > future occurrences is desired. > > This is not what I meant. After an OOM event happens, many kernel memory > allocations could fail. Since very few people are testing those error-paths > due > to allocation failures, it is considered one of those most buggy areas in the > kernel. Developers have mostly been focus on making sure the kernel OOM should > not happen in the first place. > > I still think the time is better spending on improving things like eBPF, oomd > and kdump etc to solve your problem, but leave the kernel OOM report code > alone. > Sure would rather spend my time doing other things. No argument about that. No one likes OOMs. If I never see another OOM I'd be quite happy. But OOM events still happen and an OOM report gets generated. When it happens it is useful to get information that can help find the cause of the OOM so it can be fixed and won't happen again. We get tasked to root cause OOMs even though we'd rather do other things. We've added a bit of output to the OOM Report and it has been helpful. We also reduce our total output by only printing larger entries with helpful summaries. We've been using and supporting this code for quite a few releases. We haven't had problems and we have a lot of systems in use. Contributing to an open source project like Linux is good. If the code is not accepted its not the end of the world. I was told to offer our code upstream and to try to be helpful. I understand that processing an OOM event can be flakey. We add a few lines of OOM output but in fact we reduce our total output because we skip printing smaller entries and print summaries instead. So if the volume of the output increases the likelihood of system failure during an OOM event, then we've actually increased our reliability. Maybe that is why we haven't had any problems. As far as switching from generating an OOM report to taking a dump and restarting the system, the choice is not mine to decide. Way above my pay grade. When asked, I am happy to look at a dump but dumps plus restarts for the systems we work on take too long so I typically don't get a dump to look at. Have to make due with OOM output and logs. Also, and depending on what you work on, you may take satisfaction that OOM events are far less traumatic with newer versions of Linux, with our systems. The folks upstream do really good work, give credit where credit is due. Maybe tools like KASAN really help, which we also use. Sure people fix bugs all the time, Linux is huge and super complicated, but many of the bugs are not very common and we spend an amazing (to me anyway) amount of time testing and so when we take OOM events, even multiple OOM events back to back, the system almost always recovers and we don't seem to bleed memory. That is why we systems up for months and even years. Occasionally we see a watchdog timeout failure and that can be due to a low memory situation but just FYI a fair number of those do not involve OOM events so its not because of issues with OOM code, reporting or otherwise.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote: > > Feel like you are going in circles to "sell" without any new information. If > > you > > need to deal with OOM that often, it might also worth working with FB on > > oomd. > > > > https://github.com/facebookincubator/oomd > > > > It is well-known that kernel OOM could be slow and painful to deal with, so > > I > > don't buy-in the argument that kernel OOM recover is better/faster than a > > kdump > > reboot. > > > > It is not unusual that when the system is triggering a kernel OOM, it is > > almost > > trashed/dead. Although developers are working hard to improve the recovery > > after > > OOM, there are still many error-paths that are not going to survive which > > would > > leak memories, introduce undefined behaviors, corrupt memory etc. > > But as you have pointed out many people are happy with current OOM processing > which is the report and recovery so for those people a kdump reboot is > overkill. > Making the OOM report at least optionally a bit more informative has value. > Also > making sure it doesn't produce excessive output is desirable. > > I do agree for developers having to have all the system state a kdump > provides that > and as long as you can reproduce the OOM event that works well. But > that is not the > common case as has already been discussed. > > Also, OOM events that are due to kernel bugs could leak memory and over time > and cause a crash, true. But that is not what we typically see. In > fact we've had > customers come back and report issues on systems that have been in continuous > operation for years. No point in crashing their system. Linux if > properly maintained > is thankfully quite stable. But OOMs do happen and root causing them to > prevent > future occurrences is desired. This is not what I meant. After an OOM event happens, many kernel memory allocations could fail. Since very few people are testing those error-paths due to allocation failures, it is considered one of those most buggy areas in the kernel. Developers have mostly been focus on making sure the kernel OOM should not happen in the first place. I still think the time is better spending on improving things like eBPF, oomd and kdump etc to solve your problem, but leave the kernel OOM report code alone.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 9:18 AM Michal Hocko wrote: > > On Thu 29-08-19 08:03:19, Edward Chron wrote: > > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko wrote: > [...] > > > Or simply provide a hook with the oom_control to be called to report > > > without replacing the whole oom killer behavior. That is not necessary. > > > > For very simple addition, to add a line of output this works. > > Why would a hook be limited to small stuff? It could be larger but the few items we added were just a line or two of output. The vmalloc, slabs and processes can print many entries so we added a control for those. > > > It would still be nice to address the fact the existing OOM Report prints > > all of the user processes or none. It would be nice to add some control > > for that. That's what we did. > > TBH, I am not really convinced partial taks list is desirable nor easy > to configure. What is the criterion? oom_score (with potentially unstable > metric)? Rss? Something else? We used an estimate of the memory footprint of the process: rss, swap pages and page table pages. > -- > Michal Hocko > SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu 29-08-19 08:03:19, Edward Chron wrote: > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko wrote: [...] > > Or simply provide a hook with the oom_control to be called to report > > without replacing the whole oom killer behavior. That is not necessary. > > For very simple addition, to add a line of output this works. Why would a hook be limited to small stuff? > It would still be nice to address the fact the existing OOM Report prints > all of the user processes or none. It would be nice to add some control > for that. That's what we did. TBH, I am not really convinced partial taks list is desirable nor easy to configure. What is the criterion? oom_score (with potentially unstable metric)? Rss? Something else? -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 8:42 AM Qian Cai wrote: > > On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote: > > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko wrote: > > > > > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote: > > > > On 2019/08/29 16:11, Michal Hocko wrote: > > > > > On Wed 28-08-19 12:46:20, Edward Chron wrote: > > > > > > Our belief is if you really think eBPF is the preferred mechanism > > > > > > then move OOM reporting to an eBPF. > > > > > > > > > > I've said that all this additional information has to be dynamically > > > > > extensible rather than a part of the core kernel. Whether eBPF is the > > > > > suitable tool, I do not know. I haven't explored that. There are other > > > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules > > > > > and > > > > > probably others. > > > > > > > > As for SystemTap, guru mode (an expert mode which disables protection > > > > provided > > > > by SystemTap; allowing kernel to crash when something went wrong) could > > > > be > > > > used > > > > for holding spinlock. However, as far as I know, holding mutex (or doing > > > > any > > > > operation that might sleep) from such dynamic hooks is not allowed. Also > > > > we will > > > > need to export various symbols in order to allow access from such > > > > dynamic > > > > hooks. > > > > > > This is the oom path and it should better not use any sleeping locks in > > > the first place. > > > > > > > I'm not familiar with eBPF, but I guess that eBPF is similar. > > > > > > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor > > > > SystemTap will be suitable for dumping OOM information. OOM situation > > > > means > > > > that even single page fault event cannot complete, and temporary memory > > > > allocation for reading from kernel or writing to files cannot complete. > > > > > > And I repeat that no such reporting is going to write to files. This is > > > an OOM path afterall. > > > > > > > Therefore, we will need to hold all information in kernel memory > > > > (without > > > > allocating any memory when OOM event happened). Dynamic hooks could hold > > > > a few lines of output, but not all lines we want. The only possible > > > > buffer > > > > which is preallocated and large enough would be printk()'s buffer. Thus, > > > > I believe that we will have to use printk() in order to dump OOM > > > > information. > > > > At that point, > > > > > > Yes, this is what I've had in mind. > > > > > > > +1: It makes sense to keep the report going to the dmesg to persist. > > That is where it has always gone and there is no reason to change. > > You can have several OOMs back to back and you'd like to retain the output. > > All the information should be kept together in the OOM report. > > > > > > > > > > static bool (*oom_handler)(struct oom_control *oc) = > > > > default_oom_killer; > > > > > > > > bool out_of_memory(struct oom_control *oc) > > > > { > > > > return oom_handler(oc); > > > > } > > > > > > > > and let in-tree kernel modules override current OOM killer would be > > > > the only practical choice (if we refuse adding many knobs). > > > > > > Or simply provide a hook with the oom_control to be called to report > > > without replacing the whole oom killer behavior. That is not necessary. > > > > For very simple addition, to add a line of output this works. > > It would still be nice to address the fact the existing OOM Report prints > > all of the user processes or none. It would be nice to add some control > > for that. That's what we did. > > Feel like you are going in circles to "sell" without any new information. If > you > need to deal with OOM that often, it might also worth working with FB on oomd. > > https://github.com/facebookincubator/oomd > > It is well-known that kernel OOM could be slow and painful to deal with, so I > don't buy-in the argument that kernel OOM recover is better/faster than a > kdump > reboot. > > It is not unusual that when the system is triggering a kernel OOM, it is > almost > trashed/dead. Although developers are working hard to improve the recovery > after > OOM, there are still many error-paths that are not going to survive which > would > leak memories, introduce undefined behaviors, corrupt memory etc. But as you have pointed out many people are happy with current OOM processing which is the report and recovery so for those people a kdump reboot is overkill. Making the OOM report at least optionally a bit more informative has value. Also making sure it doesn't produce excessive output is desirable. I do agree for developers having to have all the system state a kdump provides that and as long as you can reproduce the OOM event that works well. But that is not the common case as has already been discussed. Also, OOM events that are due to kernel bugs could leak memory and over time and cause a crash, true. But that is not what we typically see. In fact we've had customers come back and
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 7:09 AM Tetsuo Handa wrote: > > On 2019/08/29 20:56, Michal Hocko wrote: > >> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor > >> SystemTap will be suitable for dumping OOM information. OOM situation means > >> that even single page fault event cannot complete, and temporary memory > >> allocation for reading from kernel or writing to files cannot complete. > > > > And I repeat that no such reporting is going to write to files. This is > > an OOM path afterall. > > The process who fetches from e.g. eBPF event cannot involve page fault. > The front-end for iovisor/bcc is a python userspace process. But I think > that such process can't run under OOM situation. > > > > >> Therefore, we will need to hold all information in kernel memory (without > >> allocating any memory when OOM event happened). Dynamic hooks could hold > >> a few lines of output, but not all lines we want. The only possible buffer > >> which is preallocated and large enough would be printk()'s buffer. Thus, > >> I believe that we will have to use printk() in order to dump OOM > >> information. > >> At that point, > > > > Yes, this is what I've had in mind. > > Probably I incorrectly shortcut. > > Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold > all lines when dump_tasks() reports 32000+ processes. We have to buffer all > output > in kernel memory because we can't complete even a page fault event triggered > by > the python process monitoring eBPF event (and writing the result to some log > file > or something) while out_of_memory() is in flight. > > And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm > saying is "we won't be able to hold output from dump_tasks() if output from > dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find > a way that can handle the worst case. With the patch series we sent the addition of vmalloc entries print required us to add a small piece of code to vmalloc.c but we thought this should be core OOM Reporting function. However you want to limit which vmalloc entries you print, probably only very large memory users. For us this generates just a few entries and has proven useful. The changes to limit how many processes get printed so you don't have the all or nothing would be nice to have. It would be easiest if there was a standard mechanism to specify which entries to print, probably by a minimum size which is what we did. We used debugfs to set the controls but sysctl or some other mechanism could be used. The rest of what we did might be implemented with hooks as they only output a line or two and I've already got rid of information we had that was redundant.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote: > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko wrote: > > > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote: > > > On 2019/08/29 16:11, Michal Hocko wrote: > > > > On Wed 28-08-19 12:46:20, Edward Chron wrote: > > > > > Our belief is if you really think eBPF is the preferred mechanism > > > > > then move OOM reporting to an eBPF. > > > > > > > > I've said that all this additional information has to be dynamically > > > > extensible rather than a part of the core kernel. Whether eBPF is the > > > > suitable tool, I do not know. I haven't explored that. There are other > > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and > > > > probably others. > > > > > > As for SystemTap, guru mode (an expert mode which disables protection > > > provided > > > by SystemTap; allowing kernel to crash when something went wrong) could be > > > used > > > for holding spinlock. However, as far as I know, holding mutex (or doing > > > any > > > operation that might sleep) from such dynamic hooks is not allowed. Also > > > we will > > > need to export various symbols in order to allow access from such dynamic > > > hooks. > > > > This is the oom path and it should better not use any sleeping locks in > > the first place. > > > > > I'm not familiar with eBPF, but I guess that eBPF is similar. > > > > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor > > > SystemTap will be suitable for dumping OOM information. OOM situation > > > means > > > that even single page fault event cannot complete, and temporary memory > > > allocation for reading from kernel or writing to files cannot complete. > > > > And I repeat that no such reporting is going to write to files. This is > > an OOM path afterall. > > > > > Therefore, we will need to hold all information in kernel memory (without > > > allocating any memory when OOM event happened). Dynamic hooks could hold > > > a few lines of output, but not all lines we want. The only possible buffer > > > which is preallocated and large enough would be printk()'s buffer. Thus, > > > I believe that we will have to use printk() in order to dump OOM > > > information. > > > At that point, > > > > Yes, this is what I've had in mind. > > > > +1: It makes sense to keep the report going to the dmesg to persist. > That is where it has always gone and there is no reason to change. > You can have several OOMs back to back and you'd like to retain the output. > All the information should be kept together in the OOM report. > > > > > > > static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer; > > > > > > bool out_of_memory(struct oom_control *oc) > > > { > > > return oom_handler(oc); > > > } > > > > > > and let in-tree kernel modules override current OOM killer would be > > > the only practical choice (if we refuse adding many knobs). > > > > Or simply provide a hook with the oom_control to be called to report > > without replacing the whole oom killer behavior. That is not necessary. > > For very simple addition, to add a line of output this works. > It would still be nice to address the fact the existing OOM Report prints > all of the user processes or none. It would be nice to add some control > for that. That's what we did. Feel like you are going in circles to "sell" without any new information. If you need to deal with OOM that often, it might also worth working with FB on oomd. https://github.com/facebookincubator/oomd It is well-known that kernel OOM could be slow and painful to deal with, so I don't buy-in the argument that kernel OOM recover is better/faster than a kdump reboot. It is not unusual that when the system is triggering a kernel OOM, it is almost trashed/dead. Although developers are working hard to improve the recovery after OOM, there are still many error-paths that are not going to survive which would leak memories, introduce undefined behaviors, corrupt memory etc.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 12:11 AM Michal Hocko wrote: > > On Wed 28-08-19 12:46:20, Edward Chron wrote: > [...] > > Our belief is if you really think eBPF is the preferred mechanism > > then move OOM reporting to an eBPF. > > I've said that all this additional information has to be dynamically > extensible rather than a part of the core kernel. Whether eBPF is the > suitable tool, I do not know. I haven't explored that. There are other > ways to inject code to the kernel. systemtap/kprobes, kernel modules and > probably others. For simple code injections eBPF or kprobe works and a tracepoint would help with that. For example we could add our one line of task information that we find very useful this way. For adding controls to limit output for processes, slabs and vmalloc entries it would be harder to inject code for that. Our solution was to use debugfs. An alternate could to be add simple sysctl if using debugfs is not appropriate. As our code illustrated this can be added without changing the existing report in any substantive way. I think there is value in this and this is core to what the OOM report should provide. Additional items may be add ons that are environment specific but these are OOM reporting essentials IMHO. > > > I mentioned this before but I will reiterate this here. > > > > So how do we get there? Let's look at the existing report which we know > > has issues. > > > > Other than a few essential OOM messages the OOM code should produce, > > such as the Killed process message message sequence being included, > > you could have the entire OOM report moved to an eBPF script and > > therefore make it customizable, configurable or if you prefer programmable. > > I believe we should keep the current reporting in place and allow > additional information via dynamic mechanism. Be it a registration > mechanism that modules can hook into or other more dynamic way. > The current reporting has proven to be useful in many typical oom > situations in my past years of experience. It gives the rough state of > the failing allocation, MM subsystem, tasks that are eligible and task > that is killed so that you can understand why the event happened. > > I would argue that the eligible tasks should be printed on the opt-in > bases because this is more of relict from the past when the victim > selection was less deterministic. But that is another story. > > All the rest of dump_header should stay IMHO as a reasonable default and > bare minimum. > > > Why? Because as we all agree, you'll never have a perfect OOM Report. > > So if you believe this, than if you will, put your money where your mouth > > is (so to speak) and make the entire OOM Report and eBPF script. > > We'd be willing to help with this. > > > > I'll give specific reasons why you want to do this. > > > >- Don't want to maintain a lot of code in the kernel (eBPF code doesn't > >count). > >- Can't produce an ideal OOM report. > >- Don't like configuring things but favor programmatic solutions. > >- Agree the existing OOM report doesn't work for all environments. > >- Want to allow flexibility but can't support everything people might > >want. > >- Then installing an eBPF for OOM Reporting isn't an option, it's > >required. > > This is going into an extreme. We cannot serve all cases but that is > true for any other heuristics/reporting in the kernel. We do care about > most. Unfortunately my argument for this is moot, this can't be done with eBPF, at least not now. > > > The last reason is huge for people who live in a world with large data > > centers. Data center managers are very conservative. They don't want to > > deviate from standard operating procedure unless absolutely necessary. > > If loading an OOM Report eBPF is standard to get OOM Reporting output, > > then they'll accept that. > > I have already responded to this kind of argumentation elsewhere. This > is not a relevant argument for any kernel implementation. This is a data > process management process. > > -- > Michal Hocko > SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko wrote: > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote: > > On 2019/08/29 16:11, Michal Hocko wrote: > > > On Wed 28-08-19 12:46:20, Edward Chron wrote: > > >> Our belief is if you really think eBPF is the preferred mechanism > > >> then move OOM reporting to an eBPF. > > > > > > I've said that all this additional information has to be dynamically > > > extensible rather than a part of the core kernel. Whether eBPF is the > > > suitable tool, I do not know. I haven't explored that. There are other > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and > > > probably others. > > > > As for SystemTap, guru mode (an expert mode which disables protection > > provided > > by SystemTap; allowing kernel to crash when something went wrong) could be > > used > > for holding spinlock. However, as far as I know, holding mutex (or doing any > > operation that might sleep) from such dynamic hooks is not allowed. Also we > > will > > need to export various symbols in order to allow access from such dynamic > > hooks. > > This is the oom path and it should better not use any sleeping locks in > the first place. > > > I'm not familiar with eBPF, but I guess that eBPF is similar. > > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor > > SystemTap will be suitable for dumping OOM information. OOM situation means > > that even single page fault event cannot complete, and temporary memory > > allocation for reading from kernel or writing to files cannot complete. > > And I repeat that no such reporting is going to write to files. This is > an OOM path afterall. > > > Therefore, we will need to hold all information in kernel memory (without > > allocating any memory when OOM event happened). Dynamic hooks could hold > > a few lines of output, but not all lines we want. The only possible buffer > > which is preallocated and large enough would be printk()'s buffer. Thus, > > I believe that we will have to use printk() in order to dump OOM > > information. > > At that point, > > Yes, this is what I've had in mind. > +1: It makes sense to keep the report going to the dmesg to persist. That is where it has always gone and there is no reason to change. You can have several OOMs back to back and you'd like to retain the output. All the information should be kept together in the OOM report. > > > > static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer; > > > > bool out_of_memory(struct oom_control *oc) > > { > > return oom_handler(oc); > > } > > > > and let in-tree kernel modules override current OOM killer would be > > the only practical choice (if we refuse adding many knobs). > > Or simply provide a hook with the oom_control to be called to report > without replacing the whole oom killer behavior. That is not necessary. For very simple addition, to add a line of output this works. It would still be nice to address the fact the existing OOM Report prints all of the user processes or none. It would be nice to add some control for that. That's what we did. > -- > Michal Hocko > SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On 2019/08/29 20:56, Michal Hocko wrote: >> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor >> SystemTap will be suitable for dumping OOM information. OOM situation means >> that even single page fault event cannot complete, and temporary memory >> allocation for reading from kernel or writing to files cannot complete. > > And I repeat that no such reporting is going to write to files. This is > an OOM path afterall. The process who fetches from e.g. eBPF event cannot involve page fault. The front-end for iovisor/bcc is a python userspace process. But I think that such process can't run under OOM situation. > >> Therefore, we will need to hold all information in kernel memory (without >> allocating any memory when OOM event happened). Dynamic hooks could hold >> a few lines of output, but not all lines we want. The only possible buffer >> which is preallocated and large enough would be printk()'s buffer. Thus, >> I believe that we will have to use printk() in order to dump OOM information. >> At that point, > > Yes, this is what I've had in mind. Probably I incorrectly shortcut. Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold all lines when dump_tasks() reports 32000+ processes. We have to buffer all output in kernel memory because we can't complete even a page fault event triggered by the python process monitoring eBPF event (and writing the result to some log file or something) while out_of_memory() is in flight. And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm saying is "we won't be able to hold output from dump_tasks() if output from dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find a way that can handle the worst case.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Thu 29-08-19 19:14:46, Tetsuo Handa wrote: > On 2019/08/29 16:11, Michal Hocko wrote: > > On Wed 28-08-19 12:46:20, Edward Chron wrote: > >> Our belief is if you really think eBPF is the preferred mechanism > >> then move OOM reporting to an eBPF. > > > > I've said that all this additional information has to be dynamically > > extensible rather than a part of the core kernel. Whether eBPF is the > > suitable tool, I do not know. I haven't explored that. There are other > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and > > probably others. > > As for SystemTap, guru mode (an expert mode which disables protection provided > by SystemTap; allowing kernel to crash when something went wrong) could be > used > for holding spinlock. However, as far as I know, holding mutex (or doing any > operation that might sleep) from such dynamic hooks is not allowed. Also we > will > need to export various symbols in order to allow access from such dynamic > hooks. This is the oom path and it should better not use any sleeping locks in the first place. > I'm not familiar with eBPF, but I guess that eBPF is similar. > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor > SystemTap will be suitable for dumping OOM information. OOM situation means > that even single page fault event cannot complete, and temporary memory > allocation for reading from kernel or writing to files cannot complete. And I repeat that no such reporting is going to write to files. This is an OOM path afterall. > Therefore, we will need to hold all information in kernel memory (without > allocating any memory when OOM event happened). Dynamic hooks could hold > a few lines of output, but not all lines we want. The only possible buffer > which is preallocated and large enough would be printk()'s buffer. Thus, > I believe that we will have to use printk() in order to dump OOM information. > At that point, Yes, this is what I've had in mind. > > static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer; > > bool out_of_memory(struct oom_control *oc) > { > return oom_handler(oc); > } > > and let in-tree kernel modules override current OOM killer would be > the only practical choice (if we refuse adding many knobs). Or simply provide a hook with the oom_control to be called to report without replacing the whole oom killer behavior. That is not necessary. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On 2019/08/29 16:11, Michal Hocko wrote: > On Wed 28-08-19 12:46:20, Edward Chron wrote: >> Our belief is if you really think eBPF is the preferred mechanism >> then move OOM reporting to an eBPF. > > I've said that all this additional information has to be dynamically > extensible rather than a part of the core kernel. Whether eBPF is the > suitable tool, I do not know. I haven't explored that. There are other > ways to inject code to the kernel. systemtap/kprobes, kernel modules and > probably others. As for SystemTap, guru mode (an expert mode which disables protection provided by SystemTap; allowing kernel to crash when something went wrong) could be used for holding spinlock. However, as far as I know, holding mutex (or doing any operation that might sleep) from such dynamic hooks is not allowed. Also we will need to export various symbols in order to allow access from such dynamic hooks. I'm not familiar with eBPF, but I guess that eBPF is similar. But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor SystemTap will be suitable for dumping OOM information. OOM situation means that even single page fault event cannot complete, and temporary memory allocation for reading from kernel or writing to files cannot complete. Therefore, we will need to hold all information in kernel memory (without allocating any memory when OOM event happened). Dynamic hooks could hold a few lines of output, but not all lines we want. The only possible buffer which is preallocated and large enough would be printk()'s buffer. Thus, I believe that we will have to use printk() in order to dump OOM information. At that point, static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer; bool out_of_memory(struct oom_control *oc) { return oom_handler(oc); } and let in-tree kernel modules override current OOM killer would be the only practical choice (if we refuse adding many knobs).
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed 28-08-19 12:46:20, Edward Chron wrote: [...] > Our belief is if you really think eBPF is the preferred mechanism > then move OOM reporting to an eBPF. I've said that all this additional information has to be dynamically extensible rather than a part of the core kernel. Whether eBPF is the suitable tool, I do not know. I haven't explored that. There are other ways to inject code to the kernel. systemtap/kprobes, kernel modules and probably others. > I mentioned this before but I will reiterate this here. > > So how do we get there? Let's look at the existing report which we know > has issues. > > Other than a few essential OOM messages the OOM code should produce, > such as the Killed process message message sequence being included, > you could have the entire OOM report moved to an eBPF script and > therefore make it customizable, configurable or if you prefer programmable. I believe we should keep the current reporting in place and allow additional information via dynamic mechanism. Be it a registration mechanism that modules can hook into or other more dynamic way. The current reporting has proven to be useful in many typical oom situations in my past years of experience. It gives the rough state of the failing allocation, MM subsystem, tasks that are eligible and task that is killed so that you can understand why the event happened. I would argue that the eligible tasks should be printed on the opt-in bases because this is more of relict from the past when the victim selection was less deterministic. But that is another story. All the rest of dump_header should stay IMHO as a reasonable default and bare minimum. > Why? Because as we all agree, you'll never have a perfect OOM Report. > So if you believe this, than if you will, put your money where your mouth > is (so to speak) and make the entire OOM Report and eBPF script. > We'd be willing to help with this. > > I'll give specific reasons why you want to do this. > >- Don't want to maintain a lot of code in the kernel (eBPF code doesn't >count). >- Can't produce an ideal OOM report. >- Don't like configuring things but favor programmatic solutions. >- Agree the existing OOM report doesn't work for all environments. >- Want to allow flexibility but can't support everything people might >want. >- Then installing an eBPF for OOM Reporting isn't an option, it's >required. This is going into an extreme. We cannot serve all cases but that is true for any other heuristics/reporting in the kernel. We do care about most. > The last reason is huge for people who live in a world with large data > centers. Data center managers are very conservative. They don't want to > deviate from standard operating procedure unless absolutely necessary. > If loading an OOM Report eBPF is standard to get OOM Reporting output, > then they'll accept that. I have already responded to this kind of argumentation elsewhere. This is not a relevant argument for any kernel implementation. This is a data process management process. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed, Aug 28, 2019 at 1:04 PM Edward Chron wrote: > > On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa > wrote: > > > > On 2019/08/28 16:08, Michal Hocko wrote: > > > On Tue 27-08-19 19:47:22, Edward Chron wrote: > > >> For production systems installing and updating EBPF scripts may someday > > >> be very common, but I wonder how data center managers feel about it now? > > >> Developers are very excited about it and it is a very powerful tool but > > >> can I > > >> get permission to add or replace an existing EBPF on production systems? > > > > > > I am not sure I understand. There must be somebody trusted to take care > > > of systems, right? > > > > > > > Speak of my cases, those who take care of their systems are not developers. > > And they afraid changing code that runs in kernel mode. They unlikely give > > permission to install SystemTap/eBPF scripts. As a result, in many cases, > > the root cause cannot be identified. > > +1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting > uses a an eBPF script then systems have to load them to get any kind of > meaningful report. Frankly, if using eBPF is the route to go than essentially > the whole OOM reporting should go there. We can adjust as we need and > have precedent for wanting to load the script. That's the best we could come > up with. > > > > > Moreover, we are talking about OOM situations, where we can't expect > > userspace > > processes to work properly. We need to dump information we want, without > > counting on userspace processes, before sending SIGKILL. > > +1. We've tried and as you point out and for best results the kernel > has to provide > the state. > > Again a full system dump would be wonderful, but taking a full dump for > every OOM event on production systems? I am not nearly a good enough salesman > to sell that one. So we need an alternate mechanism. > > If we can't agree on some sort of extensible, configurable approach then put > the standard OOM Report in eBPF and make it mandatory to load it so we can > justify having to do that. Linux should load it automatically. > We'll just make a few changes and additions as needed. > > Sounds like a plan that we could live with. > Would be interested if this works for others as well. One further comment. In talking with my colleagues here who know eBPF much better than I do, it may not be possible to implement something this complicated with eBPF. If that is in the fact the case, then we'd have to try and hook the OOM Reporting code with tracepoints similar to kprobes only we want to do more than add counters we want to change the flow to skip small output entries that aren't worth printing. If this isn't feasible with eBPF, then some derivative or our approach or enhancing the OOM output code directly seem like the best options. Will have to investigate this further.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed, 2019-08-28 at 14:17 -0700, Edward Chron wrote: > On Wed, Aug 28, 2019 at 1:18 PM Qian Cai wrote: > > > > On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote: > > > But with the caveat that running a eBPF script that it isn't standard > > > Linux > > > operating procedure, at this point in time any way will not be well > > > received in the data center. > > > > Can't you get your eBPF scripts into the BCC project? As far I can tell, the > > BCC > > has been included in several distros already, and then it will become a part > > of > > standard linux toolkits. > > > > > > > > Our belief is if you really think eBPF is the preferred mechanism > > > then move OOM reporting to an eBPF. > > > I mentioned this before but I will reiterate this here. > > > > On the other hand, it seems many people are happy with the simple kernel OOM > > report we have here. Not saying the current situation is perfect. On the top > > of > > that, some people are using kdump, and some people have resource monitoring > > to > > warn about potential memory overcommits before OOM kicks in etc. > > Assuming you can implement your existing report in eBPF then those who like > the > current output would still get the current output. Same with the patches we > sent > upstream, nothing in the report changes by default. So no problems for those > who > are happy, they'll still be happy. I don't think it makes any sense to rewrite the existing code to depends on eBPF though.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed, Aug 28, 2019 at 1:18 PM Qian Cai wrote: > > On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote: > > But with the caveat that running a eBPF script that it isn't standard Linux > > operating procedure, at this point in time any way will not be well > > received in the data center. > > Can't you get your eBPF scripts into the BCC project? As far I can tell, the > BCC > has been included in several distros already, and then it will become a part > of > standard linux toolkits. > > > > > Our belief is if you really think eBPF is the preferred mechanism > > then move OOM reporting to an eBPF. > > I mentioned this before but I will reiterate this here. > > On the other hand, it seems many people are happy with the simple kernel OOM > report we have here. Not saying the current situation is perfect. On the top > of > that, some people are using kdump, and some people have resource monitoring to > warn about potential memory overcommits before OOM kicks in etc. Assuming you can implement your existing report in eBPF then those who like the current output would still get the current output. Same with the patches we sent upstream, nothing in the report changes by default. So no problems for those who are happy, they'll still be happy.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote: > But with the caveat that running a eBPF script that it isn't standard Linux > operating procedure, at this point in time any way will not be well > received in the data center. Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC has been included in several distros already, and then it will become a part of standard linux toolkits. > > Our belief is if you really think eBPF is the preferred mechanism > then move OOM reporting to an eBPF. > I mentioned this before but I will reiterate this here. On the other hand, it seems many people are happy with the simple kernel OOM report we have here. Not saying the current situation is perfect. On the top of that, some people are using kdump, and some people have resource monitoring to warn about potential memory overcommits before OOM kicks in etc.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa wrote: > > On 2019/08/28 16:08, Michal Hocko wrote: > > On Tue 27-08-19 19:47:22, Edward Chron wrote: > >> For production systems installing and updating EBPF scripts may someday > >> be very common, but I wonder how data center managers feel about it now? > >> Developers are very excited about it and it is a very powerful tool but > >> can I > >> get permission to add or replace an existing EBPF on production systems? > > > > I am not sure I understand. There must be somebody trusted to take care > > of systems, right? > > > > Speak of my cases, those who take care of their systems are not developers. > And they afraid changing code that runs in kernel mode. They unlikely give > permission to install SystemTap/eBPF scripts. As a result, in many cases, > the root cause cannot be identified. +1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting uses a an eBPF script then systems have to load them to get any kind of meaningful report. Frankly, if using eBPF is the route to go than essentially the whole OOM reporting should go there. We can adjust as we need and have precedent for wanting to load the script. That's the best we could come up with. > > Moreover, we are talking about OOM situations, where we can't expect userspace > processes to work properly. We need to dump information we want, without > counting on userspace processes, before sending SIGKILL. +1. We've tried and as you point out and for best results the kernel has to provide the state. Again a full system dump would be wonderful, but taking a full dump for every OOM event on production systems? I am not nearly a good enough salesman to sell that one. So we need an alternate mechanism. If we can't agree on some sort of extensible, configurable approach then put the standard OOM Report in eBPF and make it mandatory to load it so we can justify having to do that. Linux should load it automatically. We'll just make a few changes and additions as needed. Sounds like a plan that we could live with. Would be interested if this works for others as well.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed 28-08-19 19:56:58, Tetsuo Handa wrote: > On 2019/08/28 19:32, Michal Hocko wrote: > >> Speak of my cases, those who take care of their systems are not developers. > >> And they afraid changing code that runs in kernel mode. They unlikely give > >> permission to install SystemTap/eBPF scripts. As a result, in many cases, > >> the root cause cannot be identified. > > > > Which is something I would call a process problem more than a kernel > > one. Really if you need to debug a problem you really have to trust > > those who can debug that for you. We are not going to take tons of code > > to the kernel just because somebody is afraid to run a diagnostic. > > > > This is a problem of kernel development process. I disagree. Expecting that any larger project can be filled with the (close to) _full_ and ready to use introspection built in is just insane. We are trying to help with a generally useful information but you simply cannot cover most existing failure paths. > >> Moreover, we are talking about OOM situations, where we can't expect > >> userspace > >> processes to work properly. We need to dump information we want, without > >> counting on userspace processes, before sending SIGKILL. > > > > Yes, this is an inherent assumption I was making and that means that > > whatever dynamic hooks would have to be registered in advance. > > > > No. I'm saying that neither static hooks nor dynamic hooks can work as > expected if they count on userspace processes. Registering in advance is > irrelevant. Whether it can work without userspace processes is relevant. I am not saying otherwise. I do not expect any userspace process to dump any information or read it from elswhere than from the kernel log. > Also, out-of-tree codes tend to become defunctional. We are trying to debug > problems caused by in-tree code. Breaking out-of-tree debugging code just > because in-tree code developers don't want to pay the burden of maintaining > code for debugging problems caused by in-tree code is a very bad idea. This is a simple math of cost/benefit. The maintenance cost is not free and paying it for odd cases most people do not care about is simply not sustainable, we simply do not have that much of a man power. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On 2019/08/28 19:32, Michal Hocko wrote: >> Speak of my cases, those who take care of their systems are not developers. >> And they afraid changing code that runs in kernel mode. They unlikely give >> permission to install SystemTap/eBPF scripts. As a result, in many cases, >> the root cause cannot be identified. > > Which is something I would call a process problem more than a kernel > one. Really if you need to debug a problem you really have to trust > those who can debug that for you. We are not going to take tons of code > to the kernel just because somebody is afraid to run a diagnostic. > This is a problem of kernel development process. >> Moreover, we are talking about OOM situations, where we can't expect >> userspace >> processes to work properly. We need to dump information we want, without >> counting on userspace processes, before sending SIGKILL. > > Yes, this is an inherent assumption I was making and that means that > whatever dynamic hooks would have to be registered in advance. > No. I'm saying that neither static hooks nor dynamic hooks can work as expected if they count on userspace processes. Registering in advance is irrelevant. Whether it can work without userspace processes is relevant. Also, out-of-tree codes tend to become defunctional. We are trying to debug problems caused by in-tree code. Breaking out-of-tree debugging code just because in-tree code developers don't want to pay the burden of maintaining code for debugging problems caused by in-tree code is a very bad idea.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Wed 28-08-19 19:12:41, Tetsuo Handa wrote: > On 2019/08/28 16:08, Michal Hocko wrote: > > On Tue 27-08-19 19:47:22, Edward Chron wrote: > >> For production systems installing and updating EBPF scripts may someday > >> be very common, but I wonder how data center managers feel about it now? > >> Developers are very excited about it and it is a very powerful tool but > >> can I > >> get permission to add or replace an existing EBPF on production systems? > > > > I am not sure I understand. There must be somebody trusted to take care > > of systems, right? > > > > Speak of my cases, those who take care of their systems are not developers. > And they afraid changing code that runs in kernel mode. They unlikely give > permission to install SystemTap/eBPF scripts. As a result, in many cases, > the root cause cannot be identified. Which is something I would call a process problem more than a kernel one. Really if you need to debug a problem you really have to trust those who can debug that for you. We are not going to take tons of code to the kernel just because somebody is afraid to run a diagnostic. > Moreover, we are talking about OOM situations, where we can't expect userspace > processes to work properly. We need to dump information we want, without > counting on userspace processes, before sending SIGKILL. Yes, this is an inherent assumption I was making and that means that whatever dynamic hooks would have to be registered in advance. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On 2019/08/28 16:08, Michal Hocko wrote: > On Tue 27-08-19 19:47:22, Edward Chron wrote: >> For production systems installing and updating EBPF scripts may someday >> be very common, but I wonder how data center managers feel about it now? >> Developers are very excited about it and it is a very powerful tool but can I >> get permission to add or replace an existing EBPF on production systems? > > I am not sure I understand. There must be somebody trusted to take care > of systems, right? > Speak of my cases, those who take care of their systems are not developers. And they afraid changing code that runs in kernel mode. They unlikely give permission to install SystemTap/eBPF scripts. As a result, in many cases, the root cause cannot be identified. Moreover, we are talking about OOM situations, where we can't expect userspace processes to work properly. We need to dump information we want, without counting on userspace processes, before sending SIGKILL.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue 27-08-19 19:47:22, Edward Chron wrote: > On Tue, Aug 27, 2019 at 6:32 PM Qian Cai wrote: > > > > > > > > > On Aug 27, 2019, at 9:13 PM, Edward Chron wrote: > > > > > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai wrote: > > >> > > >> > > >> > > >>> On Aug 27, 2019, at 8:23 PM, Edward Chron wrote: > > >>> > > >>> > > >>> > > >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai wrote: > > >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: > > This patch series provides code that works as a debug option through > > debugfs to provide additional controls to limit how much information > > gets printed when an OOM event occurs and or optionally print > > additional > > information about slab usage, vmalloc allocations, user process memory > > usage, the number of processes / tasks and some summary information > > about these tasks (number runable, i/o wait), system information > > (#CPUs, Kernel Version and other useful state of the system), > > ARP and ND Cache entry information. > > > > Linux OOM can optionally provide a lot of information, what's missing? > > -- > > Linux provides a variety of detailed information when an OOM event > > occurs > > but has limited options to control how much output is produced. The > > system related information is produced unconditionally and limited per > > user process information is produced as a default enabled option. The > > per user process information may be disabled. > > > > Slab usage information was recently added and is output only if slab > > usage exceeds user memory usage. > > > > Many OOM events are due to user application memory usage sometimes in > > combination with the use of kernel resource usage that exceeds what is > > expected memory usage. Detailed information about how memory was being > > used when the event occurred may be required to identify the root cause > > of the OOM event. > > > > However, some environments are very large and printing all of the > > information about processes, slabs and or vmalloc allocations may > > not be feasible. For other environments printing as much information > > about these as possible may be needed to root cause OOM events. > > > > >>> > > >>> For more in-depth analysis of OOM events, people could use kdump to > > >>> save a > > >>> vmcore by setting "panic_on_oom", and then use the crash utility to > > >>> analysis the > > >>> vmcore which contains pretty much all the information you need. > > >>> > > >>> Certainly, this is the ideal. A full system dump would give you the > > >>> maximum amount of > > >>> information. > > >>> > > >>> Unfortunately some environments may lack space to store the dump, > > >> > > >> Kdump usually also support dumping to a remote target via NFS, SSH etc > > >> > > >>> let alone the time to dump the storage contents and restart the system. > > >>> Some > > >> > > >> There is also “makedumpfile” that could compress and filter unwanted > > >> memory to reduce > > >> the vmcore size and speed up the dumping process by utilizing > > >> multi-threads. > > >> > > >>> systems can take many minutes to fully boot up, to reset and > > >>> reinitialize all the > > >>> devices. So unfortunately this is not always an option, and we need an > > >>> OOM Report. > > >> > > >> I am not sure how the system needs some minutes to reboot would be > > >> relevant for the > > >> discussion here. The idea is to save a vmcore and it can be analyzed > > >> offline even on > > >> another system as long as it having a matching “vmlinux.". > > >> > > >> > > > > > > If selecting a dump on an OOM event doesn't reboot the system and if > > > it runs fast enough such > > > that it doesn't slow processing enough to appreciably effect the > > > system's responsiveness then > > > then it would be ideal solution. For some it would be over kill but > > > since it is an option it is a > > > choice to consider or not. > > > > It sounds like you are looking for more of this, > > If you want to supplement the OOM Report and keep the information > together than you could use EBPF to do that. If that really is the > preference it might make sense to put the entire report as an EBPF > script than you can modify the script however you choose. That would > be very flexible. You can change your configuration on the fly. As > long as it has access to everything you need it should work. > > Michal would know what direction OOM is headed and if he thinks that fits with > where things are headed. It seems we have landed in the similar thinking here. As mentioned in my earlier email in this thread I can see the extensibility to be achieved by eBPF. Essentially we would have a base form of the oom report like now and scripts would then hook in there to provide whatever a specific
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue 27-08-19 18:07:54, Edward Chron wrote: > On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko wrote: > > > > On Mon 26-08-19 12:36:28, Edward Chron wrote: > > [...] > > > Extensibility using OOM debug options > > > - > > > What is needed is an extensible system to optionally configure > > > debug options as needed and to then dynamically enable and disable > > > them. Also for options that produce multiple lines of entry based > > > output, to configure which entries to print based on how much > > > memory they use (or optionally all the entries). > > > > With a patch this large and adding a lot of new stuff we need a more > > detailed usecases described I believe. > > I guess it would make sense to explain motivation for each OOM Debug > option I've sent separately. > I see there comments on the patches I will try and add more information there. > > An overview would be that we've been collecting information on OOM's > over the last 12 years or so. > These are from switches, other embedded devices, servers both large and small. > We ask for feedback on what information was helpful or could be helpful. > We try and add it to make root causing issues easier. > > These OOM debug options are some of the options we've created. > I didn't port all of them to 5.3 but these are representative. > Our latest is kernel is a bit behind 5.3. > > > > > > > [...] > > > > > Use of debugfs to allow dynamic controls > > > > > > By providing a debugfs interface that allows options to be configured, > > > enabled and where appropriate to set a minimum size for selecting > > > entries to print, the output produced when an OOM event occurs can be > > > dynamically adjusted to produce as little or as much detail as needed > > > for a given system. > > > > Who is going to consume this information and why would that consumer be > > unreasonable to demand further maintenance of that information in future > > releases? In other words debugfs is not considered a stableAPI which is > > OK here but the side effect of any change to these files results in user > > visible behavior and we consider that more or less a stable as long as > > there are consumers. > > > > > OOM debug options can be added to the base code as needed. > > > > > > Currently we have the following OOM debug options defined: > > > > > > * System State Summary > > > > > > One line of output that includes: > > > - Uptime (days, hour, minutes, seconds) > > > > We do have timestamps in the log so why is this needed? > > > Here is how an OOM report looks when we get it to look at: > > Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer: > gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, > oom_score_adj=1000 > Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not > tainted 5.3.0-rc6+ #33 > Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd. > IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018 > > This shows the date and time, not time of the last boot. The > /var/log/messages output is what we often have to look at not raw > dmesgs. This looks more like a configuration of the logging than a kernel problem. Kernel does provide timestamps for logs. E.g. $ tail -n1 /var/log/kern.log Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable. [...] > > > Example output when configured and enabled: > > > > > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 > > > forks_since_boot:2786 procs_runable:2 procs_iowait:0 > > > > > > * ARP Table and/or Neighbour Discovery Table Summary > > > -- > > > One line of output each for ARP and ND that includes: > > > - Table name > > > - Table size (max # entries) > > > - Key Length > > > - Entry Size > > > - Number of Entries > > > - Last Flush (in seconds) > > > - hash grows > > > - entry allocations > > > - entry destroys > > > - Number lookups > > > - Number of lookup hits > > > - Resolution failures > > > - Garbage Collection Forced Runs > > > - Table Full > > > - Proxy Queue Length > > > > > > Example output when configured and enabled (for both): > > > > > > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: > > > 360 entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 > > > destroys: 0 lookups: 204 hits: 199 resFailed:38 > > > gcRuns/Forced: 111 / 0 tblFull: 0 proxyQlen: 0 > > > > > > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: > > > 368 entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 > > > destroys: 1 lookups: 0 hits: 0 resFailed: 0 > > > gcRuns/Forced: 110 / 0 tblFull: 0 proxyQlen: 0 > > > > Again, why is this needed particularly for the OOM event? I do > >
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue, Aug 27, 2019 at 6:32 PM Qian Cai wrote: > > > > > On Aug 27, 2019, at 9:13 PM, Edward Chron wrote: > > > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai wrote: > >> > >> > >> > >>> On Aug 27, 2019, at 8:23 PM, Edward Chron wrote: > >>> > >>> > >>> > >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai wrote: > >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: > This patch series provides code that works as a debug option through > debugfs to provide additional controls to limit how much information > gets printed when an OOM event occurs and or optionally print additional > information about slab usage, vmalloc allocations, user process memory > usage, the number of processes / tasks and some summary information > about these tasks (number runable, i/o wait), system information > (#CPUs, Kernel Version and other useful state of the system), > ARP and ND Cache entry information. > > Linux OOM can optionally provide a lot of information, what's missing? > -- > Linux provides a variety of detailed information when an OOM event occurs > but has limited options to control how much output is produced. The > system related information is produced unconditionally and limited per > user process information is produced as a default enabled option. The > per user process information may be disabled. > > Slab usage information was recently added and is output only if slab > usage exceeds user memory usage. > > Many OOM events are due to user application memory usage sometimes in > combination with the use of kernel resource usage that exceeds what is > expected memory usage. Detailed information about how memory was being > used when the event occurred may be required to identify the root cause > of the OOM event. > > However, some environments are very large and printing all of the > information about processes, slabs and or vmalloc allocations may > not be feasible. For other environments printing as much information > about these as possible may be needed to root cause OOM events. > > >>> > >>> For more in-depth analysis of OOM events, people could use kdump to save a > >>> vmcore by setting "panic_on_oom", and then use the crash utility to > >>> analysis the > >>> vmcore which contains pretty much all the information you need. > >>> > >>> Certainly, this is the ideal. A full system dump would give you the > >>> maximum amount of > >>> information. > >>> > >>> Unfortunately some environments may lack space to store the dump, > >> > >> Kdump usually also support dumping to a remote target via NFS, SSH etc > >> > >>> let alone the time to dump the storage contents and restart the system. > >>> Some > >> > >> There is also “makedumpfile” that could compress and filter unwanted > >> memory to reduce > >> the vmcore size and speed up the dumping process by utilizing > >> multi-threads. > >> > >>> systems can take many minutes to fully boot up, to reset and reinitialize > >>> all the > >>> devices. So unfortunately this is not always an option, and we need an > >>> OOM Report. > >> > >> I am not sure how the system needs some minutes to reboot would be > >> relevant for the > >> discussion here. The idea is to save a vmcore and it can be analyzed > >> offline even on > >> another system as long as it having a matching “vmlinux.". > >> > >> > > > > If selecting a dump on an OOM event doesn't reboot the system and if > > it runs fast enough such > > that it doesn't slow processing enough to appreciably effect the > > system's responsiveness then > > then it would be ideal solution. For some it would be over kill but > > since it is an option it is a > > choice to consider or not. > > It sounds like you are looking for more of this, If you want to supplement the OOM Report and keep the information together than you could use EBPF to do that. If that really is the preference it might make sense to put the entire report as an EBPF script than you can modify the script however you choose. That would be very flexible. You can change your configuration on the fly. As long as it has access to everything you need it should work. Michal would know what direction OOM is headed and if he thinks that fits with where things are headed. I'm flexible in he sense that I could change our submission to make specific updates to the existing OOM code. We kept it as separate as possible as for ease of porting. But if we can build an acceptable case for making updates to the existing OOM Report code that works. Our current implementation has some knobs to allow some limited scaling that has advantages over print rate limiting and it may allow environments that didn't want to allow process printing or slab or vmalloc entry allocations printing to do so without generating a lot of output. But the
Re: [PATCH 00/10] OOM Debug print selection and additional information
> On Aug 27, 2019, at 9:13 PM, Edward Chron wrote: > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai wrote: >> >> >> >>> On Aug 27, 2019, at 8:23 PM, Edward Chron wrote: >>> >>> >>> >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai wrote: >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: This patch series provides code that works as a debug option through debugfs to provide additional controls to limit how much information gets printed when an OOM event occurs and or optionally print additional information about slab usage, vmalloc allocations, user process memory usage, the number of processes / tasks and some summary information about these tasks (number runable, i/o wait), system information (#CPUs, Kernel Version and other useful state of the system), ARP and ND Cache entry information. Linux OOM can optionally provide a lot of information, what's missing? -- Linux provides a variety of detailed information when an OOM event occurs but has limited options to control how much output is produced. The system related information is produced unconditionally and limited per user process information is produced as a default enabled option. The per user process information may be disabled. Slab usage information was recently added and is output only if slab usage exceeds user memory usage. Many OOM events are due to user application memory usage sometimes in combination with the use of kernel resource usage that exceeds what is expected memory usage. Detailed information about how memory was being used when the event occurred may be required to identify the root cause of the OOM event. However, some environments are very large and printing all of the information about processes, slabs and or vmalloc allocations may not be feasible. For other environments printing as much information about these as possible may be needed to root cause OOM events. >>> >>> For more in-depth analysis of OOM events, people could use kdump to save a >>> vmcore by setting "panic_on_oom", and then use the crash utility to >>> analysis the >>> vmcore which contains pretty much all the information you need. >>> >>> Certainly, this is the ideal. A full system dump would give you the maximum >>> amount of >>> information. >>> >>> Unfortunately some environments may lack space to store the dump, >> >> Kdump usually also support dumping to a remote target via NFS, SSH etc >> >>> let alone the time to dump the storage contents and restart the system. Some >> >> There is also “makedumpfile” that could compress and filter unwanted memory >> to reduce >> the vmcore size and speed up the dumping process by utilizing multi-threads. >> >>> systems can take many minutes to fully boot up, to reset and reinitialize >>> all the >>> devices. So unfortunately this is not always an option, and we need an OOM >>> Report. >> >> I am not sure how the system needs some minutes to reboot would be relevant >> for the >> discussion here. The idea is to save a vmcore and it can be analyzed offline >> even on >> another system as long as it having a matching “vmlinux.". >> >> > > If selecting a dump on an OOM event doesn't reboot the system and if > it runs fast enough such > that it doesn't slow processing enough to appreciably effect the > system's responsiveness then > then it would be ideal solution. For some it would be over kill but > since it is an option it is a > choice to consider or not. It sounds like you are looking for more of this, https://github.com/iovisor/bcc/blob/master/tools/oomkill.py
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue, Aug 27, 2019 at 5:50 PM Qian Cai wrote: > > > > > On Aug 27, 2019, at 8:23 PM, Edward Chron wrote: > > > > > > > > On Tue, Aug 27, 2019 at 5:40 AM Qian Cai wrote: > > On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: > > > This patch series provides code that works as a debug option through > > > debugfs to provide additional controls to limit how much information > > > gets printed when an OOM event occurs and or optionally print additional > > > information about slab usage, vmalloc allocations, user process memory > > > usage, the number of processes / tasks and some summary information > > > about these tasks (number runable, i/o wait), system information > > > (#CPUs, Kernel Version and other useful state of the system), > > > ARP and ND Cache entry information. > > > > > > Linux OOM can optionally provide a lot of information, what's missing? > > > -- > > > Linux provides a variety of detailed information when an OOM event occurs > > > but has limited options to control how much output is produced. The > > > system related information is produced unconditionally and limited per > > > user process information is produced as a default enabled option. The > > > per user process information may be disabled. > > > > > > Slab usage information was recently added and is output only if slab > > > usage exceeds user memory usage. > > > > > > Many OOM events are due to user application memory usage sometimes in > > > combination with the use of kernel resource usage that exceeds what is > > > expected memory usage. Detailed information about how memory was being > > > used when the event occurred may be required to identify the root cause > > > of the OOM event. > > > > > > However, some environments are very large and printing all of the > > > information about processes, slabs and or vmalloc allocations may > > > not be feasible. For other environments printing as much information > > > about these as possible may be needed to root cause OOM events. > > > > > > > For more in-depth analysis of OOM events, people could use kdump to save a > > vmcore by setting "panic_on_oom", and then use the crash utility to > > analysis the > > vmcore which contains pretty much all the information you need. > > > > Certainly, this is the ideal. A full system dump would give you the maximum > > amount of > > information. > > > > Unfortunately some environments may lack space to store the dump, > > Kdump usually also support dumping to a remote target via NFS, SSH etc > > > let alone the time to dump the storage contents and restart the system. Some > > There is also “makedumpfile” that could compress and filter unwanted memory > to reduce > the vmcore size and speed up the dumping process by utilizing multi-threads. > > > systems can take many minutes to fully boot up, to reset and reinitialize > > all the > > devices. So unfortunately this is not always an option, and we need an OOM > > Report. > > I am not sure how the system needs some minutes to reboot would be relevant > for the > discussion here. The idea is to save a vmcore and it can be analyzed offline > even on > another system as long as it having a matching “vmlinux.". > > If selecting a dump on an OOM event doesn't reboot the system and if it runs fast enough such that it doesn't slow processing enough to appreciably effect the system's responsiveness then then it would be ideal solution. For some it would be over kill but since it is an option it is a choice to consider or not.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko wrote: > > On Mon 26-08-19 12:36:28, Edward Chron wrote: > [...] > > Extensibility using OOM debug options > > - > > What is needed is an extensible system to optionally configure > > debug options as needed and to then dynamically enable and disable > > them. Also for options that produce multiple lines of entry based > > output, to configure which entries to print based on how much > > memory they use (or optionally all the entries). > > With a patch this large and adding a lot of new stuff we need a more > detailed usecases described I believe. I guess it would make sense to explain motivation for each OOM Debug option I've sent separately. I see there comments on the patches I will try and add more information there. An overview would be that we've been collecting information on OOM's over the last 12 years or so. These are from switches, other embedded devices, servers both large and small. We ask for feedback on what information was helpful or could be helpful. We try and add it to make root causing issues easier. These OOM debug options are some of the options we've created. I didn't port all of them to 5.3 but these are representative. Our latest is kernel is a bit behind 5.3. > > > [...] > > > Use of debugfs to allow dynamic controls > > > > By providing a debugfs interface that allows options to be configured, > > enabled and where appropriate to set a minimum size for selecting > > entries to print, the output produced when an OOM event occurs can be > > dynamically adjusted to produce as little or as much detail as needed > > for a given system. > > Who is going to consume this information and why would that consumer be > unreasonable to demand further maintenance of that information in future > releases? In other words debugfs is not considered a stableAPI which is > OK here but the side effect of any change to these files results in user > visible behavior and we consider that more or less a stable as long as > there are consumers. > > > OOM debug options can be added to the base code as needed. > > > > Currently we have the following OOM debug options defined: > > > > * System State Summary > > > > One line of output that includes: > > - Uptime (days, hour, minutes, seconds) > > We do have timestamps in the log so why is this needed? Here is how an OOM report looks when we get it to look at: Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=1000 Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not tainted 5.3.0-rc6+ #33 Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd. IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018 This shows the date and time, not time of the last boot. The /var/log/messages output is what we often have to look at not raw dmesgs. > > > > - Number CPUs > > - Machine Type > > - Node name > > - Domain name > > why are these needed? That is a static information that doesn't really > influence the OOM situation. Sorry if a few of the items overlap what OOM prints. We've been printing a lot of this information since 2.6.38 and OOM reporting has been updated. We're updating our 4.19 system to have the latest OOM Report format. This was the 5.0 patch Reorg the OOM report in the dump header. Also back porting Shakeel's 5.3 patch to refactor dump tasks for memcg OOMs. We're testing those back ports right now in fact. We can probably get rid of some of the information we have but I haven't had a chance yet. Hopefully can do it as part of sending some code upstream. > > > > - Kernel Release > > - Kernel Version > > part of the oom report > > > > > Example output when configured and enabled: > > > > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 > > Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ > > Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019 > > > > * Tasks Summary > > - > > One line of output that includes: > > - Number of Threads > > - Number of processes > > - Forks since boot > > - Processes that are runnable > > - Processes that are in iowait > > We do have sysrq+t for this kind of information. Why do we need to > duplicate it? Unfortunately, we can't login into every customer system or even system of our own and do a sysrq+t after each OOM. You could scan for OOMs and have a script do it, but doing a sysrq+t after an OOM event, you'll get different results. I'd rather have the runnable and iowait counts during the OOM event not after. Computers are so darn fast, free up some memory and things can look a lot different. We've seen crond fork and hang and gradually create thousands of processes and sorts of other unintended fork bombs. On some systems we can't print all of the process information as we've
Re: [PATCH 00/10] OOM Debug print selection and additional information
> On Aug 27, 2019, at 8:23 PM, Edward Chron wrote: > > > > On Tue, Aug 27, 2019 at 5:40 AM Qian Cai wrote: > On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: > > This patch series provides code that works as a debug option through > > debugfs to provide additional controls to limit how much information > > gets printed when an OOM event occurs and or optionally print additional > > information about slab usage, vmalloc allocations, user process memory > > usage, the number of processes / tasks and some summary information > > about these tasks (number runable, i/o wait), system information > > (#CPUs, Kernel Version and other useful state of the system), > > ARP and ND Cache entry information. > > > > Linux OOM can optionally provide a lot of information, what's missing? > > -- > > Linux provides a variety of detailed information when an OOM event occurs > > but has limited options to control how much output is produced. The > > system related information is produced unconditionally and limited per > > user process information is produced as a default enabled option. The > > per user process information may be disabled. > > > > Slab usage information was recently added and is output only if slab > > usage exceeds user memory usage. > > > > Many OOM events are due to user application memory usage sometimes in > > combination with the use of kernel resource usage that exceeds what is > > expected memory usage. Detailed information about how memory was being > > used when the event occurred may be required to identify the root cause > > of the OOM event. > > > > However, some environments are very large and printing all of the > > information about processes, slabs and or vmalloc allocations may > > not be feasible. For other environments printing as much information > > about these as possible may be needed to root cause OOM events. > > > > For more in-depth analysis of OOM events, people could use kdump to save a > vmcore by setting "panic_on_oom", and then use the crash utility to analysis > the > vmcore which contains pretty much all the information you need. > > Certainly, this is the ideal. A full system dump would give you the maximum > amount of > information. > > Unfortunately some environments may lack space to store the dump, Kdump usually also support dumping to a remote target via NFS, SSH etc > let alone the time to dump the storage contents and restart the system. Some There is also “makedumpfile” that could compress and filter unwanted memory to reduce the vmcore size and speed up the dumping process by utilizing multi-threads. > systems can take many minutes to fully boot up, to reset and reinitialize all > the > devices. So unfortunately this is not always an option, and we need an OOM > Report. I am not sure how the system needs some minutes to reboot would be relevant for the discussion here. The idea is to save a vmcore and it can be analyzed offline even on another system as long as it having a matching “vmlinux.".
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote: > This patch series provides code that works as a debug option through > debugfs to provide additional controls to limit how much information > gets printed when an OOM event occurs and or optionally print additional > information about slab usage, vmalloc allocations, user process memory > usage, the number of processes / tasks and some summary information > about these tasks (number runable, i/o wait), system information > (#CPUs, Kernel Version and other useful state of the system), > ARP and ND Cache entry information. > > Linux OOM can optionally provide a lot of information, what's missing? > -- > Linux provides a variety of detailed information when an OOM event occurs > but has limited options to control how much output is produced. The > system related information is produced unconditionally and limited per > user process information is produced as a default enabled option. The > per user process information may be disabled. > > Slab usage information was recently added and is output only if slab > usage exceeds user memory usage. > > Many OOM events are due to user application memory usage sometimes in > combination with the use of kernel resource usage that exceeds what is > expected memory usage. Detailed information about how memory was being > used when the event occurred may be required to identify the root cause > of the OOM event. > > However, some environments are very large and printing all of the > information about processes, slabs and or vmalloc allocations may > not be feasible. For other environments printing as much information > about these as possible may be needed to root cause OOM events. > For more in-depth analysis of OOM events, people could use kdump to save a vmcore by setting "panic_on_oom", and then use the crash utility to analysis the vmcore which contains pretty much all the information you need. The downside of that approach is that this is probably only for enterprise use- cases that kdump/crash may be tested properly on enterprise-level distros while the combo is more broken for developers on consumer distros due to kdump/crash could be affected by many kernel subsystems and have a tendency to be broken fairly quickly where the community testing is pretty much light.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Tue 27-08-19 19:10:18, Tetsuo Handa wrote: > On 2019/08/27 16:15, Michal Hocko wrote: > > All that being said, I do not think this is something we want to merge > > without a really _strong_ usecase to back it. > > Like the sender's domain "arista.com" suggests, some of information is > geared towards networking devices, and ability to report OOM information > in a way suitable for automatic recording/analyzing (e.g. without using > shell prompt, let alone manually typing SysRq commands) would be convenient > for unattended devices. Why cannot the remote end of the logging identify the host. It has to connect somewhere anyway, right? I also do assume that a log collector already does store each log with host id of some form. -- Michal Hocko SUSE Labs
Re: [PATCH 00/10] OOM Debug print selection and additional information
On 2019/08/27 16:15, Michal Hocko wrote: > All that being said, I do not think this is something we want to merge > without a really _strong_ usecase to back it. Like the sender's domain "arista.com" suggests, some of information is geared towards networking devices, and ability to report OOM information in a way suitable for automatic recording/analyzing (e.g. without using shell prompt, let alone manually typing SysRq commands) would be convenient for unattended devices. We have only one OOM killer implementation and format/data are hard-coded. If we can make OOM killer modular, Edward would be able to use it.
Re: [PATCH 00/10] OOM Debug print selection and additional information
On Mon 26-08-19 12:36:28, Edward Chron wrote: [...] > Extensibility using OOM debug options > - > What is needed is an extensible system to optionally configure > debug options as needed and to then dynamically enable and disable > them. Also for options that produce multiple lines of entry based > output, to configure which entries to print based on how much > memory they use (or optionally all the entries). With a patch this large and adding a lot of new stuff we need a more detailed usecases described I believe. [...] > Use of debugfs to allow dynamic controls > > By providing a debugfs interface that allows options to be configured, > enabled and where appropriate to set a minimum size for selecting > entries to print, the output produced when an OOM event occurs can be > dynamically adjusted to produce as little or as much detail as needed > for a given system. Who is going to consume this information and why would that consumer be unreasonable to demand further maintenance of that information in future releases? In other words debugfs is not considered a stableAPI which is OK here but the side effect of any change to these files results in user visible behavior and we consider that more or less a stable as long as there are consumers. > OOM debug options can be added to the base code as needed. > > Currently we have the following OOM debug options defined: > > * System State Summary > > One line of output that includes: > - Uptime (days, hour, minutes, seconds) We do have timestamps in the log so why is this needed? > - Number CPUs > - Machine Type > - Node name > - Domain name why are these needed? That is a static information that doesn't really influence the OOM situation. > - Kernel Release > - Kernel Version part of the oom report > > Example output when configured and enabled: > > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 > Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ > Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019 > > * Tasks Summary > - > One line of output that includes: > - Number of Threads > - Number of processes > - Forks since boot > - Processes that are runnable > - Processes that are in iowait We do have sysrq+t for this kind of information. Why do we need to duplicate it? > Example output when configured and enabled: > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 > forks_since_boot:2786 procs_runable:2 procs_iowait:0 > > * ARP Table and/or Neighbour Discovery Table Summary > -- > One line of output each for ARP and ND that includes: > - Table name > - Table size (max # entries) > - Key Length > - Entry Size > - Number of Entries > - Last Flush (in seconds) > - hash grows > - entry allocations > - entry destroys > - Number lookups > - Number of lookup hits > - Resolution failures > - Garbage Collection Forced Runs > - Table Full > - Proxy Queue Length > > Example output when configured and enabled (for both): > > ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 > entries: 9 lastFlush: 1721s hGrows: 1 allocs: 9 destroys: 0 > lookups: 204 hits: 199 resFailed:38 gcRuns/Forced: 111 / 0 tblFull: > 0 proxyQlen: 0 > > ... kernel: neighbour: Table: nd_tbl size: 128 keyLen: 16 entrySize: 368 > entries: 6 lastFlush: 1720s hGrows: 0 allocs: 7 destroys: 1 > lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 / 0 tblFull: > 0 proxyQlen: 0 Again, why is this needed particularly for the OOM event? I do understand this might be useful system health diagnostic information but how does this contribute to the OOM? > * Add Select Slabs Print > -- > Allow select slab entries (based on a minimum size) to be printed. > Minimum size is specified as a percentage of the total RAM memory > in tenths of a percent, consistent with existing OOM process scoring. > Valid values are specified from 0 to 1000 where 0 prints all slab > entries (all slabs that have at least one slab object in use) up > to 1000 which would require a slab to use 100% of memory which can't > happen so in that case only summary information is printed. > > The first line of output is the standard Linux output header for > OOM printed Slab entries. This header looks like this: > > Aug 6 09:37:21 egc103 yourserver: Unreclaimable slab info: > > The output is existing slab entry memory usage limited such that only > entries equal to or larger than the minimum size are printed. > Empty slabs (no slab entries in slabs in use) are never printed. > > Additional output consists of summary information that is printed > at the end of the output. This summary information
[PATCH 00/10] OOM Debug print selection and additional information
This patch series provides code that works as a debug option through debugfs to provide additional controls to limit how much information gets printed when an OOM event occurs and or optionally print additional information about slab usage, vmalloc allocations, user process memory usage, the number of processes / tasks and some summary information about these tasks (number runable, i/o wait), system information (#CPUs, Kernel Version and other useful state of the system), ARP and ND Cache entry information. Linux OOM can optionally provide a lot of information, what's missing? -- Linux provides a variety of detailed information when an OOM event occurs but has limited options to control how much output is produced. The system related information is produced unconditionally and limited per user process information is produced as a default enabled option. The per user process information may be disabled. Slab usage information was recently added and is output only if slab usage exceeds user memory usage. Many OOM events are due to user application memory usage sometimes in combination with the use of kernel resource usage that exceeds what is expected memory usage. Detailed information about how memory was being used when the event occurred may be required to identify the root cause of the OOM event. However, some environments are very large and printing all of the information about processes, slabs and or vmalloc allocations may not be feasible. For other environments printing as much information about these as possible may be needed to root cause OOM events. Extensibility using OOM debug options - What is needed is an extensible system to optionally configure debug options as needed and to then dynamically enable and disable them. Also for options that produce multiple lines of entry based output, to configure which entries to print based on how much memory they use (or optionally all the entries). Limiting print entry output based on object size To limit output, a fixed size of object could be used such as: vmallocs that use more than 1MB, slabs that are using more than 512KB, processes using 16MB or more of memory. Such an apporach is quite reasonable. Using OOM's memory metrics to limit printing based on entry size However, the current implementation of OOM which has been in use for almost a decade scores based on 1/10 % of memory. This methodology scales well as memory sizes increase. If you limit the objects you examine to those using 0.1% of memory you still may get a large number of objects but avoid printing those using a relatively small amount of memory. Further options that allow limiting output based on object size can have the minimum size set to zero. In this case objects that use even a small amount of memory will be printed. Use of debugfs to allow dynamic controls By providing a debugfs interface that allows options to be configured, enabled and where appropriate to set a minimum size for selecting entries to print, the output produced when an OOM event occurs can be dynamically adjusted to produce as little or as much detail as needed for a given system. OOM debug options can be added to the base code as needed. Currently we have the following OOM debug options defined: * System State Summary One line of output that includes: - Uptime (days, hour, minutes, seconds) - Number CPUs - Machine Type - Node name - Domain name - Kernel Release - Kernel Version Example output when configured and enabled: Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019 * Tasks Summary - One line of output that includes: - Number of Threads - Number of processes - Forks since boot - Processes that are runnable - Processes that are in iowait Example output when configured and enabled: Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0 * ARP Table and/or Neighbour Discovery Table Summary -- One line of output each for ARP and ND that includes: - Table name - Table size (max # entries) - Key Length - Entry Size - Number of Entries - Last Flush (in seconds) - hash grows - entry allocations - entry destroys - Number lookups - Number of lookup hits - Resolution failures - Garbage Collection Forced Runs - Table Full - Proxy Queue Length Example output when configured and enabled (for both): ... kernel: neighbour: Table: arp_tbl size: 256 keyLen: 4 entrySize: 360 entries: 9