Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 11:44 AM Qian Cai  wrote:
>
> On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:
>
> > > Feel like you are going in circles to "sell" without any new information. 
> > > If
> > > you
> > > need to deal with OOM that often, it might also worth working with FB on
> > > oomd.
> > >
> > > https://github.com/facebookincubator/oomd
> > >
> > > It is well-known that kernel OOM could be slow and painful to deal with, 
> > > so
> > > I
> > > don't buy-in the argument that kernel OOM recover is better/faster than a
> > > kdump
> > > reboot.
> > >
> > > It is not unusual that when the system is triggering a kernel OOM, it is
> > > almost
> > > trashed/dead. Although developers are working hard to improve the recovery
> > > after
> > > OOM, there are still many error-paths that are not going to survive which
> > > would
> > > leak memories, introduce undefined behaviors, corrupt memory etc.
> >
> > But as you have pointed out many people are happy with current OOM 
> > processing
> > which is the report and recovery so for those people a kdump reboot is
> > overkill.
> > Making the OOM report at least optionally a bit more informative has value.
> > Also
> > making sure it doesn't produce excessive output is desirable.
> >
> > I do agree for developers having to have all the system state a kdump
> > provides that
> > and as long as you can reproduce the OOM event that works well. But
> > that is not the
> > common case as has already been discussed.
> >
> > Also, OOM events that are due to kernel bugs could leak memory and over time
> > and cause a crash, true. But that is not what we typically see. In
> > fact we've had
> > customers come back and report issues on systems that have been in 
> > continuous
> > operation for years. No point in crashing their system. Linux if
> > properly maintained
> > is thankfully quite stable. But OOMs do happen and root causing them to
> > prevent
> > future occurrences is desired.
>
> This is not what I meant. After an OOM event happens, many kernel memory
> allocations could fail. Since very few people are testing those error-paths 
> due
> to allocation failures, it is considered one of those most buggy areas in the
> kernel. Developers have mostly been focus on making sure the kernel OOM should
> not happen in the first place.
>
> I still think the time is better spending on improving things like eBPF, oomd
> and kdump etc to solve your problem, but leave the kernel OOM report code 
> alone.
>

Sure would rather spend my time doing other things.
No argument about that. No one likes OOMs.
If I never see another OOM I'd be quite happy.

But OOM events still happen and an OOM report gets generated.
When it happens it is useful to get information that can help
find the cause of the OOM so it can be fixed and won't happen again.
We get tasked to root cause OOMs even though we'd rather do
other things.

We've added a bit of output to the OOM Report and it has been helpful.
We also reduce our total output by only printing larger entries
with helpful summaries.
We've been using and supporting this code for quite a few releases.
We haven't had problems and we have a lot of systems in use.

Contributing to an open source project like Linux is good.
If the code is not accepted its not the end of the world.
I was told to offer our code upstream and to try to be helpful.

I understand that processing an OOM event can be flakey.
We add a few lines of OOM output but in fact we reduce our total
output because we skip printing smaller entries and print
summaries instead.

So if the volume of the output increases the likelihood of system
failure during an OOM event, then we've actually increased our
reliability. Maybe that is why we haven't had any problems.

As far as switching from generating an OOM report to taking
a dump and restarting the system, the choice is not mine to
decide. Way above my pay grade. When asked, I am
happy to look at a dump but dumps plus restarts for
the systems we work on take too long so I typically don't get
a dump to look at. Have to make due with OOM output and
logs.

Also, and depending on what you work on, you may take
satisfaction that OOM events are far less traumatic with
newer versions of Linux, with our systems. The folks upstream
do really good work, give credit where credit is due.
Maybe tools like KASAN really help, which we also use.

Sure people fix bugs all the time, Linux is huge and super
complicated, but many of the bugs are not very common
and we spend an amazing (to me anyway) amount of time
testing and so when we take OOM events, even multiple
OOM events back to back, the system almost always
recovers and we don't seem to bleed memory. That is
why we systems up for months and even years.

Occasionally we see a watchdog timeout failure and that
can be due to a low memory situation but just FYI a fair
number of those do not involve OOM events so its not
because of issues with OOM code, reporting or otherwise.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Qian Cai
On Thu, 2019-08-29 at 09:09 -0700, Edward Chron wrote:

> > Feel like you are going in circles to "sell" without any new information. If
> > you
> > need to deal with OOM that often, it might also worth working with FB on
> > oomd.
> > 
> > https://github.com/facebookincubator/oomd
> > 
> > It is well-known that kernel OOM could be slow and painful to deal with, so
> > I
> > don't buy-in the argument that kernel OOM recover is better/faster than a
> > kdump
> > reboot.
> > 
> > It is not unusual that when the system is triggering a kernel OOM, it is
> > almost
> > trashed/dead. Although developers are working hard to improve the recovery
> > after
> > OOM, there are still many error-paths that are not going to survive which
> > would
> > leak memories, introduce undefined behaviors, corrupt memory etc.
> 
> But as you have pointed out many people are happy with current OOM processing
> which is the report and recovery so for those people a kdump reboot is
> overkill.
> Making the OOM report at least optionally a bit more informative has value.
> Also
> making sure it doesn't produce excessive output is desirable.
> 
> I do agree for developers having to have all the system state a kdump
> provides that
> and as long as you can reproduce the OOM event that works well. But
> that is not the
> common case as has already been discussed.
> 
> Also, OOM events that are due to kernel bugs could leak memory and over time
> and cause a crash, true. But that is not what we typically see. In
> fact we've had
> customers come back and report issues on systems that have been in continuous
> operation for years. No point in crashing their system. Linux if
> properly maintained
> is thankfully quite stable. But OOMs do happen and root causing them to
> prevent
> future occurrences is desired.

This is not what I meant. After an OOM event happens, many kernel memory
allocations could fail. Since very few people are testing those error-paths due
to allocation failures, it is considered one of those most buggy areas in the
kernel. Developers have mostly been focus on making sure the kernel OOM should
not happen in the first place.

I still think the time is better spending on improving things like eBPF, oomd
and kdump etc to solve your problem, but leave the kernel OOM report code alone.



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 9:18 AM Michal Hocko  wrote:
>
> On Thu 29-08-19 08:03:19, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko  wrote:
> [...]
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
>
> Why would a hook be limited to small stuff?

It could be larger but the few items we added were just a line or
two of output.

The vmalloc, slabs and processes can print many entries so we
added a control for those.

>
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> TBH, I am not really convinced partial taks list is desirable nor easy
> to configure. What is the criterion? oom_score (with potentially unstable
> metric)? Rss? Something else?

We used an estimate of the memory footprint of the process:
rss, swap pages and page table pages.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Michal Hocko
On Thu 29-08-19 08:03:19, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko  wrote:
[...]
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
> 
> For very simple addition, to add a line of output this works.

Why would a hook be limited to small stuff?

> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

TBH, I am not really convinced partial taks list is desirable nor easy
to configure. What is the criterion? oom_score (with potentially unstable
metric)? Rss? Something else?
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 8:42 AM Qian Cai  wrote:
>
> On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> > On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko  wrote:
> > >
> > > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > > then move OOM reporting to an eBPF.
> > > > >
> > > > > I've said that all this additional information has to be dynamically
> > > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules 
> > > > > and
> > > > > probably others.
> > > >
> > > > As for SystemTap, guru mode (an expert mode which disables protection
> > > > provided
> > > > by SystemTap; allowing kernel to crash when something went wrong) could 
> > > > be
> > > > used
> > > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > > any
> > > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > > we will
> > > > need to export various symbols in order to allow access from such 
> > > > dynamic
> > > > hooks.
> > >
> > > This is the oom path and it should better not use any sleeping locks in
> > > the first place.
> > >
> > > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > > >
> > > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > > means
> > > > that even single page fault event cannot complete, and temporary memory
> > > > allocation for reading from kernel or writing to files cannot complete.
> > >
> > > And I repeat that no such reporting is going to write to files. This is
> > > an OOM path afterall.
> > >
> > > > Therefore, we will need to hold all information in kernel memory 
> > > > (without
> > > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > > a few lines of output, but not all lines we want. The only possible 
> > > > buffer
> > > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > > I believe that we will have to use printk() in order to dump OOM
> > > > information.
> > > > At that point,
> > >
> > > Yes, this is what I've had in mind.
> > >
> >
> > +1: It makes sense to keep the report going to the dmesg to persist.
> > That is where it has always gone and there is no reason to change.
> > You can have several OOMs back to back and you'd like to retain the output.
> > All the information should be kept together in the OOM report.
> >
> > > >
> > > >   static bool (*oom_handler)(struct oom_control *oc) = 
> > > > default_oom_killer;
> > > >
> > > >   bool out_of_memory(struct oom_control *oc)
> > > >   {
> > > >   return oom_handler(oc);
> > > >   }
> > > >
> > > > and let in-tree kernel modules override current OOM killer would be
> > > > the only practical choice (if we refuse adding many knobs).
> > >
> > > Or simply provide a hook with the oom_control to be called to report
> > > without replacing the whole oom killer behavior. That is not necessary.
> >
> > For very simple addition, to add a line of output this works.
> > It would still be nice to address the fact the existing OOM Report prints
> > all of the user processes or none. It would be nice to add some control
> > for that. That's what we did.
>
> Feel like you are going in circles to "sell" without any new information. If 
> you
> need to deal with OOM that often, it might also worth working with FB on oomd.
>
> https://github.com/facebookincubator/oomd
>
> It is well-known that kernel OOM could be slow and painful to deal with, so I
> don't buy-in the argument that kernel OOM recover is better/faster than a 
> kdump
> reboot.
>
> It is not unusual that when the system is triggering a kernel OOM, it is 
> almost
> trashed/dead. Although developers are working hard to improve the recovery 
> after
> OOM, there are still many error-paths that are not going to survive which 
> would
> leak memories, introduce undefined behaviors, corrupt memory etc.

But as you have pointed out many people are happy with current OOM processing
which is the report and recovery so for those people a kdump reboot is overkill.
Making the OOM report at least optionally a bit more informative has value. Also
making sure it doesn't produce excessive output is desirable.

I do agree for developers having to have all the system state a kdump
provides that
and as long as you can reproduce the OOM event that works well. But
that is not the
common case as has already been discussed.

Also, OOM events that are due to kernel bugs could leak memory and over time
and cause a crash, true. But that is not what we typically see. In
fact we've had
customers come back and 

Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 7:09 AM Tetsuo Handa
 wrote:
>
> On 2019/08/29 20:56, Michal Hocko wrote:
> >> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> >> SystemTap will be suitable for dumping OOM information. OOM situation means
> >> that even single page fault event cannot complete, and temporary memory
> >> allocation for reading from kernel or writing to files cannot complete.
> >
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
>
> The process who fetches from e.g. eBPF event cannot involve page fault.
> The front-end for iovisor/bcc is a python userspace process. But I think
> that such process can't run under OOM situation.
>
> >
> >> Therefore, we will need to hold all information in kernel memory (without
> >> allocating any memory when OOM event happened). Dynamic hooks could hold
> >> a few lines of output, but not all lines we want. The only possible buffer
> >> which is preallocated and large enough would be printk()'s buffer. Thus,
> >> I believe that we will have to use printk() in order to dump OOM 
> >> information.
> >> At that point,
> >
> > Yes, this is what I've had in mind.
>
> Probably I incorrectly shortcut.
>
> Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
> all lines when dump_tasks() reports 32000+ processes. We have to buffer all 
> output
> in kernel memory because we can't complete even a page fault event triggered 
> by
> the python process monitoring eBPF event (and writing the result to some log 
> file
> or something) while out_of_memory() is in flight.
>
> And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
> saying is "we won't be able to hold output from dump_tasks() if output from
> dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
> a way that can handle the worst case.

With the patch series we sent the addition of vmalloc entries print
required us to
add a small piece of code to vmalloc.c but we thought this should be core
OOM Reporting function. However you want to limit which vmalloc entries you
print, probably only very large memory users. For us this generates just a few
entries and has proven useful.

The changes to limit how many processes get printed so you don't have the all
or nothing would be nice to have. It would be easiest if there was a standard
mechanism to specify which entries to print, probably by a minimum size which
is what we did. We used debugfs to set the controls but sysctl or some other
mechanism could be used.

The rest of what we did might be implemented with hooks as they only output
a line or two and I've already got rid of information we had that was
redundant.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Qian Cai
On Thu, 2019-08-29 at 08:03 -0700, Edward Chron wrote:
> On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko  wrote:
> > 
> > On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > > On 2019/08/29 16:11, Michal Hocko wrote:
> > > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > > > > Our belief is if you really think eBPF is the preferred mechanism
> > > > > then move OOM reporting to an eBPF.
> > > > 
> > > > I've said that all this additional information has to be dynamically
> > > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > > suitable tool, I do not know. I haven't explored that. There are other
> > > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > > probably others.
> > > 
> > > As for SystemTap, guru mode (an expert mode which disables protection
> > > provided
> > > by SystemTap; allowing kernel to crash when something went wrong) could be
> > > used
> > > for holding spinlock. However, as far as I know, holding mutex (or doing
> > > any
> > > operation that might sleep) from such dynamic hooks is not allowed. Also
> > > we will
> > > need to export various symbols in order to allow access from such dynamic
> > > hooks.
> > 
> > This is the oom path and it should better not use any sleeping locks in
> > the first place.
> > 
> > > I'm not familiar with eBPF, but I guess that eBPF is similar.
> > > 
> > > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > > SystemTap will be suitable for dumping OOM information. OOM situation
> > > means
> > > that even single page fault event cannot complete, and temporary memory
> > > allocation for reading from kernel or writing to files cannot complete.
> > 
> > And I repeat that no such reporting is going to write to files. This is
> > an OOM path afterall.
> > 
> > > Therefore, we will need to hold all information in kernel memory (without
> > > allocating any memory when OOM event happened). Dynamic hooks could hold
> > > a few lines of output, but not all lines we want. The only possible buffer
> > > which is preallocated and large enough would be printk()'s buffer. Thus,
> > > I believe that we will have to use printk() in order to dump OOM
> > > information.
> > > At that point,
> > 
> > Yes, this is what I've had in mind.
> > 
> 
> +1: It makes sense to keep the report going to the dmesg to persist.
> That is where it has always gone and there is no reason to change.
> You can have several OOMs back to back and you'd like to retain the output.
> All the information should be kept together in the OOM report.
> 
> > > 
> > >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> > > 
> > >   bool out_of_memory(struct oom_control *oc)
> > >   {
> > >   return oom_handler(oc);
> > >   }
> > > 
> > > and let in-tree kernel modules override current OOM killer would be
> > > the only practical choice (if we refuse adding many knobs).
> > 
> > Or simply provide a hook with the oom_control to be called to report
> > without replacing the whole oom killer behavior. That is not necessary.
> 
> For very simple addition, to add a line of output this works.
> It would still be nice to address the fact the existing OOM Report prints
> all of the user processes or none. It would be nice to add some control
> for that. That's what we did.

Feel like you are going in circles to "sell" without any new information. If you
need to deal with OOM that often, it might also worth working with FB on oomd.

https://github.com/facebookincubator/oomd

It is well-known that kernel OOM could be slow and painful to deal with, so I
don't buy-in the argument that kernel OOM recover is better/faster than a kdump
reboot.

It is not unusual that when the system is triggering a kernel OOM, it is almost
trashed/dead. Although developers are working hard to improve the recovery after
OOM, there are still many error-paths that are not going to survive which would
leak memories, introduce undefined behaviors, corrupt memory etc.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 12:11 AM Michal Hocko  wrote:
>
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
> [...]
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
>
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

For simple code injections eBPF or kprobe works and a tracepoint would
help with that. For example we could add our one line of task information
that we find very useful this way.

For adding controls to limit output for processes, slabs and vmalloc entries
it would be harder to inject code for that. Our solution was to use debugfs.
An alternate could to be add simple sysctl if using debugfs is not appropriate.
As our code illustrated this can be added without changing the existing report
in any substantive way. I think there is value in this and this is core to what
the OOM report should provide. Additional items may be add ons that are
environment specific but these are OOM reporting essentials IMHO.

>
> > I mentioned this before but I will reiterate this here.
> >
> > So how do we get there? Let's look at the existing report which we know
> > has issues.
> >
> > Other than a few essential OOM messages the OOM code should produce,
> > such as the Killed process message message sequence being included,
> > you could have the entire OOM report moved to an eBPF script and
> > therefore make it customizable, configurable or if you prefer programmable.
>
> I believe we should keep the current reporting in place and allow
> additional information via dynamic mechanism. Be it a registration
> mechanism that modules can hook into or other more dynamic way.
> The current reporting has proven to be useful in many typical oom
> situations in my past years of experience. It gives the rough state of
> the failing allocation, MM subsystem, tasks that are eligible and task
> that is killed so that you can understand why the event happened.
>
> I would argue that the eligible tasks should be printed on the opt-in
> bases because this is more of relict from the past when the victim
> selection was less deterministic. But that is another story.
>
> All the rest of dump_header should stay IMHO as a reasonable default and
> bare minimum.
>
> > Why? Because as we all agree, you'll never have a perfect OOM Report.
> > So if you believe this, than if you will, put your money where your mouth
> > is (so to speak) and make the entire OOM Report and eBPF script.
> > We'd be willing to help with this.
> >
> > I'll give specific reasons why you want to do this.
> >
> >- Don't want to maintain a lot of code in the kernel (eBPF code doesn't
> >count).
> >- Can't produce an ideal OOM report.
> >- Don't like configuring things but favor programmatic solutions.
> >- Agree the existing OOM report doesn't work for all environments.
> >- Want to allow flexibility but can't support everything people might
> >want.
> >- Then installing an eBPF for OOM Reporting isn't an option, it's
> >required.
>
> This is going into an extreme. We cannot serve all cases but that is
> true for any other heuristics/reporting in the kernel. We do care about
> most.

Unfortunately my argument for this is moot, this can't be done with
eBPF, at least not now.

>
> > The last reason is huge for people who live in a world with large data
> > centers. Data center managers are very conservative. They don't want to
> > deviate from standard operating procedure unless absolutely necessary.
> > If loading an OOM Report eBPF is standard to get OOM Reporting output,
> > then they'll accept that.
>
> I have already responded to this kind of argumentation elsewhere. This
> is not a relevant argument for any kernel implementation. This is a data
> process management process.
>
> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Edward Chron
On Thu, Aug 29, 2019 at 4:56 AM Michal Hocko  wrote:
>
> On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> > On 2019/08/29 16:11, Michal Hocko wrote:
> > > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> > >> Our belief is if you really think eBPF is the preferred mechanism
> > >> then move OOM reporting to an eBPF.
> > >
> > > I've said that all this additional information has to be dynamically
> > > extensible rather than a part of the core kernel. Whether eBPF is the
> > > suitable tool, I do not know. I haven't explored that. There are other
> > > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > > probably others.
> >
> > As for SystemTap, guru mode (an expert mode which disables protection 
> > provided
> > by SystemTap; allowing kernel to crash when something went wrong) could be 
> > used
> > for holding spinlock. However, as far as I know, holding mutex (or doing any
> > operation that might sleep) from such dynamic hooks is not allowed. Also we 
> > will
> > need to export various symbols in order to allow access from such dynamic 
> > hooks.
>
> This is the oom path and it should better not use any sleeping locks in
> the first place.
>
> > I'm not familiar with eBPF, but I guess that eBPF is similar.
> >
> > But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> > SystemTap will be suitable for dumping OOM information. OOM situation means
> > that even single page fault event cannot complete, and temporary memory
> > allocation for reading from kernel or writing to files cannot complete.
>
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.
>
> > Therefore, we will need to hold all information in kernel memory (without
> > allocating any memory when OOM event happened). Dynamic hooks could hold
> > a few lines of output, but not all lines we want. The only possible buffer
> > which is preallocated and large enough would be printk()'s buffer. Thus,
> > I believe that we will have to use printk() in order to dump OOM 
> > information.
> > At that point,
>
> Yes, this is what I've had in mind.
>

+1: It makes sense to keep the report going to the dmesg to persist.
That is where it has always gone and there is no reason to change.
You can have several OOMs back to back and you'd like to retain the output.
All the information should be kept together in the OOM report.

> >
> >   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> >
> >   bool out_of_memory(struct oom_control *oc)
> >   {
> >   return oom_handler(oc);
> >   }
> >
> > and let in-tree kernel modules override current OOM killer would be
> > the only practical choice (if we refuse adding many knobs).
>
> Or simply provide a hook with the oom_control to be called to report
> without replacing the whole oom killer behavior. That is not necessary.

For very simple addition, to add a line of output this works.
It would still be nice to address the fact the existing OOM Report prints
all of the user processes or none. It would be nice to add some control
for that. That's what we did.

> --
> Michal Hocko
> SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Tetsuo Handa
On 2019/08/29 20:56, Michal Hocko wrote:
>> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
>> SystemTap will be suitable for dumping OOM information. OOM situation means
>> that even single page fault event cannot complete, and temporary memory
>> allocation for reading from kernel or writing to files cannot complete.
> 
> And I repeat that no such reporting is going to write to files. This is
> an OOM path afterall.

The process who fetches from e.g. eBPF event cannot involve page fault.
The front-end for iovisor/bcc is a python userspace process. But I think
that such process can't run under OOM situation.

> 
>> Therefore, we will need to hold all information in kernel memory (without
>> allocating any memory when OOM event happened). Dynamic hooks could hold
>> a few lines of output, but not all lines we want. The only possible buffer
>> which is preallocated and large enough would be printk()'s buffer. Thus,
>> I believe that we will have to use printk() in order to dump OOM information.
>> At that point,
> 
> Yes, this is what I've had in mind.

Probably I incorrectly shortcut.

Dynamic hooks could hold a few lines of output, but dynamic hooks can not hold
all lines when dump_tasks() reports 32000+ processes. We have to buffer all 
output
in kernel memory because we can't complete even a page fault event triggered by
the python process monitoring eBPF event (and writing the result to some log 
file
or something) while out_of_memory() is in flight.

And "set /proc/sys/vm/oom_dump_tasks to 0" is not the right reaction. What I'm
saying is "we won't be able to hold output from dump_tasks() if output from
dump_tasks() goes to buffer preallocated for dynamic hooks". We have to find
a way that can handle the worst case.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Michal Hocko
On Thu 29-08-19 19:14:46, Tetsuo Handa wrote:
> On 2019/08/29 16:11, Michal Hocko wrote:
> > On Wed 28-08-19 12:46:20, Edward Chron wrote:
> >> Our belief is if you really think eBPF is the preferred mechanism
> >> then move OOM reporting to an eBPF.
> > 
> > I've said that all this additional information has to be dynamically
> > extensible rather than a part of the core kernel. Whether eBPF is the
> > suitable tool, I do not know. I haven't explored that. There are other
> > ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> > probably others.
> 
> As for SystemTap, guru mode (an expert mode which disables protection provided
> by SystemTap; allowing kernel to crash when something went wrong) could be 
> used
> for holding spinlock. However, as far as I know, holding mutex (or doing any
> operation that might sleep) from such dynamic hooks is not allowed. Also we 
> will
> need to export various symbols in order to allow access from such dynamic 
> hooks.

This is the oom path and it should better not use any sleeping locks in
the first place.

> I'm not familiar with eBPF, but I guess that eBPF is similar.
> 
> But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
> SystemTap will be suitable for dumping OOM information. OOM situation means
> that even single page fault event cannot complete, and temporary memory
> allocation for reading from kernel or writing to files cannot complete.

And I repeat that no such reporting is going to write to files. This is
an OOM path afterall.

> Therefore, we will need to hold all information in kernel memory (without
> allocating any memory when OOM event happened). Dynamic hooks could hold
> a few lines of output, but not all lines we want. The only possible buffer
> which is preallocated and large enough would be printk()'s buffer. Thus,
> I believe that we will have to use printk() in order to dump OOM information.
> At that point,

Yes, this is what I've had in mind.

> 
>   static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;
> 
>   bool out_of_memory(struct oom_control *oc)
>   {
>   return oom_handler(oc);
>   }
> 
> and let in-tree kernel modules override current OOM killer would be
> the only practical choice (if we refuse adding many knobs).

Or simply provide a hook with the oom_control to be called to report
without replacing the whole oom killer behavior. That is not necessary.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Tetsuo Handa
On 2019/08/29 16:11, Michal Hocko wrote:
> On Wed 28-08-19 12:46:20, Edward Chron wrote:
>> Our belief is if you really think eBPF is the preferred mechanism
>> then move OOM reporting to an eBPF.
> 
> I've said that all this additional information has to be dynamically
> extensible rather than a part of the core kernel. Whether eBPF is the
> suitable tool, I do not know. I haven't explored that. There are other
> ways to inject code to the kernel. systemtap/kprobes, kernel modules and
> probably others.

As for SystemTap, guru mode (an expert mode which disables protection provided
by SystemTap; allowing kernel to crash when something went wrong) could be used
for holding spinlock. However, as far as I know, holding mutex (or doing any
operation that might sleep) from such dynamic hooks is not allowed. Also we will
need to export various symbols in order to allow access from such dynamic hooks.

I'm not familiar with eBPF, but I guess that eBPF is similar.

But please be aware that, I REPEAT AGAIN, I don't think neither eBPF nor
SystemTap will be suitable for dumping OOM information. OOM situation means
that even single page fault event cannot complete, and temporary memory
allocation for reading from kernel or writing to files cannot complete.

Therefore, we will need to hold all information in kernel memory (without
allocating any memory when OOM event happened). Dynamic hooks could hold
a few lines of output, but not all lines we want. The only possible buffer
which is preallocated and large enough would be printk()'s buffer. Thus,
I believe that we will have to use printk() in order to dump OOM information.
At that point,

  static bool (*oom_handler)(struct oom_control *oc) = default_oom_killer;

  bool out_of_memory(struct oom_control *oc)
  {
  return oom_handler(oc);
  }

and let in-tree kernel modules override current OOM killer would be
the only practical choice (if we refuse adding many knobs).



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-29 Thread Michal Hocko
On Wed 28-08-19 12:46:20, Edward Chron wrote:
[...]
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF.

I've said that all this additional information has to be dynamically
extensible rather than a part of the core kernel. Whether eBPF is the
suitable tool, I do not know. I haven't explored that. There are other
ways to inject code to the kernel. systemtap/kprobes, kernel modules and
probably others.

> I mentioned this before but I will reiterate this here.
> 
> So how do we get there? Let's look at the existing report which we know
> has issues.
> 
> Other than a few essential OOM messages the OOM code should produce,
> such as the Killed process message message sequence being included,
> you could have the entire OOM report moved to an eBPF script and
> therefore make it customizable, configurable or if you prefer programmable.

I believe we should keep the current reporting in place and allow
additional information via dynamic mechanism. Be it a registration
mechanism that modules can hook into or other more dynamic way.
The current reporting has proven to be useful in many typical oom
situations in my past years of experience. It gives the rough state of
the failing allocation, MM subsystem, tasks that are eligible and task
that is killed so that you can understand why the event happened.

I would argue that the eligible tasks should be printed on the opt-in
bases because this is more of relict from the past when the victim
selection was less deterministic. But that is another story.

All the rest of dump_header should stay IMHO as a reasonable default and
bare minimum.

> Why? Because as we all agree, you'll never have a perfect OOM Report.
> So if you believe this, than if you will, put your money where your mouth
> is (so to speak) and make the entire OOM Report and eBPF script.
> We'd be willing to help with this.
> 
> I'll give specific reasons why you want to do this.
> 
>- Don't want to maintain a lot of code in the kernel (eBPF code doesn't
>count).
>- Can't produce an ideal OOM report.
>- Don't like configuring things but favor programmatic solutions.
>- Agree the existing OOM report doesn't work for all environments.
>- Want to allow flexibility but can't support everything people might
>want.
>- Then installing an eBPF for OOM Reporting isn't an option, it's
>required.

This is going into an extreme. We cannot serve all cases but that is
true for any other heuristics/reporting in the kernel. We do care about
most.

> The last reason is huge for people who live in a world with large data
> centers. Data center managers are very conservative. They don't want to
> deviate from standard operating procedure unless absolutely necessary.
> If loading an OOM Report eBPF is standard to get OOM Reporting output,
> then they'll accept that.

I have already responded to this kind of argumentation elsewhere. This
is not a relevant argument for any kernel implementation. This is a data
process management process.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Edward Chron
On Wed, Aug 28, 2019 at 1:04 PM Edward Chron  wrote:
>
> On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
>  wrote:
> >
> > On 2019/08/28 16:08, Michal Hocko wrote:
> > > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> > >> For production systems installing and updating EBPF scripts may someday
> > >> be very common, but I wonder how data center managers feel about it now?
> > >> Developers are very excited about it and it is a very powerful tool but 
> > >> can I
> > >> get permission to add or replace an existing EBPF on production systems?
> > >
> > > I am not sure I understand. There must be somebody trusted to take care
> > > of systems, right?
> > >
> >
> > Speak of my cases, those who take care of their systems are not developers.
> > And they afraid changing code that runs in kernel mode. They unlikely give
> > permission to install SystemTap/eBPF scripts. As a result, in many cases,
> > the root cause cannot be identified.
>
> +1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
> uses a an eBPF script then systems have to load them to get any kind of
> meaningful report. Frankly, if using eBPF is the route to go than essentially
> the whole OOM reporting should go there. We can adjust as we need and
> have precedent for wanting to load the script. That's the best we could come
> up with.
>
> >
> > Moreover, we are talking about OOM situations, where we can't expect 
> > userspace
> > processes to work properly. We need to dump information we want, without
> > counting on userspace processes, before sending SIGKILL.
>
> +1. We've tried and as you point out and for best results the kernel
> has to provide
>  the state.
>
> Again a full system dump would be wonderful, but taking a full dump for
> every OOM event on production systems? I am not nearly a good enough salesman
> to sell that one. So we need an alternate mechanism.
>
> If we can't agree on some sort of extensible, configurable approach then put
> the standard OOM Report in eBPF and make it mandatory to load it so we can
> justify having to do that. Linux should load it automatically.
> We'll just make a few changes and additions as needed.
>
> Sounds like a plan that we could live with.
> Would be interested if this works for others as well.

One further comment. In talking with my colleagues here who know eBPF
much better
than I do, it may not be possible to implement something this
complicated with eBPF.

If that is in the fact the case, then we'd have to try and hook the
OOM Reporting code
with tracepoints similar to kprobes only we want to do more than add counters
we want to change the flow to skip small output entries that aren't
worth printing.
If this isn't feasible with eBPF, then some derivative or our approach
or enhancing
the OOM output code directly seem like the best options. Will have to
investigate
this further.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Qian Cai
On Wed, 2019-08-28 at 14:17 -0700, Edward Chron wrote:
> On Wed, Aug 28, 2019 at 1:18 PM Qian Cai  wrote:
> > 
> > On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > > But with the caveat that running a eBPF script that it isn't standard
> > > Linux
> > > operating procedure, at this point in time any way will not be well
> > > received in the data center.
> > 
> > Can't you get your eBPF scripts into the BCC project? As far I can tell, the
> > BCC
> > has been included in several distros already, and then it will become a part
> > of
> > standard linux toolkits.
> > 
> > > 
> > > Our belief is if you really think eBPF is the preferred mechanism
> > > then move OOM reporting to an eBPF.
> > > I mentioned this before but I will reiterate this here.
> > 
> > On the other hand, it seems many people are happy with the simple kernel OOM
> > report we have here. Not saying the current situation is perfect. On the top
> > of
> > that, some people are using kdump, and some people have resource monitoring
> > to
> > warn about potential memory overcommits before OOM kicks in etc.
> 
> Assuming you can implement your existing report in eBPF then those who like
> the
> current output would still get the current output. Same with the patches we
> sent
> upstream, nothing in the report changes by default. So no problems for those
> who
> are happy, they'll still be happy.

I don't think it makes any sense to rewrite the existing code to depends on eBPF
though.



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Edward Chron
On Wed, Aug 28, 2019 at 1:18 PM Qian Cai  wrote:
>
> On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> > But with the caveat that running a eBPF script that it isn't standard Linux
> > operating procedure, at this point in time any way will not be well
> > received in the data center.
>
> Can't you get your eBPF scripts into the BCC project? As far I can tell, the 
> BCC
> has been included in several distros already, and then it will become a part 
> of
> standard linux toolkits.
>
> >
> > Our belief is if you really think eBPF is the preferred mechanism
> > then move OOM reporting to an eBPF.
> > I mentioned this before but I will reiterate this here.
>
> On the other hand, it seems many people are happy with the simple kernel OOM
> report we have here. Not saying the current situation is perfect. On the top 
> of
> that, some people are using kdump, and some people have resource monitoring to
> warn about potential memory overcommits before OOM kicks in etc.

Assuming you can implement your existing report in eBPF then those who like the
current output would still get the current output. Same with the patches we sent
upstream, nothing in the report changes by default. So no problems for those who
are happy, they'll still be happy.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Qian Cai
On Wed, 2019-08-28 at 12:46 -0700, Edward Chron wrote:
> But with the caveat that running a eBPF script that it isn't standard Linux
> operating procedure, at this point in time any way will not be well
> received in the data center.

Can't you get your eBPF scripts into the BCC project? As far I can tell, the BCC
has been included in several distros already, and then it will become a part of
standard linux toolkits.

> 
> Our belief is if you really think eBPF is the preferred mechanism
> then move OOM reporting to an eBPF. 
> I mentioned this before but I will reiterate this here.

On the other hand, it seems many people are happy with the simple kernel OOM
report we have here. Not saying the current situation is perfect. On the top of
that, some people are using kdump, and some people have resource monitoring to
warn about potential memory overcommits before OOM kicks in etc.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Edward Chron
On Wed, Aug 28, 2019 at 3:12 AM Tetsuo Handa
 wrote:
>
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but 
> >> can I
> >> get permission to add or replace an existing EBPF on production systems?
> >
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> >
>
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

+1. Exactly. The only thing we could think of Tetsuo is if Linux OOM Reporting
uses a an eBPF script then systems have to load them to get any kind of
meaningful report. Frankly, if using eBPF is the route to go than essentially
the whole OOM reporting should go there. We can adjust as we need and
have precedent for wanting to load the script. That's the best we could come
up with.

>
> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

+1. We've tried and as you point out and for best results the kernel
has to provide
 the state.

Again a full system dump would be wonderful, but taking a full dump for
every OOM event on production systems? I am not nearly a good enough salesman
to sell that one. So we need an alternate mechanism.

If we can't agree on some sort of extensible, configurable approach then put
the standard OOM Report in eBPF and make it mandatory to load it so we can
justify having to do that. Linux should load it automatically.
We'll just make a few changes and additions as needed.

Sounds like a plan that we could live with.
Would be interested if this works for others as well.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Michal Hocko
On Wed 28-08-19 19:56:58, Tetsuo Handa wrote:
> On 2019/08/28 19:32, Michal Hocko wrote:
> >> Speak of my cases, those who take care of their systems are not developers.
> >> And they afraid changing code that runs in kernel mode. They unlikely give
> >> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> >> the root cause cannot be identified.
> > 
> > Which is something I would call a process problem more than a kernel
> > one. Really if you need to debug a problem you really have to trust
> > those who can debug that for you. We are not going to take tons of code
> > to the kernel just because somebody is afraid to run a diagnostic.
> > 
> 
> This is a problem of kernel development process.

I disagree. Expecting that any larger project can be filled with the
(close to) _full_ and ready to use introspection built in is just
insane. We are trying to help with a generally useful information but
you simply cannot cover most existing failure paths.

> >> Moreover, we are talking about OOM situations, where we can't expect 
> >> userspace
> >> processes to work properly. We need to dump information we want, without
> >> counting on userspace processes, before sending SIGKILL.
> > 
> > Yes, this is an inherent assumption I was making and that means that
> > whatever dynamic hooks would have to be registered in advance.
> > 
> 
> No. I'm saying that neither static hooks nor dynamic hooks can work as
> expected if they count on userspace processes. Registering in advance is
> irrelevant. Whether it can work without userspace processes is relevant.

I am not saying otherwise. I do not expect any userspace process to dump
any information or read it from elswhere than from the kernel log.

> Also, out-of-tree codes tend to become defunctional. We are trying to debug
> problems caused by in-tree code. Breaking out-of-tree debugging code just
> because in-tree code developers don't want to pay the burden of maintaining
> code for debugging problems caused by in-tree code is a very bad idea.

This is a simple math of cost/benefit. The maintenance cost is not free
and paying it for odd cases most people do not care about is simply not
sustainable, we simply do not have that much of a man power.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Tetsuo Handa
On 2019/08/28 19:32, Michal Hocko wrote:
>> Speak of my cases, those who take care of their systems are not developers.
>> And they afraid changing code that runs in kernel mode. They unlikely give
>> permission to install SystemTap/eBPF scripts. As a result, in many cases,
>> the root cause cannot be identified.
> 
> Which is something I would call a process problem more than a kernel
> one. Really if you need to debug a problem you really have to trust
> those who can debug that for you. We are not going to take tons of code
> to the kernel just because somebody is afraid to run a diagnostic.
> 

This is a problem of kernel development process.

>> Moreover, we are talking about OOM situations, where we can't expect 
>> userspace
>> processes to work properly. We need to dump information we want, without
>> counting on userspace processes, before sending SIGKILL.
> 
> Yes, this is an inherent assumption I was making and that means that
> whatever dynamic hooks would have to be registered in advance.
> 

No. I'm saying that neither static hooks nor dynamic hooks can work as
expected if they count on userspace processes. Registering in advance is
irrelevant. Whether it can work without userspace processes is relevant.

Also, out-of-tree codes tend to become defunctional. We are trying to debug
problems caused by in-tree code. Breaking out-of-tree debugging code just
because in-tree code developers don't want to pay the burden of maintaining
code for debugging problems caused by in-tree code is a very bad idea.



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Michal Hocko
On Wed 28-08-19 19:12:41, Tetsuo Handa wrote:
> On 2019/08/28 16:08, Michal Hocko wrote:
> > On Tue 27-08-19 19:47:22, Edward Chron wrote:
> >> For production systems installing and updating EBPF scripts may someday
> >> be very common, but I wonder how data center managers feel about it now?
> >> Developers are very excited about it and it is a very powerful tool but 
> >> can I
> >> get permission to add or replace an existing EBPF on production systems?
> > 
> > I am not sure I understand. There must be somebody trusted to take care
> > of systems, right?
> > 
> 
> Speak of my cases, those who take care of their systems are not developers.
> And they afraid changing code that runs in kernel mode. They unlikely give
> permission to install SystemTap/eBPF scripts. As a result, in many cases,
> the root cause cannot be identified.

Which is something I would call a process problem more than a kernel
one. Really if you need to debug a problem you really have to trust
those who can debug that for you. We are not going to take tons of code
to the kernel just because somebody is afraid to run a diagnostic.

> Moreover, we are talking about OOM situations, where we can't expect userspace
> processes to work properly. We need to dump information we want, without
> counting on userspace processes, before sending SIGKILL.

Yes, this is an inherent assumption I was making and that means that
whatever dynamic hooks would have to be registered in advance.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Tetsuo Handa
On 2019/08/28 16:08, Michal Hocko wrote:
> On Tue 27-08-19 19:47:22, Edward Chron wrote:
>> For production systems installing and updating EBPF scripts may someday
>> be very common, but I wonder how data center managers feel about it now?
>> Developers are very excited about it and it is a very powerful tool but can I
>> get permission to add or replace an existing EBPF on production systems?
> 
> I am not sure I understand. There must be somebody trusted to take care
> of systems, right?
> 

Speak of my cases, those who take care of their systems are not developers.
And they afraid changing code that runs in kernel mode. They unlikely give
permission to install SystemTap/eBPF scripts. As a result, in many cases,
the root cause cannot be identified.

Moreover, we are talking about OOM situations, where we can't expect userspace
processes to work properly. We need to dump information we want, without
counting on userspace processes, before sending SIGKILL.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Michal Hocko
On Tue 27-08-19 19:47:22, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 6:32 PM Qian Cai  wrote:
> >
> >
> >
> > > On Aug 27, 2019, at 9:13 PM, Edward Chron  wrote:
> > >
> > > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai  wrote:
> > >>
> > >>
> > >>
> > >>> On Aug 27, 2019, at 8:23 PM, Edward Chron  wrote:
> > >>>
> > >>>
> > >>>
> > >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai  wrote:
> > >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> >  This patch series provides code that works as a debug option through
> >  debugfs to provide additional controls to limit how much information
> >  gets printed when an OOM event occurs and or optionally print 
> >  additional
> >  information about slab usage, vmalloc allocations, user process memory
> >  usage, the number of processes / tasks and some summary information
> >  about these tasks (number runable, i/o wait), system information
> >  (#CPUs, Kernel Version and other useful state of the system),
> >  ARP and ND Cache entry information.
> > 
> >  Linux OOM can optionally provide a lot of information, what's missing?
> >  --
> >  Linux provides a variety of detailed information when an OOM event 
> >  occurs
> >  but has limited options to control how much output is produced. The
> >  system related information is produced unconditionally and limited per
> >  user process information is produced as a default enabled option. The
> >  per user process information may be disabled.
> > 
> >  Slab usage information was recently added and is output only if slab
> >  usage exceeds user memory usage.
> > 
> >  Many OOM events are due to user application memory usage sometimes in
> >  combination with the use of kernel resource usage that exceeds what is
> >  expected memory usage. Detailed information about how memory was being
> >  used when the event occurred may be required to identify the root cause
> >  of the OOM event.
> > 
> >  However, some environments are very large and printing all of the
> >  information about processes, slabs and or vmalloc allocations may
> >  not be feasible. For other environments printing as much information
> >  about these as possible may be needed to root cause OOM events.
> > 
> > >>>
> > >>> For more in-depth analysis of OOM events, people could use kdump to 
> > >>> save a
> > >>> vmcore by setting "panic_on_oom", and then use the crash utility to 
> > >>> analysis the
> > >>> vmcore which contains pretty much all the information you need.
> > >>>
> > >>> Certainly, this is the ideal. A full system dump would give you the 
> > >>> maximum amount of
> > >>> information.
> > >>>
> > >>> Unfortunately some environments may lack space to store the dump,
> > >>
> > >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> > >>
> > >>> let alone the time to dump the storage contents and restart the system. 
> > >>> Some
> > >>
> > >> There is also “makedumpfile” that could compress and filter unwanted 
> > >> memory to reduce
> > >> the vmcore size and speed up the dumping process by utilizing 
> > >> multi-threads.
> > >>
> > >>> systems can take many minutes to fully boot up, to reset and 
> > >>> reinitialize all the
> > >>> devices. So unfortunately this is not always an option, and we need an 
> > >>> OOM Report.
> > >>
> > >> I am not sure how the system needs some minutes to reboot would be 
> > >> relevant  for the
> > >> discussion here. The idea is to save a vmcore and it can be analyzed 
> > >> offline even on
> > >> another system as long as it having a matching “vmlinux.".
> > >>
> > >>
> > >
> > > If selecting a dump on an OOM event doesn't reboot the system and if
> > > it runs fast enough such
> > > that it doesn't slow processing enough to appreciably effect the
> > > system's responsiveness then
> > > then it would be ideal solution. For some it would be over kill but
> > > since it is an option it is a
> > > choice to consider or not.
> >
> > It sounds like you are looking for more of this,
> 
> If you want to supplement the OOM Report and keep the information
> together than you could use EBPF to do that. If that really is the
> preference it might make sense to put the entire report as an EBPF
> script than you can modify the script however you choose. That would
> be very flexible. You can change your configuration on the fly. As
> long as it has access to everything you need it should work.
> 
> Michal would know what direction OOM is headed and if he thinks that fits with
> where things are headed.

It seems we have landed in the similar thinking here. As mentioned in my
earlier email in this thread I can see the extensibility to be achieved
by eBPF. Essentially we would have a base form of the oom report like
now and scripts would then hook in there to provide whatever a specific

Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-28 Thread Michal Hocko
On Tue 27-08-19 18:07:54, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko  wrote:
> >
> > On Mon 26-08-19 12:36:28, Edward Chron wrote:
> > [...]
> > > Extensibility using OOM debug options
> > > -
> > > What is needed is an extensible system to optionally configure
> > > debug options as needed and to then dynamically enable and disable
> > > them. Also for options that produce multiple lines of entry based
> > > output, to configure which entries to print based on how much
> > > memory they use (or optionally all the entries).
> >
> > With a patch this large and adding a lot of new stuff we need a more
> > detailed usecases described I believe.
> 
> I guess it would make sense to explain motivation for each OOM Debug
> option I've sent separately.
> I see there comments on the patches I will try and add more information there.
> 
> An overview would be that we've been collecting information on OOM's
> over the last 12 years or so.
> These are from switches, other embedded devices, servers both large and small.
> We ask for feedback on what information was helpful or could be helpful.
> We try and add it to make root causing issues easier.
> 
> These OOM debug options are some of the options we've created.
> I didn't port all of them to 5.3 but these are representative.
> Our latest is kernel is a bit behind 5.3.
> 
> >
> >
> > [...]
> >
> > > Use of debugfs to allow dynamic controls
> > > 
> > > By providing a debugfs interface that allows options to be configured,
> > > enabled and where appropriate to set a minimum size for selecting
> > > entries to print, the output produced when an OOM event occurs can be
> > > dynamically adjusted to produce as little or as much detail as needed
> > > for a given system.
> >
> > Who is going to consume this information and why would that consumer be
> > unreasonable to demand further maintenance of that information in future
> > releases? In other words debugfs is not considered a stableAPI which is
> > OK here but the side effect of any change to these files results in user
> > visible behavior and we consider that more or less a stable as long as
> > there are consumers.
> >
> > > OOM debug options can be added to the base code as needed.
> > >
> > > Currently we have the following OOM debug options defined:
> > >
> > > * System State Summary
> > >   
> > >   One line of output that includes:
> > >   - Uptime (days, hour, minutes, seconds)
> >
> > We do have timestamps in the log so why is this needed?
> 
> 
> Here is how an OOM report looks when we get it to look at:
> 
> Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
> oom_score_adj=1000
> Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
> tainted 5.3.0-rc6+ #33
> Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
> IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018
> 
> This shows the date and time, not time of the last boot. The
> /var/log/messages output is what we often have to look at not raw
> dmesgs.

This looks more like a configuration of the logging than a kernel
problem. Kernel does provide timestamps for logs. E.g.
$ tail -n1 /var/log/kern.log
Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: 
link_config: autonegotiation is unset or enabled, the speed and duplex are not 
writable.

[...]
> > >   Example output when configured and enabled:
> > >
> > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 
> > > forks_since_boot:2786 procs_runable:2 procs_iowait:0
> > >
> > > * ARP Table and/or Neighbour Discovery Table Summary
> > >   --
> > >   One line of output each for ARP and ND that includes:
> > >   - Table name
> > >   - Table size (max # entries)
> > >   - Key Length
> > >   - Entry Size
> > >   - Number of Entries
> > >   - Last Flush (in seconds)
> > >   - hash grows
> > >   - entry allocations
> > >   - entry destroys
> > >   - Number lookups
> > >   - Number of lookup hits
> > >   - Resolution failures
> > >   - Garbage Collection Forced Runs
> > >   - Table Full
> > >   - Proxy Queue Length
> > >
> > >   Example output when configured and enabled (for both):
> > >
> > > ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 
> > > 360 entries: 9 lastFlush:  1721s hGrows: 1 allocs: 9 
> > > destroys: 0 lookups:   204 hits:   199 resFailed:38 
> > > gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> > >
> > > ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 
> > > 368 entries: 6 lastFlush:  1720s hGrows: 0 allocs: 7 
> > > destroys: 1 lookups: 0 hits: 0 resFailed: 0 
> > > gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0
> >
> > Again, why is this needed particularly for the OOM event? I do
> > 

Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Edward Chron
On Tue, Aug 27, 2019 at 6:32 PM Qian Cai  wrote:
>
>
>
> > On Aug 27, 2019, at 9:13 PM, Edward Chron  wrote:
> >
> > On Tue, Aug 27, 2019 at 5:50 PM Qian Cai  wrote:
> >>
> >>
> >>
> >>> On Aug 27, 2019, at 8:23 PM, Edward Chron  wrote:
> >>>
> >>>
> >>>
> >>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai  wrote:
> >>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
>  This patch series provides code that works as a debug option through
>  debugfs to provide additional controls to limit how much information
>  gets printed when an OOM event occurs and or optionally print additional
>  information about slab usage, vmalloc allocations, user process memory
>  usage, the number of processes / tasks and some summary information
>  about these tasks (number runable, i/o wait), system information
>  (#CPUs, Kernel Version and other useful state of the system),
>  ARP and ND Cache entry information.
> 
>  Linux OOM can optionally provide a lot of information, what's missing?
>  --
>  Linux provides a variety of detailed information when an OOM event occurs
>  but has limited options to control how much output is produced. The
>  system related information is produced unconditionally and limited per
>  user process information is produced as a default enabled option. The
>  per user process information may be disabled.
> 
>  Slab usage information was recently added and is output only if slab
>  usage exceeds user memory usage.
> 
>  Many OOM events are due to user application memory usage sometimes in
>  combination with the use of kernel resource usage that exceeds what is
>  expected memory usage. Detailed information about how memory was being
>  used when the event occurred may be required to identify the root cause
>  of the OOM event.
> 
>  However, some environments are very large and printing all of the
>  information about processes, slabs and or vmalloc allocations may
>  not be feasible. For other environments printing as much information
>  about these as possible may be needed to root cause OOM events.
> 
> >>>
> >>> For more in-depth analysis of OOM events, people could use kdump to save a
> >>> vmcore by setting "panic_on_oom", and then use the crash utility to 
> >>> analysis the
> >>> vmcore which contains pretty much all the information you need.
> >>>
> >>> Certainly, this is the ideal. A full system dump would give you the 
> >>> maximum amount of
> >>> information.
> >>>
> >>> Unfortunately some environments may lack space to store the dump,
> >>
> >> Kdump usually also support dumping to a remote target via NFS, SSH etc
> >>
> >>> let alone the time to dump the storage contents and restart the system. 
> >>> Some
> >>
> >> There is also “makedumpfile” that could compress and filter unwanted 
> >> memory to reduce
> >> the vmcore size and speed up the dumping process by utilizing 
> >> multi-threads.
> >>
> >>> systems can take many minutes to fully boot up, to reset and reinitialize 
> >>> all the
> >>> devices. So unfortunately this is not always an option, and we need an 
> >>> OOM Report.
> >>
> >> I am not sure how the system needs some minutes to reboot would be 
> >> relevant  for the
> >> discussion here. The idea is to save a vmcore and it can be analyzed 
> >> offline even on
> >> another system as long as it having a matching “vmlinux.".
> >>
> >>
> >
> > If selecting a dump on an OOM event doesn't reboot the system and if
> > it runs fast enough such
> > that it doesn't slow processing enough to appreciably effect the
> > system's responsiveness then
> > then it would be ideal solution. For some it would be over kill but
> > since it is an option it is a
> > choice to consider or not.
>
> It sounds like you are looking for more of this,

If you want to supplement the OOM Report and keep the information together than
you could use EBPF to do that. If that really is the preference it
might make sense
to put the entire report as an EBPF script than you can modify the
script however
you choose. That would be very flexible. You can change your
configuration on the
fly. As long as it has access to everything you need it should work.

Michal would know what direction OOM is headed and if he thinks that fits with
where things are headed.

I'm flexible in he sense that I could change our submission to make
specific updates
to the existing OOM code. We kept it as separate as possible as for
ease of porting.
But if we can build an acceptable case for making updates to the existing OOM
Report code that works.

Our current implementation has some knobs to allow some limited scaling that
has advantages over print rate limiting and it may allow environments
that didn't
want to allow process printing or slab or vmalloc entry allocations
printing to do
so without generating a lot of output.

But the 

Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Qian Cai



> On Aug 27, 2019, at 9:13 PM, Edward Chron  wrote:
> 
> On Tue, Aug 27, 2019 at 5:50 PM Qian Cai  wrote:
>> 
>> 
>> 
>>> On Aug 27, 2019, at 8:23 PM, Edward Chron  wrote:
>>> 
>>> 
>>> 
>>> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai  wrote:
>>> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
 This patch series provides code that works as a debug option through
 debugfs to provide additional controls to limit how much information
 gets printed when an OOM event occurs and or optionally print additional
 information about slab usage, vmalloc allocations, user process memory
 usage, the number of processes / tasks and some summary information
 about these tasks (number runable, i/o wait), system information
 (#CPUs, Kernel Version and other useful state of the system),
 ARP and ND Cache entry information.
 
 Linux OOM can optionally provide a lot of information, what's missing?
 --
 Linux provides a variety of detailed information when an OOM event occurs
 but has limited options to control how much output is produced. The
 system related information is produced unconditionally and limited per
 user process information is produced as a default enabled option. The
 per user process information may be disabled.
 
 Slab usage information was recently added and is output only if slab
 usage exceeds user memory usage.
 
 Many OOM events are due to user application memory usage sometimes in
 combination with the use of kernel resource usage that exceeds what is
 expected memory usage. Detailed information about how memory was being
 used when the event occurred may be required to identify the root cause
 of the OOM event.
 
 However, some environments are very large and printing all of the
 information about processes, slabs and or vmalloc allocations may
 not be feasible. For other environments printing as much information
 about these as possible may be needed to root cause OOM events.
 
>>> 
>>> For more in-depth analysis of OOM events, people could use kdump to save a
>>> vmcore by setting "panic_on_oom", and then use the crash utility to 
>>> analysis the
>>> vmcore which contains pretty much all the information you need.
>>> 
>>> Certainly, this is the ideal. A full system dump would give you the maximum 
>>> amount of
>>> information.
>>> 
>>> Unfortunately some environments may lack space to store the dump,
>> 
>> Kdump usually also support dumping to a remote target via NFS, SSH etc
>> 
>>> let alone the time to dump the storage contents and restart the system. Some
>> 
>> There is also “makedumpfile” that could compress and filter unwanted memory 
>> to reduce
>> the vmcore size and speed up the dumping process by utilizing multi-threads.
>> 
>>> systems can take many minutes to fully boot up, to reset and reinitialize 
>>> all the
>>> devices. So unfortunately this is not always an option, and we need an OOM 
>>> Report.
>> 
>> I am not sure how the system needs some minutes to reboot would be relevant  
>> for the
>> discussion here. The idea is to save a vmcore and it can be analyzed offline 
>> even on
>> another system as long as it having a matching “vmlinux.".
>> 
>> 
> 
> If selecting a dump on an OOM event doesn't reboot the system and if
> it runs fast enough such
> that it doesn't slow processing enough to appreciably effect the
> system's responsiveness then
> then it would be ideal solution. For some it would be over kill but
> since it is an option it is a
> choice to consider or not.

It sounds like you are looking for more of this,

https://github.com/iovisor/bcc/blob/master/tools/oomkill.py



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Edward Chron
On Tue, Aug 27, 2019 at 5:50 PM Qian Cai  wrote:
>
>
>
> > On Aug 27, 2019, at 8:23 PM, Edward Chron  wrote:
> >
> >
> >
> > On Tue, Aug 27, 2019 at 5:40 AM Qian Cai  wrote:
> > On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > > This patch series provides code that works as a debug option through
> > > debugfs to provide additional controls to limit how much information
> > > gets printed when an OOM event occurs and or optionally print additional
> > > information about slab usage, vmalloc allocations, user process memory
> > > usage, the number of processes / tasks and some summary information
> > > about these tasks (number runable, i/o wait), system information
> > > (#CPUs, Kernel Version and other useful state of the system),
> > > ARP and ND Cache entry information.
> > >
> > > Linux OOM can optionally provide a lot of information, what's missing?
> > > --
> > > Linux provides a variety of detailed information when an OOM event occurs
> > > but has limited options to control how much output is produced. The
> > > system related information is produced unconditionally and limited per
> > > user process information is produced as a default enabled option. The
> > > per user process information may be disabled.
> > >
> > > Slab usage information was recently added and is output only if slab
> > > usage exceeds user memory usage.
> > >
> > > Many OOM events are due to user application memory usage sometimes in
> > > combination with the use of kernel resource usage that exceeds what is
> > > expected memory usage. Detailed information about how memory was being
> > > used when the event occurred may be required to identify the root cause
> > > of the OOM event.
> > >
> > > However, some environments are very large and printing all of the
> > > information about processes, slabs and or vmalloc allocations may
> > > not be feasible. For other environments printing as much information
> > > about these as possible may be needed to root cause OOM events.
> > >
> >
> > For more in-depth analysis of OOM events, people could use kdump to save a
> > vmcore by setting "panic_on_oom", and then use the crash utility to 
> > analysis the
> >  vmcore which contains pretty much all the information you need.
> >
> > Certainly, this is the ideal. A full system dump would give you the maximum 
> > amount of
> > information.
> >
> > Unfortunately some environments may lack space to store the dump,
>
> Kdump usually also support dumping to a remote target via NFS, SSH etc
>
> > let alone the time to dump the storage contents and restart the system. Some
>
> There is also “makedumpfile” that could compress and filter unwanted memory 
> to reduce
> the vmcore size and speed up the dumping process by utilizing multi-threads.
>
> > systems can take many minutes to fully boot up, to reset and reinitialize 
> > all the
> > devices. So unfortunately this is not always an option, and we need an OOM 
> > Report.
>
> I am not sure how the system needs some minutes to reboot would be relevant  
> for the
> discussion here. The idea is to save a vmcore and it can be analyzed offline 
> even on
> another system as long as it having a matching “vmlinux.".
>
>

If selecting a dump on an OOM event doesn't reboot the system and if
it runs fast enough such
that it doesn't slow processing enough to appreciably effect the
system's responsiveness then
then it would be ideal solution. For some it would be over kill but
since it is an option it is a
choice to consider or not.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Edward Chron
On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko  wrote:
>
> On Mon 26-08-19 12:36:28, Edward Chron wrote:
> [...]
> > Extensibility using OOM debug options
> > -
> > What is needed is an extensible system to optionally configure
> > debug options as needed and to then dynamically enable and disable
> > them. Also for options that produce multiple lines of entry based
> > output, to configure which entries to print based on how much
> > memory they use (or optionally all the entries).
>
> With a patch this large and adding a lot of new stuff we need a more
> detailed usecases described I believe.

I guess it would make sense to explain motivation for each OOM Debug
option I've sent separately.
I see there comments on the patches I will try and add more information there.

An overview would be that we've been collecting information on OOM's
over the last 12 years or so.
These are from switches, other embedded devices, servers both large and small.
We ask for feedback on what information was helpful or could be helpful.
We try and add it to make root causing issues easier.

These OOM debug options are some of the options we've created.
I didn't port all of them to 5.3 but these are representative.
Our latest is kernel is a bit behind 5.3.

>
>
> [...]
>
> > Use of debugfs to allow dynamic controls
> > 
> > By providing a debugfs interface that allows options to be configured,
> > enabled and where appropriate to set a minimum size for selecting
> > entries to print, the output produced when an OOM event occurs can be
> > dynamically adjusted to produce as little or as much detail as needed
> > for a given system.
>
> Who is going to consume this information and why would that consumer be
> unreasonable to demand further maintenance of that information in future
> releases? In other words debugfs is not considered a stableAPI which is
> OK here but the side effect of any change to these files results in user
> visible behavior and we consider that more or less a stable as long as
> there are consumers.
>
> > OOM debug options can be added to the base code as needed.
> >
> > Currently we have the following OOM debug options defined:
> >
> > * System State Summary
> >   
> >   One line of output that includes:
> >   - Uptime (days, hour, minutes, seconds)
>
> We do have timestamps in the log so why is this needed?


Here is how an OOM report looks when we get it to look at:

Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
oom_score_adj=1000
Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
tainted 5.3.0-rc6+ #33
Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018

This shows the date and time, not time of the last boot. The
/var/log/messages output is what we often have to look at not raw
dmesgs.

>
>
> >   - Number CPUs
> >   - Machine Type
> >   - Node name
> >   - Domain name
>
> why are these needed? That is a static information that doesn't really
> influence the OOM situation.


Sorry if a few of the items overlap what OOM prints.
We've been printing a lot of this information since 2.6.38 and OOM
reporting has been updated.

We're updating our 4.19 system to have the latest OOM Report format.
This was the 5.0 patch Reorg the OOM report in the dump header.
Also back porting Shakeel's 5.3 patch to refactor dump tasks for memcg OOMs.
We're testing those back ports right now in fact.

We can probably get rid of some of the information we have but I
haven't had a chance yet.
Hopefully can do it as part of sending some code upstream.

>
>
> >   - Kernel Release
> >   - Kernel Version
>
> part of the oom report
>
> >
> >   Example output when configured and enabled:
> >
> > Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 
> > Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ 
> > Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
> >
> > * Tasks Summary
> >   -
> >   One line of output that includes:
> >   - Number of Threads
> >   - Number of processes
> >   - Forks since boot
> >   - Processes that are runnable
> >   - Processes that are in iowait
>
> We do have sysrq+t for this kind of information. Why do we need to
> duplicate it?

Unfortunately, we can't login into every customer system or even
system of our own and do a sysrq+t after each OOM.
You could scan for OOMs and have a script do it, but doing a sysrq+t
after an OOM event, you'll get different results.
I'd rather have the runnable and iowait counts during the OOM event not after.
Computers are so darn fast, free up some memory and things can look a
lot different.

We've seen crond fork and hang and gradually create thousands of
processes and sorts of other unintended fork bombs.
On some systems we can't print all of the process information as we've

Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Qian Cai



> On Aug 27, 2019, at 8:23 PM, Edward Chron  wrote:
> 
> 
> 
> On Tue, Aug 27, 2019 at 5:40 AM Qian Cai  wrote:
> On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> > This patch series provides code that works as a debug option through
> > debugfs to provide additional controls to limit how much information
> > gets printed when an OOM event occurs and or optionally print additional
> > information about slab usage, vmalloc allocations, user process memory
> > usage, the number of processes / tasks and some summary information
> > about these tasks (number runable, i/o wait), system information
> > (#CPUs, Kernel Version and other useful state of the system),
> > ARP and ND Cache entry information.
> > 
> > Linux OOM can optionally provide a lot of information, what's missing?
> > --
> > Linux provides a variety of detailed information when an OOM event occurs
> > but has limited options to control how much output is produced. The
> > system related information is produced unconditionally and limited per
> > user process information is produced as a default enabled option. The
> > per user process information may be disabled.
> > 
> > Slab usage information was recently added and is output only if slab
> > usage exceeds user memory usage.
> > 
> > Many OOM events are due to user application memory usage sometimes in
> > combination with the use of kernel resource usage that exceeds what is
> > expected memory usage. Detailed information about how memory was being
> > used when the event occurred may be required to identify the root cause
> > of the OOM event.
> > 
> > However, some environments are very large and printing all of the
> > information about processes, slabs and or vmalloc allocations may
> > not be feasible. For other environments printing as much information
> > about these as possible may be needed to root cause OOM events.
> > 
> 
> For more in-depth analysis of OOM events, people could use kdump to save a
> vmcore by setting "panic_on_oom", and then use the crash utility to analysis 
> the
>  vmcore which contains pretty much all the information you need.
> 
> Certainly, this is the ideal. A full system dump would give you the maximum 
> amount of
> information. 
> 
> Unfortunately some environments may lack space to store the dump,

Kdump usually also support dumping to a remote target via NFS, SSH etc 

> let alone the time to dump the storage contents and restart the system. Some

There is also “makedumpfile” that could compress and filter unwanted memory to 
reduce
the vmcore size and speed up the dumping process by utilizing multi-threads.

> systems can take many minutes to fully boot up, to reset and reinitialize all 
> the
> devices. So unfortunately this is not always an option, and we need an OOM 
> Report.

I am not sure how the system needs some minutes to reboot would be relevant  
for the
discussion here. The idea is to save a vmcore and it can be analyzed offline 
even on 
another system as long as it having a matching “vmlinux.". 




Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Qian Cai
On Mon, 2019-08-26 at 12:36 -0700, Edward Chron wrote:
> This patch series provides code that works as a debug option through
> debugfs to provide additional controls to limit how much information
> gets printed when an OOM event occurs and or optionally print additional
> information about slab usage, vmalloc allocations, user process memory
> usage, the number of processes / tasks and some summary information
> about these tasks (number runable, i/o wait), system information
> (#CPUs, Kernel Version and other useful state of the system),
> ARP and ND Cache entry information.
> 
> Linux OOM can optionally provide a lot of information, what's missing?
> --
> Linux provides a variety of detailed information when an OOM event occurs
> but has limited options to control how much output is produced. The
> system related information is produced unconditionally and limited per
> user process information is produced as a default enabled option. The
> per user process information may be disabled.
> 
> Slab usage information was recently added and is output only if slab
> usage exceeds user memory usage.
> 
> Many OOM events are due to user application memory usage sometimes in
> combination with the use of kernel resource usage that exceeds what is
> expected memory usage. Detailed information about how memory was being
> used when the event occurred may be required to identify the root cause
> of the OOM event.
> 
> However, some environments are very large and printing all of the
> information about processes, slabs and or vmalloc allocations may
> not be feasible. For other environments printing as much information
> about these as possible may be needed to root cause OOM events.
> 

For more in-depth analysis of OOM events, people could use kdump to save a
vmcore by setting "panic_on_oom", and then use the crash utility to analysis the
 vmcore which contains pretty much all the information you need.

The downside of that approach is that this is probably only for enterprise use-
cases that kdump/crash may be tested properly on enterprise-level distros while
the combo is more broken for developers on consumer distros due to kdump/crash
could be affected by many kernel subsystems and have a tendency to be broken
fairly quickly where the community testing is pretty much light.


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Michal Hocko
On Tue 27-08-19 19:10:18, Tetsuo Handa wrote:
> On 2019/08/27 16:15, Michal Hocko wrote:
> > All that being said, I do not think this is something we want to merge
> > without a really _strong_ usecase to back it.
> 
> Like the sender's domain "arista.com" suggests, some of information is
> geared towards networking devices, and ability to report OOM information
> in a way suitable for automatic recording/analyzing (e.g. without using
> shell prompt, let alone manually typing SysRq commands) would be convenient
> for unattended devices.

Why cannot the remote end of the logging identify the host. It has to
connect somewhere anyway, right? I also do assume that a log collector
already does store each log with host id of some form.

-- 
Michal Hocko
SUSE Labs


Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Tetsuo Handa
On 2019/08/27 16:15, Michal Hocko wrote:
> All that being said, I do not think this is something we want to merge
> without a really _strong_ usecase to back it.

Like the sender's domain "arista.com" suggests, some of information is
geared towards networking devices, and ability to report OOM information
in a way suitable for automatic recording/analyzing (e.g. without using
shell prompt, let alone manually typing SysRq commands) would be convenient
for unattended devices. We have only one OOM killer implementation and
format/data are hard-coded. If we can make OOM killer modular, Edward would
be able to use it.



Re: [PATCH 00/10] OOM Debug print selection and additional information

2019-08-27 Thread Michal Hocko
On Mon 26-08-19 12:36:28, Edward Chron wrote:
[...]
> Extensibility using OOM debug options
> -
> What is needed is an extensible system to optionally configure
> debug options as needed and to then dynamically enable and disable
> them. Also for options that produce multiple lines of entry based
> output, to configure which entries to print based on how much
> memory they use (or optionally all the entries).

With a patch this large and adding a lot of new stuff we need a more
detailed usecases described I believe.

[...]

> Use of debugfs to allow dynamic controls
> 
> By providing a debugfs interface that allows options to be configured,
> enabled and where appropriate to set a minimum size for selecting
> entries to print, the output produced when an OOM event occurs can be
> dynamically adjusted to produce as little or as much detail as needed
> for a given system.

Who is going to consume this information and why would that consumer be
unreasonable to demand further maintenance of that information in future
releases? In other words debugfs is not considered a stableAPI which is
OK here but the side effect of any change to these files results in user
visible behavior and we consider that more or less a stable as long as
there are consumers.

> OOM debug options can be added to the base code as needed.
> 
> Currently we have the following OOM debug options defined:
> 
> * System State Summary
>   
>   One line of output that includes:
>   - Uptime (days, hour, minutes, seconds)

We do have timestamps in the log so why is this needed?

>   - Number CPUs
>   - Machine Type
>   - Node name
>   - Domain name

why are these needed? That is a static information that doesn't really
influence the OOM situation.

>   - Kernel Release
>   - Kernel Version

part of the oom report
> 
>   Example output when configured and enabled:
> 
> Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 
> Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ 
> Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019
> 
> * Tasks Summary
>   -
>   One line of output that includes:
>   - Number of Threads
>   - Number of processes
>   - Forks since boot
>   - Processes that are runnable
>   - Processes that are in iowait

We do have sysrq+t for this kind of information. Why do we need to
duplicate it?

>   Example output when configured and enabled:
> 
> Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 
> forks_since_boot:2786 procs_runable:2 procs_iowait:0
> 
> * ARP Table and/or Neighbour Discovery Table Summary
>   --
>   One line of output each for ARP and ND that includes:
>   - Table name
>   - Table size (max # entries)
>   - Key Length
>   - Entry Size
>   - Number of Entries
>   - Last Flush (in seconds)
>   - hash grows
>   - entry allocations
>   - entry destroys
>   - Number lookups
>   - Number of lookup hits
>   - Resolution failures
>   - Garbage Collection Forced Runs
>   - Table Full
>   - Proxy Queue Length
> 
>   Example output when configured and enabled (for both):
> 
> ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 
> entries: 9 lastFlush:  1721s hGrows: 1 allocs: 9 destroys: 0 
> lookups:   204 hits:   199 resFailed:38 gcRuns/Forced: 111 /  0 tblFull:  
> 0 proxyQlen:  0
> 
> ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 
> entries: 6 lastFlush:  1720s hGrows: 0 allocs: 7 destroys: 1 
> lookups: 0 hits: 0 resFailed: 0 gcRuns/Forced: 110 /  0 tblFull:  
> 0 proxyQlen:  0

Again, why is this needed particularly for the OOM event? I do
understand this might be useful system health diagnostic information but
how does this contribute to the OOM?

> * Add Select Slabs Print
>   --
>   Allow select slab entries (based on a minimum size) to be printed.
>   Minimum size is specified as a percentage of the total RAM memory
>   in tenths of a percent, consistent with existing OOM process scoring.
>   Valid values are specified from 0 to 1000 where 0 prints all slab
>   entries (all slabs that have at least one slab object in use) up
>   to 1000 which would require a slab to use 100% of memory which can't
>   happen so in that case only summary information is printed.
> 
>   The first line of output is the standard Linux output header for
>   OOM printed Slab entries. This header looks like this:
> 
> Aug  6 09:37:21 egc103 yourserver: Unreclaimable slab info:
> 
>   The output is existing slab entry memory usage limited such that only
>   entries equal to or larger than the minimum size are printed.
>   Empty slabs (no slab entries in slabs in use) are never printed.
> 
>   Additional output consists of summary information that is printed
>   at the end of the output. This summary information 

[PATCH 00/10] OOM Debug print selection and additional information

2019-08-26 Thread Edward Chron
This patch series provides code that works as a debug option through
debugfs to provide additional controls to limit how much information
gets printed when an OOM event occurs and or optionally print additional
information about slab usage, vmalloc allocations, user process memory
usage, the number of processes / tasks and some summary information
about these tasks (number runable, i/o wait), system information
(#CPUs, Kernel Version and other useful state of the system),
ARP and ND Cache entry information.

Linux OOM can optionally provide a lot of information, what's missing?
--
Linux provides a variety of detailed information when an OOM event occurs
but has limited options to control how much output is produced. The
system related information is produced unconditionally and limited per
user process information is produced as a default enabled option. The
per user process information may be disabled.

Slab usage information was recently added and is output only if slab
usage exceeds user memory usage.

Many OOM events are due to user application memory usage sometimes in
combination with the use of kernel resource usage that exceeds what is
expected memory usage. Detailed information about how memory was being
used when the event occurred may be required to identify the root cause
of the OOM event.

However, some environments are very large and printing all of the
information about processes, slabs and or vmalloc allocations may
not be feasible. For other environments printing as much information
about these as possible may be needed to root cause OOM events.

Extensibility using OOM debug options
-
What is needed is an extensible system to optionally configure
debug options as needed and to then dynamically enable and disable
them. Also for options that produce multiple lines of entry based
output, to configure which entries to print based on how much
memory they use (or optionally all the entries).

Limiting print entry output based on object size

To limit output, a fixed size of object could be used such as:
vmallocs that use more than 1MB, slabs that are using more than
512KB, processes using 16MB or more of memory. Such an apporach
is quite reasonable.

Using OOM's memory metrics to limit printing based on entry size

However, the current implementation of OOM which has been in use for
almost a decade scores based on 1/10 % of memory. This methodology scales
well as memory sizes increase. If you limit the objects you examine to
those using 0.1% of memory you still may get a large number of objects
but avoid printing those using a relatively small amount of memory.

Further options that allow limiting output based on object size
can have the minimum size set to zero. In this case objects
that use even a small amount of memory will be printed.

Use of debugfs to allow dynamic controls

By providing a debugfs interface that allows options to be configured,
enabled and where appropriate to set a minimum size for selecting
entries to print, the output produced when an OOM event occurs can be
dynamically adjusted to produce as little or as much detail as needed
for a given system.

OOM debug options can be added to the base code as needed.

Currently we have the following OOM debug options defined:

* System State Summary
  
  One line of output that includes:
  - Uptime (days, hour, minutes, seconds)
  - Number CPUs
  - Machine Type
  - Node name
  - Domain name
  - Kernel Release
  - Kernel Version

  Example output when configured and enabled:

Jul 27 10:56:46 yoursystem kernel: System Uptime:0 days 00:17:27 CPUs:4 
Machine:x86_64 Node:yoursystem Domain:localdomain Kernel Release:5.3.0-rc2+ 
Version: #49 SMP Mon Jul 27 10:35:32 PDT 2019

* Tasks Summary
  -
  One line of output that includes:
  - Number of Threads
  - Number of processes
  - Forks since boot
  - Processes that are runnable
  - Processes that are in iowait

  Example output when configured and enabled:

Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 
forks_since_boot:2786 procs_runable:2 procs_iowait:0

* ARP Table and/or Neighbour Discovery Table Summary
  --
  One line of output each for ARP and ND that includes:
  - Table name
  - Table size (max # entries)
  - Key Length
  - Entry Size
  - Number of Entries
  - Last Flush (in seconds)
  - hash grows
  - entry allocations
  - entry destroys
  - Number lookups
  - Number of lookup hits
  - Resolution failures
  - Garbage Collection Forced Runs
  - Table Full
  - Proxy Queue Length

  Example output when configured and enabled (for both):

... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 
entries: 9