Re: How to make warn_alloc() reliable?

2016-10-20 Thread Michal Hocko
On Thu 20-10-16 21:07:49, Tetsuo Handa wrote:
[...]
> By the way, regarding "making the direct reclaim path more deterministic"
> part, I wish that we can
> 
>   (1) introduce phased watermarks which varies based on stage of reclaim
>   operation (e.g. watermark_lower()/watermark_higher() which resembles
>   preempt_disable()/preempt_enable() but is propagated to other threads
>   when delegating operations needed for reclaim to other threads).
> 
>   (2) introduce dedicated kernel threads which perform only specific
>   reclaim operation, using watermark propagated from other threads
>   which performs different reclaim operation.
> 
>   (3) remove direct reclaim which bothers callers with managing correct
>   GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
>   ___GFP_DIRECT_RECLAIM callers can simply wait for
>   wait_event(get_pages_from_freelist() succeeds) rather than polling
>   with complicated short sleep. This will significantly save CPU
>   resource (especially when oom_lock is held) which is wasted by
>   activities by multiple concurrent direct reclaim.

As always, you are free to come up with patches with the proper
justification and convince people that those steps will help both the
regular case as well of those you are bothered with.
-- 
Michal Hocko
SUSE Labs


Re: How to make warn_alloc() reliable?

2016-10-20 Thread Tetsuo Handa
Michal Hocko wrote:
> On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
> [...]
> > What I'm talking about is "why don't you stop playing whack-a-mole games
> > with missing warn_alloc() calls". I don't blame you for not having a good
> > idea, but I blame you for not having a reliable warn_alloc() mechanism.
> 
> Look, it seems pretty clear that our priorities and viewes are quite
> different. While I believe that we should solve real issues in a
> reliable and robust way you seem to love to be have as much reporting as
> possible. I do agree that reporting is important part of debugging of
> problems but as your previous attempts for the allocation watchdog show
> a proper and bullet proof reporting requires state tracking and is in
> general too complex for something that doesn't happen in most properly
> configured systems. Maybe there are other ways but my time is better
> spent on something more useful - like making the direct reclaim path
> more deterministic without any unbound loops.

Properly configured systems should not be bothered by low memory situations.
There are systems which are bothered by low memory situations. It is pointless
to refer to "properly configured systems" as a reason not to add a watchdog.
It is administrators who decide whether to use a watchdog.

> 
> So let's agree to disagree about importance of the reliability
> warn_alloc. I see it as an improvement which doesn't really have to be
> perfect.

I don't expect kmallocwd alone to be perfect. I expect kmallocwd to serve
as a hook. For example, it will be possible to turn on collecting perf data
when kmallocwd found a stalling thread and turn off when kmallocwd found
none. Since necessary information are stored in the task struct, it will
be easy to include them into perf data. Likewise, it will be easy to
extract them using a script for /usr/bin/crash when an administrator
captured a vmcore image of a stalling KVM guest. Sending vmcore images
to support centers is difficult due to file size and security reasons.
It is nice if we can get a clue by reading the task list.

But warn_alloc() can't serve as a hook. I see kmallocwd as an improvement
which doesn't really have to be perfect.



By the way, regarding "making the direct reclaim path more deterministic"
part, I wish that we can

  (1) introduce phased watermarks which varies based on stage of reclaim
  operation (e.g. watermark_lower()/watermark_higher() which resembles
  preempt_disable()/preempt_enable() but is propagated to other threads
  when delegating operations needed for reclaim to other threads).

  (2) introduce dedicated kernel threads which perform only specific
  reclaim operation, using watermark propagated from other threads
  which performs different reclaim operation.

  (3) remove direct reclaim which bothers callers with managing correct
  GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal
  ___GFP_DIRECT_RECLAIM callers can simply wait for
  wait_event(get_pages_from_freelist() succeeds) rather than polling
  with complicated short sleep. This will significantly save CPU
  resource (especially when oom_lock is held) which is wasted by
  activities by multiple concurrent direct reclaim.


Re: How to make warn_alloc() reliable?

2016-10-19 Thread Michal Hocko
On Wed 19-10-16 20:27:53, Tetsuo Handa wrote:
[...]
> What I'm talking about is "why don't you stop playing whack-a-mole games
> with missing warn_alloc() calls". I don't blame you for not having a good
> idea, but I blame you for not having a reliable warn_alloc() mechanism.

Look, it seems pretty clear that our priorities and viewes are quite
different. While I believe that we should solve real issues in a
reliable and robust way you seem to love to be have as much reporting as
possible. I do agree that reporting is important part of debugging of
problems but as your previous attempts for the allocation watchdog show
a proper and bullet proof reporting requires state tracking and is in
general too complex for something that doesn't happen in most properly
configured systems. Maybe there are other ways but my time is better
spent on something more useful - like making the direct reclaim path
more deterministic without any unbound loops.

So let's agree to disagree about importance of the reliability
warn_alloc. I see it as an improvement which doesn't really have to be
perfect.
-- 
Michal Hocko
SUSE Labs


Re: How to make warn_alloc() reliable?

2016-10-19 Thread Tetsuo Handa
Michal Hocko wrote:
> This is not about warn_alloc reliability but more about
> too_many_isolated waiting for an unbounded amount of time. And that
> should be fixed. I do not have a good idea how right now.

I'm not talking about only too_many_isolated() case. If I were talking about
this specific case, I would have proposed leaving this loop using timeout.
For example, where is the guarantee that current thread never get stuck
at shrink_inactive_list() after leaving this too_many_isolated() loop?

I think that perception of ordinary Linux user's memory management is
"Linux reclaims memory when needed. Thus, it is normal that MemFree:
field of /proc/meminfo is small." and "Linux invokes the OOM killer if
memory allocation request can't make forward progress". However we know
"Linux may not be able to invoke the OOM killer even if memory allocation
request can't make forward progress". You suddenly bring up (or admit to)
implications/limitations/problems most Linux users do not know. That's
painful for me who went to a lot of trouble to get some clue at a support
center.

When we were off-list talking about CVE-2016-2847, your response had been
"Your machine is DoSed already" until we notice the "too small to fail"
memory-allocation rule. If I were not continuing examining until I make
you angry, we would not have come to correct answer. I don't like your
optimistic "Fix it if you can trigger it." approach which will never give
users (and troubleshooting staffs at support centers) a proof. I want a
"Expose what Michal Hocko is not aware of or does not care" mechanism.

What I'm talking about is "why don't you stop playing whack-a-mole games
with missing warn_alloc() calls". I don't blame you for not having a good
idea, but I blame you for not having a reliable warn_alloc() mechanism.


Re: How to make warn_alloc() reliable?

2016-10-18 Thread Michal Hocko
On Tue 18-10-16 20:04:20, Tetsuo Handa wrote:
[...]
> @@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec 
> *lruvec,
>   int file = is_file_lru(lru);
>   struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> + unsigned long wait_start = jiffies;
> + unsigned int wait_timeout = 10 * HZ;
> + long last_diff = 0;
> + long diff;
>  
>   if (!inactive_reclaimable_pages(lruvec, sc, lru))
>   return 0;
>  
> - while (unlikely(too_many_isolated(pgdat, file, sc))) {
> + while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) {
> + if (diff < last_diff) {
> + wait_start = jiffies;
> + wait_timeout = 10 * HZ;
> + } else if (time_after(jiffies, wait_start + wait_timeout)) {
> + warn_alloc(sc->gfp_mask,
> +"shrink_inactive_list() stalls for %ums",
> +jiffies_to_msecs(jiffies - wait_start));
> + wait_timeout += 10 * HZ;
> + }
> + last_diff = diff;
>   congestion_wait(BLK_RW_ASYNC, HZ/10);
>  
>   /* We are about to die and free our memory. Return now. */
> --
[...]
> So, how can we make warn_alloc() reliable?

This is not about warn_alloc reliability but more about
too_many_isolated waiting for an unbounded amount of time. And that
should be fixed. I do not have a good idea how right now.
-- 
Michal Hocko
SUSE Labs