Re: How to make warn_alloc() reliable?
On Thu 20-10-16 21:07:49, Tetsuo Handa wrote: [...] > By the way, regarding "making the direct reclaim path more deterministic" > part, I wish that we can > > (1) introduce phased watermarks which varies based on stage of reclaim > operation (e.g. watermark_lower()/watermark_higher() which resembles > preempt_disable()/preempt_enable() but is propagated to other threads > when delegating operations needed for reclaim to other threads). > > (2) introduce dedicated kernel threads which perform only specific > reclaim operation, using watermark propagated from other threads > which performs different reclaim operation. > > (3) remove direct reclaim which bothers callers with managing correct > GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal > ___GFP_DIRECT_RECLAIM callers can simply wait for > wait_event(get_pages_from_freelist() succeeds) rather than polling > with complicated short sleep. This will significantly save CPU > resource (especially when oom_lock is held) which is wasted by > activities by multiple concurrent direct reclaim. As always, you are free to come up with patches with the proper justification and convince people that those steps will help both the regular case as well of those you are bothered with. -- Michal Hocko SUSE Labs
Re: How to make warn_alloc() reliable?
Michal Hocko wrote: > On Wed 19-10-16 20:27:53, Tetsuo Handa wrote: > [...] > > What I'm talking about is "why don't you stop playing whack-a-mole games > > with missing warn_alloc() calls". I don't blame you for not having a good > > idea, but I blame you for not having a reliable warn_alloc() mechanism. > > Look, it seems pretty clear that our priorities and viewes are quite > different. While I believe that we should solve real issues in a > reliable and robust way you seem to love to be have as much reporting as > possible. I do agree that reporting is important part of debugging of > problems but as your previous attempts for the allocation watchdog show > a proper and bullet proof reporting requires state tracking and is in > general too complex for something that doesn't happen in most properly > configured systems. Maybe there are other ways but my time is better > spent on something more useful - like making the direct reclaim path > more deterministic without any unbound loops. Properly configured systems should not be bothered by low memory situations. There are systems which are bothered by low memory situations. It is pointless to refer to "properly configured systems" as a reason not to add a watchdog. It is administrators who decide whether to use a watchdog. > > So let's agree to disagree about importance of the reliability > warn_alloc. I see it as an improvement which doesn't really have to be > perfect. I don't expect kmallocwd alone to be perfect. I expect kmallocwd to serve as a hook. For example, it will be possible to turn on collecting perf data when kmallocwd found a stalling thread and turn off when kmallocwd found none. Since necessary information are stored in the task struct, it will be easy to include them into perf data. Likewise, it will be easy to extract them using a script for /usr/bin/crash when an administrator captured a vmcore image of a stalling KVM guest. Sending vmcore images to support centers is difficult due to file size and security reasons. It is nice if we can get a clue by reading the task list. But warn_alloc() can't serve as a hook. I see kmallocwd as an improvement which doesn't really have to be perfect. By the way, regarding "making the direct reclaim path more deterministic" part, I wish that we can (1) introduce phased watermarks which varies based on stage of reclaim operation (e.g. watermark_lower()/watermark_higher() which resembles preempt_disable()/preempt_enable() but is propagated to other threads when delegating operations needed for reclaim to other threads). (2) introduce dedicated kernel threads which perform only specific reclaim operation, using watermark propagated from other threads which performs different reclaim operation. (3) remove direct reclaim which bothers callers with managing correct GFP_NOIO / GFP_NOFS / GFP_KERNEL distinction. Then, normal ___GFP_DIRECT_RECLAIM callers can simply wait for wait_event(get_pages_from_freelist() succeeds) rather than polling with complicated short sleep. This will significantly save CPU resource (especially when oom_lock is held) which is wasted by activities by multiple concurrent direct reclaim.
Re: How to make warn_alloc() reliable?
On Wed 19-10-16 20:27:53, Tetsuo Handa wrote: [...] > What I'm talking about is "why don't you stop playing whack-a-mole games > with missing warn_alloc() calls". I don't blame you for not having a good > idea, but I blame you for not having a reliable warn_alloc() mechanism. Look, it seems pretty clear that our priorities and viewes are quite different. While I believe that we should solve real issues in a reliable and robust way you seem to love to be have as much reporting as possible. I do agree that reporting is important part of debugging of problems but as your previous attempts for the allocation watchdog show a proper and bullet proof reporting requires state tracking and is in general too complex for something that doesn't happen in most properly configured systems. Maybe there are other ways but my time is better spent on something more useful - like making the direct reclaim path more deterministic without any unbound loops. So let's agree to disagree about importance of the reliability warn_alloc. I see it as an improvement which doesn't really have to be perfect. -- Michal Hocko SUSE Labs
Re: How to make warn_alloc() reliable?
Michal Hocko wrote: > This is not about warn_alloc reliability but more about > too_many_isolated waiting for an unbounded amount of time. And that > should be fixed. I do not have a good idea how right now. I'm not talking about only too_many_isolated() case. If I were talking about this specific case, I would have proposed leaving this loop using timeout. For example, where is the guarantee that current thread never get stuck at shrink_inactive_list() after leaving this too_many_isolated() loop? I think that perception of ordinary Linux user's memory management is "Linux reclaims memory when needed. Thus, it is normal that MemFree: field of /proc/meminfo is small." and "Linux invokes the OOM killer if memory allocation request can't make forward progress". However we know "Linux may not be able to invoke the OOM killer even if memory allocation request can't make forward progress". You suddenly bring up (or admit to) implications/limitations/problems most Linux users do not know. That's painful for me who went to a lot of trouble to get some clue at a support center. When we were off-list talking about CVE-2016-2847, your response had been "Your machine is DoSed already" until we notice the "too small to fail" memory-allocation rule. If I were not continuing examining until I make you angry, we would not have come to correct answer. I don't like your optimistic "Fix it if you can trigger it." approach which will never give users (and troubleshooting staffs at support centers) a proof. I want a "Expose what Michal Hocko is not aware of or does not care" mechanism. What I'm talking about is "why don't you stop playing whack-a-mole games with missing warn_alloc() calls". I don't blame you for not having a good idea, but I blame you for not having a reliable warn_alloc() mechanism.
Re: How to make warn_alloc() reliable?
On Tue 18-10-16 20:04:20, Tetsuo Handa wrote: [...] > @@ -1697,11 +1697,25 @@ static bool inactive_reclaimable_pages(struct lruvec > *lruvec, > int file = is_file_lru(lru); > struct pglist_data *pgdat = lruvec_pgdat(lruvec); > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; > + unsigned long wait_start = jiffies; > + unsigned int wait_timeout = 10 * HZ; > + long last_diff = 0; > + long diff; > > if (!inactive_reclaimable_pages(lruvec, sc, lru)) > return 0; > > - while (unlikely(too_many_isolated(pgdat, file, sc))) { > + while (unlikely((diff = too_many_isolated(pgdat, file, sc)) > 0)) { > + if (diff < last_diff) { > + wait_start = jiffies; > + wait_timeout = 10 * HZ; > + } else if (time_after(jiffies, wait_start + wait_timeout)) { > + warn_alloc(sc->gfp_mask, > +"shrink_inactive_list() stalls for %ums", > +jiffies_to_msecs(jiffies - wait_start)); > + wait_timeout += 10 * HZ; > + } > + last_diff = diff; > congestion_wait(BLK_RW_ASYNC, HZ/10); > > /* We are about to die and free our memory. Return now. */ > -- [...] > So, how can we make warn_alloc() reliable? This is not about warn_alloc reliability but more about too_many_isolated waiting for an unbounded amount of time. And that should be fixed. I do not have a good idea how right now. -- Michal Hocko SUSE Labs