Re: [PATCH 4/8] xfs: use memalloc_nofs_{save,restore} instead of memalloc_noio*

2017-01-09 Thread Vlastimil Babka
On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> kmem_zalloc_large and _xfs_buf_map_pages use memalloc_noio_{save,restore}
> API to prevent from reclaim recursion into the fs because vmalloc can
> invoke unconditional GFP_KERNEL allocations and these functions might be
> called from the NOFS contexts. The memalloc_noio_save will enforce
> GFP_NOIO context which is even weaker than GFP_NOFS and that seems to be
> unnecessary. Let's use memalloc_nofs_{save,restore} instead as it should
> provide exactly what we need here - implicit GFP_NOFS context.
> 
> Changes since v1
> - s@memalloc_noio_restore@memalloc_nofs_restore@ in _xfs_buf_map_pages
>   as per Brian Foster
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>

Not a xfs expert, but seems correct.

Acked-by: Vlastimil Babka <vba...@suse.cz>

Nit below:

> ---
>  fs/xfs/kmem.c| 10 +-
>  fs/xfs/xfs_buf.c |  8 
>  2 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/kmem.c b/fs/xfs/kmem.c
> index a76a05dae96b..d69ed5e76621 100644
> --- a/fs/xfs/kmem.c
> +++ b/fs/xfs/kmem.c
> @@ -65,7 +65,7 @@ kmem_alloc(size_t size, xfs_km_flags_t flags)
>  void *
>  kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>  {
> - unsigned noio_flag = 0;
> + unsigned nofs_flag = 0;
>   void*ptr;
>   gfp_t   lflags;
>  
> @@ -80,14 +80,14 @@ kmem_zalloc_large(size_t size, xfs_km_flags_t flags)
>* context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>* the filesystem here and potentially deadlocking.

The comment above is now largely obsolete, or minimally should be
changed to PF_MEMALLOC_NOFS?

>*/
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> - noio_flag = memalloc_noio_save();
> + if (flags & KM_NOFS)
> + nofs_flag = memalloc_nofs_save();
>  
>   lflags = kmem_flags_convert(flags);
>   ptr = __vmalloc(size, lflags | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL);
>  
> - if ((current->flags & PF_MEMALLOC_NOFS) || (flags & KM_NOFS))
> - memalloc_noio_restore(noio_flag);
> + if (flags & KM_NOFS)
> + memalloc_nofs_restore(nofs_flag);
>  
>   return ptr;
>  }
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 7f0a01f7b592..8cb8dd4cdfd8 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -441,17 +441,17 @@ _xfs_buf_map_pages(
>   bp->b_addr = NULL;
>   } else {
>   int retried = 0;
> - unsigned noio_flag;
> + unsigned nofs_flag;
>  
>   /*
>* vm_map_ram() will allocate auxillary structures (e.g.
>* pagetables) with GFP_KERNEL, yet we are likely to be under
>* GFP_NOFS context here. Hence we need to tell memory reclaim
> -  * that we are in such a context via PF_MEMALLOC_NOIO to prevent
> +  * that we are in such a context via PF_MEMALLOC_NOFS to prevent
>* memory reclaim re-entering the filesystem here and
>* potentially deadlocking.
>*/
> - noio_flag = memalloc_noio_save();
> + nofs_flag = memalloc_nofs_save();
>   do {
>   bp->b_addr = vm_map_ram(bp->b_pages, bp->b_page_count,
>   -1, PAGE_KERNEL);
> @@ -459,7 +459,7 @@ _xfs_buf_map_pages(
>   break;
>   vm_unmap_aliases();
>   } while (retried++ <= 1);
> - memalloc_noio_restore(noio_flag);
> + memalloc_nofs_restore(nofs_flag);
>  
>   if (!bp->b_addr)
>   return -ENOMEM;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API

2017-01-09 Thread Vlastimil Babka
On 01/09/2017 02:42 PM, Michal Hocko wrote:
> On Mon 09-01-17 14:04:21, Vlastimil Babka wrote:
> [...]
>>> +static inline unsigned int memalloc_nofs_save(void)
>>> +{
>>> +   unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
>>> +   current->flags |= PF_MEMALLOC_NOFS;
>>
>> So this is not new, as same goes for memalloc_noio_save, but I've
>> noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
>> So is it possible that there's a r-m-w hazard here?
> 
> exit_signals operates on current and all task_struct::flags should be
> used only on the current.
> [...]

Ah, good to know.

> 
>>> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
>>> mem_cgroup *memcg,
>>> int nid;
>>> struct scan_control sc = {
>>> .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
>>> -   .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
>>> +   .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
>>
>> So this function didn't do memalloc_noio_flags() before? Is it a bug
>> that should be fixed separately or at least mentioned? Because that
>> looks like a functional change...
> 
> We didn't need it. Kmem charges are opt-in and current all of them
> support GFP_IO. The LRU pages are not charged in NOIO context either.
> We need it now because there will be callers to charge GFP_KERNEL while
> being inside the NOFS scope.

I see.

> Now that you have opened this I have noticed that the code is wrong
> here because GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK would overwrite
> the removed GFP_FS. I guess it would be better and less error prone
> to move the current_gfp_context part into the direct reclaim entry -
> do_try_to_free_pages - and put the comment like this

Agree with your "So let's just scratch this follow up fix in the next
e-mail.

So for the unchanged patch.

Acked-by: Vlastimil Babka <vba...@suse.cz>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/8] mm: introduce memalloc_nofs_{save,restore} API

2017-01-09 Thread Vlastimil Babka
On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko 
> 
> GFP_NOFS context is used for the following 5 reasons currently
>   - to prevent from deadlocks when the lock held by the allocation
> context would be needed during the memory reclaim
>   - to prevent from stack overflows during the reclaim because
> the allocation is performed from a deep context already
>   - to prevent lockups when the allocation context depends on
> other reclaimers to make a forward progress indirectly
>   - just in case because this would be safe from the fs POV
>   - silence lockdep false positives
> 
> Unfortunately overuse of this allocation context brings some problems
> to the MM. Memory reclaim is much weaker (especially during heavy FS
> metadata workloads), OOM killer cannot be invoked because the MM layer
> doesn't have enough information about how much memory is freeable by the
> FS layer.
> 
> In many cases it is far from clear why the weaker context is even used
> and so it might be used unnecessarily. We would like to get rid of
> those as much as possible. One way to do that is to use the flag in
> scopes rather than isolated cases. Such a scope is declared when really
> necessary, tracked per task and all the allocation requests from within
> the context will simply inherit the GFP_NOFS semantic.
> 
> Not only this is easier to understand and maintain because there are
> much less problematic contexts than specific allocation requests, this
> also helps code paths where FS layer interacts with other layers (e.g.
> crypto, security modules, MM etc...) and there is no easy way to convey
> the allocation context between the layers.
> 
> Introduce memalloc_nofs_{save,restore} API to control the scope
> of GFP_NOFS allocation context. This is basically copying
> memalloc_noio_{save,restore} API we have for other restricted allocation
> context GFP_NOIO. The PF_MEMALLOC_NOFS flag already exists and it is
> just an alias for PF_FSTRANS which has been xfs specific until recently.
> There are no more PF_FSTRANS users anymore so let's just drop it.
> 
> PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
> implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO. memalloc_noio_flags
> is renamed to current_gfp_context because it now cares about both
> PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts. Xfs code paths preserve
> their semantic. kmem_flags_convert() doesn't need to evaluate the flag
> anymore.
> 
> This patch shouldn't introduce any functional changes.
> 
> Let's hope that filesystems will drop direct GFP_NOFS (resp. ~__GFP_FS)
> usage as much as possible and only use a properly documented
> memalloc_nofs_{save,restore} checkpoints where they are appropriate.
> 
> Signed-off-by: Michal Hocko 


[...]

> +static inline unsigned int memalloc_nofs_save(void)
> +{
> + unsigned int flags = current->flags & PF_MEMALLOC_NOFS;
> + current->flags |= PF_MEMALLOC_NOFS;

So this is not new, as same goes for memalloc_noio_save, but I've
noticed that e.g. exit_signal() does tsk->flags |= PF_EXITING;
So is it possible that there's a r-m-w hazard here?

> + return flags;
> +}
> +
> +static inline void memalloc_nofs_restore(unsigned int flags)
> +{
> + current->flags = (current->flags & ~PF_MEMALLOC_NOFS) | flags;
> +}
> +
>  /* Per-process atomic flags. */
>  #define PFA_NO_NEW_PRIVS 0   /* May not gain new privileges. */
>  #define PFA_SPREAD_PAGE  1  /* Spread page cache over cpuset */

[...]

> @@ -3029,7 +3029,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct 
> mem_cgroup *memcg,
>   int nid;
>   struct scan_control sc = {
>   .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
> + .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |

So this function didn't do memalloc_noio_flags() before? Is it a bug
that should be fixed separately or at least mentioned? Because that
looks like a functional change...

Thanks!

>   (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
>   .reclaim_idx = MAX_NR_ZONES - 1,
>   .target_mem_cgroup = memcg,
> @@ -3723,7 +3723,7 @@ static int __node_reclaim(struct pglist_data *pgdat, 
> gfp_t gfp_mask, unsigned in
>   int classzone_idx = gfp_zone(gfp_mask);
>   struct scan_control sc = {
>   .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
> - .gfp_mask = (gfp_mask = memalloc_noio_flags(gfp_mask)),
> + .gfp_mask = (gfp_mask = current_gfp_context(gfp_mask)),
>   .order = order,
>   .priority = NODE_RECLAIM_PRIORITY,
>   .may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH 1/8] lockdep: allow to disable reclaim lockup detection

2017-01-09 Thread Vlastimil Babka
On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> The current implementation of the reclaim lockup detection can lead to
> false positives and those even happen and usually lead to tweak the
> code to silence the lockdep by using GFP_NOFS even though the context
> can use __GFP_FS just fine. See
> http://lkml.kernel.org/r/20160512080321.GA18496@dastard as an example.
> 
> =
> [ INFO: inconsistent lock state ]
> 4.5.0-rc2+ #4 Tainted: G   O
> -
> inconsistent {RECLAIM_FS-ON-R} -> {IN-RECLAIM_FS-W} usage.
> kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:
> 
> (_nondir_ilock_class){-+}, at: [] 
> xfs_ilock+0x177/0x200 [xfs]
> 
> {RECLAIM_FS-ON-R} state was registered at:
>   [] mark_held_locks+0x79/0xa0
>   [] lockdep_trace_alloc+0xb3/0x100
>   [] kmem_cache_alloc+0x33/0x230
>   [] kmem_zone_alloc+0x81/0x120 [xfs]
>   [] xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
>   [] __xfs_refcount_find_shared+0x75/0x580 [xfs]
>   [] xfs_refcount_find_shared+0x84/0xb0 [xfs]
>   [] xfs_getbmap+0x608/0x8c0 [xfs]
>   [] xfs_vn_fiemap+0xab/0xc0 [xfs]
>   [] do_vfs_ioctl+0x498/0x670
>   [] SyS_ioctl+0x79/0x90
>   [] entry_SYSCALL_64_fastpath+0x12/0x6f
> 
>CPU0
>
>   lock(_nondir_ilock_class);
>   
> lock(_nondir_ilock_class);
> 
>  *** DEADLOCK ***
> 
> 3 locks held by kswapd0/543:
> 
> stack backtrace:
> CPU: 0 PID: 543 Comm: kswapd0 Tainted: G   O4.5.0-rc2+ #4
> 
> Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
> 
>  82a34f10 88003aa078d0 813a14f9 88003d8551c0
>  88003aa07920 8110ec65  0001
>  8801 000b 0008 88003d855aa0
> Call Trace:
>  [] dump_stack+0x4b/0x72
>  [] print_usage_bug+0x215/0x240
>  [] mark_lock+0x1f5/0x660
>  [] ? print_shortest_lock_dependencies+0x1a0/0x1a0
>  [] __lock_acquire+0xa80/0x1e50
>  [] ? kmem_cache_alloc+0x15e/0x230
>  [] ? kmem_zone_alloc+0x81/0x120 [xfs]
>  [] lock_acquire+0xd8/0x1e0
>  [] ? xfs_ilock+0x177/0x200 [xfs]
>  [] ? xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [] down_write_nested+0x5e/0xc0
>  [] ? xfs_ilock+0x177/0x200 [xfs]
>  [] xfs_ilock+0x177/0x200 [xfs]
>  [] xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
>  [] xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
>  [] evict+0xc5/0x190
>  [] dispose_list+0x39/0x60
>  [] prune_icache_sb+0x4b/0x60
>  [] super_cache_scan+0x14f/0x1a0
>  [] shrink_slab.part.63.constprop.79+0x1e9/0x4e0
>  [] shrink_zone+0x15e/0x170
>  [] kswapd+0x4f1/0xa80
>  [] ? zone_reclaim+0x230/0x230
>  [] kthread+0xf2/0x110
>  [] ? kthread_create_on_node+0x220/0x220
>  [] ret_from_fork+0x3f/0x70
>  [] ? kthread_create_on_node+0x220/0x220
> 
> To quote Dave:
> "
> Ignoring whether reflink should be doing anything or not, that's a
> "xfs_refcountbt_init_cursor() gets called both outside and inside
> transactions" lockdep false positive case. The problem here is
> lockdep has seen this allocation from within a transaction, hence a
> GFP_NOFS allocation, and now it's seeing it in a GFP_KERNEL context.
> Also note that we have an active reference to this inode.
> 
> So, because the reclaim annotations overload the interrupt level
> detections and it's seen the inode ilock been taken in reclaim
> ("interrupt") context, this triggers a reclaim context warning where
> it thinks it is unsafe to do this allocation in GFP_KERNEL context
> holding the inode ilock...
> "
> 
> This sounds like a fundamental problem of the reclaim lock detection.
> It is really impossible to annotate such a special usecase IMHO unless
> the reclaim lockup detection is reworked completely. Until then it
> is much better to provide a way to add "I know what I am doing flag"
> and mark problematic places. This would prevent from abusing GFP_NOFS
> flag which has a runtime effect even on configurations which have
> lockdep disabled.
> 
> Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
> skip the current allocation request.
> 
> While we are at it also make sure that the radix tree doesn't
> accidentaly override tags stored in the upper part of the gfp_mask.
> 
> Suggested-by: Peter Zijlstra <pet...@infradead.org>
> Acked-by: Peter Zijlstra (Intel) <pet...@infradead.org>
> Signed-off-by: Michal Hocko <mho...@suse.com>

Acked-by: Vlastimil Babka <vba...@suse.cz>

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/8] xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS

2017-01-09 Thread Vlastimil Babka
On 01/06/2017 03:11 PM, Michal Hocko wrote:
> From: Michal Hocko <mho...@suse.com>
> 
> xfs has defined PF_FSTRANS to declare a scope GFP_NOFS semantic quite
> some time ago. We would like to make this concept more generic and use
> it for other filesystems as well. Let's start by giving the flag a
> more generic name PF_MEMALLOC_NOFS which is in line with an exiting
> PF_MEMALLOC_NOIO already used for the same purpose for GFP_NOIO
> contexts. Replace all PF_FSTRANS usage from the xfs code in the first
> step before we introduce a full API for it as xfs uses the flag directly
> anyway.
> 
> This patch doesn't introduce any functional change.
> 
> Signed-off-by: Michal Hocko <mho...@suse.com>
> Reviewed-by: Brian Foster <bfos...@redhat.com>

Acked-by: Vlastimil Babka <vba...@suse.cz>

A nit:

> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -2320,6 +2320,8 @@ extern void thread_group_cputime_adjusted(struct 
> task_struct *p, cputime_t *ut,
>  #define PF_FREEZER_SKIP  0x4000  /* Freezer should not count it 
> as freezable */
>  #define PF_SUSPEND_TASK 0x8000  /* this thread called 
> freeze_processes and should not be frozen */
>  
> +#define PF_MEMALLOC_NOFS PF_FSTRANS  /* Transition to a more generic 
> GFP_NOFS scope semantic */

I don't see why this transition is needed, as there are already no users
of PF_FSTRANS after this patch. The next patch doesn't remove any more,
so this is just extra churn IMHO. But not a strong objection.

> +
>  /*
>   * Only the _current_ task can read/write to tsk->flags, but other
>   * tasks can access tsk->flags in readonly mode for example
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-22 Thread Vlastimil Babka
On 11/22/2016 02:58 PM, E V wrote:
> System OOM'd several times last night with 4.8.10, I attached the
> page_owner output from a morning cat ~8 hours after OOM's to the
> bugzilla case, split and compressed to fit under the 5M attachment
> limit. Let me know if you need anything else.

Looks like for some reason, the stack saving produces garbage stacks
that only repeat save_stack_trace and save_stack functions :/

But judging from gfp flags and page flags, most pages seem to be
allocated with:

mask 0x2400840(GFP_NOFS|__GFP_NOFAIL)

and page flags:

0x26c(referenced|uptodate|lru|active)
or
0x200016c(referenced|uptodate|lru|active|owner_priv_1)
or
0x200086c(referenced|uptodate|lru|active|private)

While GFP_HIGHUSER_MOVABLE (which I would expect on lru) are less frequent.

Example:
> grep GFP_NOFS page_owner_after_af | wc -l
973596
> grep GFP_HIGHUSER_MOVABLE page_owner_after_af | wc -l
158879
> grep GFP_NOFAIL page_owner_after_af | wc -l
971442

grepping for btrfs shows that at least some stacks for NOFS/NOFAIL pages
imply it:
clear_state_bit+0x135/0x1c0 [btrfs]
or
add_delayed_tree_ref+0xbf/0x170 [btrfs]
or
__btrfs_map_block+0x6a8/0x1200 [btrfs]
or
btrfs_buffer_uptodate+0x48/0x70 [btrfs]
or
btrfs_set_path_blocking+0x34/0x60 [btrfs]

and some more variants.

So looks like the pages contain btrfs metadata, are on file lru and from
previous checks of /proc/kpagecount we know that they most likely have
page_count() == 0 but are not freed. Could btrfs guys provide some
insight here?

> On Fri, Nov 18, 2016 at 10:02 AM, E V <eliven...@gmail.com> wrote:
>> Yes, the short window between the stalls and the panic makes it
>> difficult to manually check much. I could setup a cron every 5 minutes
>> or so if you want. Also, I see the OOM's in 4.8, but it has yet to
>> panic on me. Where as 4.9rc has panic'd both times I've booted it, so
>> depending on what you want to look at it might be easier to
>> investigate on 4.8. Let me know, I can turn on a couple of the DEBUG
>> config's and build a new 4.8.8. Never looked into a netconsole or
>> serial console. I think just getting the system to use a higher res
>> console would be an improvement, but the OOM's seemed to be the root
>> cause of the panic so I haven't spent any time looking into that as of
>> yet,
>>
>> Thanks,
>> -Eli
>>
>> On Fri, Nov 18, 2016 at 6:54 AM, Tetsuo Handa
>> <penguin-ker...@i-love.sakura.ne.jp> wrote:
>>> On 2016/11/18 6:49, Vlastimil Babka wrote:
>>>> On 11/16/2016 02:39 PM, E V wrote:
>>>>> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
>>>>> the stack trace, and the 38 call traces in a 2 minute window shortly
>>>>> before, to the bugzilla case for those not on it's e-mail list:
>>>>>
>>>>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
>>>>
>>>> The panic screenshot has only the last part, but the end marker says
>>>> it's OOM with no killable processes. The DEBUG_VM config thus didn't
>>>> trigger anything, and still there's tons of pagecache, mostly clean,
>>>> that's not being reclaimed.
>>>>
>>>> Could you now try this?
>>>> - enable CONFIG_PAGE_OWNER
>>>> - boot with kernel option: page_owner=on
>>>> - after the first oom, "cat /sys/kernel/debug/page_owner > file"
>>>> - provide the file (compressed, it will be quite large)
>>>
>>> Excuse me for a noise, but do we really need to do
>>> "cat /sys/kernel/debug/page_owner > file" after the first OOM killer
>>> invocation? I worry that it might be too difficult to do.
>>> Shouldn't we rather do "cat /sys/kernel/debug/page_owner > file"
>>> hourly and compare tendency between the latest one and previous one?
>>>
>>> This system has swap, and /var/log/messages before panic
>>> reports that swapin was stalling at memory allocation.
>>>
>>> 
>>> [130346.262510] dsm_sa_datamgrd: page allocation stalls for 52400ms, 
>>> order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
>>> [130346.262572] CPU: 1 PID: 3622 Comm: dsm_sa_datamgrd Tainted: GW 
>>> I 4.9.0-rc5 #2
>>> [130346.262662]   8129ba69 8170e4c8 
>>> c90003ccb8d8
>>> [130346.262714]  8113449f 024200ca1ca11b40 8170e4c8 
>>> c90003ccb880
>>> [130346.262765]  0010 c90003ccb8e8 c90003ccb898 
>>> 88041f226e80
>>> [130346.262817] Call Trace:
>>> [130346.2

Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-17 Thread Vlastimil Babka
On 11/16/2016 02:39 PM, E V wrote:
> System panic'd overnight running 4.9rc5 & rsync. Attached a photo of
> the stack trace, and the 38 call traces in a 2 minute window shortly
> before, to the bugzilla case for those not on it's e-mail list:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=186671

The panic screenshot has only the last part, but the end marker says
it's OOM with no killable processes. The DEBUG_VM config thus didn't
trigger anything, and still there's tons of pagecache, mostly clean,
that's not being reclaimed.

Could you now try this?
- enable CONFIG_PAGE_OWNER
- boot with kernel option: page_owner=on
- after the first oom, "cat /sys/kernel/debug/page_owner > file"
- provide the file (compressed, it will be quite large)

Vlastimil

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-14 Thread Vlastimil Babka
0]  1000  4060 8973  501  23   3
> 184 0 systemd
> [737778.730973] [ 4061]  1000  4061157020  34   4
> 612 0 (sd-pam)
> [737778.731020] [ 4063]  1000  406326472  158  54   3
> 260 0 sshd
> [737778.731067] [ 4064]  1000  4064 6041  739  16   3
> 686 0 bash
> [737778.731113] [ 4083]  1000  408316853  493  37   3
> 128 0 su
> [737778.731160] [ 4084] 0  4084 5501  756  15   3
> 160 0 bash
> [737778.731207] [15150] 0 15150 3309  678  10   3
>  57 0 run_mirror.sh
> [737778.731256] [24296] 0 24296 1450  139   8   3
>  23 0 flock
> [737778.731302] [24297] 0 24297 9576  622  22   3
>3990 0 rsync
> [737778.731349] [24298] 0 24298 7552  541  18   3
>1073 0 rsync
> [737778.731395] [24299] 0 24299 9522  401  22   3
>2416 0 rsync
> [737778.731445] [25910] 0 2591010257  522  23   3
>  81 0 systemd-journal
> [737778.731494] [25940] 0 2594016365  617  37   3
> 126 0 cron
> [737778.731540] Out of memory: Kill process 3718 (dsm_om_connsvcd)
> score 1 or sacrifice child
> [737778.731644] Killed process 3718 (dsm_om_connsvcd)
> total-vm:2960936kB, anon-rss:0kB, file-rss:6976kB, shmem-rss:0kB
> [737778.768375] oom_reaper: reaped process 3718 (dsm_om_connsvcd), now
> anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> On Fri, Nov 4, 2016 at 5:00 PM, Vlastimil Babka <vba...@suse.cz> wrote:
>> On 11/04/2016 03:13 PM, E V wrote:
>>> After the system panic'd yesterday I booted back into 4.8.4 and
>>> restarted the rsync's. I'm away on vacation next week, so when I get
>>> back I'll get rc4 or rc5 and try again. In the mean time here's data
>>> from the system running 4.8.4 without problems for about a day. I'm
>>> not familiar with xxd and didn't see a -e option, so used -E:
>>> xxd -E -g8 -c8 /proc/kpagecount | cut -d" " -f2 | sort | uniq -c
>>> 8258633 
>>>  216440 0100
>>
>> The lack of -e means it's big endian, which is not a big issue. So here
>> most of memory is free, some pages have just one pin, and only
>> relatively few have more. The vmstats also doesn't show anything bad, so
>> we'll have to wait if something appears within the week, or after you
>> try 4.9 again. Thanks.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-04 Thread Vlastimil Babka
On 11/04/2016 03:13 PM, E V wrote:
> After the system panic'd yesterday I booted back into 4.8.4 and
> restarted the rsync's. I'm away on vacation next week, so when I get
> back I'll get rc4 or rc5 and try again. In the mean time here's data
> from the system running 4.8.4 without problems for about a day. I'm
> not familiar with xxd and didn't see a -e option, so used -E:
> xxd -E -g8 -c8 /proc/kpagecount | cut -d" " -f2 | sort | uniq -c
> 8258633 
>  216440 0100

The lack of -e means it's big endian, which is not a big issue. So here
most of memory is free, some pages have just one pin, and only
relatively few have more. The vmstats also doesn't show anything bad, so
we'll have to wait if something appears within the week, or after you
try 4.9 again. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug 186671] New: OOM on system with just rsync running 32GB of ram 30GB of pagecache

2016-11-03 Thread Vlastimil Babka
On 11/03/2016 07:53 PM, Andrew Morton wrote:
> 
> (switched to email.  Please respond via emailed reply-to-all, not via the
> bugzilla web interface).

+CC also btrfs just in case it's a problem in page reclaim there

> On Wed, 02 Nov 2016 13:02:39 + bugzilla-dae...@bugzilla.kernel.org wrote:
> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=186671
>>
>> Bug ID: 186671
>>Summary: OOM on system with just rsync running 32GB of ram 30GB
>> of pagecache
>>Product: Memory Management
>>Version: 2.5
>> Kernel Version: 4.9-rc3
>>   Hardware: x86-64
>> OS: Linux
>>   Tree: Mainline
>> Status: NEW
>>   Severity: high
>>   Priority: P1
>>  Component: Page Allocator
>>   Assignee: a...@linux-foundation.org
>>   Reporter: eliven...@gmail.com
>> Regression: No
>>
>> Running rsync on a debian jessie system with 32GB of RAM and a big
>> 250TB btrfs filesystem. 30 GB of ram show up as cached, not much else
>> running on the system. Lots of page alloction stalls in dmesg before
>> hand, and several OOM's after this one as well until it finally killed
>> the rsync. So more traces available if desired. Started with the 4.7
>> series kernels, thought it was going to be fixed in 4.9:
> 
> OK, this looks bad.  Please let's work it via email so do remember the
> reply-to-alls.

It's bad but note the "started with 4.7" so it's not a 4.9 regression.
Also not a high-order OOM (phew!).

>> [93428.029768] irqbalance invoked oom-killer:
>> gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0-1, order=0,
>> oom_score_adj=0
>> [93428.029824] irqbalance cpuset=/ mems_allowed=0-1
>> [93428.029857] CPU: 11 PID: 2992 Comm: irqbalance Tainted: G  W   
>> 4.9.0-rc3
>> #1
>> [93428.029945]   812946c9 c90003d8bb10
>> c90003d8bb10
>> [93428.029997]  81190dd5  
>> 88081db051c0
>> [93428.030049]  c90003d8bb10 81711866 0002
>> 0213
>> [93428.030101] Call Trace:
>> [93428.030127]  [] ? dump_stack+0x46/0x5d
>> [93428.030157]  [] ? dump_header.isra.20+0x75/0x1a6
>> [93428.030189]  [] ? oom_kill_process+0x219/0x3d0
>> [93428.030218]  [] ? out_of_memory+0xd9/0x570
>> [93428.030246]  [] ? __alloc_pages_slowpath+0xa4b/0xa80
>> [93428.030276]  [] ? __alloc_pages_nodemask+0x288/0x2c0
>> [93428.030306]  [] ? alloc_pages_vma+0xc1/0x240
>> [93428.030337]  [] ? handle_mm_fault+0xccb/0xe60
>> [93428.030367]  [] ? __do_page_fault+0x1c5/0x490
>> [93428.030397]  [] ? page_fault+0x22/0x30
>> [93428.030425]  [] ? copy_user_generic_string+0x2c/0x40
>> [93428.030455]  [] ? seq_read+0x305/0x370
>> [93428.030483]  [] ? proc_reg_read+0x3e/0x60
>> [93428.030511]  [] ? __vfs_read+0x1e/0x110
>> [93428.030538]  [] ? vfs_read+0x89/0x130
>> [93428.030564]  [] ? SyS_read+0x3d/0x90
>> [93428.030591]  [] ? entry_SYSCALL_64_fastpath+0x13/0x94
>> [93428.030620] Mem-Info:
>> [93428.030647] active_anon:9283 inactive_anon:9905 isolated_anon:0
>> [93428.030647]  active_file:6752598 inactive_file:999166 isolated_file:288
>> [93428.030647]  unevictable:0 dirty:997857 writeback:1665 unstable:0
>> [93428.030647]  slab_reclaimable:203122 slab_unreclaimable:202102
>> [93428.030647]  mapped:7933 shmem:3170 pagetables:1752 bounce:0
>> [93428.030647]  free:39250 free_pcp:954 free_cma:0
>> [93428.030800] Node 0 active_anon:24984kB inactive_anon:26704kB
>> active_file:14365920kB inactive_file:1341120kB unevictable:0kB
>> isolated(anon):0kB isolated(file):0kB mapped:15852kB dirty:1338044kB
>> writeback:3072kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB
>> anon_thp: 9484kB writeback_tmp:0kB unstable:0kB pages_scanned:23811175
>> all_unreclaimable? yes
>> [93428.030933] Node 1 active_anon:12148kB inactive_anon:12916kB
>> active_file:12644472kB inactive_file:2655544kB unevictable:0kB
>> isolated(anon):0kB isolated(file):1152kB mapped:15880kB
>> dirty:2653384kB writeback:3588kB shmem:0kB shmem_thp: 0kB
>> shmem_pmdmapped: 0kB anon_thp: 3196kB writeback_tmp:0kB unstable:0kB
>> pages_scanned:23178917 all_unreclaimable? yes

Note the high pages_scanned and all_unreclaimable. I suspect something
is pinning the memory. Can you post /proc/vmstat from the system with an
uptime after it experiences the OOM?

There's /proc/kpagecount file that could confirm that. Could you provide
it too? Try running something like this and provide the output please.

xxd -e -g8 -c8 /proc/kpagecount | cut -d" " -f2 | sort | uniq -c

>> [93428.031059] Node 0 Normal free:44968kB min:45192kB low:61736kB
>> high:78280kB active_anon:24984kB inactive_anon:26704kB
>> active_file:14365920kB inactive_file:1341120kB unevictable:0kB
>> writepending:1341116kB present:16777216kB managed:16546296kB
>> mlocked:0kB slab_reclaimable:413824kB slab_unreclaimable:253144kB
>> kernel_stack:3496kB pagetables:4104kB bounce:0kB free_pcp:1388kB
>> 

Re: [PATCH 0/2] btrfs: fortification for GFP_NOFS allocations

2015-09-09 Thread Vlastimil Babka

On 08/19/2015 08:17 PM, Chris Mason wrote:

On Wed, Aug 19, 2015 at 02:17:39PM +0200, mho...@kernel.org wrote:

Hi,
these two patches were sent as a part of a larger RFC which aims at
allowing GFP_NOFS allocations to fail to help sort out memory reclaim
issues bound to the current behavior
(http://marc.info/?l=linux-mm=143876830616538=2).

It is clear that move to the GFP_NOFS behavior change is a long term
plan but these patches should be good enough even with that change in
place. It also seems that Chris wasn't opposed and would be willing to
take them http://marc.info/?l=linux-mm=143991792427165=2 so here we
come. I have rephrased the changeslogs to not refer to the patch which
changes the NOFS behavior.

Just to clarify. These two patches allowed my particular testcase
(mentioned in the cover referenced above) to survive it doesn't mean
that the failing GFP_NOFS are OK now. I have seen some other places
where GFP_NOFS allocation is followed by BUG_ON(ALLOC_FAILED). I have
not encountered them though.

Let me know if you would prefer other changes.


My plan is to start with these two and take more as required.


I've previously noticed in __set_extent_bit() things like:

if (!prealloc && (mask & __GFP_WAIT)) {
prealloc = alloc_extent_state(mask);
BUG_ON(!prealloc);
}

and later:

prealloc = alloc_extent_state_atomic(prealloc);
BUG_ON(!prealloc);

which internally does:

if (!prealloc)
prealloc = alloc_extent_state(GFP_ATOMIC);

The first one could be fixable by adding __GFP_NOFAIL. In fact we've got 
an internal bug report for that one already. Even without GFP_NOFS being 
allowed to fail, allocation can already fail when the thread is marked 
for oom kill, which is likely what happened in that case.


The second case is problematic though, because GFP_ATOMIC | __GFP_NOFAIL 
is not allowed. GFP_ATOMIC will give you access to memory reserves, 
which reduces the chance of hitting the BUG_ON(), but it's not a 
bulletproof solution.



-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html