Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-30 Thread Damien Le Moal
Dave,

On 2019/07/31 8:48, Dave Chinner wrote:
> On Tue, Jul 30, 2019 at 02:06:33AM +, Damien Le Moal wrote:
>> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
>> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be 
>> the
>> cleanest approach.
> 
> Clean, yes, but I'm not sure we want to expose kernel memory reclaim
> capabilities to userspace... It would be misleading, too, because we
> still want to allow reclaim to occur, just not have reclaim recurse
> into other filesystems

When I wrote RWF_NORECLAIM, I was really thinking of RWF_NOFSRECLAIM. So
suppressing direct reclaim recursing into another FS rather than completely
disabling reclaim. Sorry for the confusion.

Would this be better ? This is still application controlled, so debatable if
control should be given. Most likely this would need to be limited to CAP_SYS
capable user processes (e.g. root processes).

I still need to check on FUSE if anything at all along these lines exists there.
I will dig.

Best regards.

-- 
Damien Le Moal
Western Digital Research


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-30 Thread Dave Chinner
On Tue, Jul 30, 2019 at 02:06:33AM +, Damien Le Moal wrote:
> If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
> RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be 
> the
> cleanest approach.

Clean, yes, but I'm not sure we want to expose kernel memory reclaim
capabilities to userspace... It would be misleading, too, because we
still want to allow reclaim to occur, just not have reclaim recurse
into other filesystems

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-29 Thread Damien Le Moal
Andreas,

On 2019/07/30 3:40, Andreas Dilger wrote:
> On Jul 26, 2019, at 8:59 PM, Damien Le Moal  wrote:
>>
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

 I think it has to be the entire IO path - any allocation from the
 underlying filesystem could recurse into the top level filesystem
 and then deadlock if the memory reclaim submits IO or blocks on
 IO completion from the upper filesystem. That's a bloody big hammer
 for something that is only necessary when there are stacked
 filesystems like this
>>>
>>> Yeah that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
> in the BIO so that the worker thread will also get the correct behaviour
> when it is processing this IO submission.

But there is no BIO to work with here: dm-zoned BIOs are transformed into an
application (tcmu-runner running the ZBC file handler) calls to pread()/pwrite()
into an ext4 file. And processing of these calls trigger the direct reclaim and
FS recursion which cause the deadlock. So BIOs are not the proper vehicle to
transport the information about the NOFS allocation.

If we had a pread_nofs()/pwrite_nofs(), that would work. Or we could define a
RWF_NORECLAIM flag for pwritev2()/preadv2(). This last one could actually be the
cleanest approach.

Best regards.

-- 
Damien Le Moal
Western Digital Research


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-29 Thread Andreas Dilger
On Jul 26, 2019, at 8:59 PM, Damien Le Moal  wrote:
> 
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
 
 This looks like something that could hit every file systems, so
 shouldn't we fix this in common code?  We could also look into
 just using memalloc_nofs_save for the page cache allocation path
 instead of the per-mapping gfp_mask.
>>> 
>>> I think it has to be the entire IO path - any allocation from the
>>> underlying filesystem could recurse into the top level filesystem
>>> and then deadlock if the memory reclaim submits IO or blocks on
>>> IO completion from the upper filesystem. That's a bloody big hammer
>>> for something that is only necessary when there are stacked
>>> filesystems like this
>> 
>> Yeah that's why using memalloc_nofs_save() probably makes the most
>> sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

One way to handle this is to pass on PF_MEMALLOC/memalloc_nofs_save state
in the BIO so that the worker thread will also get the correct behaviour
when it is processing this IO submission.

> We tried a simpler setup using loopback mount (XFS used directly in an ext4
> file) and running the same workload. We failed to recreate a similar deadlock 
> in
> this case, but I am strongly suspecting that it can happen too. It is simply
> much harder to hit because the IO path from XFS to ext4 is all in-kernel and
> asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for 
> IOs
> which makes it relatively easy to get inter-dependent writes or read+write
> queued back-to-back and create the deadlock.
> 
> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) 
> ?
> 
> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-28 Thread Damien Le Moal
On 2019/07/29 8:42, Dave Chinner wrote:
> On Sat, Jul 27, 2019 at 02:59:59AM +, Damien Le Moal wrote:
>> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
>>> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

 I think it has to be the entire IO path - any allocation from the
 underlying filesystem could recurse into the top level filesystem
 and then deadlock if the memory reclaim submits IO or blocks on
 IO completion from the upper filesystem. That's a bloody big hammer
 for something that is only necessary when there are stacked
 filesystems like this
>>>
>>> Yeah that's why using memalloc_nofs_save() probably makes the most
>>> sense, and dm_zoned should use that before it calls into ext4.
>>
>> Unfortunately, with this particular setup, that will not solve the problem.
>> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
>> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
>> that HBA command ring. The commands themselves are read from the ring and
>> executed by the tcmu-runner user process which executes them doing
>> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
>> context than the dm-zoned worker thread issuing the BIO,
>> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.
> 
> Right, I'm talking about using memalloc_nofs_save() as a huge hammer
> in the pread/pwrite() calling context, not the bio submission
> context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
> below).

Yes, I understood your point. And I agree that it indeed would be a big hammer.
We should be able to do better than that :)

>> One simple hack would be an fcntl() or mount option to tell the FS to use
>> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that 
>> the
>> applications or system setup is correct. So not so safe.
> 
> Wasn't there discussion at some point in the past about an interface
> for special processes to be able to mark themselves as PF_MEMALLOC
> (some kind of prctl, I think) for things like FUSE daemons? That
> would prevent direct reclaim recursion for these userspace daemons
> that are in the kernel memory reclaim IO path. It's the same
> situation there, isn't it? How does fuse deal with this problem?

I do not recall such discussion. But indeed FUSE may give some hints. Good idea.
Thanks. I will check.


-- 
Damien Le Moal
Western Digital Research


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-28 Thread Dave Chinner
On Sat, Jul 27, 2019 at 02:59:59AM +, Damien Le Moal wrote:
> On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> > On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
> >>>
> >>> This looks like something that could hit every file systems, so
> >>> shouldn't we fix this in common code?  We could also look into
> >>> just using memalloc_nofs_save for the page cache allocation path
> >>> instead of the per-mapping gfp_mask.
> >>
> >> I think it has to be the entire IO path - any allocation from the
> >> underlying filesystem could recurse into the top level filesystem
> >> and then deadlock if the memory reclaim submits IO or blocks on
> >> IO completion from the upper filesystem. That's a bloody big hammer
> >> for something that is only necessary when there are stacked
> >> filesystems like this
> > 
> > Yeah that's why using memalloc_nofs_save() probably makes the most
> > sense, and dm_zoned should use that before it calls into ext4.
> 
> Unfortunately, with this particular setup, that will not solve the problem.
> dm-zoned submit BIOs to its backend drive in response to XFS activity. The
> requests for these BIOs are passed along to the kernel tcmu HBA and end up in
> that HBA command ring. The commands themselves are read from the ring and
> executed by the tcmu-runner user process which executes them doing
> pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
> context than the dm-zoned worker thread issuing the BIO,
> memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

Right, I'm talking about using memalloc_nofs_save() as a huge hammer
in the pread/pwrite() calling context, not the bio submission
context (which is typically GFP_NOFS above submit_bio() and GFP_NOIO
below).

> So back to Dave's point, we may be needing the big-hammer solution in the case
> of stacked file systems, while a non-stack setups do not necessarily need it
> (that is for the FS to decide). But I do not see how to implement this big
> hammer conditionally. How can a file system tell if it is at the top of the
> stack (big hammer not needed) or lower than the top level (big hammer needed) 
> ?

Age old question - tracking arbitrary nesting through mount/fs/bdev
layers is ... difficult. There is no easy way to tell.

> One simple hack would be an fcntl() or mount option to tell the FS to use
> GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
> applications or system setup is correct. So not so safe.

Wasn't there discussion at some point in the past about an interface
for special processes to be able to mark themselves as PF_MEMALLOC
(some kind of prctl, I think) for things like FUSE daemons? That
would prevent direct reclaim recursion for these userspace daemons
that are in the kernel memory reclaim IO path. It's the same
situation there, isn't it? How does fuse deal with this problem?

Cheers,

Dave,
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-26 Thread Damien Le Moal
On 2019/07/27 7:55, Theodore Y. Ts'o wrote:
> On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
>>>
>>> This looks like something that could hit every file systems, so
>>> shouldn't we fix this in common code?  We could also look into
>>> just using memalloc_nofs_save for the page cache allocation path
>>> instead of the per-mapping gfp_mask.
>>
>> I think it has to be the entire IO path - any allocation from the
>> underlying filesystem could recurse into the top level filesystem
>> and then deadlock if the memory reclaim submits IO or blocks on
>> IO completion from the upper filesystem. That's a bloody big hammer
>> for something that is only necessary when there are stacked
>> filesystems like this
> 
> Yeah that's why using memalloc_nofs_save() probably makes the most
> sense, and dm_zoned should use that before it calls into ext4.

Unfortunately, with this particular setup, that will not solve the problem.
dm-zoned submit BIOs to its backend drive in response to XFS activity. The
requests for these BIOs are passed along to the kernel tcmu HBA and end up in
that HBA command ring. The commands themselves are read from the ring and
executed by the tcmu-runner user process which executes them doing
pread()/pwrite() to the ext4 file. The tcmu-runner process being a different
context than the dm-zoned worker thread issuing the BIO,
memalloc_nofs_save/restore() calls in dm-zoned will have no effect.

We tried a simpler setup using loopback mount (XFS used directly in an ext4
file) and running the same workload. We failed to recreate a similar deadlock in
this case, but I am strongly suspecting that it can happen too. It is simply
much harder to hit because the IO path from XFS to ext4 is all in-kernel and
asynchronous, whereas tcmu-runner ZBC handler is a synchronous QD=1 path for IOs
which makes it relatively easy to get inter-dependent writes or read+write
queued back-to-back and create the deadlock.

So back to Dave's point, we may be needing the big-hammer solution in the case
of stacked file systems, while a non-stack setups do not necessarily need it
(that is for the FS to decide). But I do not see how to implement this big
hammer conditionally. How can a file system tell if it is at the top of the
stack (big hammer not needed) or lower than the top level (big hammer needed) ?

One simple hack would be an fcntl() or mount option to tell the FS to use
GFP_NOFS unconditionally, but avoiding the bug would mean making sure that the
applications or system setup is correct. So not so safe.

-- 
Damien Le Moal
Western Digital Research


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-26 Thread Theodore Y. Ts'o
On Sat, Jul 27, 2019 at 08:44:23AM +1000, Dave Chinner wrote:
> > 
> > This looks like something that could hit every file systems, so
> > shouldn't we fix this in common code?  We could also look into
> > just using memalloc_nofs_save for the page cache allocation path
> > instead of the per-mapping gfp_mask.
> 
> I think it has to be the entire IO path - any allocation from the
> underlying filesystem could recurse into the top level filesystem
> and then deadlock if the memory reclaim submits IO or blocks on
> IO completion from the upper filesystem. That's a bloody big hammer
> for something that is only necessary when there are stacked
> filesystems like this

Yeah that's why using memalloc_nofs_save() probably makes the most
sense, and dm_zoned should use that before it calls into ext4.

   - Ted


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-26 Thread Dave Chinner
On Thu, Jul 25, 2019 at 04:54:42AM -0700, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
> > +   gfp_t gfp_mask;
> > +
> > switch (ext4_inode_journal_mode(inode)) {
> > case EXT4_INODE_ORDERED_DATA_MODE:
> > case EXT4_INODE_WRITEBACK_DATA_MODE:
> > @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
> > inode->i_mapping->a_ops = &ext4_da_aops;
> > else
> > inode->i_mapping->a_ops = &ext4_aops;
> > +
> > +   /*
> > +* Ensure all page cache allocations are done from GFP_NOFS context to
> > +* prevent direct reclaim recursion back into the filesystem and blowing
> > +* stacks or deadlocking.
> > +*/
> > +   gfp_mask = mapping_gfp_mask(inode->i_mapping);
> > +   mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I think it has to be the entire IO path - any allocation from the
underlying filesystem could recurse into the top level filesystem
and then deadlock if the memory reclaim submits IO or blocks on
IO completion from the upper filesystem. That's a bloody big hammer
for something that is only necessary when there are stacked
filesystems like this

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Damien Le Moal
On 2019/07/25 20:54, Christoph Hellwig wrote:
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +gfp_t gfp_mask;
>> +
>>  switch (ext4_inode_journal_mode(inode)) {
>>  case EXT4_INODE_ORDERED_DATA_MODE:
>>  case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>>  inode->i_mapping->a_ops = &ext4_da_aops;
>>  else
>>  inode->i_mapping->a_ops = &ext4_aops;
>> +
>> +/*
>> + * Ensure all page cache allocations are done from GFP_NOFS context to
>> + * prevent direct reclaim recursion back into the filesystem and blowing
>> + * stacks or deadlocking.
>> + */
>> +gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?  We could also look into
> just using memalloc_nofs_save for the page cache allocation path
> instead of the per-mapping gfp_mask.

I agree. I did a quick scan and it looks to me like not all file systems are
using GFP_NOFS when grabbing pages for read or write. XFS, btrfs, f2fs, jfs,
gfs2 and ramfs seem OK, but I did not dig very deep for others where the use of
GFP_NOFS is not obvious.

I am not sure though how to approach a global fix. At the very least, it may be
good to have GFP_NOFS set by default in the inode mapping flags in
__address_space_init_once() which is called from inode_init_once(). But I am not
sure if that is enough nor if all file systems are using this function.

The other method as you suggest would be to add calls to
memalloc_nofs_save/restore() in functions like grab_cache_page_write_begin(),
but since that would cover writes only, we may want to do that at a higher level
in the various generic_xxx() and mpage_xxx() helper functions to cover more
ground. But that would still not be enough for the files systems not using these
helpers (plenty of examples for that for the write path).

Or do we go as high to VFS layer to add memalloc_nofs_save/restore() calls ?
That would be a big hammer fix... I personally think this would be OK though,
but I may be missing some points here.

Thoughts on the best approach ?

Best regards.

-- 
Damien Le Moal
Western Digital Research


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Andreas Dilger
On Jul 25, 2019, at 5:54 AM, Christoph Hellwig  wrote:
> 
> On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
>> +gfp_t gfp_mask;
>> +
>>  switch (ext4_inode_journal_mode(inode)) {
>>  case EXT4_INODE_ORDERED_DATA_MODE:
>>  case EXT4_INODE_WRITEBACK_DATA_MODE:
>> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>>  inode->i_mapping->a_ops = &ext4_da_aops;
>>  else
>>  inode->i_mapping->a_ops = &ext4_aops;
>> +
>> +/*
>> + * Ensure all page cache allocations are done from GFP_NOFS context to
>> + * prevent direct reclaim recursion back into the filesystem and blowing
>> + * stacks or deadlocking.
>> + */
>> +gfp_mask = mapping_gfp_mask(inode->i_mapping);
>> +mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));
> 
> This looks like something that could hit every file systems, so
> shouldn't we fix this in common code?

It also has the drawback that it prevents __GFP_FS reclaim when ext4
is *not* at the bottom of the IO stack.

> We could also look into just using memalloc_nofs_save for the page
> cache allocation path instead of the per-mapping gfp_mask.

That makes more sense.

Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: [PATCH] ext4: Fix deadlock on page reclaim

2019-07-25 Thread Christoph Hellwig
On Thu, Jul 25, 2019 at 06:33:58PM +0900, Damien Le Moal wrote:
> + gfp_t gfp_mask;
> +
>   switch (ext4_inode_journal_mode(inode)) {
>   case EXT4_INODE_ORDERED_DATA_MODE:
>   case EXT4_INODE_WRITEBACK_DATA_MODE:
> @@ -4019,6 +4019,14 @@ void ext4_set_aops(struct inode *inode)
>   inode->i_mapping->a_ops = &ext4_da_aops;
>   else
>   inode->i_mapping->a_ops = &ext4_aops;
> +
> + /*
> +  * Ensure all page cache allocations are done from GFP_NOFS context to
> +  * prevent direct reclaim recursion back into the filesystem and blowing
> +  * stacks or deadlocking.
> +  */
> + gfp_mask = mapping_gfp_mask(inode->i_mapping);
> + mapping_set_gfp_mask(inode->i_mapping, (gfp_mask & ~(__GFP_FS)));

This looks like something that could hit every file systems, so
shouldn't we fix this in common code?  We could also look into
just using memalloc_nofs_save for the page cache allocation path
instead of the per-mapping gfp_mask.

>  }
>  
>  static int __ext4_block_zero_page_range(handle_t *handle,
> -- 
> 2.21.0
> 
---end quoted text---