subject:"\[RFC PATCH\] block\: xfs\: dm thin\: train XFS to give up on retrying IO if thinp is out of space"

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Vivek Goyal

On Fri, Jul 24, 2015 at 09:00:54AM +1000, Dave Chinner wrote:
> On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
> > On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
> > 
> > [..]
> > > I don't think knowing the bdev timeout is necessary because the
> > > default is most likely to be "fail fast" in this case. i.e. no
> > > retries, just shut down.  IOWs, if we describe the configs and
> > > actions in neutral terms, then the default configurations easy for
> > > users to understand. i.e:
> > > 
> > > bdev enospc   XFS default
> > > ---   ---
> > > Fail slow Fail fast
> > > Fail fast Fail slow
> > > Fail neverFail never, Record in log
> > > EOPNOTSUPPFail never
> > > 
> > > With that in mind, I'm thinking I should drop the
> > > "permanent/transient" error classifications, and change it "failure
> > > behaviour" with the options "fast slow [never]" and only the slow
> > > option has retry/timeout configuration options.  I think the "never"
> > > option still needs to "fail at unmount" config variable, but we
> > > enable it by default rather than hanging unmount and requiring a
> > > manual shutdown like we do now
> > 
> > I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
> > we just do with one knob per error type and that is retry-timout.
> 
> "retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is
> configured as:
> 
> # echo slow > fail_method
> # echo 0 > max_retries
> # echo 300 > retry_timeout

Hi Dave,

I am sure I am missing something but I will anyway ask. Why do we need this
knob "fail_method". Isn't it sort of implied in other two knobs based
on their values.

max_retries=0 retry_timeout=0 implies fail_method=fast.

A non-zero value of max_retries or retry_timeout implies fail_method=slow

A very high value (-1) of either max_retries or retry_timeout implies
fail_method="almost never".

> > retry-timeout=0 (Fail fast)
> > retry-timeout=X (Fail slow)
> > retry-timeout=-1 (Never Give up).
> 
> What do we do when we want to add a different failure type
> with different configuration requirements?

Ok, got it. So we are targettting something very generic so that other
cases can be handled too.

> 
> > Also do we really need this timeout per error type.
> 
> I don't follow your logic here.  What do need a timeout for with
> either the "never" or "fast" failure configurations?

Ignore this. I had misunderstood it.

> 
> > Also would be nice if this timeout was configurable using a mount
> > option. Then we can just specify it during mount time and be done
> > with it.
> 
> That way lies madness.  The error configuration iinfrastructure we
> need is not just for ENOSPC errors on metadata buffers.  We need
> configurable error behaviour for multiple different errors in
> multiple different subsystems (e.g. data IO failure vs metadata
> buffer IO failure vs memory allocation failure vs inode corruption
> vs freespace corruption vs ).
> 
> And we still would need the sysfs interface for querying and
> configuring at runtime, so mount options are just a bad idea.  And
> with sysfs, the potential future route for automatic configuration
> at mount time is via udev events and configuration files, similar to
> block devices.

Agreed that sysfs provides lots of flexibility here. I guess I was
just thinking in terms of solving this particular issue  we are facing.

> 
> > Idea of auto tuning based on what block device is doing sounds reasonable
> > but that should not be a requirement for this patch and can go in even
> > later. It is one of those nice to have features.
> 
> "this patch"? Just the core infrastructure so far:

I was referring to Mike's patch where we add additional method to block
device operations.

> 
> 11 files changed, 290 insertions(+), 60 deletions(-)
> 
> and that will need to be split into 4-5 patches for review. There's
> a bunch of cleanup that preceeds this, and then there's a patch per
> error type we are going to handle in metadata buffer IO completion.
> IOWs, the dm-thinp autotuning is just a simple, small patch at the
> end of a much larger series - it's maybe 10 lines of code in XFS...

Ok. I will wait for the final patches. 

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 01:08:36PM -0400, Mikulas Patocka wrote:
> On Wed, 22 Jul 2015, Dave Chinner wrote:
> > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > > | 1

[...]

> You can just stop retrying the I/Os when the user attempts to unmount the 
> filesystem - then, you don't need any configuration option.

See above - the default will do that, but there are users who do not
want that unmount behaviour

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
> On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
> 
> [..]
> > I don't think knowing the bdev timeout is necessary because the
> > default is most likely to be "fail fast" in this case. i.e. no
> > retries, just shut down.  IOWs, if we describe the configs and
> > actions in neutral terms, then the default configurations easy for
> > users to understand. i.e:
> > 
> > bdev enospc XFS default
> > --- ---
> > Fail slow   Fail fast
> > Fail fast   Fail slow
> > Fail never  Fail never, Record in log
> > EOPNOTSUPP  Fail never
> > 
> > With that in mind, I'm thinking I should drop the
> > "permanent/transient" error classifications, and change it "failure
> > behaviour" with the options "fast slow [never]" and only the slow
> > option has retry/timeout configuration options.  I think the "never"
> > option still needs to "fail at unmount" config variable, but we
> > enable it by default rather than hanging unmount and requiring a
> > manual shutdown like we do now
> 
> I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
> we just do with one knob per error type and that is retry-timout.

"retry-timeout" == "fail slow". i.e. a 5 minute retry timeout is
configured as:

# echo slow > fail_method
# echo 0 > max_retries
# echo 300 > retry_timeout

> retry-timeout=0 (Fail fast)
> retry-timeout=X (Fail slow)
> retry-timeout=-1 (Never Give up).

What do we do when we want to add a different failure type
with different configuration requirements?

> Also do we really need this timeout per error type.

I don't follow your logic here.  What do need a timeout for with
either the "never" or "fast" failure configurations?

> Also would be nice if this timeout was configurable using a mount
> option. Then we can just specify it during mount time and be done
> with it.

That way lies madness.  The error configuration iinfrastructure we
need is not just for ENOSPC errors on metadata buffers.  We need
configurable error behaviour for multiple different errors in
multiple different subsystems (e.g. data IO failure vs metadata
buffer IO failure vs memory allocation failure vs inode corruption
vs freespace corruption vs ).

And we still would need the sysfs interface for querying and
configuring at runtime, so mount options are just a bad idea.  And
with sysfs, the potential future route for automatic configuration
at mount time is via udev events and configuration files, similar to
block devices.

> Idea of auto tuning based on what block device is doing sounds reasonable
> but that should not be a requirement for this patch and can go in even
> later. It is one of those nice to have features.

"this patch"? Just the core infrastructure so far:

11 files changed, 290 insertions(+), 60 deletions(-)

and that will need to be split into 4-5 patches for review. There's
a bunch of cleanup that preceeds this, and then there's a patch per
error type we are going to handle in metadata buffer IO completion.
IOWs, the dm-thinp autotuning is just a simple, small patch at the
end of a much larger series - it's maybe 10 lines of code in XFS...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Mikulas Patocka



On Wed, 22 Jul 2015, Dave Chinner wrote:

> On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen  
> > > wrote:
> > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > The issue we had discussed previously is that there is no agreement
> > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > behavior sucks, IMHO.
> > > 
> > > It does suck, but it beats the alternative of XFS continuing to do
> > > nothing about the problem.
> > 
> > Just a comment on that: doing nothing is better than doing the wrong
> > thing and being stuck with it forever. :)
> > 
> > > Disucssing more with Vivek, might be that XFS would be best served to
> > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > indefinitely.
> > 
> > Yes, that's exactly what I proposed in the thread I referenced in
> > my previous email, and what got stuck on the bikeshed wall because
> > of these concerns about knob twiddling:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > 
> > | e.g. if we need configurable error handling, it needs to be
> > | configurable for different error types, and it needs to be
> > | configurable on a per-mount basis. And it needs to be configurable
> > | at runtime, not just at mount time. That kind of leads to using
> > | sysfs for this. e.g. for each error type we ned to handle different
> > | behaviour for:
> > | 
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > | [transient] permanent
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > | 300
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > | 50
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > | 1
> > 
> > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > days I'll have something for review, likely for the 4.3 merge
> > window
> 
> Just thinking a bit more on how to make this simpler to configure,
> is there a simple way for the filesystem to determine the current
> config of the dm thinp volume?

You can just stop retrying the I/Os when the user attempts to unmount the 
filesystem - then, you don't need any configuration option.

Mikulas

> i.e. if the dm-thinp volume is
> configured to error out immediately on enospc, then XFS should
> default to doing the same thing. having XFS be able to grab this
> status at mount time and change the default ENOSPC error config from
> transient to permanent on such dm-thinp volumes would go a long way
> to making these configs Just Do The Right Thing on block dev enospc
> errors...
> 
> e.g. if dm-thinp is configured to queue for 60s and then fail on
> ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> we want XFS to retry and use it's default retry maximums before
> failing permanently.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Vivek Goyal

On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:

[..]
> I don't think knowing the bdev timeout is necessary because the
> default is most likely to be "fail fast" in this case. i.e. no
> retries, just shut down.  IOWs, if we describe the configs and
> actions in neutral terms, then the default configurations easy for
> users to understand. i.e:
> 
> bdev enospc   XFS default
> ---   ---
> Fail slow Fail fast
> Fail fast Fail slow
> Fail neverFail never, Record in log
> EOPNOTSUPPFail never
> 
> With that in mind, I'm thinking I should drop the
> "permanent/transient" error classifications, and change it "failure
> behaviour" with the options "fast slow [never]" and only the slow
> option has retry/timeout configuration options.  I think the "never"
> option still needs to "fail at unmount" config variable, but we
> enable it by default rather than hanging unmount and requiring a
> manual shutdown like we do now

I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
we just do with one knob per error type and that is retry-timout.

retry-timeout=0 (Fail fast)
retry-timeout=X (Fail slow)
retry-timeout=-1 (Never Give up).

Also do we really need this timeout per error type.

Also would be nice if this timeout was configurable using a mount
option. Then we can just specify it during mount time and be done
with it.

Idea of auto tuning based on what block device is doing sounds reasonable
but that should not be a requirement for this patch and can go in even
later. It is one of those nice to have features.

Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Mike Snitzer

On Thu, Jul 23 2015 at  1:10am -0400,
Dave Chinner  wrote:

> On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
> > On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 10:37pm -0400,
> > > Dave Chinner  wrote:
> > >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> > >>> I'm open to considering alternative interfaces for getting you the info
> > >>> you need.  I just don't have a great sense for what mechanism you'd like
> > >>> to use.  Do we invent a new block device operations table method that
> > >>> sets values in a 'struct no_space_strategy' passed in to the
> > >>> blockdevice?
> > >>
> > >> It's long been frowned on having the filesystems dig into block
> > >> device structures. We have lots of wrapper functions for getting
> > >> information from or performing operations on block devices. (e.g.
> > >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> > >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> > >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> > >> how that information gets to the filesystem is completely opaque
> > >> at the fs level, and the block layer can implement it in whatever
> > >> way is considered sane...
> > >>
> > >> And, realistically, all we really need returned is a enum to tell us
> > >> how the bdev behaves on enospc:
> > >>  - bdev fails fast, (i.e. immediate ENOSPC)
> > >>  - bdev fails slow, (i.e. queue for some time, then ENOSPC)
> > >>  - bdev never fails (i.e. queue forever)
> > >>  - bdev doesn't support this (i.e. EOPNOTSUPP)
> > 
> > I'm not sure how this is more useful than the bdev simply responding to
> > a query of "should we keep trying IOs?"
> 
>   - bdev fails fast, (i.e. immediate ENOSPC)
> 
> XFS should use a bound retry behaviour for to allow the possiblity of
> the admin adding more space before we shut down the fs. i.e.
> XFS fails slow.
> 
>   - bdev fails slow, (i.e. queue for some time, then ENOSPC)
> 
> We know that IOs are going to be delayed before they are failed, so
> there's no point in retrying as the admin has already had a chance
> to resolve the ENOSPC condition before failure was reported. i.e.
> XFS fails fast.
> 
>   - bdev never fails (i.e. queue forever)
> 
> Block device will appear to hang when it runs out of space. Nothing
> XFS can do here because IOs never fail, but we need to note this in
> the log at mount time so that filesystem hangs are easily explained
> when reported to us.
> 
>   - bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> XFS uses default "retry forever" behaviour.
> 
> > > This 'struct no_space_strategy' would be invented purely for
> > > informational purposes for upper layers' benefit -- I don't consider it
> > > a "block device structure" it the traditional sense.
> > > 
> > > I was thinking upper layers would like to know the actual timeout value
> > > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > > have the enum and the timeout.  And would be returned with a call:
> > >  bdev_get_nospace_strategy(bdev, _space_strategy)
> > 
> > Asking for the timeout value seems to add complexity.  It could change after
> > we ask, and knowing it now requires another layer to be handling timeouts...
> 
> I don't think knowing the bdev timeout is necessary because the
> default is most likely to be "fail fast" in this case. i.e. no
> retries, just shut down.  IOWs, if we describe the configs and
> actions in neutral terms, then the default configurations easy for
> users to understand. i.e:
> 
> bdev enospc   XFS default
> ---   ---
> Fail slow Fail fast
> Fail fast Fail slow
> Fail neverFail never, Record in log
> EOPNOTSUPPFail never
> 
> With that in mind, I'm thinking I should drop the
> "permanent/transient" error classifications, and change it "failure
> behaviour" with the options "fast slow [never]" and only the slow
> option has retry/timeout configuration options.  I think the "never"
> option still needs to "fail at unmount" config variable, but we
> enable it by default rather than hanging unmount and requiring a
> manual shutdown like we do now

This all sounds good to me.  The simpler XFS configuration looks like a
nice improvement.

If you just want to stub out the call to bdev_get_nospace_strategy() I
can crank through implementing it once I get a few minutes.

Btw, not sure what I was thinking when suggesting XFS would benefit from
knowing the duration of the thinp no_space_timeout.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
 On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
 
 [..]
  I don't think knowing the bdev timeout is necessary because the
  default is most likely to be fail fast in this case. i.e. no
  retries, just shut down.  IOWs, if we describe the configs and
  actions in neutral terms, then the default configurations easy for
  users to understand. i.e:
  
  bdev enospc XFS default
  --- ---
  Fail slow   Fail fast
  Fail fast   Fail slow
  Fail never  Fail never, Record in log
  EOPNOTSUPP  Fail never
  
  With that in mind, I'm thinking I should drop the
  permanent/transient error classifications, and change it failure
  behaviour with the options fast slow [never] and only the slow
  option has retry/timeout configuration options.  I think the never
  option still needs to fail at unmount config variable, but we
  enable it by default rather than hanging unmount and requiring a
  manual shutdown like we do now
 
 I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
 we just do with one knob per error type and that is retry-timout.

retry-timeout == fail slow. i.e. a 5 minute retry timeout is
configured as:

# echo slow  fail_method
# echo 0  max_retries
# echo 300  retry_timeout

 retry-timeout=0 (Fail fast)
 retry-timeout=X (Fail slow)
 retry-timeout=-1 (Never Give up).

What do we do when we want to add a different failure type
with different configuration requirements?

 Also do we really need this timeout per error type.

I don't follow your logic here.  What do need a timeout for with
either the never or fast failure configurations?

 Also would be nice if this timeout was configurable using a mount
 option. Then we can just specify it during mount time and be done
 with it.

That way lies madness.  The error configuration iinfrastructure we
need is not just for ENOSPC errors on metadata buffers.  We need
configurable error behaviour for multiple different errors in
multiple different subsystems (e.g. data IO failure vs metadata
buffer IO failure vs memory allocation failure vs inode corruption
vs freespace corruption vs ).

And we still would need the sysfs interface for querying and
configuring at runtime, so mount options are just a bad idea.  And
with sysfs, the potential future route for automatic configuration
at mount time is via udev events and configuration files, similar to
block devices.

 Idea of auto tuning based on what block device is doing sounds reasonable
 but that should not be a requirement for this patch and can go in even
 later. It is one of those nice to have features.

this patch? Just the core infrastructure so far:

11 files changed, 290 insertions(+), 60 deletions(-)

and that will need to be split into 4-5 patches for review. There's
a bunch of cleanup that preceeds this, and then there's a patch per
error type we are going to handle in metadata buffer IO completion.
IOWs, the dm-thinp autotuning is just a simple, small patch at the
end of a much larger series - it's maybe 10 lines of code in XFS...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Vivek Goyal

On Fri, Jul 24, 2015 at 09:00:54AM +1000, Dave Chinner wrote:
 On Thu, Jul 23, 2015 at 12:43:58PM -0400, Vivek Goyal wrote:
  On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:
  
  [..]
   I don't think knowing the bdev timeout is necessary because the
   default is most likely to be fail fast in this case. i.e. no
   retries, just shut down.  IOWs, if we describe the configs and
   actions in neutral terms, then the default configurations easy for
   users to understand. i.e:
   
   bdev enospc   XFS default
   ---   ---
   Fail slow Fail fast
   Fail fast Fail slow
   Fail neverFail never, Record in log
   EOPNOTSUPPFail never
   
   With that in mind, I'm thinking I should drop the
   permanent/transient error classifications, and change it failure
   behaviour with the options fast slow [never] and only the slow
   option has retry/timeout configuration options.  I think the never
   option still needs to fail at unmount config variable, but we
   enable it by default rather than hanging unmount and requiring a
   manual shutdown like we do now
  
  I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
  we just do with one knob per error type and that is retry-timout.
 
 retry-timeout == fail slow. i.e. a 5 minute retry timeout is
 configured as:
 
 # echo slow  fail_method
 # echo 0  max_retries
 # echo 300  retry_timeout

Hi Dave,

I am sure I am missing something but I will anyway ask. Why do we need this
knob fail_method. Isn't it sort of implied in other two knobs based
on their values.

max_retries=0 retry_timeout=0 implies fail_method=fast.

A non-zero value of max_retries or retry_timeout implies fail_method=slow

A very high value (-1) of either max_retries or retry_timeout implies
fail_method=almost never.

  retry-timeout=0 (Fail fast)
  retry-timeout=X (Fail slow)
  retry-timeout=-1 (Never Give up).
 
 What do we do when we want to add a different failure type
 with different configuration requirements?

Ok, got it. So we are targettting something very generic so that other
cases can be handled too.

 
  Also do we really need this timeout per error type.
 
 I don't follow your logic here.  What do need a timeout for with
 either the never or fast failure configurations?

Ignore this. I had misunderstood it.

 
  Also would be nice if this timeout was configurable using a mount
  option. Then we can just specify it during mount time and be done
  with it.
 
 That way lies madness.  The error configuration iinfrastructure we
 need is not just for ENOSPC errors on metadata buffers.  We need
 configurable error behaviour for multiple different errors in
 multiple different subsystems (e.g. data IO failure vs metadata
 buffer IO failure vs memory allocation failure vs inode corruption
 vs freespace corruption vs ).
 
 And we still would need the sysfs interface for querying and
 configuring at runtime, so mount options are just a bad idea.  And
 with sysfs, the potential future route for automatic configuration
 at mount time is via udev events and configuration files, similar to
 block devices.

Agreed that sysfs provides lots of flexibility here. I guess I was
just thinking in terms of solving this particular issue  we are facing.

 
  Idea of auto tuning based on what block device is doing sounds reasonable
  but that should not be a requirement for this patch and can go in even
  later. It is one of those nice to have features.
 
 this patch? Just the core infrastructure so far:

I was referring to Mike's patch where we add additional method to block
device operations.

 
 11 files changed, 290 insertions(+), 60 deletions(-)
 
 and that will need to be split into 4-5 patches for review. There's
 a bunch of cleanup that preceeds this, and then there's a patch per
 error type we are going to handle in metadata buffer IO completion.
 IOWs, the dm-thinp autotuning is just a simple, small patch at the
 end of a much larger series - it's maybe 10 lines of code in XFS...

Ok. I will wait for the final patches. 

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Dave Chinner

On Thu, Jul 23, 2015 at 01:08:36PM -0400, Mikulas Patocka wrote:
 On Wed, 22 Jul 2015, Dave Chinner wrote:
  On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
   On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
   | $ cat
   | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
   | 1

[...]

 You can just stop retrying the I/Os when the user attempts to unmount the 
 filesystem - then, you don't need any configuration option.

See above - the default will do that, but there are users who do not
want that unmount behaviour

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Mike Snitzer

On Thu, Jul 23 2015 at  1:10am -0400,
Dave Chinner da...@fromorbit.com wrote:

 On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
  On 7/22/15 8:34 AM, Mike Snitzer wrote:
   On Tue, Jul 21 2015 at 10:37pm -0400,
   Dave Chinner da...@fromorbit.com wrote:
   On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
   I'm open to considering alternative interfaces for getting you the info
   you need.  I just don't have a great sense for what mechanism you'd like
   to use.  Do we invent a new block device operations table method that
   sets values in a 'struct no_space_strategy' passed in to the
   blockdevice?
  
   It's long been frowned on having the filesystems dig into block
   device structures. We have lots of wrapper functions for getting
   information from or performing operations on block devices. (e.g.
   bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
   blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
   need to follow. If we do that - bdev_get_nospace_strategy() - then
   how that information gets to the filesystem is completely opaque
   at the fs level, and the block layer can implement it in whatever
   way is considered sane...
  
   And, realistically, all we really need returned is a enum to tell us
   how the bdev behaves on enospc:
- bdev fails fast, (i.e. immediate ENOSPC)
- bdev fails slow, (i.e. queue for some time, then ENOSPC)
- bdev never fails (i.e. queue forever)
- bdev doesn't support this (i.e. EOPNOTSUPP)
  
  I'm not sure how this is more useful than the bdev simply responding to
  a query of should we keep trying IOs?
 
   - bdev fails fast, (i.e. immediate ENOSPC)
 
 XFS should use a bound retry behaviour for to allow the possiblity of
 the admin adding more space before we shut down the fs. i.e.
 XFS fails slow.
 
   - bdev fails slow, (i.e. queue for some time, then ENOSPC)
 
 We know that IOs are going to be delayed before they are failed, so
 there's no point in retrying as the admin has already had a chance
 to resolve the ENOSPC condition before failure was reported. i.e.
 XFS fails fast.
 
   - bdev never fails (i.e. queue forever)
 
 Block device will appear to hang when it runs out of space. Nothing
 XFS can do here because IOs never fail, but we need to note this in
 the log at mount time so that filesystem hangs are easily explained
 when reported to us.
 
   - bdev doesn't support this (i.e. EOPNOTSUPP)
 
 XFS uses default retry forever behaviour.
 
   This 'struct no_space_strategy' would be invented purely for
   informational purposes for upper layers' benefit -- I don't consider it
   a block device structure it the traditional sense.
   
   I was thinking upper layers would like to know the actual timeout value
   for the fails slow case.  As such the 'struct no_space_strategy' would
   have the enum and the timeout.  And would be returned with a call:
bdev_get_nospace_strategy(bdev, no_space_strategy)
  
  Asking for the timeout value seems to add complexity.  It could change after
  we ask, and knowing it now requires another layer to be handling timeouts...
 
 I don't think knowing the bdev timeout is necessary because the
 default is most likely to be fail fast in this case. i.e. no
 retries, just shut down.  IOWs, if we describe the configs and
 actions in neutral terms, then the default configurations easy for
 users to understand. i.e:
 
 bdev enospc   XFS default
 ---   ---
 Fail slow Fail fast
 Fail fast Fail slow
 Fail neverFail never, Record in log
 EOPNOTSUPPFail never
 
 With that in mind, I'm thinking I should drop the
 permanent/transient error classifications, and change it failure
 behaviour with the options fast slow [never] and only the slow
 option has retry/timeout configuration options.  I think the never
 option still needs to fail at unmount config variable, but we
 enable it by default rather than hanging unmount and requiring a
 manual shutdown like we do now

This all sounds good to me.  The simpler XFS configuration looks like a
nice improvement.

If you just want to stub out the call to bdev_get_nospace_strategy() I
can crank through implementing it once I get a few minutes.

Btw, not sure what I was thinking when suggesting XFS would benefit from
knowing the duration of the thinp no_space_timeout.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Vivek Goyal

On Thu, Jul 23, 2015 at 03:10:43PM +1000, Dave Chinner wrote:

[..]
 I don't think knowing the bdev timeout is necessary because the
 default is most likely to be fail fast in this case. i.e. no
 retries, just shut down.  IOWs, if we describe the configs and
 actions in neutral terms, then the default configurations easy for
 users to understand. i.e:
 
 bdev enospc   XFS default
 ---   ---
 Fail slow Fail fast
 Fail fast Fail slow
 Fail neverFail never, Record in log
 EOPNOTSUPPFail never
 
 With that in mind, I'm thinking I should drop the
 permanent/transient error classifications, and change it failure
 behaviour with the options fast slow [never] and only the slow
 option has retry/timeout configuration options.  I think the never
 option still needs to fail at unmount config variable, but we
 enable it by default rather than hanging unmount and requiring a
 manual shutdown like we do now

I am wondering instead of 4 knobs (fast,slow,never,retry-timeout) can
we just do with one knob per error type and that is retry-timout.

retry-timeout=0 (Fail fast)
retry-timeout=X (Fail slow)
retry-timeout=-1 (Never Give up).

Also do we really need this timeout per error type.

Also would be nice if this timeout was configurable using a mount
option. Then we can just specify it during mount time and be done
with it.

Idea of auto tuning based on what block device is doing sounds reasonable
but that should not be a requirement for this patch and can go in even
later. It is one of those nice to have features.

Thanks
Vivek
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [dm-devel] [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-23 Thread Mikulas Patocka



On Wed, 22 Jul 2015, Dave Chinner wrote:

 On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
  On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
   On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen sand...@redhat.com 
   wrote:
On 7/20/15 5:36 PM, Dave Chinner wrote:
The issue we had discussed previously is that there is no agreement
across block devices about whether ENOSPC is a permanent or temporary
condition.  Asking the admin to tune  the fs to each block device's
behavior sucks, IMHO.
   
   It does suck, but it beats the alternative of XFS continuing to do
   nothing about the problem.
  
  Just a comment on that: doing nothing is better than doing the wrong
  thing and being stuck with it forever. :)
  
   Disucssing more with Vivek, might be that XFS would be best served to
   model what dm-thinp has provided with its 'no_space_timeout'.  It
   defaults to queueing IO for 60 seconds, once the timeout expires the
   queued IOs getted errored.  If set to 0 dm-thinp will queue IO
   indefinitely.
  
  Yes, that's exactly what I proposed in the thread I referenced in
  my previous email, and what got stuck on the bikeshed wall because
  of these concerns about knob twiddling:
  
  http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
  
  | e.g. if we need configurable error handling, it needs to be
  | configurable for different error types, and it needs to be
  | configurable on a per-mount basis. And it needs to be configurable
  | at runtime, not just at mount time. That kind of leads to using
  | sysfs for this. e.g. for each error type we ned to handle different
  | behaviour for:
  | 
  | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
  | [transient] permanent
  | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
  | 300
  | $ cat
  | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
  | 50
  | $ cat
  | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
  | 1
  
  I've rebased this patchset, and I'm cleaning it up now, so in a few
  days I'll have something for review, likely for the 4.3 merge
  window
 
 Just thinking a bit more on how to make this simpler to configure,
 is there a simple way for the filesystem to determine the current
 config of the dm thinp volume?

You can just stop retrying the I/Os when the user attempts to unmount the 
filesystem - then, you don't need any configuration option.

Mikulas

 i.e. if the dm-thinp volume is
 configured to error out immediately on enospc, then XFS should
 default to doing the same thing. having XFS be able to grab this
 status at mount time and change the default ENOSPC error config from
 transient to permanent on such dm-thinp volumes would go a long way
 to making these configs Just Do The Right Thing on block dev enospc
 errors...
 
 e.g. if dm-thinp is configured to queue for 60s and then fail on
 ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
 dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
 we want XFS to retry and use it's default retry maximums before
 failing permanently.
 
 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com
 
 --
 dm-devel mailing list
 dm-de...@redhat.com
 https://www.redhat.com/mailman/listinfo/dm-devel
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Dave Chinner

On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
> On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 10:37pm -0400,
> > Dave Chinner  wrote:
> >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> >>> I'm open to considering alternative interfaces for getting you the info
> >>> you need.  I just don't have a great sense for what mechanism you'd like
> >>> to use.  Do we invent a new block device operations table method that
> >>> sets values in a 'struct no_space_strategy' passed in to the
> >>> blockdevice?
> >>
> >> It's long been frowned on having the filesystems dig into block
> >> device structures. We have lots of wrapper functions for getting
> >> information from or performing operations on block devices. (e.g.
> >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> >> how that information gets to the filesystem is completely opaque
> >> at the fs level, and the block layer can implement it in whatever
> >> way is considered sane...
> >>
> >> And, realistically, all we really need returned is a enum to tell us
> >> how the bdev behaves on enospc:
> >>- bdev fails fast, (i.e. immediate ENOSPC)
> >>- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> >>- bdev never fails (i.e. queue forever)
> >>- bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> I'm not sure how this is more useful than the bdev simply responding to
> a query of "should we keep trying IOs?"

- bdev fails fast, (i.e. immediate ENOSPC)

XFS should use a bound retry behaviour for to allow the possiblity of
the admin adding more space before we shut down the fs. i.e.
XFS fails slow.

- bdev fails slow, (i.e. queue for some time, then ENOSPC)

We know that IOs are going to be delayed before they are failed, so
there's no point in retrying as the admin has already had a chance
to resolve the ENOSPC condition before failure was reported. i.e.
XFS fails fast.

- bdev never fails (i.e. queue forever)

Block device will appear to hang when it runs out of space. Nothing
XFS can do here because IOs never fail, but we need to note this in
the log at mount time so that filesystem hangs are easily explained
when reported to us.

- bdev doesn't support this (i.e. EOPNOTSUPP)

XFS uses default "retry forever" behaviour.

> > This 'struct no_space_strategy' would be invented purely for
> > informational purposes for upper layers' benefit -- I don't consider it
> > a "block device structure" it the traditional sense.
> > 
> > I was thinking upper layers would like to know the actual timeout value
> > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > have the enum and the timeout.  And would be returned with a call:
> >  bdev_get_nospace_strategy(bdev, _space_strategy)
> 
> Asking for the timeout value seems to add complexity.  It could change after
> we ask, and knowing it now requires another layer to be handling timeouts...

I don't think knowing the bdev timeout is necessary because the
default is most likely to be "fail fast" in this case. i.e. no
retries, just shut down.  IOWs, if we describe the configs and
actions in neutral terms, then the default configurations easy for
users to understand. i.e:

bdev enospc XFS default
--- ---
Fail slow   Fail fast
Fail fast   Fail slow
Fail never  Fail never, Record in log
EOPNOTSUPP  Fail never

With that in mind, I'm thinking I should drop the
"permanent/transient" error classifications, and change it "failure
behaviour" with the options "fast slow [never]" and only the slow
option has retry/timeout configuration options.  I think the "never"
option still needs to "fail at unmount" config variable, but we
enable it by default rather than hanging unmount and requiring a
manual shutdown like we do now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Mike Snitzer

On Wed, Jul 22 2015 at 12:28pm -0400,
Eric Sandeen  wrote:

> On 7/22/15 8:34 AM, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 10:37pm -0400,
> > Dave Chinner  wrote:
> > 
> >> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> >>
> >>> I'm open to considering alternative interfaces for getting you the info
> >>> you need.  I just don't have a great sense for what mechanism you'd like
> >>> to use.  Do we invent a new block device operations table method that
> >>> sets values in a 'struct no_space_strategy' passed in to the
> >>> blockdevice?
> >>
> >> It's long been frowned on having the filesystems dig into block
> >> device structures. We have lots of wrapper functions for getting
> >> information from or performing operations on block devices. (e.g.
> >> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> >> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> >> need to follow. If we do that - bdev_get_nospace_strategy() - then
> >> how that information gets to the filesystem is completely opaque
> >> at the fs level, and the block layer can implement it in whatever
> >> way is considered sane...
> >>
> >> And, realistically, all we really need returned is a enum to tell us
> >> how the bdev behaves on enospc:
> >>- bdev fails fast, (i.e. immediate ENOSPC)
> >>- bdev fails slow, (i.e. queue for some time, then ENOSPC)
> >>- bdev never fails (i.e. queue forever)
> >>- bdev doesn't support this (i.e. EOPNOTSUPP)
> 
> I'm not sure how this is more useful than the bdev simply responding to
> a query of "should we keep trying IOs?"
> 
> IOWS do we really care if it's failing fast or slow, vs. simply knowing
> whether it has now permanently failed?
> 
> So rather than "bdev_get_nospace_strategy" it seems like all we need
> to know is "bdev_has_failed" - do we really care about the details?

My bdev_has_space() proposal is no different then bdev_has_failed().  If
you prefer the more generic name then fine.  But bdev_has_failed() is of
limited utlity outside of devices that provide support.  So I can see
why Dave is resisting it.

Anyway, the benefit of XFS tailoring its independent config based on
dm-thinp's comparable config makes sense to me.  The reason for XFS's
independent config is it could be deployed on any storage (e.g. not
dm-thinp).

Affords XFS to defer to DM thinp but still have comparable functionality
for HW thinp or some other storage.

> > This 'struct no_space_strategy' would be invented purely for
> > informational purposes for upper layers' benefit -- I don't consider it
> > a "block device structure" it the traditional sense.
> > 
> > I was thinking upper layers would like to know the actual timeout value
> > for the "fails slow" case.  As such the 'struct no_space_strategy' would
> > have the enum and the timeout.  And would be returned with a call:
> >  bdev_get_nospace_strategy(bdev, _space_strategy)
> 
> Asking for the timeout value seems to add complexity.  It could change after
> we ask, and knowing it now requires another layer to be handling timeouts...

Dave is already saying XFS will have a timeout it'll be managing.
Stands to reason that XFS would base its timeout on DM thinp's timeout.
But yeah it does allow the stacked timeout that XFS uses to be out of
sync if the lower timeout changes (no different than blk_stack_limits).

Please fix this however you see fit.  I'll assist anywhere that makes
sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Eric Sandeen

On 7/22/15 8:34 AM, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at 10:37pm -0400,
> Dave Chinner  wrote:
> 
>> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
>>
>>> I'm open to considering alternative interfaces for getting you the info
>>> you need.  I just don't have a great sense for what mechanism you'd like
>>> to use.  Do we invent a new block device operations table method that
>>> sets values in a 'struct no_space_strategy' passed in to the
>>> blockdevice?
>>
>> It's long been frowned on having the filesystems dig into block
>> device structures. We have lots of wrapper functions for getting
>> information from or performing operations on block devices. (e.g.
>> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
>> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
>> need to follow. If we do that - bdev_get_nospace_strategy() - then
>> how that information gets to the filesystem is completely opaque
>> at the fs level, and the block layer can implement it in whatever
>> way is considered sane...
>>
>> And, realistically, all we really need returned is a enum to tell us
>> how the bdev behaves on enospc:
>>  - bdev fails fast, (i.e. immediate ENOSPC)
>>  - bdev fails slow, (i.e. queue for some time, then ENOSPC)
>>  - bdev never fails (i.e. queue forever)
>>  - bdev doesn't support this (i.e. EOPNOTSUPP)

I'm not sure how this is more useful than the bdev simply responding to
a query of "should we keep trying IOs?"

IOWS do we really care if it's failing fast or slow, vs. simply knowing
whether it has now permanently failed?

So rather than "bdev_get_nospace_strategy" it seems like all we need
to know is "bdev_has_failed" - do we really care about the details?

> This 'struct no_space_strategy' would be invented purely for
> informational purposes for upper layers' benefit -- I don't consider it
> a "block device structure" it the traditional sense.
> 
> I was thinking upper layers would like to know the actual timeout value
> for the "fails slow" case.  As such the 'struct no_space_strategy' would
> have the enum and the timeout.  And would be returned with a call:
>  bdev_get_nospace_strategy(bdev, _space_strategy)

Asking for the timeout value seems to add complexity.  It could change after
we ask, and knowing it now requires another layer to be handling timeouts...

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Mike Snitzer

On Tue, Jul 21 2015 at 10:37pm -0400,
Dave Chinner  wrote:

> On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> 
> > I'm open to considering alternative interfaces for getting you the info
> > you need.  I just don't have a great sense for what mechanism you'd like
> > to use.  Do we invent a new block device operations table method that
> > sets values in a 'struct no_space_strategy' passed in to the
> > blockdevice?
> 
> It's long been frowned on having the filesystems dig into block
> device structures. We have lots of wrapper functions for getting
> information from or performing operations on block devices. (e.g.
> bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
> blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
> need to follow. If we do that - bdev_get_nospace_strategy() - then
> how that information gets to the filesystem is completely opaque
> at the fs level, and the block layer can implement it in whatever
> way is considered sane...
> 
> And, realistically, all we really need returned is a enum to tell us
> how the bdev behaves on enospc:
>   - bdev fails fast, (i.e. immediate ENOSPC)
>   - bdev fails slow, (i.e. queue for some time, then ENOSPC)
>   - bdev never fails (i.e. queue forever)
>   - bdev doesn't support this (i.e. EOPNOTSUPP)

This 'struct no_space_strategy' would be invented purely for
informational purposes for upper layers' benefit -- I don't consider it
a "block device structure" it the traditional sense.

I was thinking upper layers would like to know the actual timeout value
for the "fails slow" case.  As such the 'struct no_space_strategy' would
have the enum and the timeout.  And would be returned with a call:
 bdev_get_nospace_strategy(bdev, _space_strategy)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Eric Sandeen

On 7/22/15 8:34 AM, Mike Snitzer wrote:
 On Tue, Jul 21 2015 at 10:37pm -0400,
 Dave Chinner da...@fromorbit.com wrote:
 
 On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:

 I'm open to considering alternative interfaces for getting you the info
 you need.  I just don't have a great sense for what mechanism you'd like
 to use.  Do we invent a new block device operations table method that
 sets values in a 'struct no_space_strategy' passed in to the
 blockdevice?

 It's long been frowned on having the filesystems dig into block
 device structures. We have lots of wrapper functions for getting
 information from or performing operations on block devices. (e.g.
 bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
 blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
 need to follow. If we do that - bdev_get_nospace_strategy() - then
 how that information gets to the filesystem is completely opaque
 at the fs level, and the block layer can implement it in whatever
 way is considered sane...

 And, realistically, all we really need returned is a enum to tell us
 how the bdev behaves on enospc:
  - bdev fails fast, (i.e. immediate ENOSPC)
  - bdev fails slow, (i.e. queue for some time, then ENOSPC)
  - bdev never fails (i.e. queue forever)
  - bdev doesn't support this (i.e. EOPNOTSUPP)

I'm not sure how this is more useful than the bdev simply responding to
a query of should we keep trying IOs?

IOWS do we really care if it's failing fast or slow, vs. simply knowing
whether it has now permanently failed?

So rather than bdev_get_nospace_strategy it seems like all we need
to know is bdev_has_failed - do we really care about the details?

 This 'struct no_space_strategy' would be invented purely for
 informational purposes for upper layers' benefit -- I don't consider it
 a block device structure it the traditional sense.
 
 I was thinking upper layers would like to know the actual timeout value
 for the fails slow case.  As such the 'struct no_space_strategy' would
 have the enum and the timeout.  And would be returned with a call:
  bdev_get_nospace_strategy(bdev, no_space_strategy)

Asking for the timeout value seems to add complexity.  It could change after
we ask, and knowing it now requires another layer to be handling timeouts...

Thanks,
-Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Mike Snitzer

On Wed, Jul 22 2015 at 12:28pm -0400,
Eric Sandeen sand...@redhat.com wrote:

 On 7/22/15 8:34 AM, Mike Snitzer wrote:
  On Tue, Jul 21 2015 at 10:37pm -0400,
  Dave Chinner da...@fromorbit.com wrote:
  
  On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
 
  I'm open to considering alternative interfaces for getting you the info
  you need.  I just don't have a great sense for what mechanism you'd like
  to use.  Do we invent a new block device operations table method that
  sets values in a 'struct no_space_strategy' passed in to the
  blockdevice?
 
  It's long been frowned on having the filesystems dig into block
  device structures. We have lots of wrapper functions for getting
  information from or performing operations on block devices. (e.g.
  bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
  blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
  need to follow. If we do that - bdev_get_nospace_strategy() - then
  how that information gets to the filesystem is completely opaque
  at the fs level, and the block layer can implement it in whatever
  way is considered sane...
 
  And, realistically, all we really need returned is a enum to tell us
  how the bdev behaves on enospc:
 - bdev fails fast, (i.e. immediate ENOSPC)
 - bdev fails slow, (i.e. queue for some time, then ENOSPC)
 - bdev never fails (i.e. queue forever)
 - bdev doesn't support this (i.e. EOPNOTSUPP)
 
 I'm not sure how this is more useful than the bdev simply responding to
 a query of should we keep trying IOs?
 
 IOWS do we really care if it's failing fast or slow, vs. simply knowing
 whether it has now permanently failed?
 
 So rather than bdev_get_nospace_strategy it seems like all we need
 to know is bdev_has_failed - do we really care about the details?

My bdev_has_space() proposal is no different then bdev_has_failed().  If
you prefer the more generic name then fine.  But bdev_has_failed() is of
limited utlity outside of devices that provide support.  So I can see
why Dave is resisting it.

Anyway, the benefit of XFS tailoring its independent config based on
dm-thinp's comparable config makes sense to me.  The reason for XFS's
independent config is it could be deployed on any storage (e.g. not
dm-thinp).

Affords XFS to defer to DM thinp but still have comparable functionality
for HW thinp or some other storage.

  This 'struct no_space_strategy' would be invented purely for
  informational purposes for upper layers' benefit -- I don't consider it
  a block device structure it the traditional sense.
  
  I was thinking upper layers would like to know the actual timeout value
  for the fails slow case.  As such the 'struct no_space_strategy' would
  have the enum and the timeout.  And would be returned with a call:
   bdev_get_nospace_strategy(bdev, no_space_strategy)
 
 Asking for the timeout value seems to add complexity.  It could change after
 we ask, and knowing it now requires another layer to be handling timeouts...

Dave is already saying XFS will have a timeout it'll be managing.
Stands to reason that XFS would base its timeout on DM thinp's timeout.
But yeah it does allow the stacked timeout that XFS uses to be out of
sync if the lower timeout changes (no different than blk_stack_limits).

Please fix this however you see fit.  I'll assist anywhere that makes
sense.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Mike Snitzer

On Tue, Jul 21 2015 at 10:37pm -0400,
Dave Chinner da...@fromorbit.com wrote:

 On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
 
  I'm open to considering alternative interfaces for getting you the info
  you need.  I just don't have a great sense for what mechanism you'd like
  to use.  Do we invent a new block device operations table method that
  sets values in a 'struct no_space_strategy' passed in to the
  blockdevice?
 
 It's long been frowned on having the filesystems dig into block
 device structures. We have lots of wrapper functions for getting
 information from or performing operations on block devices. (e.g.
 bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
 blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
 need to follow. If we do that - bdev_get_nospace_strategy() - then
 how that information gets to the filesystem is completely opaque
 at the fs level, and the block layer can implement it in whatever
 way is considered sane...
 
 And, realistically, all we really need returned is a enum to tell us
 how the bdev behaves on enospc:
   - bdev fails fast, (i.e. immediate ENOSPC)
   - bdev fails slow, (i.e. queue for some time, then ENOSPC)
   - bdev never fails (i.e. queue forever)
   - bdev doesn't support this (i.e. EOPNOTSUPP)

This 'struct no_space_strategy' would be invented purely for
informational purposes for upper layers' benefit -- I don't consider it
a block device structure it the traditional sense.

I was thinking upper layers would like to know the actual timeout value
for the fails slow case.  As such the 'struct no_space_strategy' would
have the enum and the timeout.  And would be returned with a call:
 bdev_get_nospace_strategy(bdev, no_space_strategy)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-22 Thread Dave Chinner

On Wed, Jul 22, 2015 at 11:28:06AM -0500, Eric Sandeen wrote:
 On 7/22/15 8:34 AM, Mike Snitzer wrote:
  On Tue, Jul 21 2015 at 10:37pm -0400,
  Dave Chinner da...@fromorbit.com wrote:
  On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
  I'm open to considering alternative interfaces for getting you the info
  you need.  I just don't have a great sense for what mechanism you'd like
  to use.  Do we invent a new block device operations table method that
  sets values in a 'struct no_space_strategy' passed in to the
  blockdevice?
 
  It's long been frowned on having the filesystems dig into block
  device structures. We have lots of wrapper functions for getting
  information from or performing operations on block devices. (e.g.
  bdev_read_only(), bdev_get_queue(), blkdev_issue_flush(),
  blkdev_issue_zeroout(), etc) and so I think this is the pattern we'd
  need to follow. If we do that - bdev_get_nospace_strategy() - then
  how that information gets to the filesystem is completely opaque
  at the fs level, and the block layer can implement it in whatever
  way is considered sane...
 
  And, realistically, all we really need returned is a enum to tell us
  how the bdev behaves on enospc:
 - bdev fails fast, (i.e. immediate ENOSPC)
 - bdev fails slow, (i.e. queue for some time, then ENOSPC)
 - bdev never fails (i.e. queue forever)
 - bdev doesn't support this (i.e. EOPNOTSUPP)
 
 I'm not sure how this is more useful than the bdev simply responding to
 a query of should we keep trying IOs?

- bdev fails fast, (i.e. immediate ENOSPC)

XFS should use a bound retry behaviour for to allow the possiblity of
the admin adding more space before we shut down the fs. i.e.
XFS fails slow.

- bdev fails slow, (i.e. queue for some time, then ENOSPC)

We know that IOs are going to be delayed before they are failed, so
there's no point in retrying as the admin has already had a chance
to resolve the ENOSPC condition before failure was reported. i.e.
XFS fails fast.

- bdev never fails (i.e. queue forever)

Block device will appear to hang when it runs out of space. Nothing
XFS can do here because IOs never fail, but we need to note this in
the log at mount time so that filesystem hangs are easily explained
when reported to us.

- bdev doesn't support this (i.e. EOPNOTSUPP)

XFS uses default retry forever behaviour.

  This 'struct no_space_strategy' would be invented purely for
  informational purposes for upper layers' benefit -- I don't consider it
  a block device structure it the traditional sense.
  
  I was thinking upper layers would like to know the actual timeout value
  for the fails slow case.  As such the 'struct no_space_strategy' would
  have the enum and the timeout.  And would be returned with a call:
   bdev_get_nospace_strategy(bdev, no_space_strategy)
 
 Asking for the timeout value seems to add complexity.  It could change after
 we ask, and knowing it now requires another layer to be handling timeouts...

I don't think knowing the bdev timeout is necessary because the
default is most likely to be fail fast in this case. i.e. no
retries, just shut down.  IOWs, if we describe the configs and
actions in neutral terms, then the default configurations easy for
users to understand. i.e:

bdev enospc XFS default
--- ---
Fail slow   Fail fast
Fail fast   Fail slow
Fail never  Fail never, Record in log
EOPNOTSUPP  Fail never

With that in mind, I'm thinking I should drop the
permanent/transient error classifications, and change it failure
behaviour with the options fast slow [never] and only the slow
option has retry/timeout configuration options.  I think the never
option still needs to fail at unmount config variable, but we
enable it by default rather than hanging unmount and requiring a
manual shutdown like we do now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at  9:00pm -0400,
> Dave Chinner  wrote:
> 
> > On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen  
> > > > wrote:
> > > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > > The issue we had discussed previously is that there is no agreement
> > > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > > behavior sucks, IMHO.
> > > > 
> > > > It does suck, but it beats the alternative of XFS continuing to do
> > > > nothing about the problem.
> > > 
> > > Just a comment on that: doing nothing is better than doing the wrong
> > > thing and being stuck with it forever. :)
> > > 
> > > > Disucssing more with Vivek, might be that XFS would be best served to
> > > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > > indefinitely.
> > > 
> > > Yes, that's exactly what I proposed in the thread I referenced in
> > > my previous email, and what got stuck on the bikeshed wall because
> > > of these concerns about knob twiddling:
> > > 
> > > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > > 
> > > | e.g. if we need configurable error handling, it needs to be
> > > | configurable for different error types, and it needs to be
> > > | configurable on a per-mount basis. And it needs to be configurable
> > > | at runtime, not just at mount time. That kind of leads to using
> > > | sysfs for this. e.g. for each error type we ned to handle different
> > > | behaviour for:
> > > | 
> > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > > | [transient] permanent
> > > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > > | 300
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > > | 50
> > > | $ cat
> > > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > > | 1
> > > 
> > > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > > days I'll have something for review, likely for the 4.3 merge
> > > window
> > 
> > Just thinking a bit more on how to make this simpler to configure,
> > is there a simple way for the filesystem to determine the current
> > config of the dm thinp volume? i.e. if the dm-thinp volume is
> > configured to error out immediately on enospc, then XFS should
> > default to doing the same thing. having XFS be able to grab this
> > status at mount time and change the default ENOSPC error config from
> > transient to permanent on such dm-thinp volumes would go a long way
> > to making these configs Just Do The Right Thing on block dev enospc
> > errors...
> > 
> > e.g. if dm-thinp is configured to queue for 60s and then fail on
> > ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> > dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> > we want XFS to retry and use it's default retry maximums before
> > failing permanently.
> 
> Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
> device's config from within the kernel (unless XFS wants to get into the
> business of issuing ioctls to DM devices.. unlikely). 

Not really.

> I could be
> persuaded to expose a per-device sysfs file to get the status (would
> avoid need for ioctl), e.g.:
>  # cat /sys/block/dm-5/dm/status
> (but that doesn't _really_ help in-kernel access, awkward for filesystem
> code to be opening sysfs files!)

No, not going that way.  We have direct access through the bdev we
opened, so that's the communications channel we'd need to use.

> SO userspace (mkfs.xfs) could easily check the thinp device's setup
> using 'dmsetup status ' (output will either contain
> 'queue_if_no_space' or 'error_if_no_space').  The DM thinp
> 'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
> accessed using a module param:
>  # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
>  60

Mkfs is not the right interface - users can change dm-thinp
behaviour long after the filesystem was created and so the XFS
config needs to be configurable, too. Further, I really don't want
to have to add anything to the on-disk format to support error
configuration because, well, that drives the level of complexity up
a couple or orders of magnitude (mkfs, repair, metadump, db, etc all
need to support it), especially when it can be driven easily from
userspace after mount with far less constraints and support burden.

> I'm open to considering alternative interfaces for getting you the info
> you need.  I just don't have a great sense for what mechanism you'd like
> to use.  Do we invent

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Mike Snitzer

On Tue, Jul 21 2015 at  9:00pm -0400,
Dave Chinner  wrote:

> On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> > On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen  
> > > wrote:
> > > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > > The issue we had discussed previously is that there is no agreement
> > > > across block devices about whether ENOSPC is a permanent or temporary
> > > > condition.  Asking the admin to tune  the fs to each block device's
> > > > behavior sucks, IMHO.
> > > 
> > > It does suck, but it beats the alternative of XFS continuing to do
> > > nothing about the problem.
> > 
> > Just a comment on that: doing nothing is better than doing the wrong
> > thing and being stuck with it forever. :)
> > 
> > > Disucssing more with Vivek, might be that XFS would be best served to
> > > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > > defaults to queueing IO for 60 seconds, once the timeout expires the
> > > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > > indefinitely.
> > 
> > Yes, that's exactly what I proposed in the thread I referenced in
> > my previous email, and what got stuck on the bikeshed wall because
> > of these concerns about knob twiddling:
> > 
> > http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> > 
> > | e.g. if we need configurable error handling, it needs to be
> > | configurable for different error types, and it needs to be
> > | configurable on a per-mount basis. And it needs to be configurable
> > | at runtime, not just at mount time. That kind of leads to using
> > | sysfs for this. e.g. for each error type we ned to handle different
> > | behaviour for:
> > | 
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> > | [transient] permanent
> > | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> > | 300
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> > | 50
> > | $ cat
> > | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> > | 1
> > 
> > I've rebased this patchset, and I'm cleaning it up now, so in a few
> > days I'll have something for review, likely for the 4.3 merge
> > window
> 
> Just thinking a bit more on how to make this simpler to configure,
> is there a simple way for the filesystem to determine the current
> config of the dm thinp volume? i.e. if the dm-thinp volume is
> configured to error out immediately on enospc, then XFS should
> default to doing the same thing. having XFS be able to grab this
> status at mount time and change the default ENOSPC error config from
> transient to permanent on such dm-thinp volumes would go a long way
> to making these configs Just Do The Right Thing on block dev enospc
> errors...
> 
> e.g. if dm-thinp is configured to queue for 60s and then fail on
> ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
> dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
> we want XFS to retry and use it's default retry maximums before
> failing permanently.

Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
device's config from within the kernel (unless XFS wants to get into the
business of issuing ioctls to DM devices.. unlikely).  I could be
persuaded to expose a per-device sysfs file to get the status (would
avoid need for ioctl), e.g.:
 # cat /sys/block/dm-5/dm/status
(but that doesn't _really_ help in-kernel access, awkward for filesystem
code to be opening sysfs files!)

SO userspace (mkfs.xfs) could easily check the thinp device's setup
using 'dmsetup status ' (output will either contain
'queue_if_no_space' or 'error_if_no_space').  The DM thinp
'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
accessed using a module param:
 # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
 60

I'm open to considering alternative interfaces for getting you the info
you need.  I just don't have a great sense for what mechanism you'd like
to use.  Do we invent a new block device operations table method that
sets values in a 'struct no_space_strategy' passed in to the
blockdevice?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
> On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> > On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen  
> > wrote:
> > > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > > The issue we had discussed previously is that there is no agreement
> > > across block devices about whether ENOSPC is a permanent or temporary
> > > condition.  Asking the admin to tune  the fs to each block device's
> > > behavior sucks, IMHO.
> > 
> > It does suck, but it beats the alternative of XFS continuing to do
> > nothing about the problem.
> 
> Just a comment on that: doing nothing is better than doing the wrong
> thing and being stuck with it forever. :)
> 
> > Disucssing more with Vivek, might be that XFS would be best served to
> > model what dm-thinp has provided with its 'no_space_timeout'.  It
> > defaults to queueing IO for 60 seconds, once the timeout expires the
> > queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> > indefinitely.
> 
> Yes, that's exactly what I proposed in the thread I referenced in
> my previous email, and what got stuck on the bikeshed wall because
> of these concerns about knob twiddling:
> 
> http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
> 
> | e.g. if we need configurable error handling, it needs to be
> | configurable for different error types, and it needs to be
> | configurable on a per-mount basis. And it needs to be configurable
> | at runtime, not just at mount time. That kind of leads to using
> | sysfs for this. e.g. for each error type we ned to handle different
> | behaviour for:
> | 
> | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
> | [transient] permanent
> | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
> | 300
> | $ cat
> | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
> | 50
> | $ cat
> | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
> | 1
> 
> I've rebased this patchset, and I'm cleaning it up now, so in a few
> days I'll have something for review, likely for the 4.3 merge
> window

Just thinking a bit more on how to make this simpler to configure,
is there a simple way for the filesystem to determine the current
config of the dm thinp volume? i.e. if the dm-thinp volume is
configured to error out immediately on enospc, then XFS should
default to doing the same thing. having XFS be able to grab this
status at mount time and change the default ENOSPC error config from
transient to permanent on such dm-thinp volumes would go a long way
to making these configs Just Do The Right Thing on block dev enospc
errors...

e.g. if dm-thinp is configured to queue for 60s and then fail on
ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
we want XFS to retry and use it's default retry maximums before
failing permanently.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
> On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen  wrote:
> > On 7/20/15 5:36 PM, Dave Chinner wrote:
> > The issue we had discussed previously is that there is no agreement
> > across block devices about whether ENOSPC is a permanent or temporary
> > condition.  Asking the admin to tune  the fs to each block device's
> > behavior sucks, IMHO.
> 
> It does suck, but it beats the alternative of XFS continuing to do
> nothing about the problem.

Just a comment on that: doing nothing is better than doing the wrong
thing and being stuck with it forever. :)

> Disucssing more with Vivek, might be that XFS would be best served to
> model what dm-thinp has provided with its 'no_space_timeout'.  It
> defaults to queueing IO for 60 seconds, once the timeout expires the
> queued IOs getted errored.  If set to 0 dm-thinp will queue IO
> indefinitely.

Yes, that's exactly what I proposed in the thread I referenced in
my previous email, and what got stuck on the bikeshed wall because
of these concerns about knob twiddling:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

| e.g. if we need configurable error handling, it needs to be
| configurable for different error types, and it needs to be
| configurable on a per-mount basis. And it needs to be configurable
| at runtime, not just at mount time. That kind of leads to using
| sysfs for this. e.g. for each error type we ned to handle different
| behaviour for:
| 
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
| [transient] permanent
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
| 300
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
| 50
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
| 1

I've rebased this patchset, and I'm cleaning it up now, so in a few
days I'll have something for review, likely for the 4.3 merge
window

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Mike Snitzer

On Tue, Jul 21 2015 at 11:34am -0400,
Eric Sandeen  wrote:

> On 7/20/15 5:36 PM, Dave Chinner wrote:
> > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> >> If XFS fails to write metadata it will retry the write indefinitely
> >> (with the hope that the write will succeed at some point in the future).
> >>
> >> Others can possibly speak to historic reason(s) why this is a sane
> >> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> >> this infinite retry is very unwelcome -- especially if DM thinp was
> >> configured to be automatically extended with free space but the admin
> >> hasn't provided (or restored) adequate free space.
> >>
> >> To fix this infinite retry a new bdev_has_space () hook is added to XFS
> >> to break out of its metadata retry loop if the underlying block device
> >> reports it no longer has free space.  DM thin provisioning is now
> >> trained to respond accordingly, which enables XFS to not cause a cascade
> >> of tasks blocked on IO waiting for XFS's infinite retry.
> >>
> >> All other block devices, which don't implement a .has_space method in
> >> block_device_operations, will always return true for bdev_has_space().
> >>
> >> With this change XFS will fail the metadata IO, force shutdown, and the
> >> XFS filesystem may be unmounted.  This enables an admin to recover from
> >> their oversight, of not having provided enough free space, without
> >> having to force a hard reset of the system to get XFS to unwedge.
> >>
> >> Signed-off-by: Mike Snitzer 
> > 
> > Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> > The scsi layers already do this for hardware thinp ENOSPC failures,
> > so dm-thinp should behave exactly the same (i.e. via
> > __scsi_error_from_host_byte()). The behaviour of the filesystem
> > should be the same in all cases - making it conditional on whether
> > the thinp implementation can be polled for available space is wrong
> > as most hardware thinp can't be polled by the kernel forthis info..
> > 
> > 
> > If dm-thinp just returns ENOSPC from on the BIO like other hardware
> > thinp devices, then it is up to the filesystem to handle that
> > appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> > filesystem is determined by filesystem configuration and context of
> > the IO error, not whether the block device has no space (which we
> > should already know from the ENOSPC error delivered by IO
> > completion).
> 
> The issue we had discussed previously is that there is no agreement
> across block devices about whether ENOSPC is a permanent or temporary
> condition.  Asking the admin to tune  the fs to each block device's
> behavior sucks, IMHO.

It does suck, but it beats the alternative of XFS continuing to do
nothing about the problem.

Disucssing more with Vivek, might be that XFS would be best served to
model what dm-thinp has provided with its 'no_space_timeout'.  It
defaults to queueing IO for 60 seconds, once the timeout expires the
queued IOs getted errored.  If set to 0 dm-thinp will queue IO
indefinitely.

So for XFS's use-case: s/queue/retry/

> This interface could at least be defined to reflect a permanent and
> unambiguous state...

The proposed bdev_has_space() interface enabled XFS to defer to the
block device.  But it obviously doesn't help at all if the blockdevice
isn't providing a .has_space method -- so I can see value in XFS
having something like a 'no_space_timeout' knob.

But something needs to happen.  No more bike-shedding allowed on this
one.. PLEASE DO SOMETHING! :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Eric Sandeen

On 7/20/15 5:36 PM, Dave Chinner wrote:
> On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
>> If XFS fails to write metadata it will retry the write indefinitely
>> (with the hope that the write will succeed at some point in the future).
>>
>> Others can possibly speak to historic reason(s) why this is a sane
>> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
>> this infinite retry is very unwelcome -- especially if DM thinp was
>> configured to be automatically extended with free space but the admin
>> hasn't provided (or restored) adequate free space.
>>
>> To fix this infinite retry a new bdev_has_space () hook is added to XFS
>> to break out of its metadata retry loop if the underlying block device
>> reports it no longer has free space.  DM thin provisioning is now
>> trained to respond accordingly, which enables XFS to not cause a cascade
>> of tasks blocked on IO waiting for XFS's infinite retry.
>>
>> All other block devices, which don't implement a .has_space method in
>> block_device_operations, will always return true for bdev_has_space().
>>
>> With this change XFS will fail the metadata IO, force shutdown, and the
>> XFS filesystem may be unmounted.  This enables an admin to recover from
>> their oversight, of not having provided enough free space, without
>> having to force a hard reset of the system to get XFS to unwedge.
>>
>> Signed-off-by: Mike Snitzer 
> 
> Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> The scsi layers already do this for hardware thinp ENOSPC failures,
> so dm-thinp should behave exactly the same (i.e. via
> __scsi_error_from_host_byte()). The behaviour of the filesystem
> should be the same in all cases - making it conditional on whether
> the thinp implementation can be polled for available space is wrong
> as most hardware thinp can't be polled by the kernel forthis info..
> 
> 
> If dm-thinp just returns ENOSPC from on the BIO like other hardware
> thinp devices, then it is up to the filesystem to handle that
> appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> filesystem is determined by filesystem configuration and context of
> the IO error, not whether the block device has no space (which we
> should already know from the ENOSPC error delivered by IO
> completion).

The issue we had discussed previously is that there is no agreement
across block devices about whether ENOSPC is a permanent or temporary
condition.  Asking the admin to tune  the fs to each block device's
behavior sucks, IMHO.

This interface could at least be defined to reflect a permanent and
unambiguous state...

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Eric Sandeen

On 7/20/15 5:36 PM, Dave Chinner wrote:
 On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
 If XFS fails to write metadata it will retry the write indefinitely
 (with the hope that the write will succeed at some point in the future).

 Others can possibly speak to historic reason(s) why this is a sane
 default for XFS.  But when XFS is deployed ontop of DM thin provisioning
 this infinite retry is very unwelcome -- especially if DM thinp was
 configured to be automatically extended with free space but the admin
 hasn't provided (or restored) adequate free space.

 To fix this infinite retry a new bdev_has_space () hook is added to XFS
 to break out of its metadata retry loop if the underlying block device
 reports it no longer has free space.  DM thin provisioning is now
 trained to respond accordingly, which enables XFS to not cause a cascade
 of tasks blocked on IO waiting for XFS's infinite retry.

 All other block devices, which don't implement a .has_space method in
 block_device_operations, will always return true for bdev_has_space().

 With this change XFS will fail the metadata IO, force shutdown, and the
 XFS filesystem may be unmounted.  This enables an admin to recover from
 their oversight, of not having provided enough free space, without
 having to force a hard reset of the system to get XFS to unwedge.

 Signed-off-by: Mike Snitzer snit...@redhat.com
 
 Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
 The scsi layers already do this for hardware thinp ENOSPC failures,
 so dm-thinp should behave exactly the same (i.e. via
 __scsi_error_from_host_byte()). The behaviour of the filesystem
 should be the same in all cases - making it conditional on whether
 the thinp implementation can be polled for available space is wrong
 as most hardware thinp can't be polled by the kernel forthis info..
 
 
 If dm-thinp just returns ENOSPC from on the BIO like other hardware
 thinp devices, then it is up to the filesystem to handle that
 appropriately.  i.e. whether an ENOSPC IO error is fatal to the
 filesystem is determined by filesystem configuration and context of
 the IO error, not whether the block device has no space (which we
 should already know from the ENOSPC error delivered by IO
 completion).

The issue we had discussed previously is that there is no agreement
across block devices about whether ENOSPC is a permanent or temporary
condition.  Asking the admin to tune  the fs to each block device's
behavior sucks, IMHO.

This interface could at least be defined to reflect a permanent and
unambiguous state...

-Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Tue, Jul 21, 2015 at 09:40:29PM -0400, Mike Snitzer wrote:
 On Tue, Jul 21 2015 at  9:00pm -0400,
 Dave Chinner da...@fromorbit.com wrote:
 
  On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
   On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen sand...@redhat.com 
wrote:
 On 7/20/15 5:36 PM, Dave Chinner wrote:
 The issue we had discussed previously is that there is no agreement
 across block devices about whether ENOSPC is a permanent or temporary
 condition.  Asking the admin to tune  the fs to each block device's
 behavior sucks, IMHO.

It does suck, but it beats the alternative of XFS continuing to do
nothing about the problem.
   
   Just a comment on that: doing nothing is better than doing the wrong
   thing and being stuck with it forever. :)
   
Disucssing more with Vivek, might be that XFS would be best served to
model what dm-thinp has provided with its 'no_space_timeout'.  It
defaults to queueing IO for 60 seconds, once the timeout expires the
queued IOs getted errored.  If set to 0 dm-thinp will queue IO
indefinitely.
   
   Yes, that's exactly what I proposed in the thread I referenced in
   my previous email, and what got stuck on the bikeshed wall because
   of these concerns about knob twiddling:
   
   http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
   
   | e.g. if we need configurable error handling, it needs to be
   | configurable for different error types, and it needs to be
   | configurable on a per-mount basis. And it needs to be configurable
   | at runtime, not just at mount time. That kind of leads to using
   | sysfs for this. e.g. for each error type we ned to handle different
   | behaviour for:
   | 
   | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
   | [transient] permanent
   | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
   | 300
   | $ cat
   | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
   | 50
   | $ cat
   | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
   | 1
   
   I've rebased this patchset, and I'm cleaning it up now, so in a few
   days I'll have something for review, likely for the 4.3 merge
   window
  
  Just thinking a bit more on how to make this simpler to configure,
  is there a simple way for the filesystem to determine the current
  config of the dm thinp volume? i.e. if the dm-thinp volume is
  configured to error out immediately on enospc, then XFS should
  default to doing the same thing. having XFS be able to grab this
  status at mount time and change the default ENOSPC error config from
  transient to permanent on such dm-thinp volumes would go a long way
  to making these configs Just Do The Right Thing on block dev enospc
  errors...
  
  e.g. if dm-thinp is configured to queue for 60s and then fail on
  ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
  dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
  we want XFS to retry and use it's default retry maximums before
  failing permanently.
 
 Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
 device's config from within the kernel (unless XFS wants to get into the
 business of issuing ioctls to DM devices.. unlikely). 

Not really.

 I could be
 persuaded to expose a per-device sysfs file to get the status (would
 avoid need for ioctl), e.g.:
  # cat /sys/block/dm-5/dm/status
 (but that doesn't _really_ help in-kernel access, awkward for filesystem
 code to be opening sysfs files!)

No, not going that way.  We have direct access through the bdev we
opened, so that's the communications channel we'd need to use.

 SO userspace (mkfs.xfs) could easily check the thinp device's setup
 using 'dmsetup status device' (output will either contain
 'queue_if_no_space' or 'error_if_no_space').  The DM thinp
 'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
 accessed using a module param:
  # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
  60

Mkfs is not the right interface - users can change dm-thinp
behaviour long after the filesystem was created and so the XFS
config needs to be configurable, too. Further, I really don't want
to have to add anything to the on-disk format to support error
configuration because, well, that drives the level of complexity up
a couple or orders of magnitude (mkfs, repair, metadump, db, etc all
need to support it), especially when it can be driven easily from
userspace after mount with far less constraints and support burden.

 I'm open to considering alternative interfaces for getting you the info
 you need.  I just don't have a great sense for what mechanism you'd like
 to use.  Do we invent a new block device operations table method that
 sets values in a 'struct no_space_strategy' passed in to the
 blockdevice?

It's long been frowned on having the filesystems dig

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Mike Snitzer

On Tue, Jul 21 2015 at 11:34am -0400,
Eric Sandeen sand...@redhat.com wrote:

 On 7/20/15 5:36 PM, Dave Chinner wrote:
  On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
  If XFS fails to write metadata it will retry the write indefinitely
  (with the hope that the write will succeed at some point in the future).
 
  Others can possibly speak to historic reason(s) why this is a sane
  default for XFS.  But when XFS is deployed ontop of DM thin provisioning
  this infinite retry is very unwelcome -- especially if DM thinp was
  configured to be automatically extended with free space but the admin
  hasn't provided (or restored) adequate free space.
 
  To fix this infinite retry a new bdev_has_space () hook is added to XFS
  to break out of its metadata retry loop if the underlying block device
  reports it no longer has free space.  DM thin provisioning is now
  trained to respond accordingly, which enables XFS to not cause a cascade
  of tasks blocked on IO waiting for XFS's infinite retry.
 
  All other block devices, which don't implement a .has_space method in
  block_device_operations, will always return true for bdev_has_space().
 
  With this change XFS will fail the metadata IO, force shutdown, and the
  XFS filesystem may be unmounted.  This enables an admin to recover from
  their oversight, of not having provided enough free space, without
  having to force a hard reset of the system to get XFS to unwedge.
 
  Signed-off-by: Mike Snitzer snit...@redhat.com
  
  Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
  The scsi layers already do this for hardware thinp ENOSPC failures,
  so dm-thinp should behave exactly the same (i.e. via
  __scsi_error_from_host_byte()). The behaviour of the filesystem
  should be the same in all cases - making it conditional on whether
  the thinp implementation can be polled for available space is wrong
  as most hardware thinp can't be polled by the kernel forthis info..
  
  
  If dm-thinp just returns ENOSPC from on the BIO like other hardware
  thinp devices, then it is up to the filesystem to handle that
  appropriately.  i.e. whether an ENOSPC IO error is fatal to the
  filesystem is determined by filesystem configuration and context of
  the IO error, not whether the block device has no space (which we
  should already know from the ENOSPC error delivered by IO
  completion).
 
 The issue we had discussed previously is that there is no agreement
 across block devices about whether ENOSPC is a permanent or temporary
 condition.  Asking the admin to tune  the fs to each block device's
 behavior sucks, IMHO.

It does suck, but it beats the alternative of XFS continuing to do
nothing about the problem.

Disucssing more with Vivek, might be that XFS would be best served to
model what dm-thinp has provided with its 'no_space_timeout'.  It
defaults to queueing IO for 60 seconds, once the timeout expires the
queued IOs getted errored.  If set to 0 dm-thinp will queue IO
indefinitely.

So for XFS's use-case: s/queue/retry/

 This interface could at least be defined to reflect a permanent and
 unambiguous state...

The proposed bdev_has_space() interface enabled XFS to defer to the
block device.  But it obviously doesn't help at all if the blockdevice
isn't providing a .has_space method -- so I can see value in XFS
having something like a 'no_space_timeout' knob.

But something needs to happen.  No more bike-shedding allowed on this
one.. PLEASE DO SOMETHING! :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Mike Snitzer

On Tue, Jul 21 2015 at  9:00pm -0400,
Dave Chinner da...@fromorbit.com wrote:

 On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
  On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
   On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen sand...@redhat.com 
   wrote:
On 7/20/15 5:36 PM, Dave Chinner wrote:
The issue we had discussed previously is that there is no agreement
across block devices about whether ENOSPC is a permanent or temporary
condition.  Asking the admin to tune  the fs to each block device's
behavior sucks, IMHO.
   
   It does suck, but it beats the alternative of XFS continuing to do
   nothing about the problem.
  
  Just a comment on that: doing nothing is better than doing the wrong
  thing and being stuck with it forever. :)
  
   Disucssing more with Vivek, might be that XFS would be best served to
   model what dm-thinp has provided with its 'no_space_timeout'.  It
   defaults to queueing IO for 60 seconds, once the timeout expires the
   queued IOs getted errored.  If set to 0 dm-thinp will queue IO
   indefinitely.
  
  Yes, that's exactly what I proposed in the thread I referenced in
  my previous email, and what got stuck on the bikeshed wall because
  of these concerns about knob twiddling:
  
  http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
  
  | e.g. if we need configurable error handling, it needs to be
  | configurable for different error types, and it needs to be
  | configurable on a per-mount basis. And it needs to be configurable
  | at runtime, not just at mount time. That kind of leads to using
  | sysfs for this. e.g. for each error type we ned to handle different
  | behaviour for:
  | 
  | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
  | [transient] permanent
  | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
  | 300
  | $ cat
  | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
  | 50
  | $ cat
  | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
  | 1
  
  I've rebased this patchset, and I'm cleaning it up now, so in a few
  days I'll have something for review, likely for the 4.3 merge
  window
 
 Just thinking a bit more on how to make this simpler to configure,
 is there a simple way for the filesystem to determine the current
 config of the dm thinp volume? i.e. if the dm-thinp volume is
 configured to error out immediately on enospc, then XFS should
 default to doing the same thing. having XFS be able to grab this
 status at mount time and change the default ENOSPC error config from
 transient to permanent on such dm-thinp volumes would go a long way
 to making these configs Just Do The Right Thing on block dev enospc
 errors...
 
 e.g. if dm-thinp is configured to queue for 60s and then fail on
 ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
 dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
 we want XFS to retry and use it's default retry maximums before
 failing permanently.

Yes, that'd be nice.  But there isn't a way to easily get the DM thinp
device's config from within the kernel (unless XFS wants to get into the
business of issuing ioctls to DM devices.. unlikely).  I could be
persuaded to expose a per-device sysfs file to get the status (would
avoid need for ioctl), e.g.:
 # cat /sys/block/dm-5/dm/status
(but that doesn't _really_ help in-kernel access, awkward for filesystem
code to be opening sysfs files!)

SO userspace (mkfs.xfs) could easily check the thinp device's setup
using 'dmsetup status device' (output will either contain
'queue_if_no_space' or 'error_if_no_space').  The DM thinp
'no_space_timeout' (applicable if queue_if_no_space) is a thinp global
accessed using a module param:
 # cat /sys/module/dm_thin_pool/parameters/no_space_timeout
 60

I'm open to considering alternative interfaces for getting you the info
you need.  I just don't have a great sense for what mechanism you'd like
to use.  Do we invent a new block device operations table method that
sets values in a 'struct no_space_strategy' passed in to the
blockdevice?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Wed, Jul 22, 2015 at 10:09:23AM +1000, Dave Chinner wrote:
 On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
  On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen sand...@redhat.com 
  wrote:
   On 7/20/15 5:36 PM, Dave Chinner wrote:
   The issue we had discussed previously is that there is no agreement
   across block devices about whether ENOSPC is a permanent or temporary
   condition.  Asking the admin to tune  the fs to each block device's
   behavior sucks, IMHO.
  
  It does suck, but it beats the alternative of XFS continuing to do
  nothing about the problem.
 
 Just a comment on that: doing nothing is better than doing the wrong
 thing and being stuck with it forever. :)
 
  Disucssing more with Vivek, might be that XFS would be best served to
  model what dm-thinp has provided with its 'no_space_timeout'.  It
  defaults to queueing IO for 60 seconds, once the timeout expires the
  queued IOs getted errored.  If set to 0 dm-thinp will queue IO
  indefinitely.
 
 Yes, that's exactly what I proposed in the thread I referenced in
 my previous email, and what got stuck on the bikeshed wall because
 of these concerns about knob twiddling:
 
 http://oss.sgi.com/archives/xfs/2015-02/msg00346.html
 
 | e.g. if we need configurable error handling, it needs to be
 | configurable for different error types, and it needs to be
 | configurable on a per-mount basis. And it needs to be configurable
 | at runtime, not just at mount time. That kind of leads to using
 | sysfs for this. e.g. for each error type we ned to handle different
 | behaviour for:
 | 
 | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
 | [transient] permanent
 | $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
 | 300
 | $ cat
 | /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
 | 50
 | $ cat
 | /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
 | 1
 
 I've rebased this patchset, and I'm cleaning it up now, so in a few
 days I'll have something for review, likely for the 4.3 merge
 window

Just thinking a bit more on how to make this simpler to configure,
is there a simple way for the filesystem to determine the current
config of the dm thinp volume? i.e. if the dm-thinp volume is
configured to error out immediately on enospc, then XFS should
default to doing the same thing. having XFS be able to grab this
status at mount time and change the default ENOSPC error config from
transient to permanent on such dm-thinp volumes would go a long way
to making these configs Just Do The Right Thing on block dev enospc
errors...

e.g. if dm-thinp is configured to queue for 60s and then fail on
ENOSPC, we want XFS to fail immediately on ENOSPC in metadata IO. If
dm-thinp is configured to ENOSPC instantly (i.e. no queueing) then
we want XFS to retry and use it's default retry maximums before
failing permanently.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-21 Thread Dave Chinner

On Tue, Jul 21, 2015 at 01:47:53PM -0400, Mike Snitzer wrote:
 On Tue, Jul 21 2015 at 11:34am -0400, Eric Sandeen sand...@redhat.com wrote:
  On 7/20/15 5:36 PM, Dave Chinner wrote:
  The issue we had discussed previously is that there is no agreement
  across block devices about whether ENOSPC is a permanent or temporary
  condition.  Asking the admin to tune  the fs to each block device's
  behavior sucks, IMHO.
 
 It does suck, but it beats the alternative of XFS continuing to do
 nothing about the problem.

Just a comment on that: doing nothing is better than doing the wrong
thing and being stuck with it forever. :)

 Disucssing more with Vivek, might be that XFS would be best served to
 model what dm-thinp has provided with its 'no_space_timeout'.  It
 defaults to queueing IO for 60 seconds, once the timeout expires the
 queued IOs getted errored.  If set to 0 dm-thinp will queue IO
 indefinitely.

Yes, that's exactly what I proposed in the thread I referenced in
my previous email, and what got stuck on the bikeshed wall because
of these concerns about knob twiddling:

http://oss.sgi.com/archives/xfs/2015-02/msg00346.html

| e.g. if we need configurable error handling, it needs to be
| configurable for different error types, and it needs to be
| configurable on a per-mount basis. And it needs to be configurable
| at runtime, not just at mount time. That kind of leads to using
| sysfs for this. e.g. for each error type we ned to handle different
| behaviour for:
| 
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/type
| [transient] permanent
| $ cat /sys/fs/xfs/vda/meta_write_errors/enospc/perm_timeout_seconds
| 300
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/perm_max_retry_attempts
| 50
| $ cat
| /sys/fs/xfs/vda/meta_write_errors/enospc/transient_fail_at_umount
| 1

I've rebased this patchset, and I'm cleaning it up now, so in a few
days I'll have something for review, likely for the 4.3 merge
window

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Dave Chinner

On Mon, Jul 20, 2015 at 07:20:58PM -0400, Mike Snitzer wrote:
> On Mon, Jul 20 2015 at  6:36pm -0400,
> Dave Chinner  wrote:
> 
> > On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> > > If XFS fails to write metadata it will retry the write indefinitely
> > > (with the hope that the write will succeed at some point in the future).
> > > 
> > > Others can possibly speak to historic reason(s) why this is a sane
> > > default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> > > this infinite retry is very unwelcome -- especially if DM thinp was
> > > configured to be automatically extended with free space but the admin
> > > hasn't provided (or restored) adequate free space.
> > > 
> > > To fix this infinite retry a new bdev_has_space () hook is added to XFS
> > > to break out of its metadata retry loop if the underlying block device
> > > reports it no longer has free space.  DM thin provisioning is now
> > > trained to respond accordingly, which enables XFS to not cause a cascade
> > > of tasks blocked on IO waiting for XFS's infinite retry.
> > > 
> > > All other block devices, which don't implement a .has_space method in
> > > block_device_operations, will always return true for bdev_has_space().
> > > 
> > > With this change XFS will fail the metadata IO, force shutdown, and the
> > > XFS filesystem may be unmounted.  This enables an admin to recover from
> > > their oversight, of not having provided enough free space, without
> > > having to force a hard reset of the system to get XFS to unwedge.
> > > 
> > > Signed-off-by: Mike Snitzer 
> > 
> > Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> > The scsi layers already do this for hardware thinp ENOSPC failures,
> > so dm-thinp should behave exactly the same (i.e. via
> > __scsi_error_from_host_byte())
> 
> DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
> bcc696fac1).  Before it was more nuanced, but it would either return
> -ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
> no_space_timoeut expired).

Right, which meant we couldn't get a reliable ENOSPC detection, so
we just had to retry and hope

> Anyway, XFS doesn't care what error is returned it'll just keep
> retrying metdata IO indefinitely.

Sure, that's historical behaviour and it's a good default for most
systems. But see here for how we plan to handle this particular
issue of error catergorisation:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

BTW, it's not just ENOSPC we need this for - ENOMEM needs the same
configurable behaviour now that the mm subsystem has started
treating GFP_NOFAIL differently to just looping outside the
allocator

> > The behaviour of the filesystem
> > should be the same in all cases - making it conditional on whether
> > the thinp implementation can be polled for available space is wrong
> > as most hardware thinp can't be polled by the kernel forthis info..
> 
> There isn't an immediate corollary for SCSI no.  I've discussed flagging
> the SCSI block device as out of space when the equivalent HW thinp
> ASC/ASQ is returned from the SCSI target but there isn't an existing way
> to know space was added to reset this state (like is possible with DM
> thinp).
> 
> But in the past you (or others) have claimed that ENOSPC isn't a rich
> enough return to _know_ that it is a hard error.

Yes, I've said that in the past because we haven't had any agreement
on how block devices should report ENOSPC. However, this is slowly
being standardised and I've seen enough people say
"give us a knob to decide that ourselves as we know our hardware"
that means we can rely on ENOSPC on the bio to detect this issue and
handle it appropriately.

> Now you seem to be oscillating away from a richer interface between
> filesystems and the underlying block device to: a simple error code is
> enough.

Once we decided to give users the option to configure block device
error handling, the need for more information from the block device
goes away...

> > If dm-thinp just returns ENOSPC from on the BIO like other hardware
> > thinp devices, then it is up to the filesystem to handle that
> > appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> > filesystem is determined by filesystem configuration and context of
> > the IO error, not whether the block device has no space (which we
> > should already know from the ENOSPC error delivered by IO
> > completion).
> 
> OK, _please_ add that configurability to XFS.  As I think you know, this
> is a long-standing XFS on thinp issue (both DM and HW thinp).

It got stuck on the side of the bikeshed, unfortunately, and then
other stuff got painted on top of it. I'll go uncover it now that
dm-thinp is returning reliable ENOSPC in ENOSPC conditions ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Mike Snitzer

On Mon, Jul 20 2015 at  6:36pm -0400,
Dave Chinner  wrote:

> On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> > If XFS fails to write metadata it will retry the write indefinitely
> > (with the hope that the write will succeed at some point in the future).
> > 
> > Others can possibly speak to historic reason(s) why this is a sane
> > default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> > this infinite retry is very unwelcome -- especially if DM thinp was
> > configured to be automatically extended with free space but the admin
> > hasn't provided (or restored) adequate free space.
> > 
> > To fix this infinite retry a new bdev_has_space () hook is added to XFS
> > to break out of its metadata retry loop if the underlying block device
> > reports it no longer has free space.  DM thin provisioning is now
> > trained to respond accordingly, which enables XFS to not cause a cascade
> > of tasks blocked on IO waiting for XFS's infinite retry.
> > 
> > All other block devices, which don't implement a .has_space method in
> > block_device_operations, will always return true for bdev_has_space().
> > 
> > With this change XFS will fail the metadata IO, force shutdown, and the
> > XFS filesystem may be unmounted.  This enables an admin to recover from
> > their oversight, of not having provided enough free space, without
> > having to force a hard reset of the system to get XFS to unwedge.
> > 
> > Signed-off-by: Mike Snitzer 
> 
> Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
> The scsi layers already do this for hardware thinp ENOSPC failures,
> so dm-thinp should behave exactly the same (i.e. via
> __scsi_error_from_host_byte())

DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
bcc696fac1).  Before it was more nuanced, but it would either return
-ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
no_space_timoeut expired).

Anyway, XFS doesn't care what error is returned it'll just keep
retrying metdata IO indefinitely.

> The behaviour of the filesystem
> should be the same in all cases - making it conditional on whether
> the thinp implementation can be polled for available space is wrong
> as most hardware thinp can't be polled by the kernel forthis info..

There isn't an immediate corollary for SCSI no.  I've discussed flagging
the SCSI block device as out of space when the equivalent HW thinp
ASC/ASQ is returned from the SCSI target but there isn't an existing way
to know space was added to reset this state (like is possible with DM
thinp).

But in the past you (or others) have claimed that ENOSPC isn't a rich
enough return to _know_ that it is a hard error.

Now you seem to be oscillating away from a richer interface between
filesystems and the underlying block device to: a simple error code is
enough.

> If dm-thinp just returns ENOSPC from on the BIO like other hardware
> thinp devices, then it is up to the filesystem to handle that
> appropriately.  i.e. whether an ENOSPC IO error is fatal to the
> filesystem is determined by filesystem configuration and context of
> the IO error, not whether the block device has no space (which we
> should already know from the ENOSPC error delivered by IO
> completion).

OK, _please_ add that configurability to XFS.  As I think you know, this
is a long-standing XFS on thinp issue (both DM and HW thinp).

Mike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Dave Chinner

On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
> If XFS fails to write metadata it will retry the write indefinitely
> (with the hope that the write will succeed at some point in the future).
> 
> Others can possibly speak to historic reason(s) why this is a sane
> default for XFS.  But when XFS is deployed ontop of DM thin provisioning
> this infinite retry is very unwelcome -- especially if DM thinp was
> configured to be automatically extended with free space but the admin
> hasn't provided (or restored) adequate free space.
> 
> To fix this infinite retry a new bdev_has_space () hook is added to XFS
> to break out of its metadata retry loop if the underlying block device
> reports it no longer has free space.  DM thin provisioning is now
> trained to respond accordingly, which enables XFS to not cause a cascade
> of tasks blocked on IO waiting for XFS's infinite retry.
> 
> All other block devices, which don't implement a .has_space method in
> block_device_operations, will always return true for bdev_has_space().
> 
> With this change XFS will fail the metadata IO, force shutdown, and the
> XFS filesystem may be unmounted.  This enables an admin to recover from
> their oversight, of not having provided enough free space, without
> having to force a hard reset of the system to get XFS to unwedge.
> 
> Signed-off-by: Mike Snitzer 

Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
The scsi layers already do this for hardware thinp ENOSPC failures,
so dm-thinp should behave exactly the same (i.e. via
__scsi_error_from_host_byte()). The behaviour of the filesystem
should be the same in all cases - making it conditional on whether
the thinp implementation can be polled for available space is wrong
as most hardware thinp can't be polled by the kernel forthis info..

If dm-thinp just returns ENOSPC from on the BIO like other hardware
thinp devices, then it is up to the filesystem to handle that
appropriately.  i.e. whether an ENOSPC IO error is fatal to the
filesystem is determined by filesystem configuration and context of
the IO error, not whether the block device has no space (which we
should already know from the ENOSPC error delivered by IO
completion).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Mike Snitzer

If XFS fails to write metadata it will retry the write indefinitely
(with the hope that the write will succeed at some point in the future).

Others can possibly speak to historic reason(s) why this is a sane
default for XFS.  But when XFS is deployed ontop of DM thin provisioning
this infinite retry is very unwelcome -- especially if DM thinp was
configured to be automatically extended with free space but the admin
hasn't provided (or restored) adequate free space.

To fix this infinite retry a new bdev_has_space () hook is added to XFS
to break out of its metadata retry loop if the underlying block device
reports it no longer has free space.  DM thin provisioning is now
trained to respond accordingly, which enables XFS to not cause a cascade
of tasks blocked on IO waiting for XFS's infinite retry.

All other block devices, which don't implement a .has_space method in
block_device_operations, will always return true for bdev_has_space().

With this change XFS will fail the metadata IO, force shutdown, and the
XFS filesystem may be unmounted.  This enables an admin to recover from
their oversight, of not having provided enough free space, without
having to force a hard reset of the system to get XFS to unwedge.

Signed-off-by: Mike Snitzer 
---
 drivers/md/dm-thin.c  | 27 +--
 drivers/md/dm.c   | 33 +
 fs/block_dev.c| 10 ++
 fs/xfs/xfs_buf_item.c |  3 +++
 include/linux/blkdev.h|  3 +++
 include/linux/device-mapper.h |  8 
 6 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1c50c58..55ee3cf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3948,7 +3948,7 @@ static struct target_type pool_target = {
.name = "thin-pool",
.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
DM_TARGET_IMMUTABLE,
-   .version = {1, 16, 0},
+   .version = {1, 17, 0},
.module = THIS_MODULE,
.ctr = pool_ctr,
.dtr = pool_dtr,
@@ -4333,9 +4333,31 @@ static void thin_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits->max_discard_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
+static bool thin_has_space(struct dm_target *ti)
+{
+   struct thin_c *tc = ti->private;
+   struct pool *pool = tc->pool;
+   enum pool_mode m = get_pool_mode(pool);
+
+   /*
+* The thin-pool has space if it is either in write mode _or_
+* it is still waiting for space to be added.
+*
+* If 'error_if_no_space' was configured the pool will not queue
+* IO at all, even though the pool will stay in OODS mode, so
+* there is no point having upper layers (e.g. XFS) retry IO
+* given 'error_if_no_space' is meant to _not_ queue IO.
+*/
+   if (m == PM_WRITE ||
+   (m == PM_OUT_OF_DATA_SPACE && !pool->pf.error_if_no_space))
+   return true;
+
+   return false;
+}
+
 static struct target_type thin_target = {
.name = "thin",
-   .version = {1, 16, 0},
+   .version = {1, 17, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,
@@ -4348,6 +4370,7 @@ static struct target_type thin_target = {
.merge = thin_merge,
.iterate_devices = thin_iterate_devices,
.io_hints = thin_io_hints,
+   .has_space = thin_has_space,
 };
 
 /**/
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab37ae1..14bf9df 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -597,6 +597,38 @@ out:
return r;
 }
 
+static bool dm_blk_has_space(struct block_device *bdev)
+{
+   struct mapped_device *md = bdev->bd_disk->private_data;
+   int srcu_idx;
+   struct dm_table *map;
+   struct dm_target *tgt;
+   bool r = true;
+
+   map = dm_get_live_table(md, _idx);
+
+   if (!map || !dm_table_get_size(map))
+   goto out;
+
+   /* We only support devices that have a single target */
+   if (dm_table_get_num_targets(map) != 1)
+   goto out;
+
+   tgt = dm_table_get_target(map, 0);
+   if (!tgt->type->has_space)
+   goto out;
+
+   if (dm_suspended_md(md))
+   goto out;
+
+   r = tgt->type->has_space(tgt);
+
+out:
+   dm_put_live_table(md, srcu_idx);
+
+   return r;
+}
+
 static struct dm_io *alloc_io(struct mapped_device *md)
 {
return mempool_alloc(md->io_pool, GFP_NOIO);
@@ -3647,6 +3679,7 @@ static const struct block_device_operations dm_blk_dops = 
{
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
.getgeo = dm_blk_getgeo,
+   .has_space = dm_blk_has_space,
.owner = THIS_MODULE
 };
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1982437..5034361 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -469,6

[RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Mike Snitzer

If XFS fails to write metadata it will retry the write indefinitely
(with the hope that the write will succeed at some point in the future).

Others can possibly speak to historic reason(s) why this is a sane
default for XFS.  But when XFS is deployed ontop of DM thin provisioning
this infinite retry is very unwelcome -- especially if DM thinp was
configured to be automatically extended with free space but the admin
hasn't provided (or restored) adequate free space.

To fix this infinite retry a new bdev_has_space () hook is added to XFS
to break out of its metadata retry loop if the underlying block device
reports it no longer has free space.  DM thin provisioning is now
trained to respond accordingly, which enables XFS to not cause a cascade
of tasks blocked on IO waiting for XFS's infinite retry.

All other block devices, which don't implement a .has_space method in
block_device_operations, will always return true for bdev_has_space().

With this change XFS will fail the metadata IO, force shutdown, and the
XFS filesystem may be unmounted.  This enables an admin to recover from
their oversight, of not having provided enough free space, without
having to force a hard reset of the system to get XFS to unwedge.

Signed-off-by: Mike Snitzer snit...@redhat.com
---
 drivers/md/dm-thin.c  | 27 +--
 drivers/md/dm.c   | 33 +
 fs/block_dev.c| 10 ++
 fs/xfs/xfs_buf_item.c |  3 +++
 include/linux/blkdev.h|  3 +++
 include/linux/device-mapper.h |  8 
 6 files changed, 82 insertions(+), 2 deletions(-)

diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index 1c50c58..55ee3cf 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -3948,7 +3948,7 @@ static struct target_type pool_target = {
.name = thin-pool,
.features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
DM_TARGET_IMMUTABLE,
-   .version = {1, 16, 0},
+   .version = {1, 17, 0},
.module = THIS_MODULE,
.ctr = pool_ctr,
.dtr = pool_dtr,
@@ -4333,9 +4333,31 @@ static void thin_io_hints(struct dm_target *ti, struct 
queue_limits *limits)
limits-max_discard_sectors = 2048 * 1024 * 16; /* 16G */
 }
 
+static bool thin_has_space(struct dm_target *ti)
+{
+   struct thin_c *tc = ti-private;
+   struct pool *pool = tc-pool;
+   enum pool_mode m = get_pool_mode(pool);
+
+   /*
+* The thin-pool has space if it is either in write mode _or_
+* it is still waiting for space to be added.
+*
+* If 'error_if_no_space' was configured the pool will not queue
+* IO at all, even though the pool will stay in OODS mode, so
+* there is no point having upper layers (e.g. XFS) retry IO
+* given 'error_if_no_space' is meant to _not_ queue IO.
+*/
+   if (m == PM_WRITE ||
+   (m == PM_OUT_OF_DATA_SPACE  !pool-pf.error_if_no_space))
+   return true;
+
+   return false;
+}
+
 static struct target_type thin_target = {
.name = thin,
-   .version = {1, 16, 0},
+   .version = {1, 17, 0},
.module = THIS_MODULE,
.ctr = thin_ctr,
.dtr = thin_dtr,
@@ -4348,6 +4370,7 @@ static struct target_type thin_target = {
.merge = thin_merge,
.iterate_devices = thin_iterate_devices,
.io_hints = thin_io_hints,
+   .has_space = thin_has_space,
 };
 
 /**/
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index ab37ae1..14bf9df 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -597,6 +597,38 @@ out:
return r;
 }
 
+static bool dm_blk_has_space(struct block_device *bdev)
+{
+   struct mapped_device *md = bdev-bd_disk-private_data;
+   int srcu_idx;
+   struct dm_table *map;
+   struct dm_target *tgt;
+   bool r = true;
+
+   map = dm_get_live_table(md, srcu_idx);
+
+   if (!map || !dm_table_get_size(map))
+   goto out;
+
+   /* We only support devices that have a single target */
+   if (dm_table_get_num_targets(map) != 1)
+   goto out;
+
+   tgt = dm_table_get_target(map, 0);
+   if (!tgt-type-has_space)
+   goto out;
+
+   if (dm_suspended_md(md))
+   goto out;
+
+   r = tgt-type-has_space(tgt);
+
+out:
+   dm_put_live_table(md, srcu_idx);
+
+   return r;
+}
+
 static struct dm_io *alloc_io(struct mapped_device *md)
 {
return mempool_alloc(md-io_pool, GFP_NOIO);
@@ -3647,6 +3679,7 @@ static const struct block_device_operations dm_blk_dops = 
{
.release = dm_blk_close,
.ioctl = dm_blk_ioctl,
.getgeo = dm_blk_getgeo,
+   .has_space = dm_blk_has_space,
.owner = THIS_MODULE
 };
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1982437..5034361 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -469,6

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Dave Chinner

On Mon, Jul 20, 2015 at 07:20:58PM -0400, Mike Snitzer wrote:
 On Mon, Jul 20 2015 at  6:36pm -0400,
 Dave Chinner da...@fromorbit.com wrote:
 
  On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
   If XFS fails to write metadata it will retry the write indefinitely
   (with the hope that the write will succeed at some point in the future).
   
   Others can possibly speak to historic reason(s) why this is a sane
   default for XFS.  But when XFS is deployed ontop of DM thin provisioning
   this infinite retry is very unwelcome -- especially if DM thinp was
   configured to be automatically extended with free space but the admin
   hasn't provided (or restored) adequate free space.
   
   To fix this infinite retry a new bdev_has_space () hook is added to XFS
   to break out of its metadata retry loop if the underlying block device
   reports it no longer has free space.  DM thin provisioning is now
   trained to respond accordingly, which enables XFS to not cause a cascade
   of tasks blocked on IO waiting for XFS's infinite retry.
   
   All other block devices, which don't implement a .has_space method in
   block_device_operations, will always return true for bdev_has_space().
   
   With this change XFS will fail the metadata IO, force shutdown, and the
   XFS filesystem may be unmounted.  This enables an admin to recover from
   their oversight, of not having provided enough free space, without
   having to force a hard reset of the system to get XFS to unwedge.
   
   Signed-off-by: Mike Snitzer snit...@redhat.com
  
  Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
  The scsi layers already do this for hardware thinp ENOSPC failures,
  so dm-thinp should behave exactly the same (i.e. via
  __scsi_error_from_host_byte())
 
 DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
 bcc696fac1).  Before it was more nuanced, but it would either return
 -ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
 no_space_timoeut expired).

Right, which meant we couldn't get a reliable ENOSPC detection, so
we just had to retry and hope

 Anyway, XFS doesn't care what error is returned it'll just keep
 retrying metdata IO indefinitely.

Sure, that's historical behaviour and it's a good default for most
systems. But see here for how we plan to handle this particular
issue of error catergorisation:

http://oss.sgi.com/archives/xfs/2015-02/msg00343.html

BTW, it's not just ENOSPC we need this for - ENOMEM needs the same
configurable behaviour now that the mm subsystem has started
treating GFP_NOFAIL differently to just looping outside the
allocator

  The behaviour of the filesystem
  should be the same in all cases - making it conditional on whether
  the thinp implementation can be polled for available space is wrong
  as most hardware thinp can't be polled by the kernel forthis info..
 
 There isn't an immediate corollary for SCSI no.  I've discussed flagging
 the SCSI block device as out of space when the equivalent HW thinp
 ASC/ASQ is returned from the SCSI target but there isn't an existing way
 to know space was added to reset this state (like is possible with DM
 thinp).
 
 But in the past you (or others) have claimed that ENOSPC isn't a rich
 enough return to _know_ that it is a hard error.

Yes, I've said that in the past because we haven't had any agreement
on how block devices should report ENOSPC. However, this is slowly
being standardised and I've seen enough people say
give us a knob to decide that ourselves as we know our hardware
that means we can rely on ENOSPC on the bio to detect this issue and
handle it appropriately.

 Now you seem to be oscillating away from a richer interface between
 filesystems and the underlying block device to: a simple error code is
 enough.

Once we decided to give users the option to configure block device
error handling, the need for more information from the block device
goes away...

  If dm-thinp just returns ENOSPC from on the BIO like other hardware
  thinp devices, then it is up to the filesystem to handle that
  appropriately.  i.e. whether an ENOSPC IO error is fatal to the
  filesystem is determined by filesystem configuration and context of
  the IO error, not whether the block device has no space (which we
  should already know from the ENOSPC error delivered by IO
  completion).
 
 OK, _please_ add that configurability to XFS.  As I think you know, this
 is a long-standing XFS on thinp issue (both DM and HW thinp).

It got stuck on the side of the bikeshed, unfortunately, and then
other stuff got painted on top of it. I'll go uncover it now that
dm-thinp is returning reliable ENOSPC in ENOSPC conditions ;)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Dave Chinner

On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
 If XFS fails to write metadata it will retry the write indefinitely
 (with the hope that the write will succeed at some point in the future).
 
 Others can possibly speak to historic reason(s) why this is a sane
 default for XFS.  But when XFS is deployed ontop of DM thin provisioning
 this infinite retry is very unwelcome -- especially if DM thinp was
 configured to be automatically extended with free space but the admin
 hasn't provided (or restored) adequate free space.
 
 To fix this infinite retry a new bdev_has_space () hook is added to XFS
 to break out of its metadata retry loop if the underlying block device
 reports it no longer has free space.  DM thin provisioning is now
 trained to respond accordingly, which enables XFS to not cause a cascade
 of tasks blocked on IO waiting for XFS's infinite retry.
 
 All other block devices, which don't implement a .has_space method in
 block_device_operations, will always return true for bdev_has_space().
 
 With this change XFS will fail the metadata IO, force shutdown, and the
 XFS filesystem may be unmounted.  This enables an admin to recover from
 their oversight, of not having provided enough free space, without
 having to force a hard reset of the system to get XFS to unwedge.
 
 Signed-off-by: Mike Snitzer snit...@redhat.com

Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
The scsi layers already do this for hardware thinp ENOSPC failures,
so dm-thinp should behave exactly the same (i.e. via
__scsi_error_from_host_byte()). The behaviour of the filesystem
should be the same in all cases - making it conditional on whether
the thinp implementation can be polled for available space is wrong
as most hardware thinp can't be polled by the kernel forthis info..


If dm-thinp just returns ENOSPC from on the BIO like other hardware
thinp devices, then it is up to the filesystem to handle that
appropriately.  i.e. whether an ENOSPC IO error is fatal to the
filesystem is determined by filesystem configuration and context of
the IO error, not whether the block device has no space (which we
should already know from the ENOSPC error delivered by IO
completion).

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC PATCH] block: xfs: dm thin: train XFS to give up on retrying IO if thinp is out of space

2015-07-20 Thread Mike Snitzer

On Mon, Jul 20 2015 at  6:36pm -0400,
Dave Chinner da...@fromorbit.com wrote:

 On Mon, Jul 20, 2015 at 11:18:49AM -0400, Mike Snitzer wrote:
  If XFS fails to write metadata it will retry the write indefinitely
  (with the hope that the write will succeed at some point in the future).
  
  Others can possibly speak to historic reason(s) why this is a sane
  default for XFS.  But when XFS is deployed ontop of DM thin provisioning
  this infinite retry is very unwelcome -- especially if DM thinp was
  configured to be automatically extended with free space but the admin
  hasn't provided (or restored) adequate free space.
  
  To fix this infinite retry a new bdev_has_space () hook is added to XFS
  to break out of its metadata retry loop if the underlying block device
  reports it no longer has free space.  DM thin provisioning is now
  trained to respond accordingly, which enables XFS to not cause a cascade
  of tasks blocked on IO waiting for XFS's infinite retry.
  
  All other block devices, which don't implement a .has_space method in
  block_device_operations, will always return true for bdev_has_space().
  
  With this change XFS will fail the metadata IO, force shutdown, and the
  XFS filesystem may be unmounted.  This enables an admin to recover from
  their oversight, of not having provided enough free space, without
  having to force a hard reset of the system to get XFS to unwedge.
  
  Signed-off-by: Mike Snitzer snit...@redhat.com
 
 Shouldn't dm-thinp just return the bio with ENOSPC as it's error?
 The scsi layers already do this for hardware thinp ENOSPC failures,
 so dm-thinp should behave exactly the same (i.e. via
 __scsi_error_from_host_byte())

DM thinp does return -ENOSPC (uniformly so, as of 4.2-rc3 via commit
bcc696fac1).  Before it was more nuanced, but it would either return
-ENOSPC or -EIO (transition from -ENOSPC to -EIO was when thinp's
no_space_timoeut expired).

Anyway, XFS doesn't care what error is returned it'll just keep
retrying metdata IO indefinitely.

 The behaviour of the filesystem
 should be the same in all cases - making it conditional on whether
 the thinp implementation can be polled for available space is wrong
 as most hardware thinp can't be polled by the kernel forthis info..

There isn't an immediate corollary for SCSI no.  I've discussed flagging
the SCSI block device as out of space when the equivalent HW thinp
ASC/ASQ is returned from the SCSI target but there isn't an existing way
to know space was added to reset this state (like is possible with DM
thinp).

But in the past you (or others) have claimed that ENOSPC isn't a rich
enough return to _know_ that it is a hard error.

Now you seem to be oscillating away from a richer interface between
filesystems and the underlying block device to: a simple error code is
enough.

 If dm-thinp just returns ENOSPC from on the BIO like other hardware
 thinp devices, then it is up to the filesystem to handle that
 appropriately.  i.e. whether an ENOSPC IO error is fatal to the
 filesystem is determined by filesystem configuration and context of
 the IO error, not whether the block device has no space (which we
 should already know from the ENOSPC error delivered by IO
 completion).

OK, _please_ add that configurability to XFS.  As I think you know, this
is a long-standing XFS on thinp issue (both DM and HW thinp).

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

40 matches

Mail list logo