Re: [PATCH 0/4][V4] drop the mmap_sem when doing IO in the fault path

2018-12-06 Thread Dave Chinner
On Tue, Dec 04, 2018 at 02:49:31PM -0800, Andrew Morton wrote:
> On Fri, 30 Nov 2018 14:58:08 -0500 Josef Bacik  wrote:
> 
> > Now that we have proper isolation in place with cgroups2 we have started 
> > going
> > through and fixing the various priority inversions.  Most are all gone now, 
> > but
> > this one is sort of weird since it's not necessarily a priority inversion 
> > that
> > happens within the kernel, but rather because of something userspace does.
> > 
> > We have giant applications that we want to protect, and parts of these giant
> > applications do things like watch the system state to determine how healthy 
> > the
> > box is for load balancing and such.  This involves running 'ps' or other 
> > such
> > utilities.  These utilities will often walk /proc//whatever, and these
> > files can sometimes need to down_read(>mmap_sem).  Not usually a big 
> > deal,
> > but we noticed when we are stress testing that sometimes our protected
> > application has latency spikes trying to get the mmap_sem for tasks that 
> > are in
> > lower priority cgroups.
> > 
> > This is because any down_write() on a semaphore essentially turns it into a
> > mutex, so even if we currently have it held for reading, any new readers 
> > will
> > not be allowed on to keep from starving the writer.  This is fine, except a
> > lower priority task could be stuck doing IO because it has been throttled 
> > to the
> > point that its IO is taking much longer than normal.  But because a higher
> > priority group depends on this completing it is now stuck behind lower 
> > priority
> > work.
> > 
> > In order to avoid this particular priority inversion we want to use the 
> > existing
> > retry mechanism to stop from holding the mmap_sem at all if we are going to 
> > do
> > IO.  This already exists in the read case sort of, but needed to be 
> > extended for
> > more than just grabbing the page lock.  With io.latency we throttle at
> > submit_bio() time, so the readahead stuff can block and even 
> > page_cache_read can
> > block, so all these paths need to have the mmap_sem dropped.
> > 
> > The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
> > because we have to reserve space for the dirty page, which can be a very
> > expensive operation.  We use the same retry method as the read path, and 
> > simply
> > cache the page and verify the page is still setup properly the next pass 
> > through
> > ->page_mkwrite().
> 
> Seems reasonable.  I have a few minorish changeloggish comments.
> 
> We're at v4 and no acks have been gathered?

I looked at previous versions and had a bunch of questions and
change requests. I haven't had time to look at this version yet,
but seeing as the page_mkwrite() stuff has been dropped from this
version it isn't useful anymore for solving the problem I had in
mind when reviewing it originally...

What I really want is unconditionally retriable page faults so the
filesystem can cause the page fault to be restarted from scratch. We
have a requirement for DAX and shared data extents (reflink) to
work, and that requires changing the faulted page location during
page_mkwrite. i.e. we get a fault on a read mapped shared page, then
we have to do a copy-on-write operation to break physical data
sharing and so the page with the file data in it physically changes
during ->page_mkwrite (because DAX). Hence we need to restart the
page fault to map the new page correctly because the file no longer
points at the page that was originally faulted.

With this stashed-page-and-retry mechanism implemented for
->page_mkwrite, we could stash the new page in the vmf and tell the
fault to retry, and everything would just work. Without
->page_mkwrite support, it's just not that interesting and I have
higher priority things to deal with right now

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-12-02 Thread Dave Chinner
On Sat, Dec 01, 2018 at 02:49:09AM -0500, Sasha Levin wrote:
> On Sat, Dec 01, 2018 at 08:50:05AM +1100, Dave Chinner wrote:
> >On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote:
> >>On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> >>>On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> >>>>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
> >>>>aggregate) to focus on testing the copy_file_range() changes, but
> >>>>Darrick's tests are still ongoing and have passed 40 billion ops in
> >>>>aggregate over the past few days.
> >>>>
> >>>>The reason we are running these so long is that we've seen fsx data
> >>>>corruption failures after 12+ hours of runtime and hundreds of
> >>>>millions of ops. Hence the testing for backported fixes will need to
> >>>>replicate these test runs across multiple configurations for
> >>>>multiple days before we have any confidence that we've actually
> >>>>fixed the data corruptions and not introduced any new ones.
> >>>>
> >>>>If you pull only a small subset of the fixes, the fsx will still
> >>>>fail and we have no real way of actually verifying that there have
> >>>>been no regression introduced by the backport.  IOWs, there's a
> >>>>/massive/ amount of QA needed for ensuring that these backports work
> >>>>correctly.
> >>>>
> >>>>Right now the XFS developers don't have the time or resources
> >>>>available to validate stable backports are correct and regression
> >>>>fre because we are focussed on ensuring the upstream fixes we've
> >>>>already made (and are still writing) are solid and reliable.
> >>>
> >>>Ok, that's fine, so users of XFS should wait until the 4.20 release
> >>>before relying on it?  :)
> >>
> >>It's getting to the point that with the amount of known issues with XFS
> >>on LTS kernels it makes sense to mark it as CONFIG_BROKEN.
> >
> >Really? Where are the bug reports?
> 
> In 'git log'! You report these every time you fix something in upstream
> xfs but don't backport it to stable trees:

That is so wrong on so many levels I don't really know where to
begin. I guess doing a *basic risk analysis* demonstrating that none
of those fixes are backport candidates is a good start:

> $ git log --oneline v4.18-rc1..v4.18 fs/xfs
> d4a34e165557 xfs: properly handle free inodes in extent hint validators

Found by QA with generic/229 on a non-standard config. Not user
reported, unlikely to ever be seen by users.

> 9991274fddb9 xfs: Initialize variables in xfs_alloc_get_rec before using them

Cleaning up coverity reported issues to do with corruption log
messages. No visible symptoms, Not user reported.

> d8cb5e423789 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation

Minor free space accounting issue, not user reported, doesn't affect
normal operation.

> e53c4b598372 xfs: ensure post-EOF zeroing happens after zeroing part of a file

Found with fsx via generic/127. Not user reported, doesn't affect
userspace operation at all.

> a3a374bf1889 xfs: fix off-by-one error in xfs_rtalloc_query_range

Regression fix for code introduced in 4.18-rc1. Not user reported
because the code has never been released.

> 232d0a24b0fc xfs: fix uninitialized field in rtbitmap fsmap backend

Coverity warning fix, not user reported, not user impact.

> 5bd88d153998 xfs: recheck reflink state after grabbing ILOCK_SHARED for a 
> write

Fixes warning from generic/166, not user reported. Could affect
users mixing direct IO with reflink, but we expect people using
new functionality like reflink to be tracking TOT fairly closely
anyway.

> f62cb48e4319 xfs: don't allow insert-range to shift extents past the maximum 
> offset

Found by QA w/ generic/465. Not user reported, only affects files in
the exabyte range so not a real world problem

> aafe12cee0b1 xfs: don't trip over negative free space in xfs_reserve_blocks

Found during ENOSPC stress tests that depeleted the reserve pool.
Not user reported, unlikely to ever be hit by users.

> 10ee25268e1f xfs: allow empty transactions while frozen

Removes a spurious warning when running GETFSMAP ioctl on a frozen
filesystem. Not user reported, highly unlikely any user will ever
hit this as nothing but XFs utilities use GETFSMAP at the moment.

> e53946dbd31a xfs: xfs_iflush_abort() can be called twice on cluster writeback 
> failure

Bug in corrupted filesystem handling, been there for ~15 years IIRC.
Not user reported - found by one of our shutdown stress tests
on a debug kernel (generic/388, IIRC). Highly 

Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-12-02 Thread Dave Chinner
On Sat, Dec 01, 2018 at 02:49:09AM -0500, Sasha Levin wrote:
> On Sat, Dec 01, 2018 at 08:50:05AM +1100, Dave Chinner wrote:
> >On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote:
> >>On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> >>>On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> >>>>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
> >>>>aggregate) to focus on testing the copy_file_range() changes, but
> >>>>Darrick's tests are still ongoing and have passed 40 billion ops in
> >>>>aggregate over the past few days.
> >>>>
> >>>>The reason we are running these so long is that we've seen fsx data
> >>>>corruption failures after 12+ hours of runtime and hundreds of
> >>>>millions of ops. Hence the testing for backported fixes will need to
> >>>>replicate these test runs across multiple configurations for
> >>>>multiple days before we have any confidence that we've actually
> >>>>fixed the data corruptions and not introduced any new ones.
> >>>>
> >>>>If you pull only a small subset of the fixes, the fsx will still
> >>>>fail and we have no real way of actually verifying that there have
> >>>>been no regression introduced by the backport.  IOWs, there's a
> >>>>/massive/ amount of QA needed for ensuring that these backports work
> >>>>correctly.
> >>>>
> >>>>Right now the XFS developers don't have the time or resources
> >>>>available to validate stable backports are correct and regression
> >>>>fre because we are focussed on ensuring the upstream fixes we've
> >>>>already made (and are still writing) are solid and reliable.
> >>>
> >>>Ok, that's fine, so users of XFS should wait until the 4.20 release
> >>>before relying on it?  :)
> >>
> >>It's getting to the point that with the amount of known issues with XFS
> >>on LTS kernels it makes sense to mark it as CONFIG_BROKEN.
> >
> >Really? Where are the bug reports?
> 
> In 'git log'! You report these every time you fix something in upstream
> xfs but don't backport it to stable trees:

That is so wrong on so many levels I don't really know where to
begin. I guess doing a *basic risk analysis* demonstrating that none
of those fixes are backport candidates is a good start:

> $ git log --oneline v4.18-rc1..v4.18 fs/xfs
> d4a34e165557 xfs: properly handle free inodes in extent hint validators

Found by QA with generic/229 on a non-standard config. Not user
reported, unlikely to ever be seen by users.

> 9991274fddb9 xfs: Initialize variables in xfs_alloc_get_rec before using them

Cleaning up coverity reported issues to do with corruption log
messages. No visible symptoms, Not user reported.

> d8cb5e423789 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation

Minor free space accounting issue, not user reported, doesn't affect
normal operation.

> e53c4b598372 xfs: ensure post-EOF zeroing happens after zeroing part of a file

Found with fsx via generic/127. Not user reported, doesn't affect
userspace operation at all.

> a3a374bf1889 xfs: fix off-by-one error in xfs_rtalloc_query_range

Regression fix for code introduced in 4.18-rc1. Not user reported
because the code has never been released.

> 232d0a24b0fc xfs: fix uninitialized field in rtbitmap fsmap backend

Coverity warning fix, not user reported, not user impact.

> 5bd88d153998 xfs: recheck reflink state after grabbing ILOCK_SHARED for a 
> write

Fixes warning from generic/166, not user reported. Could affect
users mixing direct IO with reflink, but we expect people using
new functionality like reflink to be tracking TOT fairly closely
anyway.

> f62cb48e4319 xfs: don't allow insert-range to shift extents past the maximum 
> offset

Found by QA w/ generic/465. Not user reported, only affects files in
the exabyte range so not a real world problem

> aafe12cee0b1 xfs: don't trip over negative free space in xfs_reserve_blocks

Found during ENOSPC stress tests that depeleted the reserve pool.
Not user reported, unlikely to ever be hit by users.

> 10ee25268e1f xfs: allow empty transactions while frozen

Removes a spurious warning when running GETFSMAP ioctl on a frozen
filesystem. Not user reported, highly unlikely any user will ever
hit this as nothing but XFs utilities use GETFSMAP at the moment.

> e53946dbd31a xfs: xfs_iflush_abort() can be called twice on cluster writeback 
> failure

Bug in corrupted filesystem handling, been there for ~15 years IIRC.
Not user reported - found by one of our shutdown stress tests
on a debug kernel (generic/388, IIRC). Highly 

Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-30 Thread Dave Chinner
On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote:
> On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> >On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> >>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
> >>aggregate) to focus on testing the copy_file_range() changes, but
> >>Darrick's tests are still ongoing and have passed 40 billion ops in
> >>aggregate over the past few days.
> >>
> >>The reason we are running these so long is that we've seen fsx data
> >>corruption failures after 12+ hours of runtime and hundreds of
> >>millions of ops. Hence the testing for backported fixes will need to
> >>replicate these test runs across multiple configurations for
> >>multiple days before we have any confidence that we've actually
> >>fixed the data corruptions and not introduced any new ones.
> >>
> >>If you pull only a small subset of the fixes, the fsx will still
> >>fail and we have no real way of actually verifying that there have
> >>been no regression introduced by the backport.  IOWs, there's a
> >>/massive/ amount of QA needed for ensuring that these backports work
> >>correctly.
> >>
> >>Right now the XFS developers don't have the time or resources
> >>available to validate stable backports are correct and regression
> >>fre because we are focussed on ensuring the upstream fixes we've
> >>already made (and are still writing) are solid and reliable.
> >
> >Ok, that's fine, so users of XFS should wait until the 4.20 release
> >before relying on it?  :)
> 
> It's getting to the point that with the amount of known issues with XFS
> on LTS kernels it makes sense to mark it as CONFIG_BROKEN.

Really? Where are the bug reports?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-30 Thread Dave Chinner
On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote:
> On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> >On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> >>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
> >>aggregate) to focus on testing the copy_file_range() changes, but
> >>Darrick's tests are still ongoing and have passed 40 billion ops in
> >>aggregate over the past few days.
> >>
> >>The reason we are running these so long is that we've seen fsx data
> >>corruption failures after 12+ hours of runtime and hundreds of
> >>millions of ops. Hence the testing for backported fixes will need to
> >>replicate these test runs across multiple configurations for
> >>multiple days before we have any confidence that we've actually
> >>fixed the data corruptions and not introduced any new ones.
> >>
> >>If you pull only a small subset of the fixes, the fsx will still
> >>fail and we have no real way of actually verifying that there have
> >>been no regression introduced by the backport.  IOWs, there's a
> >>/massive/ amount of QA needed for ensuring that these backports work
> >>correctly.
> >>
> >>Right now the XFS developers don't have the time or resources
> >>available to validate stable backports are correct and regression
> >>fre because we are focussed on ensuring the upstream fixes we've
> >>already made (and are still writing) are solid and reliable.
> >
> >Ok, that's fine, so users of XFS should wait until the 4.20 release
> >before relying on it?  :)
> 
> It's getting to the point that with the amount of known issues with XFS
> on LTS kernels it makes sense to mark it as CONFIG_BROKEN.

Really? Where are the bug reports?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-30 Thread Dave Chinner
On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> > On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote:
> > > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote:
> > > > 
> > > > Cherry picking only one of the 50-odd patches we've committed into
> > > > late 4.19 and 4.20 kernels to fix the problems we've found really
> > > > seems like asking for trouble. If you're going to back port random
> > > > data corruption fixes, then you need to spend a *lot* of time
> > > > validating that it doesn't make things worse than they already
> > > > are...
> > > 
> > > Any reason why we can't take the 50-odd patches in their entirety?  It
> > > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is?  If so, what do you
> > > recommend we do to make 4.19 working properly?
> > 
> > You coul dpull all the fixes, but then you have a QA problem.
> > Basically, we have multiple badly broken syscalls (FICLONERANGE,
> > FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully
> > fixed.
> > 
> > There were ~5 critical dedupe/clone data corruption fixes for XFS
> > went into 4.19-rc8.
> 
> Have any of those been tagged for stable?

None, because I have no confidence that the stable process will do
the necessary QA to validate that such a significant backport is
regression and data corruption free.  The backport needs to be done
as a complete series when we've finished the upstream work because
we can't test isolated patches adequately because fsx will fall over
due to all the unfixed problems and not exercise the fixes that were
backported.

Further, we just had a regression reported in one of the commit that
the autosel bot has selected for automatic backports. It has been
uncovered by overlay which appears to do some unique things with
the piece of crap that is do_splice_direct(). And Darrick just
commented on #xfs that he's just noticed more bugs with FICLONERANGE
and overlay.

IOWs, we're still finding broken stuff in this code and we are
fixing it as fast as we can - we're still putting out fires. We most
certainly don't need the added pressure of having you guys create
more spot fires by breaking stable kernels with largely untested
partial backports and having users exposed to whacky new data
corruption issues.

So, no, it isn't tagged for stable kernels because "commit into
mainline" != "this should be backported immediately". Backports of
these fixes are largely going to be done largely as a function of
time and resources, of which we have zero available right now. Doing
backports right now is premature and ill-advised because we haven't
finished finding and fixing all the bugs and regressions in this
code.

> > Right now the XFS developers don't have the time or resources
> > available to validate stable backports are correct and regression
> > fre because we are focussed on ensuring the upstream fixes we've
> > already made (and are still writing) are solid and reliable.
> 
> Ok, that's fine, so users of XFS should wait until the 4.20 release
> before relying on it?  :)

Ok, Greg, that's *out of line*.

I should throw the CoC at you because I find that comment offensive,
condescending, belittling, denegrating and insulting.  Your smug and
superior "I know what is right for you" attitude is completely
inappropriate, and a little smiley face does not make it acceptible.

If you think your comment is funny, you've badly misjudged how much
effort I've put into this (100-hour weeks for over a month now), how
close I'm flying to burn out (again!), and how pissed off I am about
this whole scenario.

We ended up here because we *trusted* that other people had
implemented and tested their APIs and code properly before it got
merged. We've been severely burnt, and we've been left to clean up
the mess made by other people by ourselves.

Instead of thanks, what we get instead is "we know better" attitude
and jokes implying our work is crap and we don't care about our
users. That's just plain *insulting*.  If anyone is looking for a
demonstration of everything that is wrong with the Linux kernel
development culture, then they don't need to look any further.

> I understand your reluctance to want to backport anything, but it really
> feels like you are not even allowing for fixes that are "obviously
> right" to be backported either, even after they pass testing.  Which
> isn't ok for your users.

It's worse for our users if we introduce regressions into stable
kernels, which is exactly what this "obviously right" auto-backport
would have done.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-30 Thread Dave Chinner
On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote:
> On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote:
> > On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote:
> > > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote:
> > > > 
> > > > Cherry picking only one of the 50-odd patches we've committed into
> > > > late 4.19 and 4.20 kernels to fix the problems we've found really
> > > > seems like asking for trouble. If you're going to back port random
> > > > data corruption fixes, then you need to spend a *lot* of time
> > > > validating that it doesn't make things worse than they already
> > > > are...
> > > 
> > > Any reason why we can't take the 50-odd patches in their entirety?  It
> > > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is?  If so, what do you
> > > recommend we do to make 4.19 working properly?
> > 
> > You coul dpull all the fixes, but then you have a QA problem.
> > Basically, we have multiple badly broken syscalls (FICLONERANGE,
> > FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully
> > fixed.
> > 
> > There were ~5 critical dedupe/clone data corruption fixes for XFS
> > went into 4.19-rc8.
> 
> Have any of those been tagged for stable?

None, because I have no confidence that the stable process will do
the necessary QA to validate that such a significant backport is
regression and data corruption free.  The backport needs to be done
as a complete series when we've finished the upstream work because
we can't test isolated patches adequately because fsx will fall over
due to all the unfixed problems and not exercise the fixes that were
backported.

Further, we just had a regression reported in one of the commit that
the autosel bot has selected for automatic backports. It has been
uncovered by overlay which appears to do some unique things with
the piece of crap that is do_splice_direct(). And Darrick just
commented on #xfs that he's just noticed more bugs with FICLONERANGE
and overlay.

IOWs, we're still finding broken stuff in this code and we are
fixing it as fast as we can - we're still putting out fires. We most
certainly don't need the added pressure of having you guys create
more spot fires by breaking stable kernels with largely untested
partial backports and having users exposed to whacky new data
corruption issues.

So, no, it isn't tagged for stable kernels because "commit into
mainline" != "this should be backported immediately". Backports of
these fixes are largely going to be done largely as a function of
time and resources, of which we have zero available right now. Doing
backports right now is premature and ill-advised because we haven't
finished finding and fixing all the bugs and regressions in this
code.

> > Right now the XFS developers don't have the time or resources
> > available to validate stable backports are correct and regression
> > fre because we are focussed on ensuring the upstream fixes we've
> > already made (and are still writing) are solid and reliable.
> 
> Ok, that's fine, so users of XFS should wait until the 4.20 release
> before relying on it?  :)

Ok, Greg, that's *out of line*.

I should throw the CoC at you because I find that comment offensive,
condescending, belittling, denegrating and insulting.  Your smug and
superior "I know what is right for you" attitude is completely
inappropriate, and a little smiley face does not make it acceptible.

If you think your comment is funny, you've badly misjudged how much
effort I've put into this (100-hour weeks for over a month now), how
close I'm flying to burn out (again!), and how pissed off I am about
this whole scenario.

We ended up here because we *trusted* that other people had
implemented and tested their APIs and code properly before it got
merged. We've been severely burnt, and we've been left to clean up
the mess made by other people by ourselves.

Instead of thanks, what we get instead is "we know better" attitude
and jokes implying our work is crap and we don't care about our
users. That's just plain *insulting*.  If anyone is looking for a
demonstration of everything that is wrong with the Linux kernel
development culture, then they don't need to look any further.

> I understand your reluctance to want to backport anything, but it really
> feels like you are not even allowing for fixes that are "obviously
> right" to be backported either, even after they pass testing.  Which
> isn't ok for your users.

It's worse for our users if we introduce regressions into stable
kernels, which is exactly what this "obviously right" auto-backport
would have done.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote:
> On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote:
> > 
> > Cherry picking only one of the 50-odd patches we've committed into
> > late 4.19 and 4.20 kernels to fix the problems we've found really
> > seems like asking for trouble. If you're going to back port random
> > data corruption fixes, then you need to spend a *lot* of time
> > validating that it doesn't make things worse than they already
> > are...
> 
> Any reason why we can't take the 50-odd patches in their entirety?  It
> sounds like 4.19 isn't fully fixed, but 4.20-rc1 is?  If so, what do you
> recommend we do to make 4.19 working properly?

You coul dpull all the fixes, but then you have a QA problem.
Basically, we have multiple badly broken syscalls (FICLONERANGE,
FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully
fixed.

There were ~5 critical dedupe/clone data corruption fixes for XFS
went into 4.19-rc8.

There were ~30 patches that went into 4.20-rc1 that fixed the
FICLONERANGE/FIDEDUPERANGE ioctls. That completely reworks the
entire VFS infrastructure for those calls, and touches several
filesystems as well. It fixes problems with setuid files, swap
files, modifying immutable files, failure to enforce rlimit and
max file size constraints, behaviour that didn't match man page
descriptions, etc.

There were another ~10 patches that went into 4.20-rc4 that fixed
yet more data corruption and API problems that we found when we
enhanced fsx to use the above syscalls.

And I have another ~10 patches that I'm working on right now to fix
the copy_file_range() implementation - it has all the same problems
I listed above for FICLONERANGE/FIDEDUPERANGE and some other unique
ones. I'm currently writing error condition tests for fstests so
that we at least have some coverage of the conditions
copy_file_range() is supposed to catch and fail. This might all make
a late 4.20-rcX, but it's looking more like 4.21 at this point.

As to testing this stuff, I've spend several weeks now on this and
so has Darrick. Between us we've done a huge amount of QA needed to
verify that the problems are fixed and it is still ongoing. From
#xfs a couple of days ago:

[28/11/18 16:59] * djwong hits 6 billion fsxops...
[28/11/18 17:07]  djwong: I've got about 3.75 billion ops running on 
a machine here
[28/11/18 17:20]  note that's 1 billion fsxops x 6 machines
[28/11/18 17:21]  [xfsv4, xfsv5, xfsv5 w/ 1k blocks] * [directio fsx, 
buffered fsx]
[28/11/18 17:21]  Oh, I've got 3.75B x 4 instances on one filesystem 
:P
[28/11/18 17:22]  [direct io, buffered] x [small op lengths, large 
op lengths]

And this morning:

[30/11/18 08:53]  7 billion fsxops...

I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
aggregate) to focus on testing the copy_file_range() changes, but
Darrick's tests are still ongoing and have passed 40 billion ops in
aggregate over the past few days.

The reason we are running these so long is that we've seen fsx data
corruption failures after 12+ hours of runtime and hundreds of
millions of ops. Hence the testing for backported fixes will need to
replicate these test runs across multiple configurations for
multiple days before we have any confidence that we've actually
fixed the data corruptions and not introduced any new ones.

If you pull only a small subset of the fixes, the fsx will still
fail and we have no real way of actually verifying that there have
been no regression introduced by the backport.  IOWs, there's a
/massive/ amount of QA needed for ensuring that these backports work
correctly.

Right now the XFS developers don't have the time or resources
available to validate stable backports are correct and regression
fre because we are focussed on ensuring the upstream fixes we've
already made (and are still writing) are solid and reliable.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote:
> On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote:
> > 
> > Cherry picking only one of the 50-odd patches we've committed into
> > late 4.19 and 4.20 kernels to fix the problems we've found really
> > seems like asking for trouble. If you're going to back port random
> > data corruption fixes, then you need to spend a *lot* of time
> > validating that it doesn't make things worse than they already
> > are...
> 
> Any reason why we can't take the 50-odd patches in their entirety?  It
> sounds like 4.19 isn't fully fixed, but 4.20-rc1 is?  If so, what do you
> recommend we do to make 4.19 working properly?

You coul dpull all the fixes, but then you have a QA problem.
Basically, we have multiple badly broken syscalls (FICLONERANGE,
FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully
fixed.

There were ~5 critical dedupe/clone data corruption fixes for XFS
went into 4.19-rc8.

There were ~30 patches that went into 4.20-rc1 that fixed the
FICLONERANGE/FIDEDUPERANGE ioctls. That completely reworks the
entire VFS infrastructure for those calls, and touches several
filesystems as well. It fixes problems with setuid files, swap
files, modifying immutable files, failure to enforce rlimit and
max file size constraints, behaviour that didn't match man page
descriptions, etc.

There were another ~10 patches that went into 4.20-rc4 that fixed
yet more data corruption and API problems that we found when we
enhanced fsx to use the above syscalls.

And I have another ~10 patches that I'm working on right now to fix
the copy_file_range() implementation - it has all the same problems
I listed above for FICLONERANGE/FIDEDUPERANGE and some other unique
ones. I'm currently writing error condition tests for fstests so
that we at least have some coverage of the conditions
copy_file_range() is supposed to catch and fail. This might all make
a late 4.20-rcX, but it's looking more like 4.21 at this point.

As to testing this stuff, I've spend several weeks now on this and
so has Darrick. Between us we've done a huge amount of QA needed to
verify that the problems are fixed and it is still ongoing. From
#xfs a couple of days ago:

[28/11/18 16:59] * djwong hits 6 billion fsxops...
[28/11/18 17:07]  djwong: I've got about 3.75 billion ops running on 
a machine here
[28/11/18 17:20]  note that's 1 billion fsxops x 6 machines
[28/11/18 17:21]  [xfsv4, xfsv5, xfsv5 w/ 1k blocks] * [directio fsx, 
buffered fsx]
[28/11/18 17:21]  Oh, I've got 3.75B x 4 instances on one filesystem 
:P
[28/11/18 17:22]  [direct io, buffered] x [small op lengths, large 
op lengths]

And this morning:

[30/11/18 08:53]  7 billion fsxops...

I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops
aggregate) to focus on testing the copy_file_range() changes, but
Darrick's tests are still ongoing and have passed 40 billion ops in
aggregate over the past few days.

The reason we are running these so long is that we've seen fsx data
corruption failures after 12+ hours of runtime and hundreds of
millions of ops. Hence the testing for backported fixes will need to
replicate these test runs across multiple configurations for
multiple days before we have any confidence that we've actually
fixed the data corruptions and not introduced any new ones.

If you pull only a small subset of the fixes, the fsx will still
fail and we have no real way of actually verifying that there have
been no regression introduced by the backport.  IOWs, there's a
/massive/ amount of QA needed for ensuring that these backports work
correctly.

Right now the XFS developers don't have the time or resources
available to validate stable backports are correct and regression
fre because we are focussed on ensuring the upstream fixes we've
already made (and are still writing) are solid and reliable.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 02:36:50PM +0200, Amir Goldstein wrote:
> > Same again - what's the test plan for these cherry-picked data
> > corruption fixes?
> >
> 
> Dave,
> 
> Just to check if we are on the same page, if this was the test plan:
> https://www.spinics.net/lists/linux-xfs/msg20639.html
> 
> Would "no regressions from baseline" have been sufficient to validate
> those specific patches are solid for stable?

No, not in this case, because fsx in fstests only does 100k ops at
most - it's a smoke test. This isn't sufficient to regression test
fixes that, in some cases, took hundreds of millions of fsx ops to
expose.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 02:36:50PM +0200, Amir Goldstein wrote:
> > Same again - what's the test plan for these cherry-picked data
> > corruption fixes?
> >
> 
> Dave,
> 
> Just to check if we are on the same page, if this was the test plan:
> https://www.spinics.net/lists/linux-xfs/msg20639.html
> 
> Would "no regressions from baseline" have been sufficient to validate
> those specific patches are solid for stable?

No, not in this case, because fsx in fstests only does 100k ops at
most - it's a smoke test. This isn't sufficient to regression test
fixes that, in some cases, took hundreds of millions of fsx ops to
expose.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 12:55:43AM -0500, Sasha Levin wrote:
> From: Dave Chinner 
> 
> [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ]
> 
> If we are doing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Sasha Levin 
> ---
>  fs/iomap.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index fa46e3ed8f53..82e35265679d 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1678,7 +1678,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   dio->submit.cookie = submit_bio(bio);
>   } while (nr_pages);
>  
> - if (need_zeroout) {
> + /*
> +  * We need to zeroout the tail of a sub-block write if the extent type
> +  * requires zeroing or the write extends beyond EOF. If we don't zero
> +  * the block tail in the latter case, we can expose stale data via mmap
> +  * reads of the EOF block.
> +  */
> + if (need_zeroout ||
> + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>   /* zero out from the end of the write to the end of the block */
>   pad = pos & (fs_block_size - 1);
>   if (pad)

Same again - what's the test plan for these cherry-picked data
corruption fixes?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 12:55:43AM -0500, Sasha Levin wrote:
> From: Dave Chinner 
> 
> [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ]
> 
> If we are doing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Sasha Levin 
> ---
>  fs/iomap.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index fa46e3ed8f53..82e35265679d 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -1678,7 +1678,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, 
> loff_t length,
>   dio->submit.cookie = submit_bio(bio);
>   } while (nr_pages);
>  
> - if (need_zeroout) {
> + /*
> +  * We need to zeroout the tail of a sub-block write if the extent type
> +  * requires zeroing or the write extends beyond EOF. If we don't zero
> +  * the block tail in the latter case, we can expose stale data via mmap
> +  * reads of the EOF block.
> +  */
> + if (need_zeroout ||
> + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>   /* zero out from the end of the write to the end of the block */
>   pad = pos & (fs_block_size - 1);
>   if (pad)

Same again - what's the test plan for these cherry-picked data
corruption fixes?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 01:00:59AM -0500, Sasha Levin wrote:
> From: Dave Chinner 
> 
> [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ]
> 
> If we are doing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Sasha Levin 
> ---
>  fs/iomap.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 8f7673a69273..407efdae3978 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -940,7 +940,14 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   dio->submit.cookie = submit_bio(bio);
>   } while (nr_pages);
>  
> - if (need_zeroout) {
> + /*
> +  * We need to zeroout the tail of a sub-block write if the extent type
> +  * requires zeroing or the write extends beyond EOF. If we don't zero
> +  * the block tail in the latter case, we can expose stale data via mmap
> +  * reads of the EOF block.
> +  */
> + if (need_zeroout ||
> + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>   /* zero out from the end of the write to the end of the block */
>   pad = pos & (fs_block_size - 1);
>   if (pad)

How do you propose to validate that this doesn't introduce new data
corruptions in isolation? I've spent the last 4 weeks of my life and
about 15 billion fsx ops chasing an validating the bug corruption
fixes we've pushed recently into the 4.19 and 4.20 codebase.

Cherry picking only one of the 50-odd patches we've committed into
late 4.19 and 4.20 kernels to fix the problems we've found really
seems like asking for trouble. If you're going to back port random
data corruption fixes, then you need to spend a *lot* of time
validating that it doesn't make things worse than they already
are...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF

2018-11-29 Thread Dave Chinner
On Thu, Nov 29, 2018 at 01:00:59AM -0500, Sasha Levin wrote:
> From: Dave Chinner 
> 
> [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ]
> 
> If we are doing sub-block dio that extends EOF, we need to zero
> the unused tail of the block to initialise the data in it it. If we
> do not zero the tail of the block, then an immediate mmap read of
> the EOF block will expose stale data beyond EOF to userspace. Found
> with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations.
> 
> Fix this by detecting if the end of the DIO write is beyond EOF
> and zeroing the tail if necessary.
> 
> Signed-off-by: Dave Chinner 
> Reviewed-by: Christoph Hellwig 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> Signed-off-by: Sasha Levin 
> ---
>  fs/iomap.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap.c b/fs/iomap.c
> index 8f7673a69273..407efdae3978 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -940,7 +940,14 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t 
> length,
>   dio->submit.cookie = submit_bio(bio);
>   } while (nr_pages);
>  
> - if (need_zeroout) {
> + /*
> +  * We need to zeroout the tail of a sub-block write if the extent type
> +  * requires zeroing or the write extends beyond EOF. If we don't zero
> +  * the block tail in the latter case, we can expose stale data via mmap
> +  * reads of the EOF block.
> +  */
> + if (need_zeroout ||
> + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) {
>   /* zero out from the end of the write to the end of the block */
>   pad = pos & (fs_block_size - 1);
>   if (pad)

How do you propose to validate that this doesn't introduce new data
corruptions in isolation? I've spent the last 4 weeks of my life and
about 15 billion fsx ops chasing an validating the bug corruption
fixes we've pushed recently into the 4.19 and 4.20 codebase.

Cherry picking only one of the 50-odd patches we've committed into
late 4.19 and 4.20 kernels to fix the problems we've found really
seems like asking for trouble. If you're going to back port random
data corruption fixes, then you need to spend a *lot* of time
validating that it doesn't make things worse than they already
are...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 08:23:42PM -0800, Joe Perches wrote:
> On Tue, 2018-11-13 at 14:09 +1100, Dave Chinner wrote:
> > On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote:
> > > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote:
> > > > I'm not interested in making code fast if distro support engineers
> > > > can't debug problems on user systems easily. Optimising for
> > > > performance over debuggability is a horrible trade off for us to
> > > > make because it means users and distros end up much more reliant on
> > > > single points of expertise for debugging problems. And that means
> > > > the majority of the load of problem triage falls directly on very
> > > > limited resources - the core XFS development team. A little bit of
> > > > thought about how to make code easier to triage and debug goes a
> > > > long, long way
> > > 
> > > So at least in my experience, if the kernels are compiled with
> > > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED,
> > > scripts/decode_stracktrace.sh seems to do a very nice job with inlined
> > 
> > That doesn't help with kernel profiling and other such things that
> > are based on callgraphs...
> 
> If that's really the case:
> 
> I rather suspect the xfs static v STATIC function marking is not
> particularly curated and the marking is somewhat arbitrary.

That's a common opinion for an outsider to form when they come
across something unfamiliar they don't really understand. "I don't
understand this, so I must rewrite it" is an unfortunate habit that
programmers have.

> So perhaps given the large number of static, but not STATIC
> functions, perhaps a sed of s/static/STATIC/ should be done
> when it's not inline for all xfs functions.

That's just as bad as removing them all, if not worse.

If you are writing new code or reworking existing code, then we'll
consider the usage of STATIC/static in the context of that work.
Otherwise, we leave it alone.

It if ain't broke, don't fix it. And it sure as hell isn't broken
right now. We've got more than enough bugs to fix without having to
deal with drive-by bikeshed painting...

-Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 08:23:42PM -0800, Joe Perches wrote:
> On Tue, 2018-11-13 at 14:09 +1100, Dave Chinner wrote:
> > On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote:
> > > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote:
> > > > I'm not interested in making code fast if distro support engineers
> > > > can't debug problems on user systems easily. Optimising for
> > > > performance over debuggability is a horrible trade off for us to
> > > > make because it means users and distros end up much more reliant on
> > > > single points of expertise for debugging problems. And that means
> > > > the majority of the load of problem triage falls directly on very
> > > > limited resources - the core XFS development team. A little bit of
> > > > thought about how to make code easier to triage and debug goes a
> > > > long, long way
> > > 
> > > So at least in my experience, if the kernels are compiled with
> > > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED,
> > > scripts/decode_stracktrace.sh seems to do a very nice job with inlined
> > 
> > That doesn't help with kernel profiling and other such things that
> > are based on callgraphs...
> 
> If that's really the case:
> 
> I rather suspect the xfs static v STATIC function marking is not
> particularly curated and the marking is somewhat arbitrary.

That's a common opinion for an outsider to form when they come
across something unfamiliar they don't really understand. "I don't
understand this, so I must rewrite it" is an unfortunate habit that
programmers have.

> So perhaps given the large number of static, but not STATIC
> functions, perhaps a sed of s/static/STATIC/ should be done
> when it's not inline for all xfs functions.

That's just as bad as removing them all, if not worse.

If you are writing new code or reworking existing code, then we'll
consider the usage of STATIC/static in the context of that work.
Otherwise, we leave it alone.

It if ain't broke, don't fix it. And it sure as hell isn't broken
right now. We've got more than enough bugs to fix without having to
deal with drive-by bikeshed painting...

-Dave.

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote:
> > I'm not interested in making code fast if distro support engineers
> > can't debug problems on user systems easily. Optimising for
> > performance over debuggability is a horrible trade off for us to
> > make because it means users and distros end up much more reliant on
> > single points of expertise for debugging problems. And that means
> > the majority of the load of problem triage falls directly on very
> > limited resources - the core XFS development team. A little bit of
> > thought about how to make code easier to triage and debug goes a
> > long, long way
> 
> So at least in my experience, if the kernels are compiled with
> CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED,
> scripts/decode_stracktrace.sh seems to do a very nice job with inlined

That doesn't help with kernel profiling and other such things that
are based on callgraphs...

> functions.  Now, ext4 generally only has about 3 or 4 nested inlines,
> and so I don't know how it works with 20 or 30 nested inlined
> functions, so perhaps this is not applicable for XFS.
> 
> But it perhaps toolchain technology has advanced since the Irix days
> such that it's no longer as necessary to force the non-inlining of
> functions for easing debugging?

Not that I've noticed.

Indeed, modern toolchains are moving the opposite direction - have
you ever tried to debug a binary with gdb that was compiled with LTO
enabled? Or maybe even just tried to profile it with perf?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote:
> On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote:
> > I'm not interested in making code fast if distro support engineers
> > can't debug problems on user systems easily. Optimising for
> > performance over debuggability is a horrible trade off for us to
> > make because it means users and distros end up much more reliant on
> > single points of expertise for debugging problems. And that means
> > the majority of the load of problem triage falls directly on very
> > limited resources - the core XFS development team. A little bit of
> > thought about how to make code easier to triage and debug goes a
> > long, long way
> 
> So at least in my experience, if the kernels are compiled with
> CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED,
> scripts/decode_stracktrace.sh seems to do a very nice job with inlined

That doesn't help with kernel profiling and other such things that
are based on callgraphs...

> functions.  Now, ext4 generally only has about 3 or 4 nested inlines,
> and so I don't know how it works with 20 or 30 nested inlined
> functions, so perhaps this is not applicable for XFS.
> 
> But it perhaps toolchain technology has advanced since the Irix days
> such that it's no longer as necessary to force the non-inlining of
> functions for easing debugging?

Not that I've noticed.

Indeed, modern toolchains are moving the opposite direction - have
you ever tried to debug a binary with gdb that was compiled with LTO
enabled? Or maybe even just tried to profile it with perf?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 02:30:01PM -0800, Joe Perches wrote:
> On Tue, 2018-11-13 at 08:45 +1100, Dave Chinner wrote:
> > On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote:
> > > On 11/10/18 7:21 PM, Joe Perches wrote:
> > > > Reduce total object size quite a bit (~32KB) and presumably
> > > > improve performance at the same time.
> > > > 
> > > > Total object size old vs new (x86-64 defconfig with xfs)
> > > > 
> > > > text   data bss dec hex filename
> > > > - 959351 165573 632 1125556  112cb4 (TOTALS) (old)
> > > > + 924683 165669 632 1090984  10a5a8 (TOTALS) (new)
> > > 
> > > And what does it do to maximum stack excursions?
> > 
> > Better yet: what does it do to corruption stack traces and debugging
> > tools like profiling traces?
> > 
> > i.e. this noinline directive isn't about stack usage, this is about
> > being able to debug production code. Basically the compiler inliner
> > is so agressive on static functions that it makes it impossible to
> > decipher the stack traces. It flattens them way too much to
> > be able to tell how we got to a specific location in the code.
> > 
> > In reality, being able to find problems quickly and efficiently is
> > far more important to us than being able to run everything at
> > ludicrous speed
> 
> Is that really a compelling argument given thw ~50:50
> split of static/STATIC uses in xfs?

Historically, yes. We're talking about code with call chains that
can go 50-60 functions deep here. If that gets flattened to 10-20
functions by compiler inlining (which is the sort of thing that
happens) then we lose a huge amount of visibility into the workings
of the code. This affects profiling, stack traces on corruption,
dynamic debug probes, kernel crash dump analysis, etc.

I'm not interested in making code fast if distro support engineers
can't debug problems on user systems easily. Optimising for
performance over debuggability is a horrible trade off for us to
make because it means users and distros end up much more reliant on
single points of expertise for debugging problems. And that means
the majority of the load of problem triage falls directly on very
limited resources - the core XFS development team. A little bit of
thought about how to make code easier to triage and debug goes a
long, long way

Indeed, this is not a new problem - we've been using techniques like
STATIC in one form or another to stop compiler inlining and/or
function hiding since XFS was first ported to linux 20 years ago.
In fact, STATIC was inherited from Irix because it helped with
debugging via the userspace simulator that the initial XFS code was
developed on. i.e.  STATIC was present in the initial XFS commit
made way back in 1993, and we've been using it ever since...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 02:30:01PM -0800, Joe Perches wrote:
> On Tue, 2018-11-13 at 08:45 +1100, Dave Chinner wrote:
> > On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote:
> > > On 11/10/18 7:21 PM, Joe Perches wrote:
> > > > Reduce total object size quite a bit (~32KB) and presumably
> > > > improve performance at the same time.
> > > > 
> > > > Total object size old vs new (x86-64 defconfig with xfs)
> > > > 
> > > > text   data bss dec hex filename
> > > > - 959351 165573 632 1125556  112cb4 (TOTALS) (old)
> > > > + 924683 165669 632 1090984  10a5a8 (TOTALS) (new)
> > > 
> > > And what does it do to maximum stack excursions?
> > 
> > Better yet: what does it do to corruption stack traces and debugging
> > tools like profiling traces?
> > 
> > i.e. this noinline directive isn't about stack usage, this is about
> > being able to debug production code. Basically the compiler inliner
> > is so agressive on static functions that it makes it impossible to
> > decipher the stack traces. It flattens them way too much to
> > be able to tell how we got to a specific location in the code.
> > 
> > In reality, being able to find problems quickly and efficiently is
> > far more important to us than being able to run everything at
> > ludicrous speed
> 
> Is that really a compelling argument given thw ~50:50
> split of static/STATIC uses in xfs?

Historically, yes. We're talking about code with call chains that
can go 50-60 functions deep here. If that gets flattened to 10-20
functions by compiler inlining (which is the sort of thing that
happens) then we lose a huge amount of visibility into the workings
of the code. This affects profiling, stack traces on corruption,
dynamic debug probes, kernel crash dump analysis, etc.

I'm not interested in making code fast if distro support engineers
can't debug problems on user systems easily. Optimising for
performance over debuggability is a horrible trade off for us to
make because it means users and distros end up much more reliant on
single points of expertise for debugging problems. And that means
the majority of the load of problem triage falls directly on very
limited resources - the core XFS development team. A little bit of
thought about how to make code easier to triage and debug goes a
long, long way

Indeed, this is not a new problem - we've been using techniques like
STATIC in one form or another to stop compiler inlining and/or
function hiding since XFS was first ported to linux 20 years ago.
In fact, STATIC was inherited from Irix because it helped with
debugging via the userspace simulator that the initial XFS code was
developed on. i.e.  STATIC was present in the initial XFS commit
made way back in 1993, and we've been using it ever since...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote:
> On 11/10/18 7:21 PM, Joe Perches wrote:
> > Reduce total object size quite a bit (~32KB) and presumably
> > improve performance at the same time.
> > 
> > Total object size old vs new (x86-64 defconfig with xfs)
> > 
> > text   data bss dec hex filename
> > - 959351 165573 632 1125556  112cb4 (TOTALS) (old)
> > + 924683 165669 632 1090984  10a5a8 (TOTALS) (new)
> 
> And what does it do to maximum stack excursions?

Better yet: what does it do to corruption stack traces and debugging
tools like profiling traces?

i.e. this noinline directive isn't about stack usage, this is about
being able to debug production code. Basically the compiler inliner
is so agressive on static functions that it makes it impossible to
decipher the stack traces. It flattens them way too much to
be able to tell how we got to a specific location in the code.

In reality, being able to find problems quickly and efficiently is
far more important to us than being able to run everything at
ludicrous speed

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] xfs: Remove noinline from #define STATIC

2018-11-12 Thread Dave Chinner
On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote:
> On 11/10/18 7:21 PM, Joe Perches wrote:
> > Reduce total object size quite a bit (~32KB) and presumably
> > improve performance at the same time.
> > 
> > Total object size old vs new (x86-64 defconfig with xfs)
> > 
> > text   data bss dec hex filename
> > - 959351 165573 632 1125556  112cb4 (TOTALS) (old)
> > + 924683 165669 632 1090984  10a5a8 (TOTALS) (new)
> 
> And what does it do to maximum stack excursions?

Better yet: what does it do to corruption stack traces and debugging
tools like profiling traces?

i.e. this noinline directive isn't about stack usage, this is about
being able to debug production code. Basically the compiler inliner
is so agressive on static functions that it makes it impossible to
decipher the stack traces. It flattens them way too much to
be able to tell how we got to a specific location in the code.

In reality, being able to find problems quickly and efficiently is
far more important to us than being able to run everything at
ludicrous speed

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] Fix coding style issue in xfs_acl.c and xfs_aops.c

2018-11-11 Thread Dave Chinner
On Sun, Nov 11, 2018 at 08:36:03AM +0800, hmsjwzb wrote:
> Possible unwrapped commit description (prefer a maximum 75 chars per line)

NACK. Our preference is (and always has been) for comments to fill
the entire 80 columns, just like the rest of the kernel. I have no
idea who told you "75 columns is preferred" but they are wrong.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] Fix coding style issue in xfs_acl.c and xfs_aops.c

2018-11-11 Thread Dave Chinner
On Sun, Nov 11, 2018 at 08:36:03AM +0800, hmsjwzb wrote:
> Possible unwrapped commit description (prefer a maximum 75 chars per line)

NACK. Our preference is (and always has been) for comments to fill
the entire 80 columns, just like the rest of the kernel. I have no
idea who told you "75 columns is preferred" but they are wrong.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Dave Chinner
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
>  wrote:
> > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
> > wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

This doesn't solve the problem.

The problem is that this specific implementation of per-cpu
counters need to be summed on every read. Hence when you have a huge
number of CPUs each per-cpu iteration that takes a substantial
amount of time.

If only we had percpu counters that had a fixed, extremely low read
overhead that doesn't care about the number of CPUs in the
machine

Oh, wait, we do: percpu_counters.[ch].

This all seems like a counter implementation deficiency to me, not
an interface problem...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] fs/proc: introduce /proc/stat2 file

2018-11-07 Thread Dave Chinner
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote:
> On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton
>  wrote:
> > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione  
> > wrote:
> >
> >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso  wrote:
> >> > This patch introduces a new /proc/stat2 file that is identical to the
> >> > regular 'stat' except that it zeroes all hard irq statistics. The new
> >> > file is a drop in replacement to stat for users that need performance.
> >>
> >> For a while now, I've been thinking over ways to improve the
> >> performance of collecting various bits of kernel information. I don't
> >> think that a proliferation of special-purpose named bag-of-fields file
> >> variants is the right answer, because even if you add a few info-file
> >> variants, you're still left with a situation where a given file
> >> provides a particular caller with too little or too much information.
> >> I'd much rather move to a model in which userspace *explicitly* tells
> >> the kernel which fields it wants, with the kernel replying with just
> >> those particular fields, maybe in their raw binary representations.
> >> The ASCII-text bag-of-everything files would remain available for
> >> ad-hoc and non-performance critical use, but programs that cared about
> >> performance would have an efficient bypass. One concrete approach is
> >> to let users open up today's proc files and, instead of read(2)ing a
> >> text blob, use an ioctl to retrieve specified and targeted information
> >> of the sort that would normally be encoded in the text blob. Because
> >> callers would open the same file when using either the text or binary
> >> interfaces, little would have to change, and it'd be easy to implement
> >> fallbacks when a particular system doesn't support a particular
> >> fast-path ioctl.
> 
> Please.   Sysfs, with the one value per file rule, was created exactly
> for the purpose of eliminating these sort of problems with procfs.  So
> instead of inventing special purpose interfaces for proc, just make
> the info available in sysfs, if not already available.

This doesn't solve the problem.

The problem is that this specific implementation of per-cpu
counters need to be summed on every read. Hence when you have a huge
number of CPUs each per-cpu iteration that takes a substantial
amount of time.

If only we had percpu counters that had a fixed, extremely low read
overhead that doesn't care about the number of CPUs in the
machine

Oh, wait, we do: percpu_counters.[ch].

This all seems like a counter implementation deficiency to me, not
an interface problem...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-06 Thread Dave Chinner
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > > has the advantage that it is pretty much standard for a test like this 
> > > > so
> > > > everyone knows what the test does from a glimpse) and run with it 
> > > > something
> > > > like the following workfile:
> > > > 
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > > 
> > > > And see how the numbers with and without your patches compare?
> > > > 
> > > > Honza
> > > > 
> > > > [1] https://github.com/axboe/fio
> > > 
> > > That program is *very* good to have. Whew. Anyway, it looks like read 
> > > bandwidth 
> > > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > > as compared to around 85 without the patch, so still showing about a 20%
> > > performance degradation, assuming I'm reading this correctly.
> > > 
> > > Raw data follows, using the fio options you listed above:
> > > 
> > > Baseline (without my patch):
> > >  
> > 
> > >  lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> > > clat percentiles (usec):
> > >  |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> > >  | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> > >  | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> > >  | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> > >  | 99.99th=[12256]
> > .
> > > Modified (with my patch):
> > >  
> > .
> > >  lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> > > clat percentiles (usec):
> > >  |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> > >  | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> > >  | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> > >  | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> > >  | 99.99th=[14484]
> > 
> > So it's adding at least 500us of completion latency to every IO?
> > I'd argue that the IO latency impact is far worse than the a 20%
> > throughput drop.
> 
> Hum, right. So for each IO we have to remove the page from LRU on submit

Which cost us less then 10us on average:

slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
vs
slat (usec): min=18, max=4378, avg=52.59, stdev=63.66

> and then put it back on IO completion (which is going to race with new
> submits so LRU lock contention might be an issue).

Removal has to take the same LRU lock, so I don't think contention
is the problem here. More likely the overhead is in selecting the
LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
a memcg lookup.

> Spending 500 us on that
> is not unthinkable when the lock is contended but it is more expensive than
> I'd have thought. John, could you perhaps profile where the time is spent?

That'll tell us for sure :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-06 Thread Dave Chinner
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote:
> On Tue 06-11-18 13:47:15, Dave Chinner wrote:
> > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> > > On 11/5/18 1:54 AM, Jan Kara wrote:
> > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > > > has the advantage that it is pretty much standard for a test like this 
> > > > so
> > > > everyone knows what the test does from a glimpse) and run with it 
> > > > something
> > > > like the following workfile:
> > > > 
> > > > [reader]
> > > > direct=1
> > > > ioengine=libaio
> > > > blocksize=4096
> > > > size=1g
> > > > numjobs=1
> > > > rw=read
> > > > iodepth=64
> > > > 
> > > > And see how the numbers with and without your patches compare?
> > > > 
> > > > Honza
> > > > 
> > > > [1] https://github.com/axboe/fio
> > > 
> > > That program is *very* good to have. Whew. Anyway, it looks like read 
> > > bandwidth 
> > > is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> > > as compared to around 85 without the patch, so still showing about a 20%
> > > performance degradation, assuming I'm reading this correctly.
> > > 
> > > Raw data follows, using the fio options you listed above:
> > > 
> > > Baseline (without my patch):
> > >  
> > 
> > >  lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> > > clat percentiles (usec):
> > >  |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
> > >  | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
> > >  | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
> > >  | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
> > >  | 99.99th=[12256]
> > .
> > > Modified (with my patch):
> > >  
> > .
> > >  lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> > > clat percentiles (usec):
> > >  |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
> > >  | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
> > >  | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
> > >  | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
> > >  | 99.99th=[14484]
> > 
> > So it's adding at least 500us of completion latency to every IO?
> > I'd argue that the IO latency impact is far worse than the a 20%
> > throughput drop.
> 
> Hum, right. So for each IO we have to remove the page from LRU on submit

Which cost us less then 10us on average:

slat (usec): min=13, max=3855, avg=44.17, stdev=61.18
vs
slat (usec): min=18, max=4378, avg=52.59, stdev=63.66

> and then put it back on IO completion (which is going to race with new
> submits so LRU lock contention might be an issue).

Removal has to take the same LRU lock, so I don't think contention
is the problem here. More likely the overhead is in selecting the
LRU to put it on. e.g. list_lru_from_kmem() which may well be doing
a memcg lookup.

> Spending 500 us on that
> is not unthinkable when the lock is contended but it is more expensive than
> I'd have thought. John, could you perhaps profile where the time is spent?

That'll tell us for sure :)

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-05 Thread Dave Chinner
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> On 11/5/18 1:54 AM, Jan Kara wrote:
> > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > has the advantage that it is pretty much standard for a test like this so
> > everyone knows what the test does from a glimpse) and run with it something
> > like the following workfile:
> > 
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> > 
> > And see how the numbers with and without your patches compare?
> > 
> > Honza
> > 
> > [1] https://github.com/axboe/fio
> 
> That program is *very* good to have. Whew. Anyway, it looks like read 
> bandwidth 
> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> as compared to around 85 without the patch, so still showing about a 20%
> performance degradation, assuming I'm reading this correctly.
> 
> Raw data follows, using the fio options you listed above:
> 
> Baseline (without my patch):
>  

>  lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> clat percentiles (usec):
>  |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>  | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>  | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>  | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>  | 99.99th=[12256]
.
> Modified (with my patch):
>  
.
>  lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> clat percentiles (usec):
>  |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>  | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>  | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>  | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>  | 99.99th=[14484]

So it's adding at least 500us of completion latency to every IO?
I'd argue that the IO latency impact is far worse than the a 20%
throughput drop.

i.e. You can make up for throughput drops by running a deeper
queue/more dispatch threads, but you can't reduce IO latency at
all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-11-05 Thread Dave Chinner
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote:
> On 11/5/18 1:54 AM, Jan Kara wrote:
> > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't
> > going to max-out NVME iops by far. Can I suggest you install fio [1] (it
> > has the advantage that it is pretty much standard for a test like this so
> > everyone knows what the test does from a glimpse) and run with it something
> > like the following workfile:
> > 
> > [reader]
> > direct=1
> > ioengine=libaio
> > blocksize=4096
> > size=1g
> > numjobs=1
> > rw=read
> > iodepth=64
> > 
> > And see how the numbers with and without your patches compare?
> > 
> > Honza
> > 
> > [1] https://github.com/axboe/fio
> 
> That program is *very* good to have. Whew. Anyway, it looks like read 
> bandwidth 
> is approximately 74 MiB/s with my patch (it varies a bit, run to run),
> as compared to around 85 without the patch, so still showing about a 20%
> performance degradation, assuming I'm reading this correctly.
> 
> Raw data follows, using the fio options you listed above:
> 
> Baseline (without my patch):
>  

>  lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75
> clat percentiles (usec):
>  |  1.00th=[ 2311],  5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343],
>  | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376],
>  | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276],
>  | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469],
>  | 99.99th=[12256]
.
> Modified (with my patch):
>  
.
>  lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21
> clat percentiles (usec):
>  |  1.00th=[ 2835],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868],
>  | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900],
>  | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259],
>  | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435],
>  | 99.99th=[14484]

So it's adding at least 500us of completion latency to every IO?
I'd argue that the IO latency impact is far worse than the a 20%
throughput drop.

i.e. You can make up for throughput drops by running a deeper
queue/more dispatch threads, but you can't reduce IO latency at
all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-04 Thread Dave Chinner
On Sat, Nov 03, 2018 at 10:13:37AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 4:36 PM Dave Chinner  wrote:
> >
> > On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote:
> > >
> > > I don't love the timing of this at the end of the merge window, but 
> > > pulled,
> >
> > When would have been a better time? It's too big for a late -rc, and
> > it contains stuff that needs fixing pretty urgently. So if the merge
> > window is not appropriate, what should we have done here?
> 
> No, I think that with the timing of the problem finding, it was
> probably the only thing to do, but generally I like these kinds of
> somewhat scary and disruptive patches (just because it touches
> multiple filesystems) _early_ in the merge window if at all possible,
> and showing that the development was done before and not some rushed
> thing..

Fair enough. That was partly my fault - I forgot to add sob's to the
commits when I pulled the final version in for testing in the middle
of the first week of the merge window. I didn't realise that until
after I'd soak and stress tested it and was getting ready to push it
out to linux-next a week later. So I had to rewrite all the commits
and so they would have looked an awful lot more recent than they
really were. I should have mentioned this in the pull request...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-04 Thread Dave Chinner
On Sat, Nov 03, 2018 at 10:13:37AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 4:36 PM Dave Chinner  wrote:
> >
> > On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote:
> > >
> > > I don't love the timing of this at the end of the merge window, but 
> > > pulled,
> >
> > When would have been a better time? It's too big for a late -rc, and
> > it contains stuff that needs fixing pretty urgently. So if the merge
> > window is not appropriate, what should we have done here?
> 
> No, I think that with the timing of the problem finding, it was
> probably the only thing to do, but generally I like these kinds of
> somewhat scary and disruptive patches (just because it touches
> multiple filesystems) _early_ in the merge window if at all possible,
> and showing that the development was done before and not some rushed
> thing..

Fair enough. That was partly my fault - I forgot to add sob's to the
commits when I pulled the final version in for testing in the middle
of the first week of the merge window. I didn't realise that until
after I'd soak and stress tested it and was getting ready to push it
out to linux-next a week later. So I had to rewrite all the commits
and so they would have looked an awful lot more recent than they
really were. I should have mentioned this in the pull request...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-02 Thread Dave Chinner
On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 10:15 PM Dave Chinner  wrote:
> >
> > Can you please pull update containing a rework of the VFS clone and
> > dedupe file range infrastructure from the tag listed below?
> 
> I don't love the timing of this at the end of the merge window, but pulled,

When would have been a better time? It's too big for a late -rc, and
it contains stuff that needs fixing pretty urgently. So if the merge
window is not appropriate, what should we have done here?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-02 Thread Dave Chinner
On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote:
> On Thu, Nov 1, 2018 at 10:15 PM Dave Chinner  wrote:
> >
> > Can you please pull update containing a rework of the VFS clone and
> > dedupe file range infrastructure from the tag listed below?
> 
> I don't love the timing of this at the end of the merge window, but pulled,

When would have been a better time? It's too big for a late -rc, and
it contains stuff that needs fixing pretty urgently. So if the merge
window is not appropriate, what should we have done here?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-01 Thread Dave Chinner
fs2/refcounttree.c   | 148 --
 fs/ocfs2/refcounttree.h   |  24 ++-
 fs/overlayfs/copy_up.c|   6 +-
 fs/overlayfs/file.c   |  43 ++--
 fs/read_write.c   | 403 +-
 fs/xfs/xfs_file.c |  82 +---
 fs/xfs/xfs_reflink.c  | 173 
 fs/xfs/xfs_reflink.h  |  15 +-
 include/linux/fs.h|  55 --
 mm/filemap.c  | 146 +++---
 20 files changed, 734 insertions(+), 596 deletions(-)
-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation

2018-11-01 Thread Dave Chinner
fs2/refcounttree.c   | 148 --
 fs/ocfs2/refcounttree.h   |  24 ++-
 fs/overlayfs/copy_up.c|   6 +-
 fs/overlayfs/file.c   |  43 ++--
 fs/read_write.c   | 403 +-
 fs/xfs/xfs_file.c |  82 +---
 fs/xfs/xfs_reflink.c  | 173 
 fs/xfs/xfs_reflink.h  |  15 +-
 include/linux/fs.h|  55 --
 mm/filemap.c  | 146 +++---
 20 files changed, 734 insertions(+), 596 deletions(-)
-- 
Dave Chinner
da...@fromorbit.com


Re: linux-next: manual merge of the vfs tree with the xfs tree

2018-10-30 Thread Dave Chinner
On Wed, Oct 31, 2018 at 11:52:47AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> [I don't understand why all this new work turned up in the xfs tree
> during the merge window ...]
> 
> Today's linux-next merge of the vfs tree got a conflict in:
> 
>   fs/read_write.c
> 
> between commits:
> 
>   42ec3d4c0218 ("vfs: make remap_file_range functions take and return bytes 
> completed")
>   eca3654e3cc7 ("vfs: enable remap callers that can handle short operations")
> 
> from the xfs tree and commit:
> 
>   5de4480ae7f8 ("vfs: allow dedupe of user owned read-only files")
> 
> from the vfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Looks ok. I didn't expect this conflict, but looks simple enough
to resolve. Thanks!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: linux-next: manual merge of the vfs tree with the xfs tree

2018-10-30 Thread Dave Chinner
On Wed, Oct 31, 2018 at 11:52:47AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> [I don't understand why all this new work turned up in the xfs tree
> during the merge window ...]
> 
> Today's linux-next merge of the vfs tree got a conflict in:
> 
>   fs/read_write.c
> 
> between commits:
> 
>   42ec3d4c0218 ("vfs: make remap_file_range functions take and return bytes 
> completed")
>   eca3654e3cc7 ("vfs: enable remap callers that can handle short operations")
> 
> from the xfs tree and commit:
> 
>   5de4480ae7f8 ("vfs: allow dedupe of user owned read-only files")
> 
> from the vfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.

Looks ok. I didn't expect this conflict, but looks simple enough
to resolve. Thanks!

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: linux-next: manual merge of the xfs tree with Linus' tree

2018-10-30 Thread Dave Chinner
On Wed, Oct 31, 2018 at 11:22:44AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the xfs tree got a conflict in:
> 
>   Documentation/filesystems/porting
> 
> between commit:
> 
>   1a16dbaf798c ("Document d_splice_alias() calling conventions for ->lookup() 
> users.")
> 
> from Linus' tree and commit:
> 
>   2e5dfc99f2e6 ("vfs: combine the clone and dedupe into a single 
> remap_file_range")
> 
> from the xfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc Documentation/filesystems/porting
> index 321d74b73937,e6d4466268dd..
> --- a/Documentation/filesystems/porting
> +++ b/Documentation/filesystems/porting
> @@@ -623,13 -623,7 +623,18 @@@ in your dentry operations instead
>   On success you get a new struct file sharing the mount/dentry with the
>   original, on failure - ERR_PTR().
>   --
>  +[recommended]
>  +->lookup() instances doing an equivalent of
>  +if (IS_ERR(inode))
>  +return ERR_CAST(inode);
>  +return d_splice_alias(inode, dentry);
>  +don't need to bother with the check - d_splice_alias() will do the
>  +right thing when given ERR_PTR(...) as inode.  Moreover, passing NULL
>  +inode to d_splice_alias() will also do the right thing (equivalent of
>  +d_add(dentry, NULL); return NULL;), so that kind of special cases
>  +also doesn't need a separate treatment.
> ++--
> + [mandatory]
> + ->clone_file_range() and ->dedupe_file_range have been replaced with
> + ->remap_file_range().  See Documentation/filesystems/vfs.txt for more
> + information.

Looks good - I knew about this one from merging back into a recent
Linus kernel.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: linux-next: manual merge of the xfs tree with Linus' tree

2018-10-30 Thread Dave Chinner
On Wed, Oct 31, 2018 at 11:22:44AM +1100, Stephen Rothwell wrote:
> Hi all,
> 
> Today's linux-next merge of the xfs tree got a conflict in:
> 
>   Documentation/filesystems/porting
> 
> between commit:
> 
>   1a16dbaf798c ("Document d_splice_alias() calling conventions for ->lookup() 
> users.")
> 
> from Linus' tree and commit:
> 
>   2e5dfc99f2e6 ("vfs: combine the clone and dedupe into a single 
> remap_file_range")
> 
> from the xfs tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc Documentation/filesystems/porting
> index 321d74b73937,e6d4466268dd..
> --- a/Documentation/filesystems/porting
> +++ b/Documentation/filesystems/porting
> @@@ -623,13 -623,7 +623,18 @@@ in your dentry operations instead
>   On success you get a new struct file sharing the mount/dentry with the
>   original, on failure - ERR_PTR().
>   --
>  +[recommended]
>  +->lookup() instances doing an equivalent of
>  +if (IS_ERR(inode))
>  +return ERR_CAST(inode);
>  +return d_splice_alias(inode, dentry);
>  +don't need to bother with the check - d_splice_alias() will do the
>  +right thing when given ERR_PTR(...) as inode.  Moreover, passing NULL
>  +inode to d_splice_alias() will also do the right thing (equivalent of
>  +d_add(dentry, NULL); return NULL;), so that kind of special cases
>  +also doesn't need a separate treatment.
> ++--
> + [mandatory]
> + ->clone_file_range() and ->dedupe_file_range have been replaced with
> + ->remap_file_range().  See Documentation/filesystems/vfs.txt for more
> + information.

Looks good - I knew about this one from merging back into a recent
Linus kernel.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: XFS: Hang and dmesg flood on mounting invalid FS image

2018-10-29 Thread Dave Chinner
On Mon, Oct 29, 2018 at 09:57:20AM +0300, Anatoly Trosinenko wrote:
> > How did the corruption occur?
> 
> It is a fuzzed image. Most probably, it was artificially "patched" by
> fuzzer. Or do you mean "what particular bytes were changed"?

I wondered how this specific corruption occurred in the real world.
If i was a real world problem, it would have been indicative of a
code bug if it did occur (i.e. whatever wrote the log record would
have been broken) but seeing as it's a fuzzer problem, I don't need
to look for any code bugs other than "we didn't validate the input
properly".

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: XFS: Hang and dmesg flood on mounting invalid FS image

2018-10-29 Thread Dave Chinner
On Mon, Oct 29, 2018 at 09:57:20AM +0300, Anatoly Trosinenko wrote:
> > How did the corruption occur?
> 
> It is a fuzzed image. Most probably, it was artificially "patched" by
> fuzzer. Or do you mean "what particular bytes were changed"?

I wondered how this specific corruption occurred in the real world.
If i was a real world problem, it would have been indicative of a
code bug if it did occur (i.e. whatever wrote the log record would
have been broken) but seeing as it's a fuzzer problem, I don't need
to look for any code bugs other than "we didn't validate the input
properly".

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: XFS: Hang and dmesg flood on mounting invalid FS image

2018-10-28 Thread Dave Chinner
On Sun, Oct 28, 2018 at 08:50:46PM +0300, Anatoly Trosinenko wrote:
> Hello,
> 
> When mounting a broken XFS image, the kernel hangs and floods dmesg
> with stack traces.

How did the corruption occur?

$ sudo xfs_logprint -d /dev/vdc
xfs_logprint:
data device: 0xfd20
log device: 0xfd20 daddr: 131112 length: 6840

 0 HEADER Cycle 1 tail 1:00 len512 ops 1
[0 - 0] Cycle 0x New Cycle 0x0001
 2 HEADER Cycle 1 tail 1:02 len512 ops 5
 4 HEADER Cycle 1 tail -2147483647:02 len512 ops 1
   
 6 HEADER Cycle 0 tail 1:00 len  0 ops 0
[0 - 6] Cycle 0x0001 New Cycle 0x
 7 HEADER Cycle 0 tail 1:00 len  0 ops 0

Ok, so from this the head of the log is block 4, and it has a
corrupt tail pointer it points to:


$ sudo xfs_logprint -D -s 4 /dev/vdc |head -10
xfs_logprint:
data device: 0xfd20
log device: 0xfd20 daddr: 131112 length: 6840

BLKNO: 4
 0 bebaedfe  100  2002  100  361  180  200 
 ^^^   ^   ^
 wrong   wrong wrong

 8 2f27bae6  200  100 dabdbab00000 
1000000000 
1800000000 
2000000000 

They decode as:

cycle: 1version: 2  lsn: 1,24835tail_lsn: 2147483649,2

So the tail LSN points to an invalid log cycle and the previous
block. IOWs, the block number in the tail indicates the whole log is
valid and needs to be scanned. but the cycle is not valid.

And that's the problem. Neither the head or tail blocks are
validated before they are used. CRC checking of the head and tail
blocks comes later

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: XFS: Hang and dmesg flood on mounting invalid FS image

2018-10-28 Thread Dave Chinner
On Sun, Oct 28, 2018 at 08:50:46PM +0300, Anatoly Trosinenko wrote:
> Hello,
> 
> When mounting a broken XFS image, the kernel hangs and floods dmesg
> with stack traces.

How did the corruption occur?

$ sudo xfs_logprint -d /dev/vdc
xfs_logprint:
data device: 0xfd20
log device: 0xfd20 daddr: 131112 length: 6840

 0 HEADER Cycle 1 tail 1:00 len512 ops 1
[0 - 0] Cycle 0x New Cycle 0x0001
 2 HEADER Cycle 1 tail 1:02 len512 ops 5
 4 HEADER Cycle 1 tail -2147483647:02 len512 ops 1
   
 6 HEADER Cycle 0 tail 1:00 len  0 ops 0
[0 - 6] Cycle 0x0001 New Cycle 0x
 7 HEADER Cycle 0 tail 1:00 len  0 ops 0

Ok, so from this the head of the log is block 4, and it has a
corrupt tail pointer it points to:


$ sudo xfs_logprint -D -s 4 /dev/vdc |head -10
xfs_logprint:
data device: 0xfd20
log device: 0xfd20 daddr: 131112 length: 6840

BLKNO: 4
 0 bebaedfe  100  2002  100  361  180  200 
 ^^^   ^   ^
 wrong   wrong wrong

 8 2f27bae6  200  100 dabdbab00000 
1000000000 
1800000000 
2000000000 

They decode as:

cycle: 1version: 2  lsn: 1,24835tail_lsn: 2147483649,2

So the tail LSN points to an invalid log cycle and the previous
block. IOWs, the block number in the tail indicates the whole log is
valid and needs to be scanned. but the cycle is not valid.

And that's the problem. Neither the head or tail blocks are
validated before they are used. CRC checking of the head and tail
blocks comes later

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] xfs: 4.20 merge

2018-10-23 Thread Dave Chinner
Hi Linus,

Can you please pull the XFS 4.20 update from the tag listed below?
There's not a huge amount of change in this cycle - Darrick has been
out of action for a couple of months (hence me sending the last few
pull requests), so we decided a quiet cycle mainly focussed on bug
fixes was a good idea. Darrick will take the helm again at the end of
this merge window.

The branch has applied cleanly to your current tree, so you
shouldn't see any problems with it.

FYI, I may be sending another update later in the cycle - there's a
pending rework of the clone/dedupe_file_range code that fixes
numerous bugs that is spread amongst the VFS, XFS and ocfs2 code. It
has been reviewed and tested, Al and I just need to work out the
details of the merge, so it may come from him rather than me.

Thanks!

-Dave.


The following changes since commit b39989009bdb84992915c9869f58094ed5becf10:

  xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 
+1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-4.20-merge-1

for you to fetch changes up to 96987eea537d6ccd98704a71958f9ba02da80843:

  xfs: cancel COW blocks before swapext (2018-10-18 17:21:55 +1100)


xfs: Changes for 4.20

- only support filesystems with unwritten extents
- add definition for statfs XFS magic number
- remove unused parameters around reflink code
- more debug for dangling delalloc extents
- cancel COW extents on extent swap targets
- fix quota stats output and clean up the code
- refactor some of the attribute code in preparation for parent pointers
- fix several buffer handling bugs


Adam Borowski (1):
  xfs: add a define for statfs magic to uapi

Allison Henderson (4):
  xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
  xfs: Add helper function xfs_attr_try_sf_addname
  xfs: Add attibute set and helper functions
  xfs: Add attibute remove and helper functions

Brian Foster (1):
  xfs: clear ail delwri queued bufs on unmount of shutdown fs

Carlos Maiolino (2):
  xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
  xfs: use offsetof() in place of offset macros for __xfsstats

Christoph Hellwig (8):
  xfs: remove XFS_IO_INVALID
  xfs: remove suport for filesystems without unwritten extent flag
  xfs: handle zeroing in xfs_file_iomap_begin_delay
  xfs: remove the unused shared argument to xfs_reflink_reserve_cow
  xfs: remove the unused trimmed argument from 
xfs_reflink_trim_around_shared
  xfs: fix fork selection in xfs_find_trim_cow_extent
  xfs: print dangling delalloc extents
  xfs: cancel COW blocks before swapext

Darrick J. Wong (3):
  xfs: xrep_findroot_block should reject root blocks with siblings
  xfs: always assign buffer verifiers when one is provided
  xfs: fix buffer state management in xrep_findroot_block

Dave Chinner (2):
  xfs: issue log message on user force shutdown
  xfs: fix use-after-free race in xfs_buf_rele

 fs/xfs/libxfs/xfs_attr.c   | 236 -
 fs/xfs/{ => libxfs}/xfs_attr.h |   2 +
 fs/xfs/libxfs/xfs_bmap.c   |  70 ++--
 fs/xfs/libxfs/xfs_bmap.h   |   1 +
 fs/xfs/libxfs/xfs_format.h |   8 +-
 fs/xfs/libxfs/xfs_sb.c |   5 +-
 fs/xfs/scrub/repair.c  | 128 +-
 fs/xfs/scrub/scrub.c   |  13 ---
 fs/xfs/xfs_aops.c  |   4 +-
 fs/xfs/xfs_aops.h  |  14 ++-
 fs/xfs/xfs_bmap_util.c |  61 ++-
 fs/xfs/xfs_buf.c   | 109 +++
 fs/xfs/xfs_buf.h   |   2 +
 fs/xfs/xfs_fsops.c |  50 -
 fs/xfs/xfs_ioctl.c |   8 --
 fs/xfs/xfs_iomap.c |  53 ++---
 fs/xfs/xfs_reflink.c   |  33 +++---
 fs/xfs/xfs_reflink.h   |   4 +-
 fs/xfs/xfs_stats.c |  52 -
 fs/xfs/xfs_stats.h |  28 +
 fs/xfs/xfs_super.c |  38 ++-
 fs/xfs/xfs_trans.h |   1 +
 fs/xfs/xfs_trans_ail.c |  28 +++--
 fs/xfs/xfs_trans_buf.c |  42 
 include/uapi/linux/magic.h |   1 +
 25 files changed, 608 insertions(+), 383 deletions(-)
 rename fs/xfs/{ => libxfs}/xfs_attr.h (97%)

-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] xfs: 4.20 merge

2018-10-23 Thread Dave Chinner
Hi Linus,

Can you please pull the XFS 4.20 update from the tag listed below?
There's not a huge amount of change in this cycle - Darrick has been
out of action for a couple of months (hence me sending the last few
pull requests), so we decided a quiet cycle mainly focussed on bug
fixes was a good idea. Darrick will take the helm again at the end of
this merge window.

The branch has applied cleanly to your current tree, so you
shouldn't see any problems with it.

FYI, I may be sending another update later in the cycle - there's a
pending rework of the clone/dedupe_file_range code that fixes
numerous bugs that is spread amongst the VFS, XFS and ocfs2 code. It
has been reviewed and tested, Al and I just need to work out the
details of the merge, so it may come from him rather than me.

Thanks!

-Dave.


The following changes since commit b39989009bdb84992915c9869f58094ed5becf10:

  xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 
+1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-4.20-merge-1

for you to fetch changes up to 96987eea537d6ccd98704a71958f9ba02da80843:

  xfs: cancel COW blocks before swapext (2018-10-18 17:21:55 +1100)


xfs: Changes for 4.20

- only support filesystems with unwritten extents
- add definition for statfs XFS magic number
- remove unused parameters around reflink code
- more debug for dangling delalloc extents
- cancel COW extents on extent swap targets
- fix quota stats output and clean up the code
- refactor some of the attribute code in preparation for parent pointers
- fix several buffer handling bugs


Adam Borowski (1):
  xfs: add a define for statfs magic to uapi

Allison Henderson (4):
  xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h
  xfs: Add helper function xfs_attr_try_sf_addname
  xfs: Add attibute set and helper functions
  xfs: Add attibute remove and helper functions

Brian Foster (1):
  xfs: clear ail delwri queued bufs on unmount of shutdown fs

Carlos Maiolino (2):
  xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat
  xfs: use offsetof() in place of offset macros for __xfsstats

Christoph Hellwig (8):
  xfs: remove XFS_IO_INVALID
  xfs: remove suport for filesystems without unwritten extent flag
  xfs: handle zeroing in xfs_file_iomap_begin_delay
  xfs: remove the unused shared argument to xfs_reflink_reserve_cow
  xfs: remove the unused trimmed argument from 
xfs_reflink_trim_around_shared
  xfs: fix fork selection in xfs_find_trim_cow_extent
  xfs: print dangling delalloc extents
  xfs: cancel COW blocks before swapext

Darrick J. Wong (3):
  xfs: xrep_findroot_block should reject root blocks with siblings
  xfs: always assign buffer verifiers when one is provided
  xfs: fix buffer state management in xrep_findroot_block

Dave Chinner (2):
  xfs: issue log message on user force shutdown
  xfs: fix use-after-free race in xfs_buf_rele

 fs/xfs/libxfs/xfs_attr.c   | 236 -
 fs/xfs/{ => libxfs}/xfs_attr.h |   2 +
 fs/xfs/libxfs/xfs_bmap.c   |  70 ++--
 fs/xfs/libxfs/xfs_bmap.h   |   1 +
 fs/xfs/libxfs/xfs_format.h |   8 +-
 fs/xfs/libxfs/xfs_sb.c |   5 +-
 fs/xfs/scrub/repair.c  | 128 +-
 fs/xfs/scrub/scrub.c   |  13 ---
 fs/xfs/xfs_aops.c  |   4 +-
 fs/xfs/xfs_aops.h  |  14 ++-
 fs/xfs/xfs_bmap_util.c |  61 ++-
 fs/xfs/xfs_buf.c   | 109 +++
 fs/xfs/xfs_buf.h   |   2 +
 fs/xfs/xfs_fsops.c |  50 -
 fs/xfs/xfs_ioctl.c |   8 --
 fs/xfs/xfs_iomap.c |  53 ++---
 fs/xfs/xfs_reflink.c   |  33 +++---
 fs/xfs/xfs_reflink.h   |   4 +-
 fs/xfs/xfs_stats.c |  52 -
 fs/xfs/xfs_stats.h |  28 +
 fs/xfs/xfs_super.c |  38 ++-
 fs/xfs/xfs_trans.h |   1 +
 fs/xfs/xfs_trans_ail.c |  28 +++--
 fs/xfs/xfs_trans_buf.c |  42 
 include/uapi/linux/magic.h |   1 +
 25 files changed, 608 insertions(+), 383 deletions(-)
 rename fs/xfs/{ => libxfs}/xfs_attr.h (97%)

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-10-13 Thread Dave Chinner
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> On 10/12/18 8:55 PM, Dave Chinner wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote:
> >> From: John Hubbard 
> [...]
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 5ed8f6292a53..017ab82e36ca 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -78,12 +78,22 @@ struct page {
> >> */
> >>union {
> >>struct {/* Page cache and anonymous pages */
> >> -  /**
> >> -   * @lru: Pageout list, eg. active_list protected by
> >> -   * zone_lru_lock.  Sometimes used as a generic list
> >> -   * by the page owner.
> >> -   */
> >> -  struct list_head lru;
> >> +  union {
> >> +  /**
> >> +   * @lru: Pageout list, eg. active_list protected
> >> +   * by zone_lru_lock.  Sometimes used as a
> >> +   * generic list by the page owner.
> >> +   */
> >> +  struct list_head lru;
> >> +  /* Used by get_user_pages*(). Pages may not be
> >> +   * on an LRU while these dma_pinned_* fields
> >> +   * are in use.
> >> +   */
> >> +  struct {
> >> +  unsigned long dma_pinned_flags;
> >> +  atomic_t  dma_pinned_count;
> >> +  };
> >> +  };
> > 
> > Isn't this broken for mapped file-backed pages? i.e. they may be
> > passed as the user buffer to read/write direct IO and so the pages
> > passed to gup will be on the active/inactive LRUs. hence I can't see
> > how you can have dual use of the LRU list head like this
> > 
> > What am I missing here?
> 
> Hi Dave,
> 
> In patch 6/6, pin_page_for_dma(), which is called at the end of 
> get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 

How is that safe? If you've ripped the page out of the LRU, it's no
longer being tracked by the page cache aging and reclaim algorithms.
Patch 6 doesn't appear to put these pages back in the LRU, either,
so it looks to me like this just dumps them on the ground after the
gup reference is dropped.  How do we reclaim these page cache pages
when there is memory pressure if they aren't in the LRU?

> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during 
> that
> time), or it's possibly on an LRU list.

gup isn't just used for RDMA. It's used by direct IO in far, far
more situations and machines than RDMA is. Please explain why
ripping pages out of the LRU and not putting them back is safe, has
no side effects, doesn't adversely impact page cache reclaim, etc.
Indeed, I'd love to see a description of all the page references and
where they come and go so we know the changes aren't just leaking
these pages until the filesystem invalidates them at unmount.

Maybe I'm not seeing why this is safe yet, but seeing as you haven't
explained why it is safe then, at minimum, the patch descriptions
are incomplete.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-10-13 Thread Dave Chinner
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote:
> On 10/12/18 8:55 PM, Dave Chinner wrote:
> > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote:
> >> From: John Hubbard 
> [...]
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 5ed8f6292a53..017ab82e36ca 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -78,12 +78,22 @@ struct page {
> >> */
> >>union {
> >>struct {/* Page cache and anonymous pages */
> >> -  /**
> >> -   * @lru: Pageout list, eg. active_list protected by
> >> -   * zone_lru_lock.  Sometimes used as a generic list
> >> -   * by the page owner.
> >> -   */
> >> -  struct list_head lru;
> >> +  union {
> >> +  /**
> >> +   * @lru: Pageout list, eg. active_list protected
> >> +   * by zone_lru_lock.  Sometimes used as a
> >> +   * generic list by the page owner.
> >> +   */
> >> +  struct list_head lru;
> >> +  /* Used by get_user_pages*(). Pages may not be
> >> +   * on an LRU while these dma_pinned_* fields
> >> +   * are in use.
> >> +   */
> >> +  struct {
> >> +  unsigned long dma_pinned_flags;
> >> +  atomic_t  dma_pinned_count;
> >> +  };
> >> +  };
> > 
> > Isn't this broken for mapped file-backed pages? i.e. they may be
> > passed as the user buffer to read/write direct IO and so the pages
> > passed to gup will be on the active/inactive LRUs. hence I can't see
> > how you can have dual use of the LRU list head like this
> > 
> > What am I missing here?
> 
> Hi Dave,
> 
> In patch 6/6, pin_page_for_dma(), which is called at the end of 
> get_user_pages(),
> unceremoniously rips the pages out of the LRU, as a prerequisite to using
> either of the page->dma_pinned_* fields. 

How is that safe? If you've ripped the page out of the LRU, it's no
longer being tracked by the page cache aging and reclaim algorithms.
Patch 6 doesn't appear to put these pages back in the LRU, either,
so it looks to me like this just dumps them on the ground after the
gup reference is dropped.  How do we reclaim these page cache pages
when there is memory pressure if they aren't in the LRU?

> The idea is that LRU is not especially useful for this situation anyway,
> so we'll just make it one or the other: either a page is dma-pinned, and
> just hanging out doing RDMA most likely (and LRU is less meaningful during 
> that
> time), or it's possibly on an LRU list.

gup isn't just used for RDMA. It's used by direct IO in far, far
more situations and machines than RDMA is. Please explain why
ripping pages out of the LRU and not putting them back is safe, has
no side effects, doesn't adversely impact page cache reclaim, etc.
Indeed, I'd love to see a description of all the page references and
where they come and go so we know the changes aren't just leaking
these pages until the filesystem invalidates them at unmount.

Maybe I'm not seeing why this is safe yet, but seeing as you haven't
explained why it is safe then, at minimum, the patch descriptions
are incomplete.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-10-12 Thread Dave Chinner
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm_types.h   | 22 +-
>  include/linux/page-flags.h | 47 ++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>*/
>   union {
>   struct {/* Page cache and anonymous pages */
> - /**
> -  * @lru: Pageout list, eg. active_list protected by
> -  * zone_lru_lock.  Sometimes used as a generic list
> -  * by the page owner.
> -  */
> - struct list_head lru;
> + union {
> + /**
> +  * @lru: Pageout list, eg. active_list protected
> +  * by zone_lru_lock.  Sometimes used as a
> +  * generic list by the page owner.
> +  */
> + struct list_head lru;
> + /* Used by get_user_pages*(). Pages may not be
> +  * on an LRU while these dma_pinned_* fields
> +  * are in use.
> +  */
> + struct {
> + unsigned long dma_pinned_flags;
> + atomic_t  dma_pinned_count;
> + };
> + };

Isn't this broken for mapped file-backed pages? i.e. they may be
passed as the user buffer to read/write direct IO and so the pages
passed to gup will be on the active/inactive LRUs. hence I can't see
how you can have dual use of the LRU list head like this

What am I missing here?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count

2018-10-12 Thread Dave Chinner
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote:
> From: John Hubbard 
> 
> Add two struct page fields that, combined, are unioned with
> struct page->lru. There is no change in the size of
> struct page. These new fields are for type safety and clarity.
> 
> Also add page flag accessors to test, set and clear the new
> page->dma_pinned_flags field.
> 
> The page->dma_pinned_count field will be used in upcoming
> patches
> 
> Signed-off-by: John Hubbard 
> ---
>  include/linux/mm_types.h   | 22 +-
>  include/linux/page-flags.h | 47 ++
>  2 files changed, 63 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5ed8f6292a53..017ab82e36ca 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -78,12 +78,22 @@ struct page {
>*/
>   union {
>   struct {/* Page cache and anonymous pages */
> - /**
> -  * @lru: Pageout list, eg. active_list protected by
> -  * zone_lru_lock.  Sometimes used as a generic list
> -  * by the page owner.
> -  */
> - struct list_head lru;
> + union {
> + /**
> +  * @lru: Pageout list, eg. active_list protected
> +  * by zone_lru_lock.  Sometimes used as a
> +  * generic list by the page owner.
> +  */
> + struct list_head lru;
> + /* Used by get_user_pages*(). Pages may not be
> +  * on an LRU while these dma_pinned_* fields
> +  * are in use.
> +  */
> + struct {
> + unsigned long dma_pinned_flags;
> + atomic_t  dma_pinned_count;
> + };
> + };

Isn't this broken for mapped file-backed pages? i.e. they may be
passed as the user buffer to read/write direct IO and so the pages
passed to gup will be on the active/inactive LRUs. hence I can't see
how you can have dual use of the LRU list head like this

What am I missing here?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] xfs: fixes for v4.19-rc7

2018-10-10 Thread Dave Chinner
Hi Greg,

Can you please pull the XFS update from the tag listed below? This
contains the fix for the clone_file_range data corruption issue I
mentioned in my -rc6 pull request (zero post-eof blocks), as well as
fixes for several other equally serious problems we found while
auditing the clone/dedupe ioctls for other issues. The rest of the
problems we found (there were a *lot*) will be addressed in the 4.20
cycle.

Cheers,

Dave.

The following changes since commit e55ec4ddbef9897199c307dfb23167e3801fdaf5:

  xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 
+1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc7

for you to fetch changes up to b39989009bdb84992915c9869f58094ed5becf10:

  xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 
+1000)


xfs: fixes for 4.19-rc7

Update for 4.19-rc7 to fix numerous file clone and deduplication issues.


Darrick J. Wong (3):
  xfs: refactor clonerange preparation into a separate helper
  xfs: zero posteof blocks when cloning above eof
  xfs: update ctime and remove suid before cloning files

Dave Chinner (2):
  xfs: fix data corruption w/ unaligned dedupe ranges
  xfs: fix data corruption w/ unaligned reflink ranges

 fs/xfs/xfs_reflink.c | 200 ++-
 1 file changed, 165 insertions(+), 35 deletions(-)

-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] xfs: fixes for v4.19-rc7

2018-10-10 Thread Dave Chinner
Hi Greg,

Can you please pull the XFS update from the tag listed below? This
contains the fix for the clone_file_range data corruption issue I
mentioned in my -rc6 pull request (zero post-eof blocks), as well as
fixes for several other equally serious problems we found while
auditing the clone/dedupe ioctls for other issues. The rest of the
problems we found (there were a *lot*) will be addressed in the 4.20
cycle.

Cheers,

Dave.

The following changes since commit e55ec4ddbef9897199c307dfb23167e3801fdaf5:

  xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 
+1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc7

for you to fetch changes up to b39989009bdb84992915c9869f58094ed5becf10:

  xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 
+1000)


xfs: fixes for 4.19-rc7

Update for 4.19-rc7 to fix numerous file clone and deduplication issues.


Darrick J. Wong (3):
  xfs: refactor clonerange preparation into a separate helper
  xfs: zero posteof blocks when cloning above eof
  xfs: update ctime and remove suid before cloning files

Dave Chinner (2):
  xfs: fix data corruption w/ unaligned dedupe ranges
  xfs: fix data corruption w/ unaligned reflink ranges

 fs/xfs/xfs_reflink.c | 200 ++-
 1 file changed, 165 insertions(+), 35 deletions(-)

-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Dave Chinner
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote:
> On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:
> 
> > From: Dave Chinner 
> > 
> > We've recently seen a workload on XFS filesystems with a repeatable
> > deadlock between background writeback and a multi-process
> > application doing concurrent writes and fsyncs to a small range of a
> > file.
> > 
> > ...
> > 
> > Signed-off-by: Dave Chinner 
> > Reviewed-by: Jan Kara 
> 
> Not a serious enough problem for a -stable backport?

Don't have enough evidence to say one way or another. The reported
incident was from a RHEL 7 kernel, so the bug has been there for
years in one form or another, but it's only ever been triggered by
this one-off custom workload.

I haven't done any analysis on older kernels, nor have I looked to see
if there's any gotchas that a stable backport might encounter. And I
tend not to change stuff in a path that is critical to data integrity
without at least doing enough due diligence to suggest a stable
backport would be fine.

You can mark it for stable backports if you want, but I'm not
prepared to because I haven't done the work necessary to ensure it's
safe to do so.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-05 Thread Dave Chinner
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote:
> On Fri,  5 Oct 2018 15:45:26 +1000 Dave Chinner  wrote:
> 
> > From: Dave Chinner 
> > 
> > We've recently seen a workload on XFS filesystems with a repeatable
> > deadlock between background writeback and a multi-process
> > application doing concurrent writes and fsyncs to a small range of a
> > file.
> > 
> > ...
> > 
> > Signed-off-by: Dave Chinner 
> > Reviewed-by: Jan Kara 
> 
> Not a serious enough problem for a -stable backport?

Don't have enough evidence to say one way or another. The reported
incident was from a RHEL 7 kernel, so the bug has been there for
years in one form or another, but it's only ever been triggered by
this one-off custom workload.

I haven't done any analysis on older kernels, nor have I looked to see
if there's any gotchas that a stable backport might encounter. And I
tend not to change stuff in a path that is critical to data integrity
without at least doing enough due diligence to suggest a stable
backport would be fine.

You can mark it for stable backports if you want, but I'm not
prepared to because I haven't done the work necessary to ensure it's
safe to do so.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-04 Thread Dave Chinner
From: Dave Chinner 

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process
application doing concurrent writes and fsyncs to a small range of a
file.

range_cyclic
writeback   Process 1   Process 2

xfs_vm_writepages
  write_cache_pages
writeback_index = 2
cycled = 0

find page 2 dirty
lock Page 2
->writepage
  page 2 writeback
  page 2 clean
  page 2 added to bio
no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0
  find page 1 towrite
  lock Page 1
  ->writepage
page 1 writeback
page 1 clean
page 1 added to bio
  find page 2 towrite
  lock Page 2
  page 2 is writeback
  
write()
locks page 1
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0

!done && !cycled
  sets index to 0, restarts lookup
find page 1 dirty
  find page 1 towrite
  lock Page 1
  page 1 is writeback
  

lock Page 1


DEADLOCK because:

- process 1 needs page 2 writeback to complete to make
  enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
  can progress and unlock the page it is blocked on, then it
  can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
  for page 1

The underlying cause of the problem here is that range_cyclic
writeback is processing pages in descending index order as we
hold higher index pages in a structure controlled from above
write_cache_pages(). The write_cache_pages() caller needs to
be able to submit these pages for IO before write_cache_pages
restarts writeback at mapping index 0 to avoid wcp inverting the
page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no
private context held across write_cache_pages() - filesystems using
this infrastructure always submit pages in ->writepage immediately
and so there is no problem with range_cyclic going back to mapping
index 0.

However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under
writeback in their private structures until write_cache_pages()
returns, and hence they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock. I
looked at the code long enough to understand that it has a similar
retry loop for range_cyclic writeback reaching the end of the file
and then promptly ran away before my eyes bled too much.  I'll leave
it for the ext4 developers to determine if their code is actually
has this deadlock and how to fix it if it has.

There's a few ways I can see avoid this deadlock. There's probably
more, but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I d

[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock

2018-10-04 Thread Dave Chinner
From: Dave Chinner 

We've recently seen a workload on XFS filesystems with a repeatable
deadlock between background writeback and a multi-process
application doing concurrent writes and fsyncs to a small range of a
file.

range_cyclic
writeback   Process 1   Process 2

xfs_vm_writepages
  write_cache_pages
writeback_index = 2
cycled = 0

find page 2 dirty
lock Page 2
->writepage
  page 2 writeback
  page 2 clean
  page 2 added to bio
no more pages
write()
locks page 1
dirties page 1
locks page 2
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0
  find page 1 towrite
  lock Page 1
  ->writepage
page 1 writeback
page 1 clean
page 1 added to bio
  find page 2 towrite
  lock Page 2
  page 2 is writeback
  
write()
locks page 1
dirties page 1
fsync()

xfs_vm_writepages
write_cache_pages
  start index 0

!done && !cycled
  sets index to 0, restarts lookup
find page 1 dirty
  find page 1 towrite
  lock Page 1
  page 1 is writeback
  

lock Page 1


DEADLOCK because:

- process 1 needs page 2 writeback to complete to make
  enough progress to issue IO pending for page 1
- writeback needs page 1 writeback to complete so process 2
  can progress and unlock the page it is blocked on, then it
  can issue the IO pending for page 2
- process 2 can't make progress until process 1 issues IO
  for page 1

The underlying cause of the problem here is that range_cyclic
writeback is processing pages in descending index order as we
hold higher index pages in a structure controlled from above
write_cache_pages(). The write_cache_pages() caller needs to
be able to submit these pages for IO before write_cache_pages
restarts writeback at mapping index 0 to avoid wcp inverting the
page lock/writeback wait order.

generic_writepages() is not susceptible to this bug as it has no
private context held across write_cache_pages() - filesystems using
this infrastructure always submit pages in ->writepage immediately
and so there is no problem with range_cyclic going back to mapping
index 0.

However:
mpage_writepages() has a private bio context,
exofs_writepages() has page_collect
fuse_writepages() has fuse_fill_wb_data
nfs_writepages() has nfs_pageio_descriptor
xfs_vm_writepages() has xfs_writepage_ctx

All of these ->writepages implementations can hold pages under
writeback in their private structures until write_cache_pages()
returns, and hence they are all susceptible to this deadlock.

Also worth noting is that ext4 has it's own bastardised version of
write_cache_pages() and so it /may/ have an equivalent deadlock. I
looked at the code long enough to understand that it has a similar
retry loop for range_cyclic writeback reaching the end of the file
and then promptly ran away before my eyes bled too much.  I'll leave
it for the ext4 developers to determine if their code is actually
has this deadlock and how to fix it if it has.

There's a few ways I can see avoid this deadlock. There's probably
more, but these are the first I've though of:

1. get rid of range_cyclic altogether

2. range_cyclic always stops at EOF, and we start again from
writeback index 0 on the next call into write_cache_pages()

2a. wcp also returns EAGAIN to ->writepages implementations to
indicate range cyclic has hit EOF. writepages implementations can
then flush the current context and call wpc again to continue. i.e.
lift the retry into the ->writepages implementation

3. range_cyclic uses trylock_page() rather than lock_page(), and it
skips pages it can't lock without blocking. It will already do this
for pages under writeback, so this seems like a no-brainer

3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
blocking as per pages under writeback.

I d

[GIT PULL] xfs: fixes for 4.19-rc6

2018-10-03 Thread Dave Chinner
Hi Greg,

Can you please pull the XFS from the tag listed below. It's a bit
bigger than that I'd like this late in the cycle, but we've had a
few challenges getting ourselves sorted out this cycle. Details
of the contents are in the pull-req output below. This has been be
run through xfstests over the past week, and merges against 4.19-rc6
cleanly.

FYI, there is likely to be one more XFS fix for this cycle - we've
just tracked down the source of a clone_file_range() data corruption
so I'll send that as a separate pullreq once Darrick's fix is
reviewed and tested.

Cheers,

Dave.

The following changes since commit 5b394b2ddf0347bef56e50c69a58773c94343ff3:

  Linux 4.19-rc1 (2018-08-26 14:11:59 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc6

for you to fetch changes up to e55ec4ddbef9897199c307dfb23167e3801fdaf5:

  xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 
+1000)


XFS fixes for 4.19-rc6

Accumlated regression and bug fixes for 4.19-rc6, including:

o make iomap correctly mark dirty pages for sub-page block sizes
o fix regression in handling extent-to-btree format conversion errors
o fix torn log wrap detection for new logs
o various corrupt inode detection fixes
o various delalloc state fixes
o cleanup all the missed transaction cancel cases missed from changes merged
  in 4.19-rc1
o fix lockdep false positive on transaction allocation
o fix locking and reference counting on buffer log items


Brian Foster (6):
  xfs: remove last of unnecessary xfs_defer_cancel() callers
  xfs: don't unlock invalidated buf on aborted tx commit
  xfs: clean up xfs_trans_brelse()
  xfs: refactor xfs_buf_log_item reference count handling
  xfs: remove invalid log recovery first/last cycle check
  iomap: set page dirty after partial delalloc on mkwrite

Christoph Hellwig (2):
  xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
  xfs: skip delalloc COW blocks in xfs_reflink_end_cow

Darrick J. Wong (1):
  xfs: don't crash the vfs on a garbage inline symlink

Dave Chinner (3):
  xfs: avoid lockdep false positives in xfs_trans_alloc
  xfs: fix transaction leak in xfs_reflink_allocate_cow()
  xfs: fix error handling in xfs_bmap_extents_to_btree

Eric Sandeen (2):
  xfs: don't treat unknown di_flags2 as corruption in scrub
  xfs: validate inode di_forkoff

YueHaibing (1):
  xfs: remove duplicated include from alloc.c

 fs/iomap.c  |   2 +-
 fs/xfs/libxfs/xfs_attr.c|  28 +++-
 fs/xfs/libxfs/xfs_attr_remote.c |  10 +--
 fs/xfs/libxfs/xfs_bmap.c|  24 +++
 fs/xfs/libxfs/xfs_format.h  |   2 +
 fs/xfs/libxfs/xfs_inode_buf.c   |  30 +
 fs/xfs/scrub/alloc.c|   1 -
 fs/xfs/scrub/inode.c|   4 +-
 fs/xfs/xfs_bmap_util.c  |  20 ++
 fs/xfs/xfs_buf_item.c   | 119 ++
 fs/xfs/xfs_buf_item.h   |   1 +
 fs/xfs/xfs_inode.c  |  10 +--
 fs/xfs/xfs_iops.c   |  12 +++-
 fs/xfs/xfs_log_recover.c|  10 ---
 fs/xfs/xfs_reflink.c| 137 
 fs/xfs/xfs_trace.h  |   1 -
 fs/xfs/xfs_trans.c  |  10 ++-
 fs/xfs/xfs_trans_buf.c  |  99 +++--
 18 files changed, 256 insertions(+), 264 deletions(-)
-- 
Dave Chinner
da...@fromorbit.com


[GIT PULL] xfs: fixes for 4.19-rc6

2018-10-03 Thread Dave Chinner
Hi Greg,

Can you please pull the XFS from the tag listed below. It's a bit
bigger than that I'd like this late in the cycle, but we've had a
few challenges getting ourselves sorted out this cycle. Details
of the contents are in the pull-req output below. This has been be
run through xfstests over the past week, and merges against 4.19-rc6
cleanly.

FYI, there is likely to be one more XFS fix for this cycle - we've
just tracked down the source of a clone_file_range() data corruption
so I'll send that as a separate pullreq once Darrick's fix is
reviewed and tested.

Cheers,

Dave.

The following changes since commit 5b394b2ddf0347bef56e50c69a58773c94343ff3:

  Linux 4.19-rc1 (2018-08-26 14:11:59 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc6

for you to fetch changes up to e55ec4ddbef9897199c307dfb23167e3801fdaf5:

  xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 
+1000)


XFS fixes for 4.19-rc6

Accumlated regression and bug fixes for 4.19-rc6, including:

o make iomap correctly mark dirty pages for sub-page block sizes
o fix regression in handling extent-to-btree format conversion errors
o fix torn log wrap detection for new logs
o various corrupt inode detection fixes
o various delalloc state fixes
o cleanup all the missed transaction cancel cases missed from changes merged
  in 4.19-rc1
o fix lockdep false positive on transaction allocation
o fix locking and reference counting on buffer log items


Brian Foster (6):
  xfs: remove last of unnecessary xfs_defer_cancel() callers
  xfs: don't unlock invalidated buf on aborted tx commit
  xfs: clean up xfs_trans_brelse()
  xfs: refactor xfs_buf_log_item reference count handling
  xfs: remove invalid log recovery first/last cycle check
  iomap: set page dirty after partial delalloc on mkwrite

Christoph Hellwig (2):
  xfs: don't bring in extents in xfs_bmap_punch_delalloc_range
  xfs: skip delalloc COW blocks in xfs_reflink_end_cow

Darrick J. Wong (1):
  xfs: don't crash the vfs on a garbage inline symlink

Dave Chinner (3):
  xfs: avoid lockdep false positives in xfs_trans_alloc
  xfs: fix transaction leak in xfs_reflink_allocate_cow()
  xfs: fix error handling in xfs_bmap_extents_to_btree

Eric Sandeen (2):
  xfs: don't treat unknown di_flags2 as corruption in scrub
  xfs: validate inode di_forkoff

YueHaibing (1):
  xfs: remove duplicated include from alloc.c

 fs/iomap.c  |   2 +-
 fs/xfs/libxfs/xfs_attr.c|  28 +++-
 fs/xfs/libxfs/xfs_attr_remote.c |  10 +--
 fs/xfs/libxfs/xfs_bmap.c|  24 +++
 fs/xfs/libxfs/xfs_format.h  |   2 +
 fs/xfs/libxfs/xfs_inode_buf.c   |  30 +
 fs/xfs/scrub/alloc.c|   1 -
 fs/xfs/scrub/inode.c|   4 +-
 fs/xfs/xfs_bmap_util.c  |  20 ++
 fs/xfs/xfs_buf_item.c   | 119 ++
 fs/xfs/xfs_buf_item.h   |   1 +
 fs/xfs/xfs_inode.c  |  10 +--
 fs/xfs/xfs_iops.c   |  12 +++-
 fs/xfs/xfs_log_recover.c|  10 ---
 fs/xfs/xfs_reflink.c| 137 
 fs/xfs/xfs_trace.h  |   1 -
 fs/xfs/xfs_trans.c  |  10 ++-
 fs/xfs/xfs_trans_buf.c  |  99 +++--
 18 files changed, 256 insertions(+), 264 deletions(-)
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-02 Thread Dave Chinner
On Wed, Oct 03, 2018 at 05:20:31AM +1000, James Morris wrote:
> On Tue, 2 Oct 2018, Dave Chinner wrote:
> 
> > On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote:
> > > On Mon, 1 Oct 2018, Darrick J. Wong wrote:
> > > 
> > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks,
> > > 
> > > Not sure we'd need a pile of hooks, what about just "read" and "write" 
> > > storage admin?
> > > 
> > > Or even two new capabilities along these lines, which we convert existing 
> > > CAP_SYS_ADMIN etc. to?
> > 
> > So instead of having hundreds of management ioctls under
> > CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under
> > CAP_SYS_ADMIN and hundreds of storage ioctls under
> > CAP_SYS_STORAGE_ADMIN?
> > 
> > Maybe I'm missing something, but I don't see how that improves the
> > situation w.r.t. locked down LSM configurations?
> 
> I'm not sure about capabilities, but having two specific LSM hooks for 
> storage admin would allow SELinux et al to explicitly control privileged 
> access to these interfaces.  Storage admin seems to be a special case of 
> its own which we want to be able to mediate as such.

Perhaps so - as a stepping stone it would allow isolation in
specific cases where no management should be allowed, but there are
cases with modern filesystems where users need access to storage
APIs.

e.g. It's entirely plausible that /home is set up as a subvolume per
user, and that subvols in a fileystem can be independently
snapshotted. Hence it would be completely acceptible to allow users
to have access to snapshot management APIs to be able to snapshot
their home directories for backup/rollback purposes.

Hence I'm not sure that black/white storage admin LSM hooks are a
solution that will end up being particularly useful to the wider
population...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-02 Thread Dave Chinner
On Wed, Oct 03, 2018 at 05:20:31AM +1000, James Morris wrote:
> On Tue, 2 Oct 2018, Dave Chinner wrote:
> 
> > On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote:
> > > On Mon, 1 Oct 2018, Darrick J. Wong wrote:
> > > 
> > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks,
> > > 
> > > Not sure we'd need a pile of hooks, what about just "read" and "write" 
> > > storage admin?
> > > 
> > > Or even two new capabilities along these lines, which we convert existing 
> > > CAP_SYS_ADMIN etc. to?
> > 
> > So instead of having hundreds of management ioctls under
> > CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under
> > CAP_SYS_ADMIN and hundreds of storage ioctls under
> > CAP_SYS_STORAGE_ADMIN?
> > 
> > Maybe I'm missing something, but I don't see how that improves the
> > situation w.r.t. locked down LSM configurations?
> 
> I'm not sure about capabilities, but having two specific LSM hooks for 
> storage admin would allow SELinux et al to explicitly control privileged 
> access to these interfaces.  Storage admin seems to be a special case of 
> its own which we want to be able to mediate as such.

Perhaps so - as a stepping stone it would allow isolation in
specific cases where no management should be allowed, but there are
cases with modern filesystems where users need access to storage
APIs.

e.g. It's entirely plausible that /home is set up as a subvolume per
user, and that subvols in a fileystem can be independently
snapshotted. Hence it would be completely acceptible to allow users
to have access to snapshot management APIs to be able to snapshot
their home directories for backup/rollback purposes.

Hence I'm not sure that black/white storage admin LSM hooks are a
solution that will end up being particularly useful to the wider
population...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:
> > This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> > worked for ext3 without being aware of page faults,
> 
> And "worked" still is a bit of a stretch, as soon as you'd get
> ENOSPC it would still blow up badly.  Which probably makes it an
> even better analogy to the current case.
> 
> > RDMA does not call ->page_mkwrite on clean file backed pages before it
> > writes to them and calls set_page_dirty(), and hence RDMA to file
> > backed pages is completely unreliable. I'm not sure this can be
> > solved without having page fault capable RDMA hardware
> 
> We can always software prefault at gup time.

I'm not sure that's sufficient - we've got a series of panics from
machines running ext4+RDMA where there are no bufferheads on dirty
pages at writeback time. This was also reproducable on versions of
XFS that used bufferheads.

We suspect that memory reclaim has tripped the bufferhead stripping
threshold (yeah, that old ext3 hack to avoid writeback deadlocks
under memory pressure), hence removed the bufferheads from clean
mapped pages while RDMA has them pinned. And then some time later
after set_page_dirty() was called on them the filesystem's page
writeback code crashes and burns

i.e. just because the page was in a known state at when it was
pinned, it doesn't mean it will remain in that state until it is
unpinned

> And also remember that
> while RDMA might be the case at least some people care about here it
> really isn't different from any of the other gup + I/O cases, including
> doing direct I/O to a mmap area.  The only difference in the various
> cases is how long the area should be pinned down - some users like RDMA
> want a long term mapping, while others like direct I/O just need a short
> transient one.

Yup, now that I'm aware of all those little intricacies with gup I
always try to consider what impact they have...

> > We could address these use-after-free situations via forcing RDMA to
> > use file layout leases and revoke the lease when we need to modify
> > the backing store on leased files. However, this doesn't solve the
> > need for filesystems to receive write fault notifications via
> > ->page_mkwrite.
> 
> Exactly.   We need three things here:
> 
>  - notification to the filesystem that a page is (possibly) beeing
>written to
>  - a way to to block fs operations while the pages are pinned
>  - a way to distinguish between short and long term mappings,
>and only allow long terms mappings if they can be broken
>using something like leases
> 
> I'm also pretty sure we already explained this a long time ago when the
> issue came up last year, so I'm not sure why this is even still
> contentious.

I suspect that it's simply because these discussions have been
spread across different groups and not everyone is aware of what the
other groups are discussing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote:
> > This reminds me so much of Linux mmap() in the mid-2000s - mmap()
> > worked for ext3 without being aware of page faults,
> 
> And "worked" still is a bit of a stretch, as soon as you'd get
> ENOSPC it would still blow up badly.  Which probably makes it an
> even better analogy to the current case.
> 
> > RDMA does not call ->page_mkwrite on clean file backed pages before it
> > writes to them and calls set_page_dirty(), and hence RDMA to file
> > backed pages is completely unreliable. I'm not sure this can be
> > solved without having page fault capable RDMA hardware
> 
> We can always software prefault at gup time.

I'm not sure that's sufficient - we've got a series of panics from
machines running ext4+RDMA where there are no bufferheads on dirty
pages at writeback time. This was also reproducable on versions of
XFS that used bufferheads.

We suspect that memory reclaim has tripped the bufferhead stripping
threshold (yeah, that old ext3 hack to avoid writeback deadlocks
under memory pressure), hence removed the bufferheads from clean
mapped pages while RDMA has them pinned. And then some time later
after set_page_dirty() was called on them the filesystem's page
writeback code crashes and burns

i.e. just because the page was in a known state at when it was
pinned, it doesn't mean it will remain in that state until it is
unpinned

> And also remember that
> while RDMA might be the case at least some people care about here it
> really isn't different from any of the other gup + I/O cases, including
> doing direct I/O to a mmap area.  The only difference in the various
> cases is how long the area should be pinned down - some users like RDMA
> want a long term mapping, while others like direct I/O just need a short
> transient one.

Yup, now that I'm aware of all those little intricacies with gup I
always try to consider what impact they have...

> > We could address these use-after-free situations via forcing RDMA to
> > use file layout leases and revoke the lease when we need to modify
> > the backing store on leased files. However, this doesn't solve the
> > need for filesystems to receive write fault notifications via
> > ->page_mkwrite.
> 
> Exactly.   We need three things here:
> 
>  - notification to the filesystem that a page is (possibly) beeing
>written to
>  - a way to to block fs operations while the pages are pinned
>  - a way to distinguish between short and long term mappings,
>and only allow long terms mappings if they can be broken
>using something like leases
> 
> I'm also pretty sure we already explained this a long time ago when the
> issue came up last year, so I'm not sure why this is even still
> contentious.

I suspect that it's simply because these discussions have been
spread across different groups and not everyone is aware of what the
other groups are discussing...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 11:25:29AM -0400, Theodore Y. Ts'o wrote:
> On Mon, Oct 01, 2018 at 04:04:42PM +0100, Alan Cox wrote:
> > > Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not
> > > trusted have exactly the same issues. i.e. there's nobody trusted by
> > > the kernel to administer the storage stack, and nobody has defined a
> > > workable security model that can prevent untrusted users from
> > > violating the existing storage trust model
> > 
> > With a proper set of LSM checks you can lock the filesystem management
> > and enforcement to a particular set of objects. You can build that model
> > where for example only an administrative login from a trusted console may
> > launch processes to do that management.
> > 
> > Or you could - if things were not going around the LSM hooks.
> 
> It would be useful if anyone actually *wants* to do this thing to
> define a formal security model, and detail *everything* that would
> need to be changed in order to accomplish it.  Just as we don't
> speculatively add code "just in case" someone might want to use it
> someday, I don't think we should be adding random LSM hooks just
> becausre someone *might* want do something.

Yeah, that's what I was implying we needed to do - taking the
current model and slapping LSM hooks around randomly will only make
things break and cause admins to curse us

> Let's see the use case, and let's see how horrible the changes would
> need to be, and how credible we think it is that someone will actually
> want to *use* it.  I suspect the chagnes will be a really huge number
> of places, and not just in XFS

So do I - the "in root we trust" model is pretty deeply ingrained up
and down the storage stack. I also suspect that most of our hardware
admin (not just storage) has similar assumptions about the security
model they operate in.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 11:25:29AM -0400, Theodore Y. Ts'o wrote:
> On Mon, Oct 01, 2018 at 04:04:42PM +0100, Alan Cox wrote:
> > > Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not
> > > trusted have exactly the same issues. i.e. there's nobody trusted by
> > > the kernel to administer the storage stack, and nobody has defined a
> > > workable security model that can prevent untrusted users from
> > > violating the existing storage trust model
> > 
> > With a proper set of LSM checks you can lock the filesystem management
> > and enforcement to a particular set of objects. You can build that model
> > where for example only an administrative login from a trusted console may
> > launch processes to do that management.
> > 
> > Or you could - if things were not going around the LSM hooks.
> 
> It would be useful if anyone actually *wants* to do this thing to
> define a formal security model, and detail *everything* that would
> need to be changed in order to accomplish it.  Just as we don't
> speculatively add code "just in case" someone might want to use it
> someday, I don't think we should be adding random LSM hooks just
> becausre someone *might* want do something.

Yeah, that's what I was implying we needed to do - taking the
current model and slapping LSM hooks around randomly will only make
things break and cause admins to curse us

> Let's see the use case, and let's see how horrible the changes would
> need to be, and how credible we think it is that someone will actually
> want to *use* it.  I suspect the chagnes will be a really huge number
> of places, and not just in XFS

So do I - the "in root we trust" model is pretty deeply ingrained up
and down the storage stack. I also suspect that most of our hardware
admin (not just storage) has similar assumptions about the security
model they operate in.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-01 Thread Dave Chinner
On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote:
> On Mon, 1 Oct 2018, Darrick J. Wong wrote:
> 
> > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks,
> 
> Not sure we'd need a pile of hooks, what about just "read" and "write" 
> storage admin?
> 
> Or even two new capabilities along these lines, which we convert existing 
> CAP_SYS_ADMIN etc. to?

So instead of having hundreds of management ioctls under
CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under
CAP_SYS_ADMIN and hundreds of storage ioctls under
CAP_SYS_STORAGE_ADMIN?

Maybe I'm missing something, but I don't see how that improves the
situation w.r.t. locked down LSM configurations?

Cheers,

Dave.


-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-10-01 Thread Dave Chinner
On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote:
> On Mon, 1 Oct 2018, Darrick J. Wong wrote:
> 
> > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks,
> 
> Not sure we'd need a pile of hooks, what about just "read" and "write" 
> storage admin?
> 
> Or even two new capabilities along these lines, which we convert existing 
> CAP_SYS_ADMIN etc. to?

So instead of having hundreds of management ioctls under
CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under
CAP_SYS_ADMIN and hundreds of storage ioctls under
CAP_SYS_STORAGE_ADMIN?

Maybe I'm missing something, but I don't see how that improves the
situation w.r.t. locked down LSM configurations?

Cheers,

Dave.


-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/3] namei: implement various scoping AT_* flags

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Dave Chinner  wrote:
> > > I've added some selftests for this, but it's not clear to me whether
> > > they should live here or in xfstests (as far as I can tell there are no
> > > other VFS tests in selftests, while there are some tests that look like
> > > generic VFS tests in xfstests). If you'd prefer them to be included in
> > > xfstests, let me know.
> > 
> > xfstests, please. That way the new functionality will get immediate
> > coverage by all the main filesystem development and distro QA
> > teams
> 
> Sure, will do. Do you want me to submit them in parallel --

That's usually the way we do things, but we don't tend to commit the
fstests changes until the thing it is testing has landed upstream.

> and what is
> the correct ML to submit changes to xfstests?

fste...@vger.kernel.org

> Sorry for the silly questions. :P

You're going to have many more of them when you start moving the
tests across to fstests :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/3] namei: implement various scoping AT_* flags

2018-10-01 Thread Dave Chinner
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote:
> On 2018-10-01, Dave Chinner  wrote:
> > > I've added some selftests for this, but it's not clear to me whether
> > > they should live here or in xfstests (as far as I can tell there are no
> > > other VFS tests in selftests, while there are some tests that look like
> > > generic VFS tests in xfstests). If you'd prefer them to be included in
> > > xfstests, let me know.
> > 
> > xfstests, please. That way the new functionality will get immediate
> > coverage by all the main filesystem development and distro QA
> > teams
> 
> Sure, will do. Do you want me to submit them in parallel --

That's usually the way we do things, but we don't tend to commit the
fstests changes until the thing it is testing has landed upstream.

> and what is
> the correct ML to submit changes to xfstests?

fste...@vger.kernel.org

> Sorry for the silly questions. :P

You're going to have many more of them when you start moving the
tests across to fstests :P

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-10-01 Thread Dave Chinner
On Sat, Sep 29, 2018 at 04:46:09AM -0400, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > On 9/28/18 2:49 PM, Jerome Glisse wrote:
> > > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
> > >> use a non-CPU device to read and write to "pinned" memory, especially 
> > >> when
> > >> that memory is backed by a file system.

"backed by a filesystem" is the biggest problem here.

> > >> I recall there were objections to just narrowly fixing the 
> > >> set_page_dirty()
> > >> bug, because the underlying problem is large and serious. So here we are.
> > > 
> > > Except that you can not solve that issue without proper hardware. GPU are
> > > fine. RDMA are broken except the mellanox5 hardware which can invalidate
> > > at anytime its page table thus allowing to write protect the page at any
> > > time.
> > 
> > Today, people are out there using RDMA without page-fault-capable hardware.
> > And they are hitting problems, as we've seen. From the discussions so far,
> > I don't think it's impossible to solve the problems, even for "lesser", 
> > non-fault-capable hardware.

This reminds me so much of Linux mmap() in the mid-2000s - mmap()
worked for ext3 without being aware of page faults, so most mm/
developers at the time were of the opinion that all the other
filesystems should work just fine without being aware of page
faults.

But some loud-mouthed idiot at SGI kept complaining that mmap()
could never be fixed for XFS without write fault notification
because of delayed allocation, unwritten extents and ENOSPC had to
be handled before mapped writes could be posted.  Eventually
Christoph Lameter got ->page_mkwrite into the page fault path and
the loud mouthed idiot finally got mmap() to work correctly on XFS:

commit 4f57dbc6b5bae5a3978d429f45ac597ca7a3b8c6
Author: David Chinner 
Date:   Thu Jul 19 16:28:17 2007 +1000

[XFS] Implement ->page_mkwrite in XFS.

Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages
being written to. This allows use to do correct delayed allocation and
ENOSPC checking as well as remap unwritten extents so that they get
converted correctly during writeback. This is done via the generic
block_page_mkwrite code.

SGI-PV: 940392
SGI-Modid: xfs-linux-melb:xfs-kern:29149a

Signed-off-by: David Chinner 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Tim Shimmin 

Nowdays, ->page_mkwrite is fundamental filesystem functionality -
copy-on-write filesystems like btrfs simply don't work if they can't
trigger COW on mapped write accesses. These days all the main linux
filesystems depend on write fault notifications in some way or
another for correct operation.

The way RDMA uses GUP to take references to file backed pages to
'stop them going away' is reminiscent of mmap() back before
->page_mkwrite(). i.e. it assumes that an initial read of the page
will populate the page state correctly for all future operations,
including set_page_dirty() after write accesses.

This is not a valid assumption - filesystems can have different
private clean vs dirty page state, and may need to perform
operations to take a page from clean to dirty.  Hence calling
set_page_dirty() on a file backed mapped page without first having
called ->page_mkwrite() is a bug.

RDMA does not call ->page_mkwrite on clean file backed pages before it
writes to them and calls set_page_dirty(), and hence RDMA to file
backed pages is completely unreliable. I'm not sure this can be
solved without having page fault capable RDMA hardware

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 

Right, but pinning via GUP is not an option for file backed pages
because the filesystem is completely unaware of these references.
i.e. waiting forever isn't an issue here because the filesystem
never waits on them. Instead, they are a filesystem corruption
vector because the filesystem can invalidate those mappings and free
the backing store while they are still in use by RDMA.

Hence for DAX filesystems, this leaves the RDMA app with direct
access to the physical storage even though the filesystem has freed
the space it is accessing. This is a use after free of the physical
storage that the filesystem cannot control, and why DAX+RDMA is
disabled right now.

We could address these use-after-free situations via forcing RDMA to
use file layout leases and revoke the lease when we need to modify
the backing store on leased files. However, this doesn't solve the
need for filesystems to receive write fault notifications via
->page_mkwrite.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps

2018-10-01 Thread Dave Chinner
On Sat, Sep 29, 2018 at 04:46:09AM -0400, Jerome Glisse wrote:
> On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote:
> > On 9/28/18 2:49 PM, Jerome Glisse wrote:
> > > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote:
> > >> use a non-CPU device to read and write to "pinned" memory, especially 
> > >> when
> > >> that memory is backed by a file system.

"backed by a filesystem" is the biggest problem here.

> > >> I recall there were objections to just narrowly fixing the 
> > >> set_page_dirty()
> > >> bug, because the underlying problem is large and serious. So here we are.
> > > 
> > > Except that you can not solve that issue without proper hardware. GPU are
> > > fine. RDMA are broken except the mellanox5 hardware which can invalidate
> > > at anytime its page table thus allowing to write protect the page at any
> > > time.
> > 
> > Today, people are out there using RDMA without page-fault-capable hardware.
> > And they are hitting problems, as we've seen. From the discussions so far,
> > I don't think it's impossible to solve the problems, even for "lesser", 
> > non-fault-capable hardware.

This reminds me so much of Linux mmap() in the mid-2000s - mmap()
worked for ext3 without being aware of page faults, so most mm/
developers at the time were of the opinion that all the other
filesystems should work just fine without being aware of page
faults.

But some loud-mouthed idiot at SGI kept complaining that mmap()
could never be fixed for XFS without write fault notification
because of delayed allocation, unwritten extents and ENOSPC had to
be handled before mapped writes could be posted.  Eventually
Christoph Lameter got ->page_mkwrite into the page fault path and
the loud mouthed idiot finally got mmap() to work correctly on XFS:

commit 4f57dbc6b5bae5a3978d429f45ac597ca7a3b8c6
Author: David Chinner 
Date:   Thu Jul 19 16:28:17 2007 +1000

[XFS] Implement ->page_mkwrite in XFS.

Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages
being written to. This allows use to do correct delayed allocation and
ENOSPC checking as well as remap unwritten extents so that they get
converted correctly during writeback. This is done via the generic
block_page_mkwrite code.

SGI-PV: 940392
SGI-Modid: xfs-linux-melb:xfs-kern:29149a

Signed-off-by: David Chinner 
Signed-off-by: Christoph Hellwig 
Signed-off-by: Tim Shimmin 

Nowdays, ->page_mkwrite is fundamental filesystem functionality -
copy-on-write filesystems like btrfs simply don't work if they can't
trigger COW on mapped write accesses. These days all the main linux
filesystems depend on write fault notifications in some way or
another for correct operation.

The way RDMA uses GUP to take references to file backed pages to
'stop them going away' is reminiscent of mmap() back before
->page_mkwrite(). i.e. it assumes that an initial read of the page
will populate the page state correctly for all future operations,
including set_page_dirty() after write accesses.

This is not a valid assumption - filesystems can have different
private clean vs dirty page state, and may need to perform
operations to take a page from clean to dirty.  Hence calling
set_page_dirty() on a file backed mapped page without first having
called ->page_mkwrite() is a bug.

RDMA does not call ->page_mkwrite on clean file backed pages before it
writes to them and calls set_page_dirty(), and hence RDMA to file
backed pages is completely unreliable. I'm not sure this can be
solved without having page fault capable RDMA hardware

> > > With the solution put forward here you can potentialy wait _forever_ for
> > > the driver that holds a pin to drop it. This was the point i was trying to
> > > get accross during LSF/MM. 

Right, but pinning via GUP is not an option for file backed pages
because the filesystem is completely unaware of these references.
i.e. waiting forever isn't an issue here because the filesystem
never waits on them. Instead, they are a filesystem corruption
vector because the filesystem can invalidate those mappings and free
the backing store while they are still in use by RDMA.

Hence for DAX filesystems, this leaves the RDMA app with direct
access to the physical storage even though the filesystem has freed
the space it is accessing. This is a use after free of the physical
storage that the filesystem cannot control, and why DAX+RDMA is
disabled right now.

We could address these use-after-free situations via forcing RDMA to
use file layout leases and revoke the lease when we need to modify
the backing store on leased files. However, this doesn't solve the
need for filesystems to receive write fault notifications via
->page_mkwrite.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/3] namei: implement various scoping AT_* flags

2018-09-30 Thread Dave Chinner
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.

xfstests, please. That way the new functionality will get immediate
coverage by all the main filesystem development and distro QA
teams

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 0/3] namei: implement various scoping AT_* flags

2018-09-30 Thread Dave Chinner
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote:
> I've added some selftests for this, but it's not clear to me whether
> they should live here or in xfstests (as far as I can tell there are no
> other VFS tests in selftests, while there are some tests that look like
> generic VFS tests in xfstests). If you'd prefer them to be included in
> xfstests, let me know.

xfstests, please. That way the new functionality will get immediate
coverage by all the main filesystem development and distro QA
teams

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-30 Thread Dave Chinner
On Sun, Sep 30, 2018 at 03:16:52PM +0100, Alan Cox wrote:
> > > CAP_SYS_ADMIN is also a bit weird because low level access usually
> > > implies you can bypass access controls so you should also check
> > > CAP_SYS_DAC ?  
> > 
> > Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls?
> > But that only allows bypassing directory search operations, so maybe
> > you mean CAP_DAC_OVERRIDE?
> 
> It depends what the ioctl allows you to do. If it allows me to bypass
> DAC and manipulate the file system to move objects around then it's a
> serious issue.

These interfaces have always been allowed to do that. You can't do
transparent online background defragmentation without bypassing DAC
and moving objects around. You can't scrub metadata and data without
bypassing DAC. You can't do dedupe without bypassing /some level/ of
DAC to get access to the filesystem used space map and the raw block
device to hash the data. But the really important access control for
dedupe - avoiding deduping data across files at different security
levels - isn't controlled at all.

> The underlying problem is if CAP_SYS_ADMIN is able to move objects around
> then I can move modules around.

Yup, anything with direct access to block devices can do that. Many
filesystem and storage utilities are given direct access to the
block device, because that's what they need to work.

e.g. in DM land, the control ioctls (ctl_ioctl()) are protected by:

/* only root can play with this */
if (!capable(CAP_SYS_ADMIN))
return -EACCES;

Think about it - if DM control ioctls only require CAP_SYS_ADMIN,
then if have that cap you can use DM to remap any block in a block
device to any other block. You don't need to the filesystem to move
stuff around, it can be moved around without the filesystem knowing
anything about it.

> We already have a problem with
> CAP_DAC_OVERRIDE giving you CAP_SYS_RAWIO (ie totally owning the machine)
> unless the modules are signed, if xfs allows ADMIN as well then
> CAP_SYS_ADMIN is much easier to obtain and you'd get total system
> ownership from it.

Always been the case, and it's not isolated to XFS.

$ git grep CAP_SYS_ADMIN fs/ |wc -l
139
$ git grep CAP_SYS_ADMIN block/ |wc -l
16
$ git grep CAP_SYS_ADMIN drivers/block/ drivers/scsi |wc -l
88

The "CAP_SYS_ADMIN for ioctls" trust model in the storage stack
extends both above and below the filesystem. If you don't trust
CAP_SYS_ADMIN, then you are basically saying that you cannot trust
your storage management and maintenance utilities at any level.


> Not good.
> 
> > Regardless, this horse bolted long before those syscalls were
> > introduced.  The time to address this issue was when XFS was merged
> > into linux all those years ago, back when the apps that run in
> > highly secure restricted environments that use these interfaces were
> > being ported to linux. We can't change this now without breaking
> > userspace
> 
> That's what people said about setuid shell scripts.

Completely different. setuid shell scripts got abused as a hack for
the lazy to avoid setting up permissions properly and hence were
easily exploited.

The storage stack is completely dependent on a simplisitic layered
trust model and that root (CAP_SYS_ADMIN) is god. The storage trust
model falls completely apart if we don't have a trusted root user to
administer all layers of the storage stack.

This isn't the first time I've raised this issue - I raised it
back when the user namespace stuff was ram-roaded into the kernel,
and was essentially ignored by the userns people. As a result, we
end up with all the storage management ioctls restricted to the
initns where we have trusted CAP_SYS_ADMIN users.

I've also raised it more recently in the unprivileged mount
discussions (so untrusted root in containers can mount filesystems)
- no solution to the underlying trust model deficiencies was found
in those discussions, either. Instead, filesystems that can
be mounted by untrusted users (i.e. FUSE) have a special flag in
their fstype definition to say this is allowed.

Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not
trusted have exactly the same issues. i.e. there's nobody trusted by
the kernel to administer the storage stack, and nobody has defined a
workable security model that can prevent untrusted users from
violating the existing storage trust model

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-30 Thread Dave Chinner
On Sun, Sep 30, 2018 at 03:16:52PM +0100, Alan Cox wrote:
> > > CAP_SYS_ADMIN is also a bit weird because low level access usually
> > > implies you can bypass access controls so you should also check
> > > CAP_SYS_DAC ?  
> > 
> > Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls?
> > But that only allows bypassing directory search operations, so maybe
> > you mean CAP_DAC_OVERRIDE?
> 
> It depends what the ioctl allows you to do. If it allows me to bypass
> DAC and manipulate the file system to move objects around then it's a
> serious issue.

These interfaces have always been allowed to do that. You can't do
transparent online background defragmentation without bypassing DAC
and moving objects around. You can't scrub metadata and data without
bypassing DAC. You can't do dedupe without bypassing /some level/ of
DAC to get access to the filesystem used space map and the raw block
device to hash the data. But the really important access control for
dedupe - avoiding deduping data across files at different security
levels - isn't controlled at all.

> The underlying problem is if CAP_SYS_ADMIN is able to move objects around
> then I can move modules around.

Yup, anything with direct access to block devices can do that. Many
filesystem and storage utilities are given direct access to the
block device, because that's what they need to work.

e.g. in DM land, the control ioctls (ctl_ioctl()) are protected by:

/* only root can play with this */
if (!capable(CAP_SYS_ADMIN))
return -EACCES;

Think about it - if DM control ioctls only require CAP_SYS_ADMIN,
then if have that cap you can use DM to remap any block in a block
device to any other block. You don't need to the filesystem to move
stuff around, it can be moved around without the filesystem knowing
anything about it.

> We already have a problem with
> CAP_DAC_OVERRIDE giving you CAP_SYS_RAWIO (ie totally owning the machine)
> unless the modules are signed, if xfs allows ADMIN as well then
> CAP_SYS_ADMIN is much easier to obtain and you'd get total system
> ownership from it.

Always been the case, and it's not isolated to XFS.

$ git grep CAP_SYS_ADMIN fs/ |wc -l
139
$ git grep CAP_SYS_ADMIN block/ |wc -l
16
$ git grep CAP_SYS_ADMIN drivers/block/ drivers/scsi |wc -l
88

The "CAP_SYS_ADMIN for ioctls" trust model in the storage stack
extends both above and below the filesystem. If you don't trust
CAP_SYS_ADMIN, then you are basically saying that you cannot trust
your storage management and maintenance utilities at any level.


> Not good.
> 
> > Regardless, this horse bolted long before those syscalls were
> > introduced.  The time to address this issue was when XFS was merged
> > into linux all those years ago, back when the apps that run in
> > highly secure restricted environments that use these interfaces were
> > being ported to linux. We can't change this now without breaking
> > userspace
> 
> That's what people said about setuid shell scripts.

Completely different. setuid shell scripts got abused as a hack for
the lazy to avoid setting up permissions properly and hence were
easily exploited.

The storage stack is completely dependent on a simplisitic layered
trust model and that root (CAP_SYS_ADMIN) is god. The storage trust
model falls completely apart if we don't have a trusted root user to
administer all layers of the storage stack.

This isn't the first time I've raised this issue - I raised it
back when the user namespace stuff was ram-roaded into the kernel,
and was essentially ignored by the userns people. As a result, we
end up with all the storage management ioctls restricted to the
initns where we have trusted CAP_SYS_ADMIN users.

I've also raised it more recently in the unprivileged mount
discussions (so untrusted root in containers can mount filesystems)
- no solution to the underlying trust model deficiencies was found
in those discussions, either. Instead, filesystems that can
be mounted by untrusted users (i.e. FUSE) have a special flag in
their fstype definition to say this is allowed.

Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not
trusted have exactly the same issues. i.e. there's nobody trusted by
the kernel to administer the storage stack, and nobody has defined a
workable security model that can prevent untrusted users from
violating the existing storage trust model

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-27 Thread Dave Chinner
On Fri, Sep 28, 2018 at 07:23:42AM +1000, James Morris wrote:
> On Thu, 27 Sep 2018, Dave Chinner wrote:
> 
> > Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel
> > that have no LSM coverage that this is not an isolated problem that
> > people setting up such systems have to deal with. 
> 
> I could be missing something here, but all ioctls are mediated by LSM at a 
> high level (security_file_ioctl). Some problematic ones are singled out at 
> that point by LSMs for special handling.

selinux_file_ioctl() checks whether a specific /file/ has
permissions to execute ioctls, but it doesn't know anything about
what that ioctl actually does. It has special cases for a handful of
ioctls, but misses a large number of ioctls that perform similar
functionality (e.g. file getattr/setattr type ioctls).

smack just checks for read/write permissions depending on the
IOW/IOR/IORW definition of the ioctl.  Tomoyo does a path check
based on the path associated with the file passed to it to see if an
ioctl can be run.

IOWs, none of there really know anything about what the ioctl does,
and they sure as hell don't check any of the arguments for other
file descriptors that the ioctl might be accessing. It's basically a
"does we allow ioctls to this inode/path from this user?" check and
nothing more.

That just doesn't work for ioctls structured like the XFS filehandle
operations. The ioctl is performed on a "fshandle" file descriptor,
which generally points at the root directory of the filesystem the
file handle belongs to. This gives the ioctl the superblock context
that it is to operate on, and nothing more. It then opens a new file
based on the filehandle that was in the ioctl argument, and performs
the required operation on that file it opened itself.

IOWs, the security_file_ioctl() hook is almost completely useless in
cases like this - you can't isolate the ioctl based on the file
argument, because it can point to any file or directory in the
filesystem. And without actually parsing, decoding and instantiating
the the ioctl arguments, you can't tell the ioctl it can't act on
specific targets. And because filehandle to dentry resolution
results in disconnected dentries, the paths are not complete and
hence path based security checks (e.g. tomoyo) are likely to be
broken and unfixable...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-27 Thread Dave Chinner
On Fri, Sep 28, 2018 at 07:23:42AM +1000, James Morris wrote:
> On Thu, 27 Sep 2018, Dave Chinner wrote:
> 
> > Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel
> > that have no LSM coverage that this is not an isolated problem that
> > people setting up such systems have to deal with. 
> 
> I could be missing something here, but all ioctls are mediated by LSM at a 
> high level (security_file_ioctl). Some problematic ones are singled out at 
> that point by LSMs for special handling.

selinux_file_ioctl() checks whether a specific /file/ has
permissions to execute ioctls, but it doesn't know anything about
what that ioctl actually does. It has special cases for a handful of
ioctls, but misses a large number of ioctls that perform similar
functionality (e.g. file getattr/setattr type ioctls).

smack just checks for read/write permissions depending on the
IOW/IOR/IORW definition of the ioctl.  Tomoyo does a path check
based on the path associated with the file passed to it to see if an
ioctl can be run.

IOWs, none of there really know anything about what the ioctl does,
and they sure as hell don't check any of the arguments for other
file descriptors that the ioctl might be accessing. It's basically a
"does we allow ioctls to this inode/path from this user?" check and
nothing more.

That just doesn't work for ioctls structured like the XFS filehandle
operations. The ioctl is performed on a "fshandle" file descriptor,
which generally points at the root directory of the filesystem the
file handle belongs to. This gives the ioctl the superblock context
that it is to operate on, and nothing more. It then opens a new file
based on the filehandle that was in the ioctl argument, and performs
the required operation on that file it opened itself.

IOWs, the security_file_ioctl() hook is almost completely useless in
cases like this - you can't isolate the ioctl based on the file
argument, because it can point to any file or directory in the
filesystem. And without actually parsing, decoding and instantiating
the the ioctl arguments, you can't tell the ioctl it can't act on
specific targets. And because filehandle to dentry resolution
results in disconnected dentries, the paths are not complete and
hence path based security checks (e.g. tomoyo) are likely to be
broken and unfixable...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-26 Thread Dave Chinner
On Wed, Sep 26, 2018 at 09:23:03AM -0400, Stephen Smalley wrote:
> On 09/25/2018 09:33 PM, Dave Chinner wrote:
> >On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> >>Hi,
> >>
> >>I'm bringing up this issue again to let of LSM developers know the 
> >>situation, and would like to know your thoughts.
> >>Several weeks ago I sent an email to the security list to discuss the issue 
> >>where
> >>XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> >>permission, which we think is kind of weird and this kind of operation 
> >>should be
> >>audited by LSM.
> >
> >These aren't user interfaces. They are filesystem maintenance and
> >extension interfaces.  They are intended for low level filesystem
> >utilities that require complete, unrestricted access to the
> >underlying filesystem via holding CAP_SYSADMIN in the initns.
> >
> >i.e.  they are used to perform filesystem maintenance and extension
> >operations that need to be completely invisible to users from
> >userspace. e.g.  online file defragmentation (xfs_fsr), data
> >migration (e.g. HSM products), efficient backup of data (xfsdump),
> >metadata and data scrubbing, online repair, etc.
> >
> >IOWs, I really don't think these interfaces are something the LSMs
> >should be trying to intercept or audit, because they are essentially
> >internal filesystem interfaces used by trusted code and not general
> >user application facing APIs.
> 
> If they are interfaces exposed to userspace, then they should be
> mediated via LSM.  We only omit the LSM hook when the usage is
> purely kernel-internal.

/me points to the mass of diverse management ioctls across the
kernel that aren't mediated by LSM hooks.

> Security modules are often used to limit
> even "trusted" applications to least privilege and protect them from
> untrustworthy inputs, moving from binary trust notions to only
> trusting them for what they must be trusted to do.  CAP_SYS_ADMIN
> doesn't necessarily indicate that they are trusted to override any
> given MAC policy restrictions.

Applications that are tightly integrated into the filesystem to
extend it's functionality effectively operate in the same trust
space as the kernel filesystem implementation itself.  i.e. they
interact with the filesystem at levels below the DAC/MAC checks that
are performed on user filesystem accesses, and perform manipluation
of the on-disk filesystem structure that is invisible to users
accessing the filesystem.

As such, there are very few trusted applications have "massive data
loss" as a potential failure mode if an inappropriately configured
LSM is loaded into the kernel. Breaking a HSM application's access
to the filesystem unexpectedly because someone didn't set up a new
security policy correctly brings a whole new level of risk to
administrating sites that mix non-trivial storage solutions with
LSM-based security.

> Wondering why we don't perform the security_inode_readlink() call
> inside of vfs_readlink() currently.  The general pattern is that we
> do perform security_inode_*() calls inside the other vfs_*()
> helpers, so vfs_readlink() is an exception currently.

Pretty sure that was done to mitigate the risk of breaking existing
userspace applications using the handle interfaces to read links.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-26 Thread Dave Chinner
On Wed, Sep 26, 2018 at 09:23:03AM -0400, Stephen Smalley wrote:
> On 09/25/2018 09:33 PM, Dave Chinner wrote:
> >On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> >>Hi,
> >>
> >>I'm bringing up this issue again to let of LSM developers know the 
> >>situation, and would like to know your thoughts.
> >>Several weeks ago I sent an email to the security list to discuss the issue 
> >>where
> >>XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> >>permission, which we think is kind of weird and this kind of operation 
> >>should be
> >>audited by LSM.
> >
> >These aren't user interfaces. They are filesystem maintenance and
> >extension interfaces.  They are intended for low level filesystem
> >utilities that require complete, unrestricted access to the
> >underlying filesystem via holding CAP_SYSADMIN in the initns.
> >
> >i.e.  they are used to perform filesystem maintenance and extension
> >operations that need to be completely invisible to users from
> >userspace. e.g.  online file defragmentation (xfs_fsr), data
> >migration (e.g. HSM products), efficient backup of data (xfsdump),
> >metadata and data scrubbing, online repair, etc.
> >
> >IOWs, I really don't think these interfaces are something the LSMs
> >should be trying to intercept or audit, because they are essentially
> >internal filesystem interfaces used by trusted code and not general
> >user application facing APIs.
> 
> If they are interfaces exposed to userspace, then they should be
> mediated via LSM.  We only omit the LSM hook when the usage is
> purely kernel-internal.

/me points to the mass of diverse management ioctls across the
kernel that aren't mediated by LSM hooks.

> Security modules are often used to limit
> even "trusted" applications to least privilege and protect them from
> untrustworthy inputs, moving from binary trust notions to only
> trusting them for what they must be trusted to do.  CAP_SYS_ADMIN
> doesn't necessarily indicate that they are trusted to override any
> given MAC policy restrictions.

Applications that are tightly integrated into the filesystem to
extend it's functionality effectively operate in the same trust
space as the kernel filesystem implementation itself.  i.e. they
interact with the filesystem at levels below the DAC/MAC checks that
are performed on user filesystem accesses, and perform manipluation
of the on-disk filesystem structure that is invisible to users
accessing the filesystem.

As such, there are very few trusted applications have "massive data
loss" as a potential failure mode if an inappropriately configured
LSM is loaded into the kernel. Breaking a HSM application's access
to the filesystem unexpectedly because someone didn't set up a new
security policy correctly brings a whole new level of risk to
administrating sites that mix non-trivial storage solutions with
LSM-based security.

> Wondering why we don't perform the security_inode_readlink() call
> inside of vfs_readlink() currently.  The general pattern is that we
> do perform security_inode_*() calls inside the other vfs_*()
> helpers, so vfs_readlink() is an exception currently.

Pretty sure that was done to mitigate the risk of breaking existing
userspace applications using the handle interfaces to read links.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-26 Thread Dave Chinner
On Wed, Sep 26, 2018 at 07:24:26PM +0100, Alan Cox wrote:
> On Wed, 26 Sep 2018 11:33:29 +1000
> Dave Chinner  wrote:
> 
> > On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> > > Hi,
> > > 
> > > I'm bringing up this issue again to let of LSM developers know the 
> > > situation, and would like to know your thoughts.
> > > Several weeks ago I sent an email to the security list to discuss the 
> > > issue where
> > > XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> > > permission, which we think is kind of weird and this kind of operation 
> > > should be
> > > audited by LSM.  
> > 
> > These aren't user interfaces. They are filesystem maintenance and
> > extension interfaces.  They are intended for low level filesystem
> > utilities that require complete, unrestricted access to the
> > underlying filesystem via holding CAP_SYSADMIN in the initns.
> 
> CAP_SYS_ADMIN is meaningless in an environment using a strong LSM set up.

Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel
that have no LSM coverage that this is not an isolated problem that
people setting up such systems have to deal with. the LSM hooks are
already complex enough without adding hundreds more hooks to control
individual ioctl behaviour for every filesystem, device, etc.

Unless you are going to add LSM hooks to all ioctls, I don't see
much point in singling out one set of ioctls in a way that will
break existing configurations. It's just a knee jerk reaction (like
ariport security) because it doesn't meaningfully address the
problem of all the other management ioctls in the kernel being
completely unconstrainted by LSMs.

> CAP_SYS_ADMIN is also a bit weird because low level access usually
> implies you can bypass access controls so you should also check
> CAP_SYS_DAC ?

Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls?
But that only allows bypassing directory search operations, so maybe
you mean CAP_DAC_OVERRIDE?

Regardless, this horse bolted long before those syscalls were
introduced.  The time to address this issue was when XFS was merged
into linux all those years ago, back when the apps that run in
highly secure restricted environments that use these interfaces were
being ported to linux. We can't change this now without breaking
userspace

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-26 Thread Dave Chinner
On Wed, Sep 26, 2018 at 07:24:26PM +0100, Alan Cox wrote:
> On Wed, 26 Sep 2018 11:33:29 +1000
> Dave Chinner  wrote:
> 
> > On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> > > Hi,
> > > 
> > > I'm bringing up this issue again to let of LSM developers know the 
> > > situation, and would like to know your thoughts.
> > > Several weeks ago I sent an email to the security list to discuss the 
> > > issue where
> > > XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> > > permission, which we think is kind of weird and this kind of operation 
> > > should be
> > > audited by LSM.  
> > 
> > These aren't user interfaces. They are filesystem maintenance and
> > extension interfaces.  They are intended for low level filesystem
> > utilities that require complete, unrestricted access to the
> > underlying filesystem via holding CAP_SYSADMIN in the initns.
> 
> CAP_SYS_ADMIN is meaningless in an environment using a strong LSM set up.

Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel
that have no LSM coverage that this is not an isolated problem that
people setting up such systems have to deal with. the LSM hooks are
already complex enough without adding hundreds more hooks to control
individual ioctl behaviour for every filesystem, device, etc.

Unless you are going to add LSM hooks to all ioctls, I don't see
much point in singling out one set of ioctls in a way that will
break existing configurations. It's just a knee jerk reaction (like
ariport security) because it doesn't meaningfully address the
problem of all the other management ioctls in the kernel being
completely unconstrainted by LSMs.

> CAP_SYS_ADMIN is also a bit weird because low level access usually
> implies you can bypass access controls so you should also check
> CAP_SYS_DAC ?

Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls?
But that only allows bypassing directory search operations, so maybe
you mean CAP_DAC_OVERRIDE?

Regardless, this horse bolted long before those syscalls were
introduced.  The time to address this issue was when XFS was merged
into linux all those years ago, back when the apps that run in
highly secure restricted environments that use these interfaces were
being ported to linux. We can't change this now without breaking
userspace

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-25 Thread Dave Chinner
On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> Hi,
> 
> I'm bringing up this issue again to let of LSM developers know the situation, 
> and would like to know your thoughts.
> Several weeks ago I sent an email to the security list to discuss the issue 
> where
> XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> permission, which we think is kind of weird and this kind of operation should 
> be
> audited by LSM.

These aren't user interfaces. They are filesystem maintenance and
extension interfaces.  They are intended for low level filesystem
utilities that require complete, unrestricted access to the
underlying filesystem via holding CAP_SYSADMIN in the initns.

i.e.  they are used to perform filesystem maintenance and extension
operations that need to be completely invisible to users from
userspace. e.g.  online file defragmentation (xfs_fsr), data
migration (e.g. HSM products), efficient backup of data (xfsdump),
metadata and data scrubbing, online repair, etc.

IOWs, I really don't think these interfaces are something the LSMs
should be trying to intercept or audit, because they are essentially
internal filesystem interfaces used by trusted code and not general
user application facing APIs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: Leaking Path in XFS's ioctl interface(missing LSM check)

2018-09-25 Thread Dave Chinner
On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote:
> Hi,
> 
> I'm bringing up this issue again to let of LSM developers know the situation, 
> and would like to know your thoughts.
> Several weeks ago I sent an email to the security list to discuss the issue 
> where
> XFS's ioctl interface can do things like vfs_readlink without asking LSM's
> permission, which we think is kind of weird and this kind of operation should 
> be
> audited by LSM.

These aren't user interfaces. They are filesystem maintenance and
extension interfaces.  They are intended for low level filesystem
utilities that require complete, unrestricted access to the
underlying filesystem via holding CAP_SYSADMIN in the initns.

i.e.  they are used to perform filesystem maintenance and extension
operations that need to be completely invisible to users from
userspace. e.g.  online file defragmentation (xfs_fsr), data
migration (e.g. HSM products), efficient backup of data (xfsdump),
metadata and data scrubbing, online repair, etc.

IOWs, I really don't think these interfaces are something the LSMs
should be trying to intercept or audit, because they are essentially
internal filesystem interfaces used by trusted code and not general
user application facing APIs.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 3.16 52/63] xfs: validate cached inodes are free when allocated

2018-09-21 Thread Dave Chinner
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote:
> 3.16.58-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Dave Chinner 
> 
> commit afca6c5b2595fc44383919fba740c194b0b76aff upstream.
> 
> A recent fuzzed filesystem image cached random dcache corruption
> when the reproducer was run. This often showed up as panics in
> lookup_slow() on a null inode->i_ops pointer when doing pathwalks.
.
> [bwh: Backported to 3.16:
>  - Look up mode in XFS inode, not VFS inode
>  - Use positive error codes, and EIO instead of EFSCORRUPTED]

Again, why EIO?

And 
> Signed-off-by: Ben Hutchings 
> ---
>  fs/xfs/xfs_icache.c | 73 +
>  1 file changed, 48 insertions(+), 25 deletions(-)
> 
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -133,6 +133,46 @@ xfs_inode_free(
>  }
>  
>  /*
> + * If we are allocating a new inode, then check what was returned is
> + * actually a free, empty inode. If we are not allocating an inode,
> + * then check we didn't find a free inode.
> + *
> + * Returns:
> + *   0   if the inode free state matches the lookup context
> + *   ENOENT  if the inode is free and we are not allocating
> + *   EFSCORRUPTED    if there is any state mismatch at all

You changed the code but not the comment.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [PATCH 3.16 52/63] xfs: validate cached inodes are free when allocated

2018-09-21 Thread Dave Chinner
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote:
> 3.16.58-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Dave Chinner 
> 
> commit afca6c5b2595fc44383919fba740c194b0b76aff upstream.
> 
> A recent fuzzed filesystem image cached random dcache corruption
> when the reproducer was run. This often showed up as panics in
> lookup_slow() on a null inode->i_ops pointer when doing pathwalks.
.
> [bwh: Backported to 3.16:
>  - Look up mode in XFS inode, not VFS inode
>  - Use positive error codes, and EIO instead of EFSCORRUPTED]

Again, why EIO?

And 
> Signed-off-by: Ben Hutchings 
> ---
>  fs/xfs/xfs_icache.c | 73 +
>  1 file changed, 48 insertions(+), 25 deletions(-)
> 
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -133,6 +133,46 @@ xfs_inode_free(
>  }
>  
>  /*
> + * If we are allocating a new inode, then check what was returned is
> + * actually a free, empty inode. If we are not allocating an inode,
> + * then check we didn't find a free inode.
> + *
> + * Returns:
> + *   0   if the inode free state matches the lookup context
> + *   ENOENT  if the inode is free and we are not allocating
> + *   EFSCORRUPTED    if there is any state mismatch at all

You changed the code but not the comment.

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [PATCH 3.16 51/63] xfs: catch inode allocation state mismatch corruption

2018-09-21 Thread Dave Chinner
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote:
> 3.16.58-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Dave Chinner 
> 
> commit ee457001ed6c6f31ddad69c24c1da8f377d8472d upstream.
> 
> We recently came across a V4 filesystem causing memory corruption
> due to a newly allocated inode being setup twice and being added to
> the superblock inode list twice. From code inspection, the only way
> this could happen is if a newly allocated inode was not marked as
> free on disk (i.e. di_mode wasn't zero).

> Signed-Off-By: Dave Chinner 
> Reviewed-by: Carlos Maiolino 
> Tested-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> [bwh: Backported to 3.16:
>  - Look up mode in XFS inode, not VFS inode
>  - Use positive error codes, and EIO instead of EFSCORRUPTED]

Why EIO?

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: [PATCH 3.16 51/63] xfs: catch inode allocation state mismatch corruption

2018-09-21 Thread Dave Chinner
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote:
> 3.16.58-rc1 review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Dave Chinner 
> 
> commit ee457001ed6c6f31ddad69c24c1da8f377d8472d upstream.
> 
> We recently came across a V4 filesystem causing memory corruption
> due to a newly allocated inode being setup twice and being added to
> the superblock inode list twice. From code inspection, the only way
> this could happen is if a newly allocated inode was not marked as
> free on disk (i.e. di_mode wasn't zero).

> Signed-Off-By: Dave Chinner 
> Reviewed-by: Carlos Maiolino 
> Tested-by: Carlos Maiolino 
> Reviewed-by: Darrick J. Wong 
> Signed-off-by: Darrick J. Wong 
> [bwh: Backported to 3.16:
>  - Look up mode in XFS inode, not VFS inode
>  - Use positive error codes, and EIO instead of EFSCORRUPTED]

Why EIO?

Cheers,

Dave.
-- 
Dave Chinner
dchin...@redhat.com


Re: metadata operation reordering regards to crash

2018-09-14 Thread Dave Chinner
On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
> Hi, all,
> 
> A probably bit of complex question:
> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
> operation order through a crash/power failure?

Yes.

Behaviour is filesystem dependent, but we have tests in fstests that
specifically exercise order preservation across filesystem failures.

> What I know is modern filesystems ensure metadata consistency
> after crash/power failure. Journal filesystems like extX do that by
> write-ahead logging of metadata operations into transactions. Other
> filesystems do that in various ways as btfs do that by COW.
> 
> What I'm not so far clear is whether these filesystems preserve
> metadata operation order after a crash.
> 
> For example,
> op 1.  rename(A, B)
> op 2.  rename(C, D)
> 
> As mentioned above,  metadata consistency is ensured after a crash.
> Thus, B is either the original B(or not exists) or has been replaced by A.
> The same to D.
> 
> Is it possible that, after a crash, D has been replaced by C but B is still
> the original file(or not exists)?

Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
different.

Cheers,

Dave,
-- 
Dave Chinner
da...@fromorbit.com


Re: metadata operation reordering regards to crash

2018-09-14 Thread Dave Chinner
On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote:
> Hi, all,
> 
> A probably bit of complex question:
> Does nowadays practical filesystems, eg., extX, btfs, preserve metadata
> operation order through a crash/power failure?

Yes.

Behaviour is filesystem dependent, but we have tests in fstests that
specifically exercise order preservation across filesystem failures.

> What I know is modern filesystems ensure metadata consistency
> after crash/power failure. Journal filesystems like extX do that by
> write-ahead logging of metadata operations into transactions. Other
> filesystems do that in various ways as btfs do that by COW.
> 
> What I'm not so far clear is whether these filesystems preserve
> metadata operation order after a crash.
> 
> For example,
> op 1.  rename(A, B)
> op 2.  rename(C, D)
> 
> As mentioned above,  metadata consistency is ensured after a crash.
> Thus, B is either the original B(or not exists) or has been replaced by A.
> The same to D.
> 
> Is it possible that, after a crash, D has been replaced by C but B is still
> the original file(or not exists)?

Not for XFS, ext4, btrfs or f2fs. Other filesystems might be
different.

Cheers,

Dave,
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/2] fs: list init in the critical section

2018-09-10 Thread Dave Chinner
On Mon, Sep 10, 2018 at 05:09:52AM -0700, swkhack wrote:
> the i_state init in the critical section,so as the list init should in it.

Why? What bug does this fix?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 2/2] fs: list init in the critical section

2018-09-10 Thread Dave Chinner
On Mon, Sep 10, 2018 at 05:09:52AM -0700, swkhack wrote:
> the i_state init in the critical section,so as the list init should in it.

Why? What bug does this fix?

-Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: POSIX violation by writeback error

2018-09-05 Thread Dave Chinner
rite-flush-close operation by
using AIO and not waiting for the fsync to complete before closing
the file. I have this in fsmark, because lots of people with IO
intensive apps have been asking for it over the past 10 years.

-A 

.
FSUse%Count SizeFiles/sec App Overhead
 08 4096  28770.5  1569090
 0   16 4096  31340.7  1356595
 0   24 4096  30803.0  1423583
 0   32 4096  30404.5  1510099
 0   40 4096  30961.2  1500736

Yup, it's pretty much the same throughput as background async
writeback.  It's a little slower - about 160MB/s and 2,500 IOPS -
due to the increase in overall journal writes caused by the fsync
calls.

What's clear, however, is that we're retiring 10 userspace fsync
operations for every physical disk IO here, as opposed to 2 IOs per
fsync in the above case.  Put simply, the assumption that
applications can't do more flush/fsync operations than disk IOs is
not valid, and that performance of open-write-flush-close workloads
on modern filesystems isn't anywhere near as bad as you think it is.

To mangle a common saying into storage speak:

"Caches are for show, IO is for go"

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: POSIX violation by writeback error

2018-09-05 Thread Dave Chinner
rite-flush-close operation by
using AIO and not waiting for the fsync to complete before closing
the file. I have this in fsmark, because lots of people with IO
intensive apps have been asking for it over the past 10 years.

-A 

.
FSUse%Count SizeFiles/sec App Overhead
 08 4096  28770.5  1569090
 0   16 4096  31340.7  1356595
 0   24 4096  30803.0  1423583
 0   32 4096  30404.5  1510099
 0   40 4096  30961.2  1500736

Yup, it's pretty much the same throughput as background async
writeback.  It's a little slower - about 160MB/s and 2,500 IOPS -
due to the increase in overall journal writes caused by the fsync
calls.

What's clear, however, is that we're retiring 10 userspace fsync
operations for every physical disk IO here, as opposed to 2 IOs per
fsync in the above case.  Put simply, the assumption that
applications can't do more flush/fsync operations than disk IOs is
not valid, and that performance of open-write-flush-close workloads
on modern filesystems isn't anywhere near as bad as you think it is.

To mangle a common saying into storage speak:

"Caches are for show, IO is for go"

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: POSIX violation by writeback error

2018-09-04 Thread Dave Chinner
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote:
> On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote:
> > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > > Well, I think the point was that in the above examples you'd prefer 
> > > > > that
> > > > > the read just fail--no need to keep the data.  A bit marking the file
> > > > > (or even the entire filesystem) unreadable would satisfy posix, I 
> > > > > guess.
> > > > > Whether that's practical, I don't know.
> > > > 
> > > > When you would do it like that (mark the whole filesystem as "in
> > > > error") things go from bad to worse even faster. The Linux kernel 
> > > > tries to keep the system up even in the face of errors. 
> > > > 
> > > > With that suggestion, having one application run into a writeback
> > > > error would effectively crash the whole system because the filesystem
> > > > may be the root filesystem and stuff like "sshd" that you need to
> > > > diagnose the problem needs to be read from the disk 
> > > 
> > > Well, the absolutist position on posix compliance here would be that a
> > > crash is still preferable to returning the wrong data.  And for the
> > > cases 焦晓冬 gives, that sounds right?  Maybe it's the wrong balance in
> > > general, I don't know.  And we do already have filesystems with
> > > panic-on-error options, so if they aren't used maybe then maybe users
> > > have already voted against that level of strictness.
> > > 
> > 
> > Yeah, idk. The problem here is that this is squarely in the domain of
> > implementation defined behavior. I do think that the current "policy"
> > (if you call it that) of what to do after a wb error is weird and wrong.
> > What we probably ought to do is start considering how we'd like it to
> > behave.
> > 
> > How about something like this?
> > 
> > Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> > reads from the cached data until someone calls fsync, at which point
> > we'd return the error and invalidate the uncleanable pages.
> > 
> > If no one calls fsync and scrapes the error, we'll hold on to it for as
> > long as we can (or up to some predefined limit) and then after that
> > we'll invalidate the uncleanable pages and start returning errors on
> > reads. If someone eventually calls fsync afterward, we can return to
> > normal operation.
> > 
> > As always though...what about mmap? Would we need to SIGBUS at the point
> > where we'd start returning errors on read()?
> > 
> > Would that approximate the current behavior enough and make sense?
> > Implementing it all sounds non-trivial though...
> > 
> 
> Here's a crazy and potentially stupid idea:
> 
> Implement a new class of swap space for backing dirty pages which fail
> to write back.  Pages in this space survive reboots, essentially backing
> the implicit commitment POSIX establishes in the face of asynchronous
> writeback errors.  Rather than evicting these pages as clean, they are
> swapped out to the persistent swap.

And when that "swap" area gets write errors, too? What then? We're
straight back to the same "what the hell do we do with the error"
problem.

Adding more turtles doesn't help solve this issue.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: POSIX violation by writeback error

2018-09-04 Thread Dave Chinner
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote:
> On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote:
> > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote:
> > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote:
> > > > > Well, I think the point was that in the above examples you'd prefer 
> > > > > that
> > > > > the read just fail--no need to keep the data.  A bit marking the file
> > > > > (or even the entire filesystem) unreadable would satisfy posix, I 
> > > > > guess.
> > > > > Whether that's practical, I don't know.
> > > > 
> > > > When you would do it like that (mark the whole filesystem as "in
> > > > error") things go from bad to worse even faster. The Linux kernel 
> > > > tries to keep the system up even in the face of errors. 
> > > > 
> > > > With that suggestion, having one application run into a writeback
> > > > error would effectively crash the whole system because the filesystem
> > > > may be the root filesystem and stuff like "sshd" that you need to
> > > > diagnose the problem needs to be read from the disk 
> > > 
> > > Well, the absolutist position on posix compliance here would be that a
> > > crash is still preferable to returning the wrong data.  And for the
> > > cases 焦晓冬 gives, that sounds right?  Maybe it's the wrong balance in
> > > general, I don't know.  And we do already have filesystems with
> > > panic-on-error options, so if they aren't used maybe then maybe users
> > > have already voted against that level of strictness.
> > > 
> > 
> > Yeah, idk. The problem here is that this is squarely in the domain of
> > implementation defined behavior. I do think that the current "policy"
> > (if you call it that) of what to do after a wb error is weird and wrong.
> > What we probably ought to do is start considering how we'd like it to
> > behave.
> > 
> > How about something like this?
> > 
> > Mark the pages as "uncleanable" after a writeback error. We'll satisfy
> > reads from the cached data until someone calls fsync, at which point
> > we'd return the error and invalidate the uncleanable pages.
> > 
> > If no one calls fsync and scrapes the error, we'll hold on to it for as
> > long as we can (or up to some predefined limit) and then after that
> > we'll invalidate the uncleanable pages and start returning errors on
> > reads. If someone eventually calls fsync afterward, we can return to
> > normal operation.
> > 
> > As always though...what about mmap? Would we need to SIGBUS at the point
> > where we'd start returning errors on read()?
> > 
> > Would that approximate the current behavior enough and make sense?
> > Implementing it all sounds non-trivial though...
> > 
> 
> Here's a crazy and potentially stupid idea:
> 
> Implement a new class of swap space for backing dirty pages which fail
> to write back.  Pages in this space survive reboots, essentially backing
> the implicit commitment POSIX establishes in the face of asynchronous
> writeback errors.  Rather than evicting these pages as clean, they are
> swapped out to the persistent swap.

And when that "swap" area gets write errors, too? What then? We're
straight back to the same "what the hell do we do with the error"
problem.

Adding more turtles doesn't help solve this issue.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter

2018-08-27 Thread Dave Chinner
On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote:
> On 08/26/2018 08:21 PM, Dave Chinner wrote:
> > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
> >> The current log space reservation code allows multiple wakeups of the
> >> same sleeping waiter to happen. This is a just a waste of cpu time as
> >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
> >> added to track if a task is being waken up and skip the wake_up_process()
> >> call if the flag is set.
> >>
> >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
> >> Broadwell system with a small xfs filesystem on ramfs, the performance
> >> increased from 91,486 jobs/min to 192,666 jobs/min with this change.
> > Oh, I just noticed you are using a ramfs for this benchmark,
> >
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Oh sorry, I made a mistake.
> 
> There were some problems with my test configuration. I was actually
> running the test on a regular enterprise-class disk device mount on /.
> 
> Filesystem  1K-blocks Used Available
> Use% Mounted on
> /dev/mapper/rhel_hp--xl420gen9--01-root  52403200 11284408  41118792  22% /
> 
> It was not an SSD, nor ramdisk. I reran the test on ramdisk, the
> performance of the patched kernel was 679,880 jobs/min which was a bit
> more than double the 285,221 score that I got on a regular disk.

Can you please re-run and report the results for each patch on the
ramdisk setup? And, please, include the mkfs.xfs or xfs_info output
for the ramdisk filesystem so I can see /exactly/ how much
concurrency the filesystems are providing to the benchmark you are
running.

> So the filesystem used wasn't tiny, though it is still not very large.

50GB is tiny for XFS. Personally, I've been using ~1PB
filesystems(*) for the performance testing I've been doing
recently...

Cheers,

Dave.

(*) Yes, petabytes. Sparse image files on really fast SSDs are a
wonderful thing.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter

2018-08-27 Thread Dave Chinner
On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote:
> On 08/26/2018 08:21 PM, Dave Chinner wrote:
> > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
> >> The current log space reservation code allows multiple wakeups of the
> >> same sleeping waiter to happen. This is a just a waste of cpu time as
> >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
> >> added to track if a task is being waken up and skip the wake_up_process()
> >> call if the flag is set.
> >>
> >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
> >> Broadwell system with a small xfs filesystem on ramfs, the performance
> >> increased from 91,486 jobs/min to 192,666 jobs/min with this change.
> > Oh, I just noticed you are using a ramfs for this benchmark,
> >
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Oh sorry, I made a mistake.
> 
> There were some problems with my test configuration. I was actually
> running the test on a regular enterprise-class disk device mount on /.
> 
> Filesystem  1K-blocks Used Available
> Use% Mounted on
> /dev/mapper/rhel_hp--xl420gen9--01-root  52403200 11284408  41118792  22% /
> 
> It was not an SSD, nor ramdisk. I reran the test on ramdisk, the
> performance of the patched kernel was 679,880 jobs/min which was a bit
> more than double the 285,221 score that I got on a regular disk.

Can you please re-run and report the results for each patch on the
ramdisk setup? And, please, include the mkfs.xfs or xfs_info output
for the ramdisk filesystem so I can see /exactly/ how much
concurrency the filesystems are providing to the benchmark you are
running.

> So the filesystem used wasn't tiny, though it is still not very large.

50GB is tiny for XFS. Personally, I've been using ~1PB
filesystems(*) for the performance testing I've been doing
recently...

Cheers,

Dave.

(*) Yes, petabytes. Sparse image files on really fast SSDs are a
wonderful thing.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter

2018-08-27 Thread Dave Chinner
On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote:
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Note that all these ramdisk issues you mentioned below will also apply
> to using the pmem driver on nvdimms, which might be a more realistic
> version.  Even worse at least for cases where the nvdimms aren't
> actually powerfail dram of some sort with write through caching and
> ADR the latency is going to be much higher than the ramdisk as well.

Yes, I realise that.

I am expecting that when it comes to optimising for pmem, we'll
actually rewrite the journal to map pmem and memcpy() directly
rather than go through the buffering and IO layers we currently do
so we can minimise write latency and control concurrency ourselves.
Hence I'm not really concerned by performance issues with pmem at
this point - most of our still users have traditional storage and
will for a long time to come

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter

2018-08-27 Thread Dave Chinner
On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote:
> On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote:
> > tl; dr: Once you pass a certain point, ramdisks can be *much* slower
> > than SSDs on journal intensive workloads like AIM7. Hence it would be
> > useful to see if you have the same problems on, say, high
> > performance nvme SSDs.
> 
> Note that all these ramdisk issues you mentioned below will also apply
> to using the pmem driver on nvdimms, which might be a more realistic
> version.  Even worse at least for cases where the nvdimms aren't
> actually powerfail dram of some sort with write through caching and
> ADR the latency is going to be much higher than the ramdisk as well.

Yes, I realise that.

I am expecting that when it comes to optimising for pmem, we'll
actually rewrite the journal to map pmem and memcpy() directly
rather than go through the buffering and IO layers we currently do
so we can minimise write latency and control concurrency ourselves.
Hence I'm not really concerned by performance issues with pmem at
this point - most of our still users have traditional storage and
will for a long time to come

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter

2018-08-26 Thread Dave Chinner
On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote:
> The current log space reservation code allows multiple wakeups of the
> same sleeping waiter to happen. This is a just a waste of cpu time as
> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is
> added to track if a task is being waken up and skip the wake_up_process()
> call if the flag is set.
> 
> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread
> Broadwell system with a small xfs filesystem on ramfs, the performance
> increased from 91,486 jobs/min to 192,666 jobs/min with this change.

Oh, I just noticed you are using a ramfs for this benchmark,

tl; dr: Once you pass a certain point, ramdisks can be *much* slower
than SSDs on journal intensive workloads like AIM7. Hence it would be
useful to see if you have the same problems on, say, high
performance nvme SSDs.

-

Ramdisks have substantially different means log IO completion and
wakeup behaviour compared to real storage on real production
systems. Basically, ramdisks are synchronous and real storage is
asynchronous.

That is, on a ramdisk the IO completion is run synchronously in the
same task as the IO submission because the IO is just a memcpy().
Hence a single dispatch thread can only drive an IO queue depth of 1
IO - there is no concurrency possible. This serialises large parts
of the XFS journal - the journal is really an asynchronous IO engine
that gets it's performance from driving deep IO queues and batching
commits while IO is in flight.

Ramdisks also have very low IO latency, which means there's only a
very small window for "IO in flight" batching optimisations to be
made effectively. It effectively stops such algorithms from working
completely. This means the XFS journal behaves very differently on
ramdisks when compared to normal storage.

The submission batching techniques reduces log IOs by a factor of
10-20 under heavy synchrnous transaction loads when there is any
noticeable journal IO delay - a few tens of microseconds is enough
for it to function effectively, but a ramdisk doesn't even have this
delay on journal IO.  The submission batching also has the
effect of reducing log space wakeups by the same factor there are
less IO completions signalling that space has been made available.

Further, when we get async IO completions from real hardware, they
get processed in batches by a completion workqueue - this leads to
there typically only being a single reservation space update from
all batched IO completions. This tends to reduce log space wakeups
due to log IO completion by a factor of 6-8 as the log can have up
to 8 concurrent IOs in flight at a time.

And when we throw in the lack of batching, merging and IO completion
aggregation of metadata writeback because ramdisks are synchrnous
and don't queue or merge adjacent IOs, we end up with lots more
contention on the AIL lock and much more frequent log space wakeups
(i.e. from log tail movement updates). This futher exacerbates the
problems the log already has with synchronous IO.

IOWs, log space wakeups on real storage are likely to be 50-100x
lower than on a ramdisk for the same metadata and journal intensive
workload, and as such those workloads often run faster on real
storage than they do on ramdisks.

This can be trivially seen with dbench, a simple IO benchmark that
hammers the journal. On a ramdisk, I can only get 2-2.5GB/s
throughput from the benchmark before the log bottlenecks at about
20,000 log tiny IOs per second. In comparison, on an old, badly
abused Samsung 850EVO SSD, I see 5-6GB/s in 2,000 log IOs per second
because of the pipelining and IO batching in the XFS journal async
IO engine and the massive reduction in metadata IO due to merging of
adjacent IOs in the block layer. i.e. the journal and metadata
writeback design allows the filesystem to operate at a much higher
synchronous transaction rate than would otherwise be possible by
taking advantage of the IO concurrency that storage provides us
with.

So if you use proper storage hardware (e.g. nvme SSD) and/or an
appropriately sized log, does the slowpath wakeup contention go
away? Can you please test both of these things and report the
results so we can properly evaluate the impact of these changes?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


<    1   2   3   4   5   6   7   8   9   10   >