Re: [PATCH 0/4][V4] drop the mmap_sem when doing IO in the fault path
On Tue, Dec 04, 2018 at 02:49:31PM -0800, Andrew Morton wrote: > On Fri, 30 Nov 2018 14:58:08 -0500 Josef Bacik wrote: > > > Now that we have proper isolation in place with cgroups2 we have started > > going > > through and fixing the various priority inversions. Most are all gone now, > > but > > this one is sort of weird since it's not necessarily a priority inversion > > that > > happens within the kernel, but rather because of something userspace does. > > > > We have giant applications that we want to protect, and parts of these giant > > applications do things like watch the system state to determine how healthy > > the > > box is for load balancing and such. This involves running 'ps' or other > > such > > utilities. These utilities will often walk /proc//whatever, and these > > files can sometimes need to down_read(>mmap_sem). Not usually a big > > deal, > > but we noticed when we are stress testing that sometimes our protected > > application has latency spikes trying to get the mmap_sem for tasks that > > are in > > lower priority cgroups. > > > > This is because any down_write() on a semaphore essentially turns it into a > > mutex, so even if we currently have it held for reading, any new readers > > will > > not be allowed on to keep from starving the writer. This is fine, except a > > lower priority task could be stuck doing IO because it has been throttled > > to the > > point that its IO is taking much longer than normal. But because a higher > > priority group depends on this completing it is now stuck behind lower > > priority > > work. > > > > In order to avoid this particular priority inversion we want to use the > > existing > > retry mechanism to stop from holding the mmap_sem at all if we are going to > > do > > IO. This already exists in the read case sort of, but needed to be > > extended for > > more than just grabbing the page lock. With io.latency we throttle at > > submit_bio() time, so the readahead stuff can block and even > > page_cache_read can > > block, so all these paths need to have the mmap_sem dropped. > > > > The other big thing is ->page_mkwrite. btrfs is particularly shitty here > > because we have to reserve space for the dirty page, which can be a very > > expensive operation. We use the same retry method as the read path, and > > simply > > cache the page and verify the page is still setup properly the next pass > > through > > ->page_mkwrite(). > > Seems reasonable. I have a few minorish changeloggish comments. > > We're at v4 and no acks have been gathered? I looked at previous versions and had a bunch of questions and change requests. I haven't had time to look at this version yet, but seeing as the page_mkwrite() stuff has been dropped from this version it isn't useful anymore for solving the problem I had in mind when reviewing it originally... What I really want is unconditionally retriable page faults so the filesystem can cause the page fault to be restarted from scratch. We have a requirement for DAX and shared data extents (reflink) to work, and that requires changing the faulted page location during page_mkwrite. i.e. we get a fault on a read mapped shared page, then we have to do a copy-on-write operation to break physical data sharing and so the page with the file data in it physically changes during ->page_mkwrite (because DAX). Hence we need to restart the page fault to map the new page correctly because the file no longer points at the page that was originally faulted. With this stashed-page-and-retry mechanism implemented for ->page_mkwrite, we could stash the new page in the vmf and tell the fault to retry, and everything would just work. Without ->page_mkwrite support, it's just not that interesting and I have higher priority things to deal with right now Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Sat, Dec 01, 2018 at 02:49:09AM -0500, Sasha Levin wrote: > On Sat, Dec 01, 2018 at 08:50:05AM +1100, Dave Chinner wrote: > >On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote: > >>On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > >>>On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > >>>>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops > >>>>aggregate) to focus on testing the copy_file_range() changes, but > >>>>Darrick's tests are still ongoing and have passed 40 billion ops in > >>>>aggregate over the past few days. > >>>> > >>>>The reason we are running these so long is that we've seen fsx data > >>>>corruption failures after 12+ hours of runtime and hundreds of > >>>>millions of ops. Hence the testing for backported fixes will need to > >>>>replicate these test runs across multiple configurations for > >>>>multiple days before we have any confidence that we've actually > >>>>fixed the data corruptions and not introduced any new ones. > >>>> > >>>>If you pull only a small subset of the fixes, the fsx will still > >>>>fail and we have no real way of actually verifying that there have > >>>>been no regression introduced by the backport. IOWs, there's a > >>>>/massive/ amount of QA needed for ensuring that these backports work > >>>>correctly. > >>>> > >>>>Right now the XFS developers don't have the time or resources > >>>>available to validate stable backports are correct and regression > >>>>fre because we are focussed on ensuring the upstream fixes we've > >>>>already made (and are still writing) are solid and reliable. > >>> > >>>Ok, that's fine, so users of XFS should wait until the 4.20 release > >>>before relying on it? :) > >> > >>It's getting to the point that with the amount of known issues with XFS > >>on LTS kernels it makes sense to mark it as CONFIG_BROKEN. > > > >Really? Where are the bug reports? > > In 'git log'! You report these every time you fix something in upstream > xfs but don't backport it to stable trees: That is so wrong on so many levels I don't really know where to begin. I guess doing a *basic risk analysis* demonstrating that none of those fixes are backport candidates is a good start: > $ git log --oneline v4.18-rc1..v4.18 fs/xfs > d4a34e165557 xfs: properly handle free inodes in extent hint validators Found by QA with generic/229 on a non-standard config. Not user reported, unlikely to ever be seen by users. > 9991274fddb9 xfs: Initialize variables in xfs_alloc_get_rec before using them Cleaning up coverity reported issues to do with corruption log messages. No visible symptoms, Not user reported. > d8cb5e423789 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation Minor free space accounting issue, not user reported, doesn't affect normal operation. > e53c4b598372 xfs: ensure post-EOF zeroing happens after zeroing part of a file Found with fsx via generic/127. Not user reported, doesn't affect userspace operation at all. > a3a374bf1889 xfs: fix off-by-one error in xfs_rtalloc_query_range Regression fix for code introduced in 4.18-rc1. Not user reported because the code has never been released. > 232d0a24b0fc xfs: fix uninitialized field in rtbitmap fsmap backend Coverity warning fix, not user reported, not user impact. > 5bd88d153998 xfs: recheck reflink state after grabbing ILOCK_SHARED for a > write Fixes warning from generic/166, not user reported. Could affect users mixing direct IO with reflink, but we expect people using new functionality like reflink to be tracking TOT fairly closely anyway. > f62cb48e4319 xfs: don't allow insert-range to shift extents past the maximum > offset Found by QA w/ generic/465. Not user reported, only affects files in the exabyte range so not a real world problem > aafe12cee0b1 xfs: don't trip over negative free space in xfs_reserve_blocks Found during ENOSPC stress tests that depeleted the reserve pool. Not user reported, unlikely to ever be hit by users. > 10ee25268e1f xfs: allow empty transactions while frozen Removes a spurious warning when running GETFSMAP ioctl on a frozen filesystem. Not user reported, highly unlikely any user will ever hit this as nothing but XFs utilities use GETFSMAP at the moment. > e53946dbd31a xfs: xfs_iflush_abort() can be called twice on cluster writeback > failure Bug in corrupted filesystem handling, been there for ~15 years IIRC. Not user reported - found by one of our shutdown stress tests on a debug kernel (generic/388, IIRC). Highly
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Sat, Dec 01, 2018 at 02:49:09AM -0500, Sasha Levin wrote: > On Sat, Dec 01, 2018 at 08:50:05AM +1100, Dave Chinner wrote: > >On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote: > >>On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > >>>On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > >>>>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops > >>>>aggregate) to focus on testing the copy_file_range() changes, but > >>>>Darrick's tests are still ongoing and have passed 40 billion ops in > >>>>aggregate over the past few days. > >>>> > >>>>The reason we are running these so long is that we've seen fsx data > >>>>corruption failures after 12+ hours of runtime and hundreds of > >>>>millions of ops. Hence the testing for backported fixes will need to > >>>>replicate these test runs across multiple configurations for > >>>>multiple days before we have any confidence that we've actually > >>>>fixed the data corruptions and not introduced any new ones. > >>>> > >>>>If you pull only a small subset of the fixes, the fsx will still > >>>>fail and we have no real way of actually verifying that there have > >>>>been no regression introduced by the backport. IOWs, there's a > >>>>/massive/ amount of QA needed for ensuring that these backports work > >>>>correctly. > >>>> > >>>>Right now the XFS developers don't have the time or resources > >>>>available to validate stable backports are correct and regression > >>>>fre because we are focussed on ensuring the upstream fixes we've > >>>>already made (and are still writing) are solid and reliable. > >>> > >>>Ok, that's fine, so users of XFS should wait until the 4.20 release > >>>before relying on it? :) > >> > >>It's getting to the point that with the amount of known issues with XFS > >>on LTS kernels it makes sense to mark it as CONFIG_BROKEN. > > > >Really? Where are the bug reports? > > In 'git log'! You report these every time you fix something in upstream > xfs but don't backport it to stable trees: That is so wrong on so many levels I don't really know where to begin. I guess doing a *basic risk analysis* demonstrating that none of those fixes are backport candidates is a good start: > $ git log --oneline v4.18-rc1..v4.18 fs/xfs > d4a34e165557 xfs: properly handle free inodes in extent hint validators Found by QA with generic/229 on a non-standard config. Not user reported, unlikely to ever be seen by users. > 9991274fddb9 xfs: Initialize variables in xfs_alloc_get_rec before using them Cleaning up coverity reported issues to do with corruption log messages. No visible symptoms, Not user reported. > d8cb5e423789 xfs: fix fdblocks accounting w/ RMAPBT per-AG reservation Minor free space accounting issue, not user reported, doesn't affect normal operation. > e53c4b598372 xfs: ensure post-EOF zeroing happens after zeroing part of a file Found with fsx via generic/127. Not user reported, doesn't affect userspace operation at all. > a3a374bf1889 xfs: fix off-by-one error in xfs_rtalloc_query_range Regression fix for code introduced in 4.18-rc1. Not user reported because the code has never been released. > 232d0a24b0fc xfs: fix uninitialized field in rtbitmap fsmap backend Coverity warning fix, not user reported, not user impact. > 5bd88d153998 xfs: recheck reflink state after grabbing ILOCK_SHARED for a > write Fixes warning from generic/166, not user reported. Could affect users mixing direct IO with reflink, but we expect people using new functionality like reflink to be tracking TOT fairly closely anyway. > f62cb48e4319 xfs: don't allow insert-range to shift extents past the maximum > offset Found by QA w/ generic/465. Not user reported, only affects files in the exabyte range so not a real world problem > aafe12cee0b1 xfs: don't trip over negative free space in xfs_reserve_blocks Found during ENOSPC stress tests that depeleted the reserve pool. Not user reported, unlikely to ever be hit by users. > 10ee25268e1f xfs: allow empty transactions while frozen Removes a spurious warning when running GETFSMAP ioctl on a frozen filesystem. Not user reported, highly unlikely any user will ever hit this as nothing but XFs utilities use GETFSMAP at the moment. > e53946dbd31a xfs: xfs_iflush_abort() can be called twice on cluster writeback > failure Bug in corrupted filesystem handling, been there for ~15 years IIRC. Not user reported - found by one of our shutdown stress tests on a debug kernel (generic/388, IIRC). Highly
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote: > On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > >On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > >>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops > >>aggregate) to focus on testing the copy_file_range() changes, but > >>Darrick's tests are still ongoing and have passed 40 billion ops in > >>aggregate over the past few days. > >> > >>The reason we are running these so long is that we've seen fsx data > >>corruption failures after 12+ hours of runtime and hundreds of > >>millions of ops. Hence the testing for backported fixes will need to > >>replicate these test runs across multiple configurations for > >>multiple days before we have any confidence that we've actually > >>fixed the data corruptions and not introduced any new ones. > >> > >>If you pull only a small subset of the fixes, the fsx will still > >>fail and we have no real way of actually verifying that there have > >>been no regression introduced by the backport. IOWs, there's a > >>/massive/ amount of QA needed for ensuring that these backports work > >>correctly. > >> > >>Right now the XFS developers don't have the time or resources > >>available to validate stable backports are correct and regression > >>fre because we are focussed on ensuring the upstream fixes we've > >>already made (and are still writing) are solid and reliable. > > > >Ok, that's fine, so users of XFS should wait until the 4.20 release > >before relying on it? :) > > It's getting to the point that with the amount of known issues with XFS > on LTS kernels it makes sense to mark it as CONFIG_BROKEN. Really? Where are the bug reports? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Fri, Nov 30, 2018 at 05:14:41AM -0500, Sasha Levin wrote: > On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > >On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > >>I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops > >>aggregate) to focus on testing the copy_file_range() changes, but > >>Darrick's tests are still ongoing and have passed 40 billion ops in > >>aggregate over the past few days. > >> > >>The reason we are running these so long is that we've seen fsx data > >>corruption failures after 12+ hours of runtime and hundreds of > >>millions of ops. Hence the testing for backported fixes will need to > >>replicate these test runs across multiple configurations for > >>multiple days before we have any confidence that we've actually > >>fixed the data corruptions and not introduced any new ones. > >> > >>If you pull only a small subset of the fixes, the fsx will still > >>fail and we have no real way of actually verifying that there have > >>been no regression introduced by the backport. IOWs, there's a > >>/massive/ amount of QA needed for ensuring that these backports work > >>correctly. > >> > >>Right now the XFS developers don't have the time or resources > >>available to validate stable backports are correct and regression > >>fre because we are focussed on ensuring the upstream fixes we've > >>already made (and are still writing) are solid and reliable. > > > >Ok, that's fine, so users of XFS should wait until the 4.20 release > >before relying on it? :) > > It's getting to the point that with the amount of known issues with XFS > on LTS kernels it makes sense to mark it as CONFIG_BROKEN. Really? Where are the bug reports? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > > On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote: > > > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote: > > > > > > > > Cherry picking only one of the 50-odd patches we've committed into > > > > late 4.19 and 4.20 kernels to fix the problems we've found really > > > > seems like asking for trouble. If you're going to back port random > > > > data corruption fixes, then you need to spend a *lot* of time > > > > validating that it doesn't make things worse than they already > > > > are... > > > > > > Any reason why we can't take the 50-odd patches in their entirety? It > > > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is? If so, what do you > > > recommend we do to make 4.19 working properly? > > > > You coul dpull all the fixes, but then you have a QA problem. > > Basically, we have multiple badly broken syscalls (FICLONERANGE, > > FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully > > fixed. > > > > There were ~5 critical dedupe/clone data corruption fixes for XFS > > went into 4.19-rc8. > > Have any of those been tagged for stable? None, because I have no confidence that the stable process will do the necessary QA to validate that such a significant backport is regression and data corruption free. The backport needs to be done as a complete series when we've finished the upstream work because we can't test isolated patches adequately because fsx will fall over due to all the unfixed problems and not exercise the fixes that were backported. Further, we just had a regression reported in one of the commit that the autosel bot has selected for automatic backports. It has been uncovered by overlay which appears to do some unique things with the piece of crap that is do_splice_direct(). And Darrick just commented on #xfs that he's just noticed more bugs with FICLONERANGE and overlay. IOWs, we're still finding broken stuff in this code and we are fixing it as fast as we can - we're still putting out fires. We most certainly don't need the added pressure of having you guys create more spot fires by breaking stable kernels with largely untested partial backports and having users exposed to whacky new data corruption issues. So, no, it isn't tagged for stable kernels because "commit into mainline" != "this should be backported immediately". Backports of these fixes are largely going to be done largely as a function of time and resources, of which we have zero available right now. Doing backports right now is premature and ill-advised because we haven't finished finding and fixing all the bugs and regressions in this code. > > Right now the XFS developers don't have the time or resources > > available to validate stable backports are correct and regression > > fre because we are focussed on ensuring the upstream fixes we've > > already made (and are still writing) are solid and reliable. > > Ok, that's fine, so users of XFS should wait until the 4.20 release > before relying on it? :) Ok, Greg, that's *out of line*. I should throw the CoC at you because I find that comment offensive, condescending, belittling, denegrating and insulting. Your smug and superior "I know what is right for you" attitude is completely inappropriate, and a little smiley face does not make it acceptible. If you think your comment is funny, you've badly misjudged how much effort I've put into this (100-hour weeks for over a month now), how close I'm flying to burn out (again!), and how pissed off I am about this whole scenario. We ended up here because we *trusted* that other people had implemented and tested their APIs and code properly before it got merged. We've been severely burnt, and we've been left to clean up the mess made by other people by ourselves. Instead of thanks, what we get instead is "we know better" attitude and jokes implying our work is crap and we don't care about our users. That's just plain *insulting*. If anyone is looking for a demonstration of everything that is wrong with the Linux kernel development culture, then they don't need to look any further. > I understand your reluctance to want to backport anything, but it really > feels like you are not even allowing for fixes that are "obviously > right" to be backported either, even after they pass testing. Which > isn't ok for your users. It's worse for our users if we introduce regressions into stable kernels, which is exactly what this "obviously right" auto-backport would have done. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Fri, Nov 30, 2018 at 09:22:03AM +0100, Greg KH wrote: > On Fri, Nov 30, 2018 at 09:40:19AM +1100, Dave Chinner wrote: > > On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote: > > > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote: > > > > > > > > Cherry picking only one of the 50-odd patches we've committed into > > > > late 4.19 and 4.20 kernels to fix the problems we've found really > > > > seems like asking for trouble. If you're going to back port random > > > > data corruption fixes, then you need to spend a *lot* of time > > > > validating that it doesn't make things worse than they already > > > > are... > > > > > > Any reason why we can't take the 50-odd patches in their entirety? It > > > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is? If so, what do you > > > recommend we do to make 4.19 working properly? > > > > You coul dpull all the fixes, but then you have a QA problem. > > Basically, we have multiple badly broken syscalls (FICLONERANGE, > > FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully > > fixed. > > > > There were ~5 critical dedupe/clone data corruption fixes for XFS > > went into 4.19-rc8. > > Have any of those been tagged for stable? None, because I have no confidence that the stable process will do the necessary QA to validate that such a significant backport is regression and data corruption free. The backport needs to be done as a complete series when we've finished the upstream work because we can't test isolated patches adequately because fsx will fall over due to all the unfixed problems and not exercise the fixes that were backported. Further, we just had a regression reported in one of the commit that the autosel bot has selected for automatic backports. It has been uncovered by overlay which appears to do some unique things with the piece of crap that is do_splice_direct(). And Darrick just commented on #xfs that he's just noticed more bugs with FICLONERANGE and overlay. IOWs, we're still finding broken stuff in this code and we are fixing it as fast as we can - we're still putting out fires. We most certainly don't need the added pressure of having you guys create more spot fires by breaking stable kernels with largely untested partial backports and having users exposed to whacky new data corruption issues. So, no, it isn't tagged for stable kernels because "commit into mainline" != "this should be backported immediately". Backports of these fixes are largely going to be done largely as a function of time and resources, of which we have zero available right now. Doing backports right now is premature and ill-advised because we haven't finished finding and fixing all the bugs and regressions in this code. > > Right now the XFS developers don't have the time or resources > > available to validate stable backports are correct and regression > > fre because we are focussed on ensuring the upstream fixes we've > > already made (and are still writing) are solid and reliable. > > Ok, that's fine, so users of XFS should wait until the 4.20 release > before relying on it? :) Ok, Greg, that's *out of line*. I should throw the CoC at you because I find that comment offensive, condescending, belittling, denegrating and insulting. Your smug and superior "I know what is right for you" attitude is completely inappropriate, and a little smiley face does not make it acceptible. If you think your comment is funny, you've badly misjudged how much effort I've put into this (100-hour weeks for over a month now), how close I'm flying to burn out (again!), and how pissed off I am about this whole scenario. We ended up here because we *trusted* that other people had implemented and tested their APIs and code properly before it got merged. We've been severely burnt, and we've been left to clean up the mess made by other people by ourselves. Instead of thanks, what we get instead is "we know better" attitude and jokes implying our work is crap and we don't care about our users. That's just plain *insulting*. If anyone is looking for a demonstration of everything that is wrong with the Linux kernel development culture, then they don't need to look any further. > I understand your reluctance to want to backport anything, but it really > feels like you are not even allowing for fixes that are "obviously > right" to be backported either, even after they pass testing. Which > isn't ok for your users. It's worse for our users if we introduce regressions into stable kernels, which is exactly what this "obviously right" auto-backport would have done. -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote: > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote: > > > > Cherry picking only one of the 50-odd patches we've committed into > > late 4.19 and 4.20 kernels to fix the problems we've found really > > seems like asking for trouble. If you're going to back port random > > data corruption fixes, then you need to spend a *lot* of time > > validating that it doesn't make things worse than they already > > are... > > Any reason why we can't take the 50-odd patches in their entirety? It > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is? If so, what do you > recommend we do to make 4.19 working properly? You coul dpull all the fixes, but then you have a QA problem. Basically, we have multiple badly broken syscalls (FICLONERANGE, FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully fixed. There were ~5 critical dedupe/clone data corruption fixes for XFS went into 4.19-rc8. There were ~30 patches that went into 4.20-rc1 that fixed the FICLONERANGE/FIDEDUPERANGE ioctls. That completely reworks the entire VFS infrastructure for those calls, and touches several filesystems as well. It fixes problems with setuid files, swap files, modifying immutable files, failure to enforce rlimit and max file size constraints, behaviour that didn't match man page descriptions, etc. There were another ~10 patches that went into 4.20-rc4 that fixed yet more data corruption and API problems that we found when we enhanced fsx to use the above syscalls. And I have another ~10 patches that I'm working on right now to fix the copy_file_range() implementation - it has all the same problems I listed above for FICLONERANGE/FIDEDUPERANGE and some other unique ones. I'm currently writing error condition tests for fstests so that we at least have some coverage of the conditions copy_file_range() is supposed to catch and fail. This might all make a late 4.20-rcX, but it's looking more like 4.21 at this point. As to testing this stuff, I've spend several weeks now on this and so has Darrick. Between us we've done a huge amount of QA needed to verify that the problems are fixed and it is still ongoing. From #xfs a couple of days ago: [28/11/18 16:59] * djwong hits 6 billion fsxops... [28/11/18 17:07] djwong: I've got about 3.75 billion ops running on a machine here [28/11/18 17:20] note that's 1 billion fsxops x 6 machines [28/11/18 17:21] [xfsv4, xfsv5, xfsv5 w/ 1k blocks] * [directio fsx, buffered fsx] [28/11/18 17:21] Oh, I've got 3.75B x 4 instances on one filesystem :P [28/11/18 17:22] [direct io, buffered] x [small op lengths, large op lengths] And this morning: [30/11/18 08:53] 7 billion fsxops... I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops aggregate) to focus on testing the copy_file_range() changes, but Darrick's tests are still ongoing and have passed 40 billion ops in aggregate over the past few days. The reason we are running these so long is that we've seen fsx data corruption failures after 12+ hours of runtime and hundreds of millions of ops. Hence the testing for backported fixes will need to replicate these test runs across multiple configurations for multiple days before we have any confidence that we've actually fixed the data corruptions and not introduced any new ones. If you pull only a small subset of the fixes, the fsx will still fail and we have no real way of actually verifying that there have been no regression introduced by the backport. IOWs, there's a /massive/ amount of QA needed for ensuring that these backports work correctly. Right now the XFS developers don't have the time or resources available to validate stable backports are correct and regression fre because we are focussed on ensuring the upstream fixes we've already made (and are still writing) are solid and reliable. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 01:47:56PM +0100, Greg KH wrote: > On Thu, Nov 29, 2018 at 11:14:59PM +1100, Dave Chinner wrote: > > > > Cherry picking only one of the 50-odd patches we've committed into > > late 4.19 and 4.20 kernels to fix the problems we've found really > > seems like asking for trouble. If you're going to back port random > > data corruption fixes, then you need to spend a *lot* of time > > validating that it doesn't make things worse than they already > > are... > > Any reason why we can't take the 50-odd patches in their entirety? It > sounds like 4.19 isn't fully fixed, but 4.20-rc1 is? If so, what do you > recommend we do to make 4.19 working properly? You coul dpull all the fixes, but then you have a QA problem. Basically, we have multiple badly broken syscalls (FICLONERANGE, FIDEDUPERANGE and copy_file_range), and even 4.20-rc4 isn't fully fixed. There were ~5 critical dedupe/clone data corruption fixes for XFS went into 4.19-rc8. There were ~30 patches that went into 4.20-rc1 that fixed the FICLONERANGE/FIDEDUPERANGE ioctls. That completely reworks the entire VFS infrastructure for those calls, and touches several filesystems as well. It fixes problems with setuid files, swap files, modifying immutable files, failure to enforce rlimit and max file size constraints, behaviour that didn't match man page descriptions, etc. There were another ~10 patches that went into 4.20-rc4 that fixed yet more data corruption and API problems that we found when we enhanced fsx to use the above syscalls. And I have another ~10 patches that I'm working on right now to fix the copy_file_range() implementation - it has all the same problems I listed above for FICLONERANGE/FIDEDUPERANGE and some other unique ones. I'm currently writing error condition tests for fstests so that we at least have some coverage of the conditions copy_file_range() is supposed to catch and fail. This might all make a late 4.20-rcX, but it's looking more like 4.21 at this point. As to testing this stuff, I've spend several weeks now on this and so has Darrick. Between us we've done a huge amount of QA needed to verify that the problems are fixed and it is still ongoing. From #xfs a couple of days ago: [28/11/18 16:59] * djwong hits 6 billion fsxops... [28/11/18 17:07] djwong: I've got about 3.75 billion ops running on a machine here [28/11/18 17:20] note that's 1 billion fsxops x 6 machines [28/11/18 17:21] [xfsv4, xfsv5, xfsv5 w/ 1k blocks] * [directio fsx, buffered fsx] [28/11/18 17:21] Oh, I've got 3.75B x 4 instances on one filesystem :P [28/11/18 17:22] [direct io, buffered] x [small op lengths, large op lengths] And this morning: [30/11/18 08:53] 7 billion fsxops... I stopped my tests at 5 billion ops yesterday (i.e. 20 billion ops aggregate) to focus on testing the copy_file_range() changes, but Darrick's tests are still ongoing and have passed 40 billion ops in aggregate over the past few days. The reason we are running these so long is that we've seen fsx data corruption failures after 12+ hours of runtime and hundreds of millions of ops. Hence the testing for backported fixes will need to replicate these test runs across multiple configurations for multiple days before we have any confidence that we've actually fixed the data corruptions and not introduced any new ones. If you pull only a small subset of the fixes, the fsx will still fail and we have no real way of actually verifying that there have been no regression introduced by the backport. IOWs, there's a /massive/ amount of QA needed for ensuring that these backports work correctly. Right now the XFS developers don't have the time or resources available to validate stable backports are correct and regression fre because we are focussed on ensuring the upstream fixes we've already made (and are still writing) are solid and reliable. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 02:36:50PM +0200, Amir Goldstein wrote: > > Same again - what's the test plan for these cherry-picked data > > corruption fixes? > > > > Dave, > > Just to check if we are on the same page, if this was the test plan: > https://www.spinics.net/lists/linux-xfs/msg20639.html > > Would "no regressions from baseline" have been sufficient to validate > those specific patches are solid for stable? No, not in this case, because fsx in fstests only does 100k ops at most - it's a smoke test. This isn't sufficient to regression test fixes that, in some cases, took hundreds of millions of fsx ops to expose. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 02:36:50PM +0200, Amir Goldstein wrote: > > Same again - what's the test plan for these cherry-picked data > > corruption fixes? > > > > Dave, > > Just to check if we are on the same page, if this was the test plan: > https://www.spinics.net/lists/linux-xfs/msg20639.html > > Would "no regressions from baseline" have been sufficient to validate > those specific patches are solid for stable? No, not in this case, because fsx in fstests only does 100k ops at most - it's a smoke test. This isn't sufficient to regression test fixes that, in some cases, took hundreds of millions of fsx ops to expose. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 12:55:43AM -0500, Sasha Levin wrote: > From: Dave Chinner > > [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ] > > If we are doing sub-block dio that extends EOF, we need to zero > the unused tail of the block to initialise the data in it it. If we > do not zero the tail of the block, then an immediate mmap read of > the EOF block will expose stale data beyond EOF to userspace. Found > with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations. > > Fix this by detecting if the end of the DIO write is beyond EOF > and zeroing the tail if necessary. > > Signed-off-by: Dave Chinner > Reviewed-by: Christoph Hellwig > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > Signed-off-by: Sasha Levin > --- > fs/iomap.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index fa46e3ed8f53..82e35265679d 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1678,7 +1678,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, > loff_t length, > dio->submit.cookie = submit_bio(bio); > } while (nr_pages); > > - if (need_zeroout) { > + /* > + * We need to zeroout the tail of a sub-block write if the extent type > + * requires zeroing or the write extends beyond EOF. If we don't zero > + * the block tail in the latter case, we can expose stale data via mmap > + * reads of the EOF block. > + */ > + if (need_zeroout || > + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > pad = pos & (fs_block_size - 1); > if (pad) Same again - what's the test plan for these cherry-picked data corruption fixes? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.19 52/68] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 12:55:43AM -0500, Sasha Levin wrote: > From: Dave Chinner > > [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ] > > If we are doing sub-block dio that extends EOF, we need to zero > the unused tail of the block to initialise the data in it it. If we > do not zero the tail of the block, then an immediate mmap read of > the EOF block will expose stale data beyond EOF to userspace. Found > with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations. > > Fix this by detecting if the end of the DIO write is beyond EOF > and zeroing the tail if necessary. > > Signed-off-by: Dave Chinner > Reviewed-by: Christoph Hellwig > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > Signed-off-by: Sasha Levin > --- > fs/iomap.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index fa46e3ed8f53..82e35265679d 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -1678,7 +1678,14 @@ iomap_dio_bio_actor(struct inode *inode, loff_t pos, > loff_t length, > dio->submit.cookie = submit_bio(bio); > } while (nr_pages); > > - if (need_zeroout) { > + /* > + * We need to zeroout the tail of a sub-block write if the extent type > + * requires zeroing or the write extends beyond EOF. If we don't zero > + * the block tail in the latter case, we can expose stale data via mmap > + * reads of the EOF block. > + */ > + if (need_zeroout || > + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > pad = pos & (fs_block_size - 1); > if (pad) Same again - what's the test plan for these cherry-picked data corruption fixes? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 01:00:59AM -0500, Sasha Levin wrote: > From: Dave Chinner > > [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ] > > If we are doing sub-block dio that extends EOF, we need to zero > the unused tail of the block to initialise the data in it it. If we > do not zero the tail of the block, then an immediate mmap read of > the EOF block will expose stale data beyond EOF to userspace. Found > with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations. > > Fix this by detecting if the end of the DIO write is beyond EOF > and zeroing the tail if necessary. > > Signed-off-by: Dave Chinner > Reviewed-by: Christoph Hellwig > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > Signed-off-by: Sasha Levin > --- > fs/iomap.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 8f7673a69273..407efdae3978 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -940,7 +940,14 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t > length, > dio->submit.cookie = submit_bio(bio); > } while (nr_pages); > > - if (need_zeroout) { > + /* > + * We need to zeroout the tail of a sub-block write if the extent type > + * requires zeroing or the write extends beyond EOF. If we don't zero > + * the block tail in the latter case, we can expose stale data via mmap > + * reads of the EOF block. > + */ > + if (need_zeroout || > + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > pad = pos & (fs_block_size - 1); > if (pad) How do you propose to validate that this doesn't introduce new data corruptions in isolation? I've spent the last 4 weeks of my life and about 15 billion fsx ops chasing an validating the bug corruption fixes we've pushed recently into the 4.19 and 4.20 codebase. Cherry picking only one of the 50-odd patches we've committed into late 4.19 and 4.20 kernels to fix the problems we've found really seems like asking for trouble. If you're going to back port random data corruption fixes, then you need to spend a *lot* of time validating that it doesn't make things worse than they already are... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH AUTOSEL 4.14 25/35] iomap: sub-block dio needs to zeroout beyond EOF
On Thu, Nov 29, 2018 at 01:00:59AM -0500, Sasha Levin wrote: > From: Dave Chinner > > [ Upstream commit b450672fb66b4a991a5b55ee24209ac7ae7690ce ] > > If we are doing sub-block dio that extends EOF, we need to zero > the unused tail of the block to initialise the data in it it. If we > do not zero the tail of the block, then an immediate mmap read of > the EOF block will expose stale data beyond EOF to userspace. Found > with fsx running sub-block DIO sizes vs MAPREAD/MAPWRITE operations. > > Fix this by detecting if the end of the DIO write is beyond EOF > and zeroing the tail if necessary. > > Signed-off-by: Dave Chinner > Reviewed-by: Christoph Hellwig > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > Signed-off-by: Sasha Levin > --- > fs/iomap.c | 9 - > 1 file changed, 8 insertions(+), 1 deletion(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 8f7673a69273..407efdae3978 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -940,7 +940,14 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t > length, > dio->submit.cookie = submit_bio(bio); > } while (nr_pages); > > - if (need_zeroout) { > + /* > + * We need to zeroout the tail of a sub-block write if the extent type > + * requires zeroing or the write extends beyond EOF. If we don't zero > + * the block tail in the latter case, we can expose stale data via mmap > + * reads of the EOF block. > + */ > + if (need_zeroout || > + ((dio->flags & IOMAP_DIO_WRITE) && pos >= i_size_read(inode))) { > /* zero out from the end of the write to the end of the block */ > pad = pos & (fs_block_size - 1); > if (pad) How do you propose to validate that this doesn't introduce new data corruptions in isolation? I've spent the last 4 weeks of my life and about 15 billion fsx ops chasing an validating the bug corruption fixes we've pushed recently into the 4.19 and 4.20 codebase. Cherry picking only one of the 50-odd patches we've committed into late 4.19 and 4.20 kernels to fix the problems we've found really seems like asking for trouble. If you're going to back port random data corruption fixes, then you need to spend a *lot* of time validating that it doesn't make things worse than they already are... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 08:23:42PM -0800, Joe Perches wrote: > On Tue, 2018-11-13 at 14:09 +1100, Dave Chinner wrote: > > On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote: > > > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote: > > > > I'm not interested in making code fast if distro support engineers > > > > can't debug problems on user systems easily. Optimising for > > > > performance over debuggability is a horrible trade off for us to > > > > make because it means users and distros end up much more reliant on > > > > single points of expertise for debugging problems. And that means > > > > the majority of the load of problem triage falls directly on very > > > > limited resources - the core XFS development team. A little bit of > > > > thought about how to make code easier to triage and debug goes a > > > > long, long way > > > > > > So at least in my experience, if the kernels are compiled with > > > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED, > > > scripts/decode_stracktrace.sh seems to do a very nice job with inlined > > > > That doesn't help with kernel profiling and other such things that > > are based on callgraphs... > > If that's really the case: > > I rather suspect the xfs static v STATIC function marking is not > particularly curated and the marking is somewhat arbitrary. That's a common opinion for an outsider to form when they come across something unfamiliar they don't really understand. "I don't understand this, so I must rewrite it" is an unfortunate habit that programmers have. > So perhaps given the large number of static, but not STATIC > functions, perhaps a sed of s/static/STATIC/ should be done > when it's not inline for all xfs functions. That's just as bad as removing them all, if not worse. If you are writing new code or reworking existing code, then we'll consider the usage of STATIC/static in the context of that work. Otherwise, we leave it alone. It if ain't broke, don't fix it. And it sure as hell isn't broken right now. We've got more than enough bugs to fix without having to deal with drive-by bikeshed painting... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 08:23:42PM -0800, Joe Perches wrote: > On Tue, 2018-11-13 at 14:09 +1100, Dave Chinner wrote: > > On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote: > > > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote: > > > > I'm not interested in making code fast if distro support engineers > > > > can't debug problems on user systems easily. Optimising for > > > > performance over debuggability is a horrible trade off for us to > > > > make because it means users and distros end up much more reliant on > > > > single points of expertise for debugging problems. And that means > > > > the majority of the load of problem triage falls directly on very > > > > limited resources - the core XFS development team. A little bit of > > > > thought about how to make code easier to triage and debug goes a > > > > long, long way > > > > > > So at least in my experience, if the kernels are compiled with > > > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED, > > > scripts/decode_stracktrace.sh seems to do a very nice job with inlined > > > > That doesn't help with kernel profiling and other such things that > > are based on callgraphs... > > If that's really the case: > > I rather suspect the xfs static v STATIC function marking is not > particularly curated and the marking is somewhat arbitrary. That's a common opinion for an outsider to form when they come across something unfamiliar they don't really understand. "I don't understand this, so I must rewrite it" is an unfortunate habit that programmers have. > So perhaps given the large number of static, but not STATIC > functions, perhaps a sed of s/static/STATIC/ should be done > when it's not inline for all xfs functions. That's just as bad as removing them all, if not worse. If you are writing new code or reworking existing code, then we'll consider the usage of STATIC/static in the context of that work. Otherwise, we leave it alone. It if ain't broke, don't fix it. And it sure as hell isn't broken right now. We've got more than enough bugs to fix without having to deal with drive-by bikeshed painting... -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote: > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote: > > I'm not interested in making code fast if distro support engineers > > can't debug problems on user systems easily. Optimising for > > performance over debuggability is a horrible trade off for us to > > make because it means users and distros end up much more reliant on > > single points of expertise for debugging problems. And that means > > the majority of the load of problem triage falls directly on very > > limited resources - the core XFS development team. A little bit of > > thought about how to make code easier to triage and debug goes a > > long, long way > > So at least in my experience, if the kernels are compiled with > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED, > scripts/decode_stracktrace.sh seems to do a very nice job with inlined That doesn't help with kernel profiling and other such things that are based on callgraphs... > functions. Now, ext4 generally only has about 3 or 4 nested inlines, > and so I don't know how it works with 20 or 30 nested inlined > functions, so perhaps this is not applicable for XFS. > > But it perhaps toolchain technology has advanced since the Irix days > such that it's no longer as necessary to force the non-inlining of > functions for easing debugging? Not that I've noticed. Indeed, modern toolchains are moving the opposite direction - have you ever tried to debug a binary with gdb that was compiled with LTO enabled? Or maybe even just tried to profile it with perf? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 08:54:10PM -0500, Theodore Y. Ts'o wrote: > On Tue, Nov 13, 2018 at 12:18:05PM +1100, Dave Chinner wrote: > > I'm not interested in making code fast if distro support engineers > > can't debug problems on user systems easily. Optimising for > > performance over debuggability is a horrible trade off for us to > > make because it means users and distros end up much more reliant on > > single points of expertise for debugging problems. And that means > > the majority of the load of problem triage falls directly on very > > limited resources - the core XFS development team. A little bit of > > thought about how to make code easier to triage and debug goes a > > long, long way > > So at least in my experience, if the kernels are compiled with > CONFIG_DEBUG_INFO and/or CONFIG_DEBUG_INFO_REDUCED, > scripts/decode_stracktrace.sh seems to do a very nice job with inlined That doesn't help with kernel profiling and other such things that are based on callgraphs... > functions. Now, ext4 generally only has about 3 or 4 nested inlines, > and so I don't know how it works with 20 or 30 nested inlined > functions, so perhaps this is not applicable for XFS. > > But it perhaps toolchain technology has advanced since the Irix days > such that it's no longer as necessary to force the non-inlining of > functions for easing debugging? Not that I've noticed. Indeed, modern toolchains are moving the opposite direction - have you ever tried to debug a binary with gdb that was compiled with LTO enabled? Or maybe even just tried to profile it with perf? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 02:30:01PM -0800, Joe Perches wrote: > On Tue, 2018-11-13 at 08:45 +1100, Dave Chinner wrote: > > On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote: > > > On 11/10/18 7:21 PM, Joe Perches wrote: > > > > Reduce total object size quite a bit (~32KB) and presumably > > > > improve performance at the same time. > > > > > > > > Total object size old vs new (x86-64 defconfig with xfs) > > > > > > > > text data bss dec hex filename > > > > - 959351 165573 632 1125556 112cb4 (TOTALS) (old) > > > > + 924683 165669 632 1090984 10a5a8 (TOTALS) (new) > > > > > > And what does it do to maximum stack excursions? > > > > Better yet: what does it do to corruption stack traces and debugging > > tools like profiling traces? > > > > i.e. this noinline directive isn't about stack usage, this is about > > being able to debug production code. Basically the compiler inliner > > is so agressive on static functions that it makes it impossible to > > decipher the stack traces. It flattens them way too much to > > be able to tell how we got to a specific location in the code. > > > > In reality, being able to find problems quickly and efficiently is > > far more important to us than being able to run everything at > > ludicrous speed > > Is that really a compelling argument given thw ~50:50 > split of static/STATIC uses in xfs? Historically, yes. We're talking about code with call chains that can go 50-60 functions deep here. If that gets flattened to 10-20 functions by compiler inlining (which is the sort of thing that happens) then we lose a huge amount of visibility into the workings of the code. This affects profiling, stack traces on corruption, dynamic debug probes, kernel crash dump analysis, etc. I'm not interested in making code fast if distro support engineers can't debug problems on user systems easily. Optimising for performance over debuggability is a horrible trade off for us to make because it means users and distros end up much more reliant on single points of expertise for debugging problems. And that means the majority of the load of problem triage falls directly on very limited resources - the core XFS development team. A little bit of thought about how to make code easier to triage and debug goes a long, long way Indeed, this is not a new problem - we've been using techniques like STATIC in one form or another to stop compiler inlining and/or function hiding since XFS was first ported to linux 20 years ago. In fact, STATIC was inherited from Irix because it helped with debugging via the userspace simulator that the initial XFS code was developed on. i.e. STATIC was present in the initial XFS commit made way back in 1993, and we've been using it ever since... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 02:30:01PM -0800, Joe Perches wrote: > On Tue, 2018-11-13 at 08:45 +1100, Dave Chinner wrote: > > On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote: > > > On 11/10/18 7:21 PM, Joe Perches wrote: > > > > Reduce total object size quite a bit (~32KB) and presumably > > > > improve performance at the same time. > > > > > > > > Total object size old vs new (x86-64 defconfig with xfs) > > > > > > > > text data bss dec hex filename > > > > - 959351 165573 632 1125556 112cb4 (TOTALS) (old) > > > > + 924683 165669 632 1090984 10a5a8 (TOTALS) (new) > > > > > > And what does it do to maximum stack excursions? > > > > Better yet: what does it do to corruption stack traces and debugging > > tools like profiling traces? > > > > i.e. this noinline directive isn't about stack usage, this is about > > being able to debug production code. Basically the compiler inliner > > is so agressive on static functions that it makes it impossible to > > decipher the stack traces. It flattens them way too much to > > be able to tell how we got to a specific location in the code. > > > > In reality, being able to find problems quickly and efficiently is > > far more important to us than being able to run everything at > > ludicrous speed > > Is that really a compelling argument given thw ~50:50 > split of static/STATIC uses in xfs? Historically, yes. We're talking about code with call chains that can go 50-60 functions deep here. If that gets flattened to 10-20 functions by compiler inlining (which is the sort of thing that happens) then we lose a huge amount of visibility into the workings of the code. This affects profiling, stack traces on corruption, dynamic debug probes, kernel crash dump analysis, etc. I'm not interested in making code fast if distro support engineers can't debug problems on user systems easily. Optimising for performance over debuggability is a horrible trade off for us to make because it means users and distros end up much more reliant on single points of expertise for debugging problems. And that means the majority of the load of problem triage falls directly on very limited resources - the core XFS development team. A little bit of thought about how to make code easier to triage and debug goes a long, long way Indeed, this is not a new problem - we've been using techniques like STATIC in one form or another to stop compiler inlining and/or function hiding since XFS was first ported to linux 20 years ago. In fact, STATIC was inherited from Irix because it helped with debugging via the userspace simulator that the initial XFS code was developed on. i.e. STATIC was present in the initial XFS commit made way back in 1993, and we've been using it ever since... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote: > On 11/10/18 7:21 PM, Joe Perches wrote: > > Reduce total object size quite a bit (~32KB) and presumably > > improve performance at the same time. > > > > Total object size old vs new (x86-64 defconfig with xfs) > > > > text data bss dec hex filename > > - 959351 165573 632 1125556 112cb4 (TOTALS) (old) > > + 924683 165669 632 1090984 10a5a8 (TOTALS) (new) > > And what does it do to maximum stack excursions? Better yet: what does it do to corruption stack traces and debugging tools like profiling traces? i.e. this noinline directive isn't about stack usage, this is about being able to debug production code. Basically the compiler inliner is so agressive on static functions that it makes it impossible to decipher the stack traces. It flattens them way too much to be able to tell how we got to a specific location in the code. In reality, being able to find problems quickly and efficiently is far more important to us than being able to run everything at ludicrous speed Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] xfs: Remove noinline from #define STATIC
On Mon, Nov 12, 2018 at 02:12:08PM -0600, Eric Sandeen wrote: > On 11/10/18 7:21 PM, Joe Perches wrote: > > Reduce total object size quite a bit (~32KB) and presumably > > improve performance at the same time. > > > > Total object size old vs new (x86-64 defconfig with xfs) > > > > text data bss dec hex filename > > - 959351 165573 632 1125556 112cb4 (TOTALS) (old) > > + 924683 165669 632 1090984 10a5a8 (TOTALS) (new) > > And what does it do to maximum stack excursions? Better yet: what does it do to corruption stack traces and debugging tools like profiling traces? i.e. this noinline directive isn't about stack usage, this is about being able to debug production code. Basically the compiler inliner is so agressive on static functions that it makes it impossible to decipher the stack traces. It flattens them way too much to be able to tell how we got to a specific location in the code. In reality, being able to find problems quickly and efficiently is far more important to us than being able to run everything at ludicrous speed Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] Fix coding style issue in xfs_acl.c and xfs_aops.c
On Sun, Nov 11, 2018 at 08:36:03AM +0800, hmsjwzb wrote: > Possible unwrapped commit description (prefer a maximum 75 chars per line) NACK. Our preference is (and always has been) for comments to fill the entire 80 columns, just like the rest of the kernel. I have no idea who told you "75 columns is preferred" but they are wrong. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] Fix coding style issue in xfs_acl.c and xfs_aops.c
On Sun, Nov 11, 2018 at 08:36:03AM +0800, hmsjwzb wrote: > Possible unwrapped commit description (prefer a maximum 75 chars per line) NACK. Our preference is (and always has been) for comments to fill the entire 80 columns, just like the rest of the kernel. I have no idea who told you "75 columns is preferred" but they are wrong. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] fs/proc: introduce /proc/stat2 file
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote: > On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton > wrote: > > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione > > wrote: > > > >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso wrote: > >> > This patch introduces a new /proc/stat2 file that is identical to the > >> > regular 'stat' except that it zeroes all hard irq statistics. The new > >> > file is a drop in replacement to stat for users that need performance. > >> > >> For a while now, I've been thinking over ways to improve the > >> performance of collecting various bits of kernel information. I don't > >> think that a proliferation of special-purpose named bag-of-fields file > >> variants is the right answer, because even if you add a few info-file > >> variants, you're still left with a situation where a given file > >> provides a particular caller with too little or too much information. > >> I'd much rather move to a model in which userspace *explicitly* tells > >> the kernel which fields it wants, with the kernel replying with just > >> those particular fields, maybe in their raw binary representations. > >> The ASCII-text bag-of-everything files would remain available for > >> ad-hoc and non-performance critical use, but programs that cared about > >> performance would have an efficient bypass. One concrete approach is > >> to let users open up today's proc files and, instead of read(2)ing a > >> text blob, use an ioctl to retrieve specified and targeted information > >> of the sort that would normally be encoded in the text blob. Because > >> callers would open the same file when using either the text or binary > >> interfaces, little would have to change, and it'd be easy to implement > >> fallbacks when a particular system doesn't support a particular > >> fast-path ioctl. > > Please. Sysfs, with the one value per file rule, was created exactly > for the purpose of eliminating these sort of problems with procfs. So > instead of inventing special purpose interfaces for proc, just make > the info available in sysfs, if not already available. This doesn't solve the problem. The problem is that this specific implementation of per-cpu counters need to be summed on every read. Hence when you have a huge number of CPUs each per-cpu iteration that takes a substantial amount of time. If only we had percpu counters that had a fixed, extremely low read overhead that doesn't care about the number of CPUs in the machine Oh, wait, we do: percpu_counters.[ch]. This all seems like a counter implementation deficiency to me, not an interface problem... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] fs/proc: introduce /proc/stat2 file
On Wed, Nov 07, 2018 at 11:03:06AM +0100, Miklos Szeredi wrote: > On Wed, Nov 7, 2018 at 12:48 AM, Andrew Morton > wrote: > > On Mon, 29 Oct 2018 23:04:45 + Daniel Colascione > > wrote: > > > >> On Mon, Oct 29, 2018 at 7:25 PM, Davidlohr Bueso wrote: > >> > This patch introduces a new /proc/stat2 file that is identical to the > >> > regular 'stat' except that it zeroes all hard irq statistics. The new > >> > file is a drop in replacement to stat for users that need performance. > >> > >> For a while now, I've been thinking over ways to improve the > >> performance of collecting various bits of kernel information. I don't > >> think that a proliferation of special-purpose named bag-of-fields file > >> variants is the right answer, because even if you add a few info-file > >> variants, you're still left with a situation where a given file > >> provides a particular caller with too little or too much information. > >> I'd much rather move to a model in which userspace *explicitly* tells > >> the kernel which fields it wants, with the kernel replying with just > >> those particular fields, maybe in their raw binary representations. > >> The ASCII-text bag-of-everything files would remain available for > >> ad-hoc and non-performance critical use, but programs that cared about > >> performance would have an efficient bypass. One concrete approach is > >> to let users open up today's proc files and, instead of read(2)ing a > >> text blob, use an ioctl to retrieve specified and targeted information > >> of the sort that would normally be encoded in the text blob. Because > >> callers would open the same file when using either the text or binary > >> interfaces, little would have to change, and it'd be easy to implement > >> fallbacks when a particular system doesn't support a particular > >> fast-path ioctl. > > Please. Sysfs, with the one value per file rule, was created exactly > for the purpose of eliminating these sort of problems with procfs. So > instead of inventing special purpose interfaces for proc, just make > the info available in sysfs, if not already available. This doesn't solve the problem. The problem is that this specific implementation of per-cpu counters need to be summed on every read. Hence when you have a huge number of CPUs each per-cpu iteration that takes a substantial amount of time. If only we had percpu counters that had a fixed, extremely low read overhead that doesn't care about the number of CPUs in the machine Oh, wait, we do: percpu_counters.[ch]. This all seems like a counter implementation deficiency to me, not an interface problem... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: > On Tue 06-11-18 13:47:15, Dave Chinner wrote: > > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > > has the advantage that it is pretty much standard for a test like this > > > > so > > > > everyone knows what the test does from a glimpse) and run with it > > > > something > > > > like the following workfile: > > > > > > > > [reader] > > > > direct=1 > > > > ioengine=libaio > > > > blocksize=4096 > > > > size=1g > > > > numjobs=1 > > > > rw=read > > > > iodepth=64 > > > > > > > > And see how the numbers with and without your patches compare? > > > > > > > > Honza > > > > > > > > [1] https://github.com/axboe/fio > > > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > > bandwidth > > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > > as compared to around 85 without the patch, so still showing about a 20% > > > performance degradation, assuming I'm reading this correctly. > > > > > > Raw data follows, using the fio options you listed above: > > > > > > Baseline (without my patch): > > > > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > > clat percentiles (usec): > > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > > | 99.99th=[12256] > > . > > > Modified (with my patch): > > > > > . > > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > > clat percentiles (usec): > > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > > | 99.99th=[14484] > > > > So it's adding at least 500us of completion latency to every IO? > > I'd argue that the IO latency impact is far worse than the a 20% > > throughput drop. > > Hum, right. So for each IO we have to remove the page from LRU on submit Which cost us less then 10us on average: slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 vs slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > and then put it back on IO completion (which is going to race with new > submits so LRU lock contention might be an issue). Removal has to take the same LRU lock, so I don't think contention is the problem here. More likely the overhead is in selecting the LRU to put it on. e.g. list_lru_from_kmem() which may well be doing a memcg lookup. > Spending 500 us on that > is not unthinkable when the lock is contended but it is more expensive than > I'd have thought. John, could you perhaps profile where the time is spent? That'll tell us for sure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Tue, Nov 06, 2018 at 12:00:06PM +0100, Jan Kara wrote: > On Tue 06-11-18 13:47:15, Dave Chinner wrote: > > On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > > > On 11/5/18 1:54 AM, Jan Kara wrote: > > > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > > > has the advantage that it is pretty much standard for a test like this > > > > so > > > > everyone knows what the test does from a glimpse) and run with it > > > > something > > > > like the following workfile: > > > > > > > > [reader] > > > > direct=1 > > > > ioengine=libaio > > > > blocksize=4096 > > > > size=1g > > > > numjobs=1 > > > > rw=read > > > > iodepth=64 > > > > > > > > And see how the numbers with and without your patches compare? > > > > > > > > Honza > > > > > > > > [1] https://github.com/axboe/fio > > > > > > That program is *very* good to have. Whew. Anyway, it looks like read > > > bandwidth > > > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > > > as compared to around 85 without the patch, so still showing about a 20% > > > performance degradation, assuming I'm reading this correctly. > > > > > > Raw data follows, using the fio options you listed above: > > > > > > Baseline (without my patch): > > > > > > > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > > > clat percentiles (usec): > > > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > > > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > > > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > > > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > > > | 99.99th=[12256] > > . > > > Modified (with my patch): > > > > > . > > > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > > > clat percentiles (usec): > > > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > > > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > > > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > > > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > > > | 99.99th=[14484] > > > > So it's adding at least 500us of completion latency to every IO? > > I'd argue that the IO latency impact is far worse than the a 20% > > throughput drop. > > Hum, right. So for each IO we have to remove the page from LRU on submit Which cost us less then 10us on average: slat (usec): min=13, max=3855, avg=44.17, stdev=61.18 vs slat (usec): min=18, max=4378, avg=52.59, stdev=63.66 > and then put it back on IO completion (which is going to race with new > submits so LRU lock contention might be an issue). Removal has to take the same LRU lock, so I don't think contention is the problem here. More likely the overhead is in selecting the LRU to put it on. e.g. list_lru_from_kmem() which may well be doing a memcg lookup. > Spending 500 us on that > is not unthinkable when the lock is contended but it is more expensive than > I'd have thought. John, could you perhaps profile where the time is spent? That'll tell us for sure :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > On 11/5/18 1:54 AM, Jan Kara wrote: > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > has the advantage that it is pretty much standard for a test like this so > > everyone knows what the test does from a glimpse) and run with it something > > like the following workfile: > > > > [reader] > > direct=1 > > ioengine=libaio > > blocksize=4096 > > size=1g > > numjobs=1 > > rw=read > > iodepth=64 > > > > And see how the numbers with and without your patches compare? > > > > Honza > > > > [1] https://github.com/axboe/fio > > That program is *very* good to have. Whew. Anyway, it looks like read > bandwidth > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > as compared to around 85 without the patch, so still showing about a 20% > performance degradation, assuming I'm reading this correctly. > > Raw data follows, using the fio options you listed above: > > Baseline (without my patch): > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > clat percentiles (usec): > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > | 99.99th=[12256] . > Modified (with my patch): > . > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > clat percentiles (usec): > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > | 99.99th=[14484] So it's adding at least 500us of completion latency to every IO? I'd argue that the IO latency impact is far worse than the a 20% throughput drop. i.e. You can make up for throughput drops by running a deeper queue/more dispatch threads, but you can't reduce IO latency at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Mon, Nov 05, 2018 at 04:26:04PM -0800, John Hubbard wrote: > On 11/5/18 1:54 AM, Jan Kara wrote: > > Hmm, have you tried larger buffer sizes? Because synchronous 8k IO isn't > > going to max-out NVME iops by far. Can I suggest you install fio [1] (it > > has the advantage that it is pretty much standard for a test like this so > > everyone knows what the test does from a glimpse) and run with it something > > like the following workfile: > > > > [reader] > > direct=1 > > ioengine=libaio > > blocksize=4096 > > size=1g > > numjobs=1 > > rw=read > > iodepth=64 > > > > And see how the numbers with and without your patches compare? > > > > Honza > > > > [1] https://github.com/axboe/fio > > That program is *very* good to have. Whew. Anyway, it looks like read > bandwidth > is approximately 74 MiB/s with my patch (it varies a bit, run to run), > as compared to around 85 without the patch, so still showing about a 20% > performance degradation, assuming I'm reading this correctly. > > Raw data follows, using the fio options you listed above: > > Baseline (without my patch): > > lat (usec): min=179, max=14003, avg=2913.65, stdev=1241.75 > clat percentiles (usec): > | 1.00th=[ 2311], 5.00th=[ 2343], 10.00th=[ 2343], 20.00th=[ 2343], > | 30.00th=[ 2343], 40.00th=[ 2376], 50.00th=[ 2376], 60.00th=[ 2376], > | 70.00th=[ 2409], 80.00th=[ 2933], 90.00th=[ 4359], 95.00th=[ 5276], > | 99.00th=[ 8291], 99.50th=[ 9110], 99.90th=[10945], 99.95th=[11469], > | 99.99th=[12256] . > Modified (with my patch): > . > lat (usec): min=81, max=15766, avg=3496.57, stdev=1450.21 > clat percentiles (usec): > | 1.00th=[ 2835], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2868], > | 30.00th=[ 2868], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2900], > | 70.00th=[ 2933], 80.00th=[ 3425], 90.00th=[ 5080], 95.00th=[ 6259], > | 99.00th=[10159], 99.50th=[11076], 99.90th=[12649], 99.95th=[13435], > | 99.99th=[14484] So it's adding at least 500us of completion latency to every IO? I'd argue that the IO latency impact is far worse than the a 20% throughput drop. i.e. You can make up for throughput drops by running a deeper queue/more dispatch threads, but you can't reduce IO latency at all... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
On Sat, Nov 03, 2018 at 10:13:37AM -0700, Linus Torvalds wrote: > On Fri, Nov 2, 2018 at 4:36 PM Dave Chinner wrote: > > > > On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote: > > > > > > I don't love the timing of this at the end of the merge window, but > > > pulled, > > > > When would have been a better time? It's too big for a late -rc, and > > it contains stuff that needs fixing pretty urgently. So if the merge > > window is not appropriate, what should we have done here? > > No, I think that with the timing of the problem finding, it was > probably the only thing to do, but generally I like these kinds of > somewhat scary and disruptive patches (just because it touches > multiple filesystems) _early_ in the merge window if at all possible, > and showing that the development was done before and not some rushed > thing.. Fair enough. That was partly my fault - I forgot to add sob's to the commits when I pulled the final version in for testing in the middle of the first week of the merge window. I didn't realise that until after I'd soak and stress tested it and was getting ready to push it out to linux-next a week later. So I had to rewrite all the commits and so they would have looked an awful lot more recent than they really were. I should have mentioned this in the pull request... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
On Sat, Nov 03, 2018 at 10:13:37AM -0700, Linus Torvalds wrote: > On Fri, Nov 2, 2018 at 4:36 PM Dave Chinner wrote: > > > > On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote: > > > > > > I don't love the timing of this at the end of the merge window, but > > > pulled, > > > > When would have been a better time? It's too big for a late -rc, and > > it contains stuff that needs fixing pretty urgently. So if the merge > > window is not appropriate, what should we have done here? > > No, I think that with the timing of the problem finding, it was > probably the only thing to do, but generally I like these kinds of > somewhat scary and disruptive patches (just because it touches > multiple filesystems) _early_ in the merge window if at all possible, > and showing that the development was done before and not some rushed > thing.. Fair enough. That was partly my fault - I forgot to add sob's to the commits when I pulled the final version in for testing in the middle of the first week of the merge window. I didn't realise that until after I'd soak and stress tested it and was getting ready to push it out to linux-next a week later. So I had to rewrite all the commits and so they would have looked an awful lot more recent than they really were. I should have mentioned this in the pull request... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote: > On Thu, Nov 1, 2018 at 10:15 PM Dave Chinner wrote: > > > > Can you please pull update containing a rework of the VFS clone and > > dedupe file range infrastructure from the tag listed below? > > I don't love the timing of this at the end of the merge window, but pulled, When would have been a better time? It's too big for a late -rc, and it contains stuff that needs fixing pretty urgently. So if the merge window is not appropriate, what should we have done here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
On Fri, Nov 02, 2018 at 09:35:23AM -0700, Linus Torvalds wrote: > On Thu, Nov 1, 2018 at 10:15 PM Dave Chinner wrote: > > > > Can you please pull update containing a rework of the VFS clone and > > dedupe file range infrastructure from the tag listed below? > > I don't love the timing of this at the end of the merge window, but pulled, When would have been a better time? It's too big for a late -rc, and it contains stuff that needs fixing pretty urgently. So if the merge window is not appropriate, what should we have done here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
[GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
fs2/refcounttree.c | 148 -- fs/ocfs2/refcounttree.h | 24 ++- fs/overlayfs/copy_up.c| 6 +- fs/overlayfs/file.c | 43 ++-- fs/read_write.c | 403 +- fs/xfs/xfs_file.c | 82 +--- fs/xfs/xfs_reflink.c | 173 fs/xfs/xfs_reflink.h | 15 +- include/linux/fs.h| 55 -- mm/filemap.c | 146 +++--- 20 files changed, 734 insertions(+), 596 deletions(-) -- Dave Chinner da...@fromorbit.com
[GIT PULL] vfs: fix many problems in vfs clone/dedupe implementation
fs2/refcounttree.c | 148 -- fs/ocfs2/refcounttree.h | 24 ++- fs/overlayfs/copy_up.c| 6 +- fs/overlayfs/file.c | 43 ++-- fs/read_write.c | 403 +- fs/xfs/xfs_file.c | 82 +--- fs/xfs/xfs_reflink.c | 173 fs/xfs/xfs_reflink.h | 15 +- include/linux/fs.h| 55 -- mm/filemap.c | 146 +++--- 20 files changed, 734 insertions(+), 596 deletions(-) -- Dave Chinner da...@fromorbit.com
Re: linux-next: manual merge of the vfs tree with the xfs tree
On Wed, Oct 31, 2018 at 11:52:47AM +1100, Stephen Rothwell wrote: > Hi all, > > [I don't understand why all this new work turned up in the xfs tree > during the merge window ...] > > Today's linux-next merge of the vfs tree got a conflict in: > > fs/read_write.c > > between commits: > > 42ec3d4c0218 ("vfs: make remap_file_range functions take and return bytes > completed") > eca3654e3cc7 ("vfs: enable remap callers that can handle short operations") > > from the xfs tree and commit: > > 5de4480ae7f8 ("vfs: allow dedupe of user owned read-only files") > > from the vfs tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. Looks ok. I didn't expect this conflict, but looks simple enough to resolve. Thanks! Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: linux-next: manual merge of the vfs tree with the xfs tree
On Wed, Oct 31, 2018 at 11:52:47AM +1100, Stephen Rothwell wrote: > Hi all, > > [I don't understand why all this new work turned up in the xfs tree > during the merge window ...] > > Today's linux-next merge of the vfs tree got a conflict in: > > fs/read_write.c > > between commits: > > 42ec3d4c0218 ("vfs: make remap_file_range functions take and return bytes > completed") > eca3654e3cc7 ("vfs: enable remap callers that can handle short operations") > > from the xfs tree and commit: > > 5de4480ae7f8 ("vfs: allow dedupe of user owned read-only files") > > from the vfs tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. Looks ok. I didn't expect this conflict, but looks simple enough to resolve. Thanks! Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: linux-next: manual merge of the xfs tree with Linus' tree
On Wed, Oct 31, 2018 at 11:22:44AM +1100, Stephen Rothwell wrote: > Hi all, > > Today's linux-next merge of the xfs tree got a conflict in: > > Documentation/filesystems/porting > > between commit: > > 1a16dbaf798c ("Document d_splice_alias() calling conventions for ->lookup() > users.") > > from Linus' tree and commit: > > 2e5dfc99f2e6 ("vfs: combine the clone and dedupe into a single > remap_file_range") > > from the xfs tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > -- > Cheers, > Stephen Rothwell > > diff --cc Documentation/filesystems/porting > index 321d74b73937,e6d4466268dd.. > --- a/Documentation/filesystems/porting > +++ b/Documentation/filesystems/porting > @@@ -623,13 -623,7 +623,18 @@@ in your dentry operations instead > On success you get a new struct file sharing the mount/dentry with the > original, on failure - ERR_PTR(). > -- > +[recommended] > +->lookup() instances doing an equivalent of > +if (IS_ERR(inode)) > +return ERR_CAST(inode); > +return d_splice_alias(inode, dentry); > +don't need to bother with the check - d_splice_alias() will do the > +right thing when given ERR_PTR(...) as inode. Moreover, passing NULL > +inode to d_splice_alias() will also do the right thing (equivalent of > +d_add(dentry, NULL); return NULL;), so that kind of special cases > +also doesn't need a separate treatment. > ++-- > + [mandatory] > + ->clone_file_range() and ->dedupe_file_range have been replaced with > + ->remap_file_range(). See Documentation/filesystems/vfs.txt for more > + information. Looks good - I knew about this one from merging back into a recent Linus kernel. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: linux-next: manual merge of the xfs tree with Linus' tree
On Wed, Oct 31, 2018 at 11:22:44AM +1100, Stephen Rothwell wrote: > Hi all, > > Today's linux-next merge of the xfs tree got a conflict in: > > Documentation/filesystems/porting > > between commit: > > 1a16dbaf798c ("Document d_splice_alias() calling conventions for ->lookup() > users.") > > from Linus' tree and commit: > > 2e5dfc99f2e6 ("vfs: combine the clone and dedupe into a single > remap_file_range") > > from the xfs tree. > > I fixed it up (see below) and can carry the fix as necessary. This > is now fixed as far as linux-next is concerned, but any non trivial > conflicts should be mentioned to your upstream maintainer when your tree > is submitted for merging. You may also want to consider cooperating > with the maintainer of the conflicting tree to minimise any particularly > complex conflicts. > > -- > Cheers, > Stephen Rothwell > > diff --cc Documentation/filesystems/porting > index 321d74b73937,e6d4466268dd.. > --- a/Documentation/filesystems/porting > +++ b/Documentation/filesystems/porting > @@@ -623,13 -623,7 +623,18 @@@ in your dentry operations instead > On success you get a new struct file sharing the mount/dentry with the > original, on failure - ERR_PTR(). > -- > +[recommended] > +->lookup() instances doing an equivalent of > +if (IS_ERR(inode)) > +return ERR_CAST(inode); > +return d_splice_alias(inode, dentry); > +don't need to bother with the check - d_splice_alias() will do the > +right thing when given ERR_PTR(...) as inode. Moreover, passing NULL > +inode to d_splice_alias() will also do the right thing (equivalent of > +d_add(dentry, NULL); return NULL;), so that kind of special cases > +also doesn't need a separate treatment. > ++-- > + [mandatory] > + ->clone_file_range() and ->dedupe_file_range have been replaced with > + ->remap_file_range(). See Documentation/filesystems/vfs.txt for more > + information. Looks good - I knew about this one from merging back into a recent Linus kernel. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: XFS: Hang and dmesg flood on mounting invalid FS image
On Mon, Oct 29, 2018 at 09:57:20AM +0300, Anatoly Trosinenko wrote: > > How did the corruption occur? > > It is a fuzzed image. Most probably, it was artificially "patched" by > fuzzer. Or do you mean "what particular bytes were changed"? I wondered how this specific corruption occurred in the real world. If i was a real world problem, it would have been indicative of a code bug if it did occur (i.e. whatever wrote the log record would have been broken) but seeing as it's a fuzzer problem, I don't need to look for any code bugs other than "we didn't validate the input properly". Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: XFS: Hang and dmesg flood on mounting invalid FS image
On Mon, Oct 29, 2018 at 09:57:20AM +0300, Anatoly Trosinenko wrote: > > How did the corruption occur? > > It is a fuzzed image. Most probably, it was artificially "patched" by > fuzzer. Or do you mean "what particular bytes were changed"? I wondered how this specific corruption occurred in the real world. If i was a real world problem, it would have been indicative of a code bug if it did occur (i.e. whatever wrote the log record would have been broken) but seeing as it's a fuzzer problem, I don't need to look for any code bugs other than "we didn't validate the input properly". Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: XFS: Hang and dmesg flood on mounting invalid FS image
On Sun, Oct 28, 2018 at 08:50:46PM +0300, Anatoly Trosinenko wrote: > Hello, > > When mounting a broken XFS image, the kernel hangs and floods dmesg > with stack traces. How did the corruption occur? $ sudo xfs_logprint -d /dev/vdc xfs_logprint: data device: 0xfd20 log device: 0xfd20 daddr: 131112 length: 6840 0 HEADER Cycle 1 tail 1:00 len512 ops 1 [0 - 0] Cycle 0x New Cycle 0x0001 2 HEADER Cycle 1 tail 1:02 len512 ops 5 4 HEADER Cycle 1 tail -2147483647:02 len512 ops 1 6 HEADER Cycle 0 tail 1:00 len 0 ops 0 [0 - 6] Cycle 0x0001 New Cycle 0x 7 HEADER Cycle 0 tail 1:00 len 0 ops 0 Ok, so from this the head of the log is block 4, and it has a corrupt tail pointer it points to: $ sudo xfs_logprint -D -s 4 /dev/vdc |head -10 xfs_logprint: data device: 0xfd20 log device: 0xfd20 daddr: 131112 length: 6840 BLKNO: 4 0 bebaedfe 100 2002 100 361 180 200 ^^^ ^ ^ wrong wrong wrong 8 2f27bae6 200 100 dabdbab00000 1000000000 1800000000 2000000000 They decode as: cycle: 1version: 2 lsn: 1,24835tail_lsn: 2147483649,2 So the tail LSN points to an invalid log cycle and the previous block. IOWs, the block number in the tail indicates the whole log is valid and needs to be scanned. but the cycle is not valid. And that's the problem. Neither the head or tail blocks are validated before they are used. CRC checking of the head and tail blocks comes later Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: XFS: Hang and dmesg flood on mounting invalid FS image
On Sun, Oct 28, 2018 at 08:50:46PM +0300, Anatoly Trosinenko wrote: > Hello, > > When mounting a broken XFS image, the kernel hangs and floods dmesg > with stack traces. How did the corruption occur? $ sudo xfs_logprint -d /dev/vdc xfs_logprint: data device: 0xfd20 log device: 0xfd20 daddr: 131112 length: 6840 0 HEADER Cycle 1 tail 1:00 len512 ops 1 [0 - 0] Cycle 0x New Cycle 0x0001 2 HEADER Cycle 1 tail 1:02 len512 ops 5 4 HEADER Cycle 1 tail -2147483647:02 len512 ops 1 6 HEADER Cycle 0 tail 1:00 len 0 ops 0 [0 - 6] Cycle 0x0001 New Cycle 0x 7 HEADER Cycle 0 tail 1:00 len 0 ops 0 Ok, so from this the head of the log is block 4, and it has a corrupt tail pointer it points to: $ sudo xfs_logprint -D -s 4 /dev/vdc |head -10 xfs_logprint: data device: 0xfd20 log device: 0xfd20 daddr: 131112 length: 6840 BLKNO: 4 0 bebaedfe 100 2002 100 361 180 200 ^^^ ^ ^ wrong wrong wrong 8 2f27bae6 200 100 dabdbab00000 1000000000 1800000000 2000000000 They decode as: cycle: 1version: 2 lsn: 1,24835tail_lsn: 2147483649,2 So the tail LSN points to an invalid log cycle and the previous block. IOWs, the block number in the tail indicates the whole log is valid and needs to be scanned. but the cycle is not valid. And that's the problem. Neither the head or tail blocks are validated before they are used. CRC checking of the head and tail blocks comes later Cheers, Dave. -- Dave Chinner da...@fromorbit.com
[GIT PULL] xfs: 4.20 merge
Hi Linus, Can you please pull the XFS 4.20 update from the tag listed below? There's not a huge amount of change in this cycle - Darrick has been out of action for a couple of months (hence me sending the last few pull requests), so we decided a quiet cycle mainly focussed on bug fixes was a good idea. Darrick will take the helm again at the end of this merge window. The branch has applied cleanly to your current tree, so you shouldn't see any problems with it. FYI, I may be sending another update later in the cycle - there's a pending rework of the clone/dedupe_file_range code that fixes numerous bugs that is spread amongst the VFS, XFS and ocfs2 code. It has been reviewed and tested, Al and I just need to work out the details of the merge, so it may come from him rather than me. Thanks! -Dave. The following changes since commit b39989009bdb84992915c9869f58094ed5becf10: xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 +1000) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-4.20-merge-1 for you to fetch changes up to 96987eea537d6ccd98704a71958f9ba02da80843: xfs: cancel COW blocks before swapext (2018-10-18 17:21:55 +1100) xfs: Changes for 4.20 - only support filesystems with unwritten extents - add definition for statfs XFS magic number - remove unused parameters around reflink code - more debug for dangling delalloc extents - cancel COW extents on extent swap targets - fix quota stats output and clean up the code - refactor some of the attribute code in preparation for parent pointers - fix several buffer handling bugs Adam Borowski (1): xfs: add a define for statfs magic to uapi Allison Henderson (4): xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h xfs: Add helper function xfs_attr_try_sf_addname xfs: Add attibute set and helper functions xfs: Add attibute remove and helper functions Brian Foster (1): xfs: clear ail delwri queued bufs on unmount of shutdown fs Carlos Maiolino (2): xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat xfs: use offsetof() in place of offset macros for __xfsstats Christoph Hellwig (8): xfs: remove XFS_IO_INVALID xfs: remove suport for filesystems without unwritten extent flag xfs: handle zeroing in xfs_file_iomap_begin_delay xfs: remove the unused shared argument to xfs_reflink_reserve_cow xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared xfs: fix fork selection in xfs_find_trim_cow_extent xfs: print dangling delalloc extents xfs: cancel COW blocks before swapext Darrick J. Wong (3): xfs: xrep_findroot_block should reject root blocks with siblings xfs: always assign buffer verifiers when one is provided xfs: fix buffer state management in xrep_findroot_block Dave Chinner (2): xfs: issue log message on user force shutdown xfs: fix use-after-free race in xfs_buf_rele fs/xfs/libxfs/xfs_attr.c | 236 - fs/xfs/{ => libxfs}/xfs_attr.h | 2 + fs/xfs/libxfs/xfs_bmap.c | 70 ++-- fs/xfs/libxfs/xfs_bmap.h | 1 + fs/xfs/libxfs/xfs_format.h | 8 +- fs/xfs/libxfs/xfs_sb.c | 5 +- fs/xfs/scrub/repair.c | 128 +- fs/xfs/scrub/scrub.c | 13 --- fs/xfs/xfs_aops.c | 4 +- fs/xfs/xfs_aops.h | 14 ++- fs/xfs/xfs_bmap_util.c | 61 ++- fs/xfs/xfs_buf.c | 109 +++ fs/xfs/xfs_buf.h | 2 + fs/xfs/xfs_fsops.c | 50 - fs/xfs/xfs_ioctl.c | 8 -- fs/xfs/xfs_iomap.c | 53 ++--- fs/xfs/xfs_reflink.c | 33 +++--- fs/xfs/xfs_reflink.h | 4 +- fs/xfs/xfs_stats.c | 52 - fs/xfs/xfs_stats.h | 28 + fs/xfs/xfs_super.c | 38 ++- fs/xfs/xfs_trans.h | 1 + fs/xfs/xfs_trans_ail.c | 28 +++-- fs/xfs/xfs_trans_buf.c | 42 include/uapi/linux/magic.h | 1 + 25 files changed, 608 insertions(+), 383 deletions(-) rename fs/xfs/{ => libxfs}/xfs_attr.h (97%) -- Dave Chinner da...@fromorbit.com
[GIT PULL] xfs: 4.20 merge
Hi Linus, Can you please pull the XFS 4.20 update from the tag listed below? There's not a huge amount of change in this cycle - Darrick has been out of action for a couple of months (hence me sending the last few pull requests), so we decided a quiet cycle mainly focussed on bug fixes was a good idea. Darrick will take the helm again at the end of this merge window. The branch has applied cleanly to your current tree, so you shouldn't see any problems with it. FYI, I may be sending another update later in the cycle - there's a pending rework of the clone/dedupe_file_range code that fixes numerous bugs that is spread amongst the VFS, XFS and ocfs2 code. It has been reviewed and tested, Al and I just need to work out the details of the merge, so it may come from him rather than me. Thanks! -Dave. The following changes since commit b39989009bdb84992915c9869f58094ed5becf10: xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 +1000) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-4.20-merge-1 for you to fetch changes up to 96987eea537d6ccd98704a71958f9ba02da80843: xfs: cancel COW blocks before swapext (2018-10-18 17:21:55 +1100) xfs: Changes for 4.20 - only support filesystems with unwritten extents - add definition for statfs XFS magic number - remove unused parameters around reflink code - more debug for dangling delalloc extents - cancel COW extents on extent swap targets - fix quota stats output and clean up the code - refactor some of the attribute code in preparation for parent pointers - fix several buffer handling bugs Adam Borowski (1): xfs: add a define for statfs magic to uapi Allison Henderson (4): xfs: Move fs/xfs/xfs_attr.h to fs/xfs/libxfs/xfs_attr.h xfs: Add helper function xfs_attr_try_sf_addname xfs: Add attibute set and helper functions xfs: Add attibute remove and helper functions Brian Foster (1): xfs: clear ail delwri queued bufs on unmount of shutdown fs Carlos Maiolino (2): xfs: Fix xqmstats offsets in /proc/fs/xfs/xqmstat xfs: use offsetof() in place of offset macros for __xfsstats Christoph Hellwig (8): xfs: remove XFS_IO_INVALID xfs: remove suport for filesystems without unwritten extent flag xfs: handle zeroing in xfs_file_iomap_begin_delay xfs: remove the unused shared argument to xfs_reflink_reserve_cow xfs: remove the unused trimmed argument from xfs_reflink_trim_around_shared xfs: fix fork selection in xfs_find_trim_cow_extent xfs: print dangling delalloc extents xfs: cancel COW blocks before swapext Darrick J. Wong (3): xfs: xrep_findroot_block should reject root blocks with siblings xfs: always assign buffer verifiers when one is provided xfs: fix buffer state management in xrep_findroot_block Dave Chinner (2): xfs: issue log message on user force shutdown xfs: fix use-after-free race in xfs_buf_rele fs/xfs/libxfs/xfs_attr.c | 236 - fs/xfs/{ => libxfs}/xfs_attr.h | 2 + fs/xfs/libxfs/xfs_bmap.c | 70 ++-- fs/xfs/libxfs/xfs_bmap.h | 1 + fs/xfs/libxfs/xfs_format.h | 8 +- fs/xfs/libxfs/xfs_sb.c | 5 +- fs/xfs/scrub/repair.c | 128 +- fs/xfs/scrub/scrub.c | 13 --- fs/xfs/xfs_aops.c | 4 +- fs/xfs/xfs_aops.h | 14 ++- fs/xfs/xfs_bmap_util.c | 61 ++- fs/xfs/xfs_buf.c | 109 +++ fs/xfs/xfs_buf.h | 2 + fs/xfs/xfs_fsops.c | 50 - fs/xfs/xfs_ioctl.c | 8 -- fs/xfs/xfs_iomap.c | 53 ++--- fs/xfs/xfs_reflink.c | 33 +++--- fs/xfs/xfs_reflink.h | 4 +- fs/xfs/xfs_stats.c | 52 - fs/xfs/xfs_stats.h | 28 + fs/xfs/xfs_super.c | 38 ++- fs/xfs/xfs_trans.h | 1 + fs/xfs/xfs_trans_ail.c | 28 +++-- fs/xfs/xfs_trans_buf.c | 42 include/uapi/linux/magic.h | 1 + 25 files changed, 608 insertions(+), 383 deletions(-) rename fs/xfs/{ => libxfs}/xfs_attr.h (97%) -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > On 10/12/18 8:55 PM, Dave Chinner wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >> index 5ed8f6292a53..017ab82e36ca 100644 > >> --- a/include/linux/mm_types.h > >> +++ b/include/linux/mm_types.h > >> @@ -78,12 +78,22 @@ struct page { > >> */ > >>union { > >>struct {/* Page cache and anonymous pages */ > >> - /** > >> - * @lru: Pageout list, eg. active_list protected by > >> - * zone_lru_lock. Sometimes used as a generic list > >> - * by the page owner. > >> - */ > >> - struct list_head lru; > >> + union { > >> + /** > >> + * @lru: Pageout list, eg. active_list protected > >> + * by zone_lru_lock. Sometimes used as a > >> + * generic list by the page owner. > >> + */ > >> + struct list_head lru; > >> + /* Used by get_user_pages*(). Pages may not be > >> + * on an LRU while these dma_pinned_* fields > >> + * are in use. > >> + */ > >> + struct { > >> + unsigned long dma_pinned_flags; > >> + atomic_t dma_pinned_count; > >> + }; > >> + }; > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > passed as the user buffer to read/write direct IO and so the pages > > passed to gup will be on the active/inactive LRUs. hence I can't see > > how you can have dual use of the LRU list head like this > > > > What am I missing here? > > Hi Dave, > > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. How is that safe? If you've ripped the page out of the LRU, it's no longer being tracked by the page cache aging and reclaim algorithms. Patch 6 doesn't appear to put these pages back in the LRU, either, so it looks to me like this just dumps them on the ground after the gup reference is dropped. How do we reclaim these page cache pages when there is memory pressure if they aren't in the LRU? > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. gup isn't just used for RDMA. It's used by direct IO in far, far more situations and machines than RDMA is. Please explain why ripping pages out of the LRU and not putting them back is safe, has no side effects, doesn't adversely impact page cache reclaim, etc. Indeed, I'd love to see a description of all the page references and where they come and go so we know the changes aren't just leaking these pages until the filesystem invalidates them at unmount. Maybe I'm not seeing why this is safe yet, but seeing as you haven't explained why it is safe then, at minimum, the patch descriptions are incomplete. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Sat, Oct 13, 2018 at 12:34:12AM -0700, John Hubbard wrote: > On 10/12/18 8:55 PM, Dave Chinner wrote: > > On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > >> From: John Hubbard > [...] > >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >> index 5ed8f6292a53..017ab82e36ca 100644 > >> --- a/include/linux/mm_types.h > >> +++ b/include/linux/mm_types.h > >> @@ -78,12 +78,22 @@ struct page { > >> */ > >>union { > >>struct {/* Page cache and anonymous pages */ > >> - /** > >> - * @lru: Pageout list, eg. active_list protected by > >> - * zone_lru_lock. Sometimes used as a generic list > >> - * by the page owner. > >> - */ > >> - struct list_head lru; > >> + union { > >> + /** > >> + * @lru: Pageout list, eg. active_list protected > >> + * by zone_lru_lock. Sometimes used as a > >> + * generic list by the page owner. > >> + */ > >> + struct list_head lru; > >> + /* Used by get_user_pages*(). Pages may not be > >> + * on an LRU while these dma_pinned_* fields > >> + * are in use. > >> + */ > >> + struct { > >> + unsigned long dma_pinned_flags; > >> + atomic_t dma_pinned_count; > >> + }; > >> + }; > > > > Isn't this broken for mapped file-backed pages? i.e. they may be > > passed as the user buffer to read/write direct IO and so the pages > > passed to gup will be on the active/inactive LRUs. hence I can't see > > how you can have dual use of the LRU list head like this > > > > What am I missing here? > > Hi Dave, > > In patch 6/6, pin_page_for_dma(), which is called at the end of > get_user_pages(), > unceremoniously rips the pages out of the LRU, as a prerequisite to using > either of the page->dma_pinned_* fields. How is that safe? If you've ripped the page out of the LRU, it's no longer being tracked by the page cache aging and reclaim algorithms. Patch 6 doesn't appear to put these pages back in the LRU, either, so it looks to me like this just dumps them on the ground after the gup reference is dropped. How do we reclaim these page cache pages when there is memory pressure if they aren't in the LRU? > The idea is that LRU is not especially useful for this situation anyway, > so we'll just make it one or the other: either a page is dma-pinned, and > just hanging out doing RDMA most likely (and LRU is less meaningful during > that > time), or it's possibly on an LRU list. gup isn't just used for RDMA. It's used by direct IO in far, far more situations and machines than RDMA is. Please explain why ripping pages out of the LRU and not putting them back is safe, has no side effects, doesn't adversely impact page cache reclaim, etc. Indeed, I'd love to see a description of all the page references and where they come and go so we know the changes aren't just leaking these pages until the filesystem invalidates them at unmount. Maybe I'm not seeing why this is safe yet, but seeing as you haven't explained why it is safe then, at minimum, the patch descriptions are incomplete. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 4/6] mm: introduce page->dma_pinned_flags, _count
On Thu, Oct 11, 2018 at 11:00:12PM -0700, john.hubb...@gmail.com wrote: > From: John Hubbard > > Add two struct page fields that, combined, are unioned with > struct page->lru. There is no change in the size of > struct page. These new fields are for type safety and clarity. > > Also add page flag accessors to test, set and clear the new > page->dma_pinned_flags field. > > The page->dma_pinned_count field will be used in upcoming > patches > > Signed-off-by: John Hubbard > --- > include/linux/mm_types.h | 22 +- > include/linux/page-flags.h | 47 ++ > 2 files changed, 63 insertions(+), 6 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5ed8f6292a53..017ab82e36ca 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -78,12 +78,22 @@ struct page { >*/ > union { > struct {/* Page cache and anonymous pages */ > - /** > - * @lru: Pageout list, eg. active_list protected by > - * zone_lru_lock. Sometimes used as a generic list > - * by the page owner. > - */ > - struct list_head lru; > + union { > + /** > + * @lru: Pageout list, eg. active_list protected > + * by zone_lru_lock. Sometimes used as a > + * generic list by the page owner. > + */ > + struct list_head lru; > + /* Used by get_user_pages*(). Pages may not be > + * on an LRU while these dma_pinned_* fields > + * are in use. > + */ > + struct { > + unsigned long dma_pinned_flags; > + atomic_t dma_pinned_count; > + }; > + }; Isn't this broken for mapped file-backed pages? i.e. they may be passed as the user buffer to read/write direct IO and so the pages passed to gup will be on the active/inactive LRUs. hence I can't see how you can have dual use of the LRU list head like this What am I missing here? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
[GIT PULL] xfs: fixes for v4.19-rc7
Hi Greg, Can you please pull the XFS update from the tag listed below? This contains the fix for the clone_file_range data corruption issue I mentioned in my -rc6 pull request (zero post-eof blocks), as well as fixes for several other equally serious problems we found while auditing the clone/dedupe ioctls for other issues. The rest of the problems we found (there were a *lot*) will be addressed in the 4.20 cycle. Cheers, Dave. The following changes since commit e55ec4ddbef9897199c307dfb23167e3801fdaf5: xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 +1000) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc7 for you to fetch changes up to b39989009bdb84992915c9869f58094ed5becf10: xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 +1000) xfs: fixes for 4.19-rc7 Update for 4.19-rc7 to fix numerous file clone and deduplication issues. Darrick J. Wong (3): xfs: refactor clonerange preparation into a separate helper xfs: zero posteof blocks when cloning above eof xfs: update ctime and remove suid before cloning files Dave Chinner (2): xfs: fix data corruption w/ unaligned dedupe ranges xfs: fix data corruption w/ unaligned reflink ranges fs/xfs/xfs_reflink.c | 200 ++- 1 file changed, 165 insertions(+), 35 deletions(-) -- Dave Chinner da...@fromorbit.com
[GIT PULL] xfs: fixes for v4.19-rc7
Hi Greg, Can you please pull the XFS update from the tag listed below? This contains the fix for the clone_file_range data corruption issue I mentioned in my -rc6 pull request (zero post-eof blocks), as well as fixes for several other equally serious problems we found while auditing the clone/dedupe ioctls for other issues. The rest of the problems we found (there were a *lot*) will be addressed in the 4.20 cycle. Cheers, Dave. The following changes since commit e55ec4ddbef9897199c307dfb23167e3801fdaf5: xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 +1000) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc7 for you to fetch changes up to b39989009bdb84992915c9869f58094ed5becf10: xfs: fix data corruption w/ unaligned reflink ranges (2018-10-06 11:44:39 +1000) xfs: fixes for 4.19-rc7 Update for 4.19-rc7 to fix numerous file clone and deduplication issues. Darrick J. Wong (3): xfs: refactor clonerange preparation into a separate helper xfs: zero posteof blocks when cloning above eof xfs: update ctime and remove suid before cloning files Dave Chinner (2): xfs: fix data corruption w/ unaligned dedupe ranges xfs: fix data corruption w/ unaligned reflink ranges fs/xfs/xfs_reflink.c | 200 ++- 1 file changed, 165 insertions(+), 35 deletions(-) -- Dave Chinner da...@fromorbit.com
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote: > On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > > > From: Dave Chinner > > > > We've recently seen a workload on XFS filesystems with a repeatable > > deadlock between background writeback and a multi-process > > application doing concurrent writes and fsyncs to a small range of a > > file. > > > > ... > > > > Signed-off-by: Dave Chinner > > Reviewed-by: Jan Kara > > Not a serious enough problem for a -stable backport? Don't have enough evidence to say one way or another. The reported incident was from a RHEL 7 kernel, so the bug has been there for years in one form or another, but it's only ever been triggered by this one-off custom workload. I haven't done any analysis on older kernels, nor have I looked to see if there's any gotchas that a stable backport might encounter. And I tend not to change stuff in a path that is critical to data integrity without at least doing enough due diligence to suggest a stable backport would be fine. You can mark it for stable backports if you want, but I'm not prepared to because I haven't done the work necessary to ensure it's safe to do so. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
On Fri, Oct 05, 2018 at 12:46:40PM -0700, Andrew Morton wrote: > On Fri, 5 Oct 2018 15:45:26 +1000 Dave Chinner wrote: > > > From: Dave Chinner > > > > We've recently seen a workload on XFS filesystems with a repeatable > > deadlock between background writeback and a multi-process > > application doing concurrent writes and fsyncs to a small range of a > > file. > > > > ... > > > > Signed-off-by: Dave Chinner > > Reviewed-by: Jan Kara > > Not a serious enough problem for a -stable backport? Don't have enough evidence to say one way or another. The reported incident was from a RHEL 7 kernel, so the bug has been there for years in one form or another, but it's only ever been triggered by this one-off custom workload. I haven't done any analysis on older kernels, nor have I looked to see if there's any gotchas that a stable backport might encounter. And I tend not to change stuff in a path that is critical to data integrity without at least doing enough due diligence to suggest a stable backport would be fine. You can mark it for stable backports if you want, but I'm not prepared to because I haven't done the work necessary to ensure it's safe to do so. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
From: Dave Chinner We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I d
[PATCH] writeback: fix range_cyclic writeback vs writepages deadlock
From: Dave Chinner We've recently seen a workload on XFS filesystems with a repeatable deadlock between background writeback and a multi-process application doing concurrent writes and fsyncs to a small range of a file. range_cyclic writeback Process 1 Process 2 xfs_vm_writepages write_cache_pages writeback_index = 2 cycled = 0 find page 2 dirty lock Page 2 ->writepage page 2 writeback page 2 clean page 2 added to bio no more pages write() locks page 1 dirties page 1 locks page 2 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 find page 1 towrite lock Page 1 ->writepage page 1 writeback page 1 clean page 1 added to bio find page 2 towrite lock Page 2 page 2 is writeback write() locks page 1 dirties page 1 fsync() xfs_vm_writepages write_cache_pages start index 0 !done && !cycled sets index to 0, restarts lookup find page 1 dirty find page 1 towrite lock Page 1 page 1 is writeback lock Page 1 DEADLOCK because: - process 1 needs page 2 writeback to complete to make enough progress to issue IO pending for page 1 - writeback needs page 1 writeback to complete so process 2 can progress and unlock the page it is blocked on, then it can issue the IO pending for page 2 - process 2 can't make progress until process 1 issues IO for page 1 The underlying cause of the problem here is that range_cyclic writeback is processing pages in descending index order as we hold higher index pages in a structure controlled from above write_cache_pages(). The write_cache_pages() caller needs to be able to submit these pages for IO before write_cache_pages restarts writeback at mapping index 0 to avoid wcp inverting the page lock/writeback wait order. generic_writepages() is not susceptible to this bug as it has no private context held across write_cache_pages() - filesystems using this infrastructure always submit pages in ->writepage immediately and so there is no problem with range_cyclic going back to mapping index 0. However: mpage_writepages() has a private bio context, exofs_writepages() has page_collect fuse_writepages() has fuse_fill_wb_data nfs_writepages() has nfs_pageio_descriptor xfs_vm_writepages() has xfs_writepage_ctx All of these ->writepages implementations can hold pages under writeback in their private structures until write_cache_pages() returns, and hence they are all susceptible to this deadlock. Also worth noting is that ext4 has it's own bastardised version of write_cache_pages() and so it /may/ have an equivalent deadlock. I looked at the code long enough to understand that it has a similar retry loop for range_cyclic writeback reaching the end of the file and then promptly ran away before my eyes bled too much. I'll leave it for the ext4 developers to determine if their code is actually has this deadlock and how to fix it if it has. There's a few ways I can see avoid this deadlock. There's probably more, but these are the first I've though of: 1. get rid of range_cyclic altogether 2. range_cyclic always stops at EOF, and we start again from writeback index 0 on the next call into write_cache_pages() 2a. wcp also returns EAGAIN to ->writepages implementations to indicate range cyclic has hit EOF. writepages implementations can then flush the current context and call wpc again to continue. i.e. lift the retry into the ->writepages implementation 3. range_cyclic uses trylock_page() rather than lock_page(), and it skips pages it can't lock without blocking. It will already do this for pages under writeback, so this seems like a no-brainer 3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid blocking as per pages under writeback. I d
[GIT PULL] xfs: fixes for 4.19-rc6
Hi Greg, Can you please pull the XFS from the tag listed below. It's a bit bigger than that I'd like this late in the cycle, but we've had a few challenges getting ourselves sorted out this cycle. Details of the contents are in the pull-req output below. This has been be run through xfstests over the past week, and merges against 4.19-rc6 cleanly. FYI, there is likely to be one more XFS fix for this cycle - we've just tracked down the source of a clone_file_range() data corruption so I'll send that as a separate pullreq once Darrick's fix is reviewed and tested. Cheers, Dave. The following changes since commit 5b394b2ddf0347bef56e50c69a58773c94343ff3: Linux 4.19-rc1 (2018-08-26 14:11:59 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc6 for you to fetch changes up to e55ec4ddbef9897199c307dfb23167e3801fdaf5: xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 +1000) XFS fixes for 4.19-rc6 Accumlated regression and bug fixes for 4.19-rc6, including: o make iomap correctly mark dirty pages for sub-page block sizes o fix regression in handling extent-to-btree format conversion errors o fix torn log wrap detection for new logs o various corrupt inode detection fixes o various delalloc state fixes o cleanup all the missed transaction cancel cases missed from changes merged in 4.19-rc1 o fix lockdep false positive on transaction allocation o fix locking and reference counting on buffer log items Brian Foster (6): xfs: remove last of unnecessary xfs_defer_cancel() callers xfs: don't unlock invalidated buf on aborted tx commit xfs: clean up xfs_trans_brelse() xfs: refactor xfs_buf_log_item reference count handling xfs: remove invalid log recovery first/last cycle check iomap: set page dirty after partial delalloc on mkwrite Christoph Hellwig (2): xfs: don't bring in extents in xfs_bmap_punch_delalloc_range xfs: skip delalloc COW blocks in xfs_reflink_end_cow Darrick J. Wong (1): xfs: don't crash the vfs on a garbage inline symlink Dave Chinner (3): xfs: avoid lockdep false positives in xfs_trans_alloc xfs: fix transaction leak in xfs_reflink_allocate_cow() xfs: fix error handling in xfs_bmap_extents_to_btree Eric Sandeen (2): xfs: don't treat unknown di_flags2 as corruption in scrub xfs: validate inode di_forkoff YueHaibing (1): xfs: remove duplicated include from alloc.c fs/iomap.c | 2 +- fs/xfs/libxfs/xfs_attr.c| 28 +++- fs/xfs/libxfs/xfs_attr_remote.c | 10 +-- fs/xfs/libxfs/xfs_bmap.c| 24 +++ fs/xfs/libxfs/xfs_format.h | 2 + fs/xfs/libxfs/xfs_inode_buf.c | 30 + fs/xfs/scrub/alloc.c| 1 - fs/xfs/scrub/inode.c| 4 +- fs/xfs/xfs_bmap_util.c | 20 ++ fs/xfs/xfs_buf_item.c | 119 ++ fs/xfs/xfs_buf_item.h | 1 + fs/xfs/xfs_inode.c | 10 +-- fs/xfs/xfs_iops.c | 12 +++- fs/xfs/xfs_log_recover.c| 10 --- fs/xfs/xfs_reflink.c| 137 fs/xfs/xfs_trace.h | 1 - fs/xfs/xfs_trans.c | 10 ++- fs/xfs/xfs_trans_buf.c | 99 +++-- 18 files changed, 256 insertions(+), 264 deletions(-) -- Dave Chinner da...@fromorbit.com
[GIT PULL] xfs: fixes for 4.19-rc6
Hi Greg, Can you please pull the XFS from the tag listed below. It's a bit bigger than that I'd like this late in the cycle, but we've had a few challenges getting ourselves sorted out this cycle. Details of the contents are in the pull-req output below. This has been be run through xfstests over the past week, and merges against 4.19-rc6 cleanly. FYI, there is likely to be one more XFS fix for this cycle - we've just tracked down the source of a clone_file_range() data corruption so I'll send that as a separate pullreq once Darrick's fix is reviewed and tested. Cheers, Dave. The following changes since commit 5b394b2ddf0347bef56e50c69a58773c94343ff3: Linux 4.19-rc1 (2018-08-26 14:11:59 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/fs/xfs/xfs-linux tags/xfs-fixes-for-4.19-rc6 for you to fetch changes up to e55ec4ddbef9897199c307dfb23167e3801fdaf5: xfs: fix error handling in xfs_bmap_extents_to_btree (2018-10-01 08:11:07 +1000) XFS fixes for 4.19-rc6 Accumlated regression and bug fixes for 4.19-rc6, including: o make iomap correctly mark dirty pages for sub-page block sizes o fix regression in handling extent-to-btree format conversion errors o fix torn log wrap detection for new logs o various corrupt inode detection fixes o various delalloc state fixes o cleanup all the missed transaction cancel cases missed from changes merged in 4.19-rc1 o fix lockdep false positive on transaction allocation o fix locking and reference counting on buffer log items Brian Foster (6): xfs: remove last of unnecessary xfs_defer_cancel() callers xfs: don't unlock invalidated buf on aborted tx commit xfs: clean up xfs_trans_brelse() xfs: refactor xfs_buf_log_item reference count handling xfs: remove invalid log recovery first/last cycle check iomap: set page dirty after partial delalloc on mkwrite Christoph Hellwig (2): xfs: don't bring in extents in xfs_bmap_punch_delalloc_range xfs: skip delalloc COW blocks in xfs_reflink_end_cow Darrick J. Wong (1): xfs: don't crash the vfs on a garbage inline symlink Dave Chinner (3): xfs: avoid lockdep false positives in xfs_trans_alloc xfs: fix transaction leak in xfs_reflink_allocate_cow() xfs: fix error handling in xfs_bmap_extents_to_btree Eric Sandeen (2): xfs: don't treat unknown di_flags2 as corruption in scrub xfs: validate inode di_forkoff YueHaibing (1): xfs: remove duplicated include from alloc.c fs/iomap.c | 2 +- fs/xfs/libxfs/xfs_attr.c| 28 +++- fs/xfs/libxfs/xfs_attr_remote.c | 10 +-- fs/xfs/libxfs/xfs_bmap.c| 24 +++ fs/xfs/libxfs/xfs_format.h | 2 + fs/xfs/libxfs/xfs_inode_buf.c | 30 + fs/xfs/scrub/alloc.c| 1 - fs/xfs/scrub/inode.c| 4 +- fs/xfs/xfs_bmap_util.c | 20 ++ fs/xfs/xfs_buf_item.c | 119 ++ fs/xfs/xfs_buf_item.h | 1 + fs/xfs/xfs_inode.c | 10 +-- fs/xfs/xfs_iops.c | 12 +++- fs/xfs/xfs_log_recover.c| 10 --- fs/xfs/xfs_reflink.c| 137 fs/xfs/xfs_trace.h | 1 - fs/xfs/xfs_trans.c | 10 ++- fs/xfs/xfs_trans_buf.c | 99 +++-- 18 files changed, 256 insertions(+), 264 deletions(-) -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Oct 03, 2018 at 05:20:31AM +1000, James Morris wrote: > On Tue, 2 Oct 2018, Dave Chinner wrote: > > > On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote: > > > On Mon, 1 Oct 2018, Darrick J. Wong wrote: > > > > > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks, > > > > > > Not sure we'd need a pile of hooks, what about just "read" and "write" > > > storage admin? > > > > > > Or even two new capabilities along these lines, which we convert existing > > > CAP_SYS_ADMIN etc. to? > > > > So instead of having hundreds of management ioctls under > > CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under > > CAP_SYS_ADMIN and hundreds of storage ioctls under > > CAP_SYS_STORAGE_ADMIN? > > > > Maybe I'm missing something, but I don't see how that improves the > > situation w.r.t. locked down LSM configurations? > > I'm not sure about capabilities, but having two specific LSM hooks for > storage admin would allow SELinux et al to explicitly control privileged > access to these interfaces. Storage admin seems to be a special case of > its own which we want to be able to mediate as such. Perhaps so - as a stepping stone it would allow isolation in specific cases where no management should be allowed, but there are cases with modern filesystems where users need access to storage APIs. e.g. It's entirely plausible that /home is set up as a subvolume per user, and that subvols in a fileystem can be independently snapshotted. Hence it would be completely acceptible to allow users to have access to snapshot management APIs to be able to snapshot their home directories for backup/rollback purposes. Hence I'm not sure that black/white storage admin LSM hooks are a solution that will end up being particularly useful to the wider population... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Oct 03, 2018 at 05:20:31AM +1000, James Morris wrote: > On Tue, 2 Oct 2018, Dave Chinner wrote: > > > On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote: > > > On Mon, 1 Oct 2018, Darrick J. Wong wrote: > > > > > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks, > > > > > > Not sure we'd need a pile of hooks, what about just "read" and "write" > > > storage admin? > > > > > > Or even two new capabilities along these lines, which we convert existing > > > CAP_SYS_ADMIN etc. to? > > > > So instead of having hundreds of management ioctls under > > CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under > > CAP_SYS_ADMIN and hundreds of storage ioctls under > > CAP_SYS_STORAGE_ADMIN? > > > > Maybe I'm missing something, but I don't see how that improves the > > situation w.r.t. locked down LSM configurations? > > I'm not sure about capabilities, but having two specific LSM hooks for > storage admin would allow SELinux et al to explicitly control privileged > access to these interfaces. Storage admin seems to be a special case of > its own which we want to be able to mediate as such. Perhaps so - as a stepping stone it would allow isolation in specific cases where no management should be allowed, but there are cases with modern filesystems where users need access to storage APIs. e.g. It's entirely plausible that /home is set up as a subvolume per user, and that subvols in a fileystem can be independently snapshotted. Hence it would be completely acceptible to allow users to have access to snapshot management APIs to be able to snapshot their home directories for backup/rollback purposes. Hence I'm not sure that black/white storage admin LSM hooks are a solution that will end up being particularly useful to the wider population... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote: > On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote: > > This reminds me so much of Linux mmap() in the mid-2000s - mmap() > > worked for ext3 without being aware of page faults, > > And "worked" still is a bit of a stretch, as soon as you'd get > ENOSPC it would still blow up badly. Which probably makes it an > even better analogy to the current case. > > > RDMA does not call ->page_mkwrite on clean file backed pages before it > > writes to them and calls set_page_dirty(), and hence RDMA to file > > backed pages is completely unreliable. I'm not sure this can be > > solved without having page fault capable RDMA hardware > > We can always software prefault at gup time. I'm not sure that's sufficient - we've got a series of panics from machines running ext4+RDMA where there are no bufferheads on dirty pages at writeback time. This was also reproducable on versions of XFS that used bufferheads. We suspect that memory reclaim has tripped the bufferhead stripping threshold (yeah, that old ext3 hack to avoid writeback deadlocks under memory pressure), hence removed the bufferheads from clean mapped pages while RDMA has them pinned. And then some time later after set_page_dirty() was called on them the filesystem's page writeback code crashes and burns i.e. just because the page was in a known state at when it was pinned, it doesn't mean it will remain in that state until it is unpinned > And also remember that > while RDMA might be the case at least some people care about here it > really isn't different from any of the other gup + I/O cases, including > doing direct I/O to a mmap area. The only difference in the various > cases is how long the area should be pinned down - some users like RDMA > want a long term mapping, while others like direct I/O just need a short > transient one. Yup, now that I'm aware of all those little intricacies with gup I always try to consider what impact they have... > > We could address these use-after-free situations via forcing RDMA to > > use file layout leases and revoke the lease when we need to modify > > the backing store on leased files. However, this doesn't solve the > > need for filesystems to receive write fault notifications via > > ->page_mkwrite. > > Exactly. We need three things here: > > - notification to the filesystem that a page is (possibly) beeing >written to > - a way to to block fs operations while the pages are pinned > - a way to distinguish between short and long term mappings, >and only allow long terms mappings if they can be broken >using something like leases > > I'm also pretty sure we already explained this a long time ago when the > issue came up last year, so I'm not sure why this is even still > contentious. I suspect that it's simply because these discussions have been spread across different groups and not everyone is aware of what the other groups are discussing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
On Mon, Oct 01, 2018 at 05:47:57AM -0700, Christoph Hellwig wrote: > On Mon, Oct 01, 2018 at 04:11:27PM +1000, Dave Chinner wrote: > > This reminds me so much of Linux mmap() in the mid-2000s - mmap() > > worked for ext3 without being aware of page faults, > > And "worked" still is a bit of a stretch, as soon as you'd get > ENOSPC it would still blow up badly. Which probably makes it an > even better analogy to the current case. > > > RDMA does not call ->page_mkwrite on clean file backed pages before it > > writes to them and calls set_page_dirty(), and hence RDMA to file > > backed pages is completely unreliable. I'm not sure this can be > > solved without having page fault capable RDMA hardware > > We can always software prefault at gup time. I'm not sure that's sufficient - we've got a series of panics from machines running ext4+RDMA where there are no bufferheads on dirty pages at writeback time. This was also reproducable on versions of XFS that used bufferheads. We suspect that memory reclaim has tripped the bufferhead stripping threshold (yeah, that old ext3 hack to avoid writeback deadlocks under memory pressure), hence removed the bufferheads from clean mapped pages while RDMA has them pinned. And then some time later after set_page_dirty() was called on them the filesystem's page writeback code crashes and burns i.e. just because the page was in a known state at when it was pinned, it doesn't mean it will remain in that state until it is unpinned > And also remember that > while RDMA might be the case at least some people care about here it > really isn't different from any of the other gup + I/O cases, including > doing direct I/O to a mmap area. The only difference in the various > cases is how long the area should be pinned down - some users like RDMA > want a long term mapping, while others like direct I/O just need a short > transient one. Yup, now that I'm aware of all those little intricacies with gup I always try to consider what impact they have... > > We could address these use-after-free situations via forcing RDMA to > > use file layout leases and revoke the lease when we need to modify > > the backing store on leased files. However, this doesn't solve the > > need for filesystems to receive write fault notifications via > > ->page_mkwrite. > > Exactly. We need three things here: > > - notification to the filesystem that a page is (possibly) beeing >written to > - a way to to block fs operations while the pages are pinned > - a way to distinguish between short and long term mappings, >and only allow long terms mappings if they can be broken >using something like leases > > I'm also pretty sure we already explained this a long time ago when the > issue came up last year, so I'm not sure why this is even still > contentious. I suspect that it's simply because these discussions have been spread across different groups and not everyone is aware of what the other groups are discussing... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Mon, Oct 01, 2018 at 11:25:29AM -0400, Theodore Y. Ts'o wrote: > On Mon, Oct 01, 2018 at 04:04:42PM +0100, Alan Cox wrote: > > > Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not > > > trusted have exactly the same issues. i.e. there's nobody trusted by > > > the kernel to administer the storage stack, and nobody has defined a > > > workable security model that can prevent untrusted users from > > > violating the existing storage trust model > > > > With a proper set of LSM checks you can lock the filesystem management > > and enforcement to a particular set of objects. You can build that model > > where for example only an administrative login from a trusted console may > > launch processes to do that management. > > > > Or you could - if things were not going around the LSM hooks. > > It would be useful if anyone actually *wants* to do this thing to > define a formal security model, and detail *everything* that would > need to be changed in order to accomplish it. Just as we don't > speculatively add code "just in case" someone might want to use it > someday, I don't think we should be adding random LSM hooks just > becausre someone *might* want do something. Yeah, that's what I was implying we needed to do - taking the current model and slapping LSM hooks around randomly will only make things break and cause admins to curse us > Let's see the use case, and let's see how horrible the changes would > need to be, and how credible we think it is that someone will actually > want to *use* it. I suspect the chagnes will be a really huge number > of places, and not just in XFS So do I - the "in root we trust" model is pretty deeply ingrained up and down the storage stack. I also suspect that most of our hardware admin (not just storage) has similar assumptions about the security model they operate in. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Mon, Oct 01, 2018 at 11:25:29AM -0400, Theodore Y. Ts'o wrote: > On Mon, Oct 01, 2018 at 04:04:42PM +0100, Alan Cox wrote: > > > Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not > > > trusted have exactly the same issues. i.e. there's nobody trusted by > > > the kernel to administer the storage stack, and nobody has defined a > > > workable security model that can prevent untrusted users from > > > violating the existing storage trust model > > > > With a proper set of LSM checks you can lock the filesystem management > > and enforcement to a particular set of objects. You can build that model > > where for example only an administrative login from a trusted console may > > launch processes to do that management. > > > > Or you could - if things were not going around the LSM hooks. > > It would be useful if anyone actually *wants* to do this thing to > define a formal security model, and detail *everything* that would > need to be changed in order to accomplish it. Just as we don't > speculatively add code "just in case" someone might want to use it > someday, I don't think we should be adding random LSM hooks just > becausre someone *might* want do something. Yeah, that's what I was implying we needed to do - taking the current model and slapping LSM hooks around randomly will only make things break and cause admins to curse us > Let's see the use case, and let's see how horrible the changes would > need to be, and how credible we think it is that someone will actually > want to *use* it. I suspect the chagnes will be a really huge number > of places, and not just in XFS So do I - the "in root we trust" model is pretty deeply ingrained up and down the storage stack. I also suspect that most of our hardware admin (not just storage) has similar assumptions about the security model they operate in. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote: > On Mon, 1 Oct 2018, Darrick J. Wong wrote: > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks, > > Not sure we'd need a pile of hooks, what about just "read" and "write" > storage admin? > > Or even two new capabilities along these lines, which we convert existing > CAP_SYS_ADMIN etc. to? So instead of having hundreds of management ioctls under CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under CAP_SYS_ADMIN and hundreds of storage ioctls under CAP_SYS_STORAGE_ADMIN? Maybe I'm missing something, but I don't see how that improves the situation w.r.t. locked down LSM configurations? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Tue, Oct 02, 2018 at 06:08:16AM +1000, James Morris wrote: > On Mon, 1 Oct 2018, Darrick J. Wong wrote: > > > If we /did/ replace CAP_SYS_ADMIN checking with a pile of LSM hooks, > > Not sure we'd need a pile of hooks, what about just "read" and "write" > storage admin? > > Or even two new capabilities along these lines, which we convert existing > CAP_SYS_ADMIN etc. to? So instead of having hundreds of management ioctls under CAP_SYS_ADMIN, we'd now have hundreds of non-storage ioctls under CAP_SYS_ADMIN and hundreds of storage ioctls under CAP_SYS_STORAGE_ADMIN? Maybe I'm missing something, but I don't see how that improves the situation w.r.t. locked down LSM configurations? Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] namei: implement various scoping AT_* flags
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote: > On 2018-10-01, Dave Chinner wrote: > > > I've added some selftests for this, but it's not clear to me whether > > > they should live here or in xfstests (as far as I can tell there are no > > > other VFS tests in selftests, while there are some tests that look like > > > generic VFS tests in xfstests). If you'd prefer them to be included in > > > xfstests, let me know. > > > > xfstests, please. That way the new functionality will get immediate > > coverage by all the main filesystem development and distro QA > > teams > > Sure, will do. Do you want me to submit them in parallel -- That's usually the way we do things, but we don't tend to commit the fstests changes until the thing it is testing has landed upstream. > and what is > the correct ML to submit changes to xfstests? fste...@vger.kernel.org > Sorry for the silly questions. :P You're going to have many more of them when you start moving the tests across to fstests :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] namei: implement various scoping AT_* flags
On Mon, Oct 01, 2018 at 03:47:23PM +1000, Aleksa Sarai wrote: > On 2018-10-01, Dave Chinner wrote: > > > I've added some selftests for this, but it's not clear to me whether > > > they should live here or in xfstests (as far as I can tell there are no > > > other VFS tests in selftests, while there are some tests that look like > > > generic VFS tests in xfstests). If you'd prefer them to be included in > > > xfstests, let me know. > > > > xfstests, please. That way the new functionality will get immediate > > coverage by all the main filesystem development and distro QA > > teams > > Sure, will do. Do you want me to submit them in parallel -- That's usually the way we do things, but we don't tend to commit the fstests changes until the thing it is testing has landed upstream. > and what is > the correct ML to submit changes to xfstests? fste...@vger.kernel.org > Sorry for the silly questions. :P You're going to have many more of them when you start moving the tests across to fstests :P Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
On Sat, Sep 29, 2018 at 04:46:09AM -0400, Jerome Glisse wrote: > On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote: > > On 9/28/18 2:49 PM, Jerome Glisse wrote: > > > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote: > > >> use a non-CPU device to read and write to "pinned" memory, especially > > >> when > > >> that memory is backed by a file system. "backed by a filesystem" is the biggest problem here. > > >> I recall there were objections to just narrowly fixing the > > >> set_page_dirty() > > >> bug, because the underlying problem is large and serious. So here we are. > > > > > > Except that you can not solve that issue without proper hardware. GPU are > > > fine. RDMA are broken except the mellanox5 hardware which can invalidate > > > at anytime its page table thus allowing to write protect the page at any > > > time. > > > > Today, people are out there using RDMA without page-fault-capable hardware. > > And they are hitting problems, as we've seen. From the discussions so far, > > I don't think it's impossible to solve the problems, even for "lesser", > > non-fault-capable hardware. This reminds me so much of Linux mmap() in the mid-2000s - mmap() worked for ext3 without being aware of page faults, so most mm/ developers at the time were of the opinion that all the other filesystems should work just fine without being aware of page faults. But some loud-mouthed idiot at SGI kept complaining that mmap() could never be fixed for XFS without write fault notification because of delayed allocation, unwritten extents and ENOSPC had to be handled before mapped writes could be posted. Eventually Christoph Lameter got ->page_mkwrite into the page fault path and the loud mouthed idiot finally got mmap() to work correctly on XFS: commit 4f57dbc6b5bae5a3978d429f45ac597ca7a3b8c6 Author: David Chinner Date: Thu Jul 19 16:28:17 2007 +1000 [XFS] Implement ->page_mkwrite in XFS. Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages being written to. This allows use to do correct delayed allocation and ENOSPC checking as well as remap unwritten extents so that they get converted correctly during writeback. This is done via the generic block_page_mkwrite code. SGI-PV: 940392 SGI-Modid: xfs-linux-melb:xfs-kern:29149a Signed-off-by: David Chinner Signed-off-by: Christoph Hellwig Signed-off-by: Tim Shimmin Nowdays, ->page_mkwrite is fundamental filesystem functionality - copy-on-write filesystems like btrfs simply don't work if they can't trigger COW on mapped write accesses. These days all the main linux filesystems depend on write fault notifications in some way or another for correct operation. The way RDMA uses GUP to take references to file backed pages to 'stop them going away' is reminiscent of mmap() back before ->page_mkwrite(). i.e. it assumes that an initial read of the page will populate the page state correctly for all future operations, including set_page_dirty() after write accesses. This is not a valid assumption - filesystems can have different private clean vs dirty page state, and may need to perform operations to take a page from clean to dirty. Hence calling set_page_dirty() on a file backed mapped page without first having called ->page_mkwrite() is a bug. RDMA does not call ->page_mkwrite on clean file backed pages before it writes to them and calls set_page_dirty(), and hence RDMA to file backed pages is completely unreliable. I'm not sure this can be solved without having page fault capable RDMA hardware > > > With the solution put forward here you can potentialy wait _forever_ for > > > the driver that holds a pin to drop it. This was the point i was trying to > > > get accross during LSF/MM. Right, but pinning via GUP is not an option for file backed pages because the filesystem is completely unaware of these references. i.e. waiting forever isn't an issue here because the filesystem never waits on them. Instead, they are a filesystem corruption vector because the filesystem can invalidate those mappings and free the backing store while they are still in use by RDMA. Hence for DAX filesystems, this leaves the RDMA app with direct access to the physical storage even though the filesystem has freed the space it is accessing. This is a use after free of the physical storage that the filesystem cannot control, and why DAX+RDMA is disabled right now. We could address these use-after-free situations via forcing RDMA to use file layout leases and revoke the lease when we need to modify the backing store on leased files. However, this doesn't solve the need for filesystems to receive write fault notifications via ->page_mkwrite. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/4] get_user_pages*() and RDMA: first steps
On Sat, Sep 29, 2018 at 04:46:09AM -0400, Jerome Glisse wrote: > On Fri, Sep 28, 2018 at 07:28:16PM -0700, John Hubbard wrote: > > On 9/28/18 2:49 PM, Jerome Glisse wrote: > > > On Fri, Sep 28, 2018 at 12:06:12PM -0700, John Hubbard wrote: > > >> use a non-CPU device to read and write to "pinned" memory, especially > > >> when > > >> that memory is backed by a file system. "backed by a filesystem" is the biggest problem here. > > >> I recall there were objections to just narrowly fixing the > > >> set_page_dirty() > > >> bug, because the underlying problem is large and serious. So here we are. > > > > > > Except that you can not solve that issue without proper hardware. GPU are > > > fine. RDMA are broken except the mellanox5 hardware which can invalidate > > > at anytime its page table thus allowing to write protect the page at any > > > time. > > > > Today, people are out there using RDMA without page-fault-capable hardware. > > And they are hitting problems, as we've seen. From the discussions so far, > > I don't think it's impossible to solve the problems, even for "lesser", > > non-fault-capable hardware. This reminds me so much of Linux mmap() in the mid-2000s - mmap() worked for ext3 without being aware of page faults, so most mm/ developers at the time were of the opinion that all the other filesystems should work just fine without being aware of page faults. But some loud-mouthed idiot at SGI kept complaining that mmap() could never be fixed for XFS without write fault notification because of delayed allocation, unwritten extents and ENOSPC had to be handled before mapped writes could be posted. Eventually Christoph Lameter got ->page_mkwrite into the page fault path and the loud mouthed idiot finally got mmap() to work correctly on XFS: commit 4f57dbc6b5bae5a3978d429f45ac597ca7a3b8c6 Author: David Chinner Date: Thu Jul 19 16:28:17 2007 +1000 [XFS] Implement ->page_mkwrite in XFS. Hook XFS up to ->page_mkwrite to ensure that we know about mmap pages being written to. This allows use to do correct delayed allocation and ENOSPC checking as well as remap unwritten extents so that they get converted correctly during writeback. This is done via the generic block_page_mkwrite code. SGI-PV: 940392 SGI-Modid: xfs-linux-melb:xfs-kern:29149a Signed-off-by: David Chinner Signed-off-by: Christoph Hellwig Signed-off-by: Tim Shimmin Nowdays, ->page_mkwrite is fundamental filesystem functionality - copy-on-write filesystems like btrfs simply don't work if they can't trigger COW on mapped write accesses. These days all the main linux filesystems depend on write fault notifications in some way or another for correct operation. The way RDMA uses GUP to take references to file backed pages to 'stop them going away' is reminiscent of mmap() back before ->page_mkwrite(). i.e. it assumes that an initial read of the page will populate the page state correctly for all future operations, including set_page_dirty() after write accesses. This is not a valid assumption - filesystems can have different private clean vs dirty page state, and may need to perform operations to take a page from clean to dirty. Hence calling set_page_dirty() on a file backed mapped page without first having called ->page_mkwrite() is a bug. RDMA does not call ->page_mkwrite on clean file backed pages before it writes to them and calls set_page_dirty(), and hence RDMA to file backed pages is completely unreliable. I'm not sure this can be solved without having page fault capable RDMA hardware > > > With the solution put forward here you can potentialy wait _forever_ for > > > the driver that holds a pin to drop it. This was the point i was trying to > > > get accross during LSF/MM. Right, but pinning via GUP is not an option for file backed pages because the filesystem is completely unaware of these references. i.e. waiting forever isn't an issue here because the filesystem never waits on them. Instead, they are a filesystem corruption vector because the filesystem can invalidate those mappings and free the backing store while they are still in use by RDMA. Hence for DAX filesystems, this leaves the RDMA app with direct access to the physical storage even though the filesystem has freed the space it is accessing. This is a use after free of the physical storage that the filesystem cannot control, and why DAX+RDMA is disabled right now. We could address these use-after-free situations via forcing RDMA to use file layout leases and revoke the lease when we need to modify the backing store on leased files. However, this doesn't solve the need for filesystems to receive write fault notifications via ->page_mkwrite. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] namei: implement various scoping AT_* flags
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote: > I've added some selftests for this, but it's not clear to me whether > they should live here or in xfstests (as far as I can tell there are no > other VFS tests in selftests, while there are some tests that look like > generic VFS tests in xfstests). If you'd prefer them to be included in > xfstests, let me know. xfstests, please. That way the new functionality will get immediate coverage by all the main filesystem development and distro QA teams Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 0/3] namei: implement various scoping AT_* flags
On Sat, Sep 29, 2018 at 08:34:50PM +1000, Aleksa Sarai wrote: > I've added some selftests for this, but it's not clear to me whether > they should live here or in xfstests (as far as I can tell there are no > other VFS tests in selftests, while there are some tests that look like > generic VFS tests in xfstests). If you'd prefer them to be included in > xfstests, let me know. xfstests, please. That way the new functionality will get immediate coverage by all the main filesystem development and distro QA teams Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Sun, Sep 30, 2018 at 03:16:52PM +0100, Alan Cox wrote: > > > CAP_SYS_ADMIN is also a bit weird because low level access usually > > > implies you can bypass access controls so you should also check > > > CAP_SYS_DAC ? > > > > Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls? > > But that only allows bypassing directory search operations, so maybe > > you mean CAP_DAC_OVERRIDE? > > It depends what the ioctl allows you to do. If it allows me to bypass > DAC and manipulate the file system to move objects around then it's a > serious issue. These interfaces have always been allowed to do that. You can't do transparent online background defragmentation without bypassing DAC and moving objects around. You can't scrub metadata and data without bypassing DAC. You can't do dedupe without bypassing /some level/ of DAC to get access to the filesystem used space map and the raw block device to hash the data. But the really important access control for dedupe - avoiding deduping data across files at different security levels - isn't controlled at all. > The underlying problem is if CAP_SYS_ADMIN is able to move objects around > then I can move modules around. Yup, anything with direct access to block devices can do that. Many filesystem and storage utilities are given direct access to the block device, because that's what they need to work. e.g. in DM land, the control ioctls (ctl_ioctl()) are protected by: /* only root can play with this */ if (!capable(CAP_SYS_ADMIN)) return -EACCES; Think about it - if DM control ioctls only require CAP_SYS_ADMIN, then if have that cap you can use DM to remap any block in a block device to any other block. You don't need to the filesystem to move stuff around, it can be moved around without the filesystem knowing anything about it. > We already have a problem with > CAP_DAC_OVERRIDE giving you CAP_SYS_RAWIO (ie totally owning the machine) > unless the modules are signed, if xfs allows ADMIN as well then > CAP_SYS_ADMIN is much easier to obtain and you'd get total system > ownership from it. Always been the case, and it's not isolated to XFS. $ git grep CAP_SYS_ADMIN fs/ |wc -l 139 $ git grep CAP_SYS_ADMIN block/ |wc -l 16 $ git grep CAP_SYS_ADMIN drivers/block/ drivers/scsi |wc -l 88 The "CAP_SYS_ADMIN for ioctls" trust model in the storage stack extends both above and below the filesystem. If you don't trust CAP_SYS_ADMIN, then you are basically saying that you cannot trust your storage management and maintenance utilities at any level. > Not good. > > > Regardless, this horse bolted long before those syscalls were > > introduced. The time to address this issue was when XFS was merged > > into linux all those years ago, back when the apps that run in > > highly secure restricted environments that use these interfaces were > > being ported to linux. We can't change this now without breaking > > userspace > > That's what people said about setuid shell scripts. Completely different. setuid shell scripts got abused as a hack for the lazy to avoid setting up permissions properly and hence were easily exploited. The storage stack is completely dependent on a simplisitic layered trust model and that root (CAP_SYS_ADMIN) is god. The storage trust model falls completely apart if we don't have a trusted root user to administer all layers of the storage stack. This isn't the first time I've raised this issue - I raised it back when the user namespace stuff was ram-roaded into the kernel, and was essentially ignored by the userns people. As a result, we end up with all the storage management ioctls restricted to the initns where we have trusted CAP_SYS_ADMIN users. I've also raised it more recently in the unprivileged mount discussions (so untrusted root in containers can mount filesystems) - no solution to the underlying trust model deficiencies was found in those discussions, either. Instead, filesystems that can be mounted by untrusted users (i.e. FUSE) have a special flag in their fstype definition to say this is allowed. Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not trusted have exactly the same issues. i.e. there's nobody trusted by the kernel to administer the storage stack, and nobody has defined a workable security model that can prevent untrusted users from violating the existing storage trust model Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Sun, Sep 30, 2018 at 03:16:52PM +0100, Alan Cox wrote: > > > CAP_SYS_ADMIN is also a bit weird because low level access usually > > > implies you can bypass access controls so you should also check > > > CAP_SYS_DAC ? > > > > Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls? > > But that only allows bypassing directory search operations, so maybe > > you mean CAP_DAC_OVERRIDE? > > It depends what the ioctl allows you to do. If it allows me to bypass > DAC and manipulate the file system to move objects around then it's a > serious issue. These interfaces have always been allowed to do that. You can't do transparent online background defragmentation without bypassing DAC and moving objects around. You can't scrub metadata and data without bypassing DAC. You can't do dedupe without bypassing /some level/ of DAC to get access to the filesystem used space map and the raw block device to hash the data. But the really important access control for dedupe - avoiding deduping data across files at different security levels - isn't controlled at all. > The underlying problem is if CAP_SYS_ADMIN is able to move objects around > then I can move modules around. Yup, anything with direct access to block devices can do that. Many filesystem and storage utilities are given direct access to the block device, because that's what they need to work. e.g. in DM land, the control ioctls (ctl_ioctl()) are protected by: /* only root can play with this */ if (!capable(CAP_SYS_ADMIN)) return -EACCES; Think about it - if DM control ioctls only require CAP_SYS_ADMIN, then if have that cap you can use DM to remap any block in a block device to any other block. You don't need to the filesystem to move stuff around, it can be moved around without the filesystem knowing anything about it. > We already have a problem with > CAP_DAC_OVERRIDE giving you CAP_SYS_RAWIO (ie totally owning the machine) > unless the modules are signed, if xfs allows ADMIN as well then > CAP_SYS_ADMIN is much easier to obtain and you'd get total system > ownership from it. Always been the case, and it's not isolated to XFS. $ git grep CAP_SYS_ADMIN fs/ |wc -l 139 $ git grep CAP_SYS_ADMIN block/ |wc -l 16 $ git grep CAP_SYS_ADMIN drivers/block/ drivers/scsi |wc -l 88 The "CAP_SYS_ADMIN for ioctls" trust model in the storage stack extends both above and below the filesystem. If you don't trust CAP_SYS_ADMIN, then you are basically saying that you cannot trust your storage management and maintenance utilities at any level. > Not good. > > > Regardless, this horse bolted long before those syscalls were > > introduced. The time to address this issue was when XFS was merged > > into linux all those years ago, back when the apps that run in > > highly secure restricted environments that use these interfaces were > > being ported to linux. We can't change this now without breaking > > userspace > > That's what people said about setuid shell scripts. Completely different. setuid shell scripts got abused as a hack for the lazy to avoid setting up permissions properly and hence were easily exploited. The storage stack is completely dependent on a simplisitic layered trust model and that root (CAP_SYS_ADMIN) is god. The storage trust model falls completely apart if we don't have a trusted root user to administer all layers of the storage stack. This isn't the first time I've raised this issue - I raised it back when the user namespace stuff was ram-roaded into the kernel, and was essentially ignored by the userns people. As a result, we end up with all the storage management ioctls restricted to the initns where we have trusted CAP_SYS_ADMIN users. I've also raised it more recently in the unprivileged mount discussions (so untrusted root in containers can mount filesystems) - no solution to the underlying trust model deficiencies was found in those discussions, either. Instead, filesystems that can be mounted by untrusted users (i.e. FUSE) have a special flag in their fstype definition to say this is allowed. Systems restricted by LSMs to the point where CAP_SYS_ADMIN is not trusted have exactly the same issues. i.e. there's nobody trusted by the kernel to administer the storage stack, and nobody has defined a workable security model that can prevent untrusted users from violating the existing storage trust model Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Fri, Sep 28, 2018 at 07:23:42AM +1000, James Morris wrote: > On Thu, 27 Sep 2018, Dave Chinner wrote: > > > Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel > > that have no LSM coverage that this is not an isolated problem that > > people setting up such systems have to deal with. > > I could be missing something here, but all ioctls are mediated by LSM at a > high level (security_file_ioctl). Some problematic ones are singled out at > that point by LSMs for special handling. selinux_file_ioctl() checks whether a specific /file/ has permissions to execute ioctls, but it doesn't know anything about what that ioctl actually does. It has special cases for a handful of ioctls, but misses a large number of ioctls that perform similar functionality (e.g. file getattr/setattr type ioctls). smack just checks for read/write permissions depending on the IOW/IOR/IORW definition of the ioctl. Tomoyo does a path check based on the path associated with the file passed to it to see if an ioctl can be run. IOWs, none of there really know anything about what the ioctl does, and they sure as hell don't check any of the arguments for other file descriptors that the ioctl might be accessing. It's basically a "does we allow ioctls to this inode/path from this user?" check and nothing more. That just doesn't work for ioctls structured like the XFS filehandle operations. The ioctl is performed on a "fshandle" file descriptor, which generally points at the root directory of the filesystem the file handle belongs to. This gives the ioctl the superblock context that it is to operate on, and nothing more. It then opens a new file based on the filehandle that was in the ioctl argument, and performs the required operation on that file it opened itself. IOWs, the security_file_ioctl() hook is almost completely useless in cases like this - you can't isolate the ioctl based on the file argument, because it can point to any file or directory in the filesystem. And without actually parsing, decoding and instantiating the the ioctl arguments, you can't tell the ioctl it can't act on specific targets. And because filehandle to dentry resolution results in disconnected dentries, the paths are not complete and hence path based security checks (e.g. tomoyo) are likely to be broken and unfixable... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Fri, Sep 28, 2018 at 07:23:42AM +1000, James Morris wrote: > On Thu, 27 Sep 2018, Dave Chinner wrote: > > > Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel > > that have no LSM coverage that this is not an isolated problem that > > people setting up such systems have to deal with. > > I could be missing something here, but all ioctls are mediated by LSM at a > high level (security_file_ioctl). Some problematic ones are singled out at > that point by LSMs for special handling. selinux_file_ioctl() checks whether a specific /file/ has permissions to execute ioctls, but it doesn't know anything about what that ioctl actually does. It has special cases for a handful of ioctls, but misses a large number of ioctls that perform similar functionality (e.g. file getattr/setattr type ioctls). smack just checks for read/write permissions depending on the IOW/IOR/IORW definition of the ioctl. Tomoyo does a path check based on the path associated with the file passed to it to see if an ioctl can be run. IOWs, none of there really know anything about what the ioctl does, and they sure as hell don't check any of the arguments for other file descriptors that the ioctl might be accessing. It's basically a "does we allow ioctls to this inode/path from this user?" check and nothing more. That just doesn't work for ioctls structured like the XFS filehandle operations. The ioctl is performed on a "fshandle" file descriptor, which generally points at the root directory of the filesystem the file handle belongs to. This gives the ioctl the superblock context that it is to operate on, and nothing more. It then opens a new file based on the filehandle that was in the ioctl argument, and performs the required operation on that file it opened itself. IOWs, the security_file_ioctl() hook is almost completely useless in cases like this - you can't isolate the ioctl based on the file argument, because it can point to any file or directory in the filesystem. And without actually parsing, decoding and instantiating the the ioctl arguments, you can't tell the ioctl it can't act on specific targets. And because filehandle to dentry resolution results in disconnected dentries, the paths are not complete and hence path based security checks (e.g. tomoyo) are likely to be broken and unfixable... Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Sep 26, 2018 at 09:23:03AM -0400, Stephen Smalley wrote: > On 09/25/2018 09:33 PM, Dave Chinner wrote: > >On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > >>Hi, > >> > >>I'm bringing up this issue again to let of LSM developers know the > >>situation, and would like to know your thoughts. > >>Several weeks ago I sent an email to the security list to discuss the issue > >>where > >>XFS's ioctl interface can do things like vfs_readlink without asking LSM's > >>permission, which we think is kind of weird and this kind of operation > >>should be > >>audited by LSM. > > > >These aren't user interfaces. They are filesystem maintenance and > >extension interfaces. They are intended for low level filesystem > >utilities that require complete, unrestricted access to the > >underlying filesystem via holding CAP_SYSADMIN in the initns. > > > >i.e. they are used to perform filesystem maintenance and extension > >operations that need to be completely invisible to users from > >userspace. e.g. online file defragmentation (xfs_fsr), data > >migration (e.g. HSM products), efficient backup of data (xfsdump), > >metadata and data scrubbing, online repair, etc. > > > >IOWs, I really don't think these interfaces are something the LSMs > >should be trying to intercept or audit, because they are essentially > >internal filesystem interfaces used by trusted code and not general > >user application facing APIs. > > If they are interfaces exposed to userspace, then they should be > mediated via LSM. We only omit the LSM hook when the usage is > purely kernel-internal. /me points to the mass of diverse management ioctls across the kernel that aren't mediated by LSM hooks. > Security modules are often used to limit > even "trusted" applications to least privilege and protect them from > untrustworthy inputs, moving from binary trust notions to only > trusting them for what they must be trusted to do. CAP_SYS_ADMIN > doesn't necessarily indicate that they are trusted to override any > given MAC policy restrictions. Applications that are tightly integrated into the filesystem to extend it's functionality effectively operate in the same trust space as the kernel filesystem implementation itself. i.e. they interact with the filesystem at levels below the DAC/MAC checks that are performed on user filesystem accesses, and perform manipluation of the on-disk filesystem structure that is invisible to users accessing the filesystem. As such, there are very few trusted applications have "massive data loss" as a potential failure mode if an inappropriately configured LSM is loaded into the kernel. Breaking a HSM application's access to the filesystem unexpectedly because someone didn't set up a new security policy correctly brings a whole new level of risk to administrating sites that mix non-trivial storage solutions with LSM-based security. > Wondering why we don't perform the security_inode_readlink() call > inside of vfs_readlink() currently. The general pattern is that we > do perform security_inode_*() calls inside the other vfs_*() > helpers, so vfs_readlink() is an exception currently. Pretty sure that was done to mitigate the risk of breaking existing userspace applications using the handle interfaces to read links. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Sep 26, 2018 at 09:23:03AM -0400, Stephen Smalley wrote: > On 09/25/2018 09:33 PM, Dave Chinner wrote: > >On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > >>Hi, > >> > >>I'm bringing up this issue again to let of LSM developers know the > >>situation, and would like to know your thoughts. > >>Several weeks ago I sent an email to the security list to discuss the issue > >>where > >>XFS's ioctl interface can do things like vfs_readlink without asking LSM's > >>permission, which we think is kind of weird and this kind of operation > >>should be > >>audited by LSM. > > > >These aren't user interfaces. They are filesystem maintenance and > >extension interfaces. They are intended for low level filesystem > >utilities that require complete, unrestricted access to the > >underlying filesystem via holding CAP_SYSADMIN in the initns. > > > >i.e. they are used to perform filesystem maintenance and extension > >operations that need to be completely invisible to users from > >userspace. e.g. online file defragmentation (xfs_fsr), data > >migration (e.g. HSM products), efficient backup of data (xfsdump), > >metadata and data scrubbing, online repair, etc. > > > >IOWs, I really don't think these interfaces are something the LSMs > >should be trying to intercept or audit, because they are essentially > >internal filesystem interfaces used by trusted code and not general > >user application facing APIs. > > If they are interfaces exposed to userspace, then they should be > mediated via LSM. We only omit the LSM hook when the usage is > purely kernel-internal. /me points to the mass of diverse management ioctls across the kernel that aren't mediated by LSM hooks. > Security modules are often used to limit > even "trusted" applications to least privilege and protect them from > untrustworthy inputs, moving from binary trust notions to only > trusting them for what they must be trusted to do. CAP_SYS_ADMIN > doesn't necessarily indicate that they are trusted to override any > given MAC policy restrictions. Applications that are tightly integrated into the filesystem to extend it's functionality effectively operate in the same trust space as the kernel filesystem implementation itself. i.e. they interact with the filesystem at levels below the DAC/MAC checks that are performed on user filesystem accesses, and perform manipluation of the on-disk filesystem structure that is invisible to users accessing the filesystem. As such, there are very few trusted applications have "massive data loss" as a potential failure mode if an inappropriately configured LSM is loaded into the kernel. Breaking a HSM application's access to the filesystem unexpectedly because someone didn't set up a new security policy correctly brings a whole new level of risk to administrating sites that mix non-trivial storage solutions with LSM-based security. > Wondering why we don't perform the security_inode_readlink() call > inside of vfs_readlink() currently. The general pattern is that we > do perform security_inode_*() calls inside the other vfs_*() > helpers, so vfs_readlink() is an exception currently. Pretty sure that was done to mitigate the risk of breaking existing userspace applications using the handle interfaces to read links. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Sep 26, 2018 at 07:24:26PM +0100, Alan Cox wrote: > On Wed, 26 Sep 2018 11:33:29 +1000 > Dave Chinner wrote: > > > On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > > > Hi, > > > > > > I'm bringing up this issue again to let of LSM developers know the > > > situation, and would like to know your thoughts. > > > Several weeks ago I sent an email to the security list to discuss the > > > issue where > > > XFS's ioctl interface can do things like vfs_readlink without asking LSM's > > > permission, which we think is kind of weird and this kind of operation > > > should be > > > audited by LSM. > > > > These aren't user interfaces. They are filesystem maintenance and > > extension interfaces. They are intended for low level filesystem > > utilities that require complete, unrestricted access to the > > underlying filesystem via holding CAP_SYSADMIN in the initns. > > CAP_SYS_ADMIN is meaningless in an environment using a strong LSM set up. Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel that have no LSM coverage that this is not an isolated problem that people setting up such systems have to deal with. the LSM hooks are already complex enough without adding hundreds more hooks to control individual ioctl behaviour for every filesystem, device, etc. Unless you are going to add LSM hooks to all ioctls, I don't see much point in singling out one set of ioctls in a way that will break existing configurations. It's just a knee jerk reaction (like ariport security) because it doesn't meaningfully address the problem of all the other management ioctls in the kernel being completely unconstrainted by LSMs. > CAP_SYS_ADMIN is also a bit weird because low level access usually > implies you can bypass access controls so you should also check > CAP_SYS_DAC ? Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls? But that only allows bypassing directory search operations, so maybe you mean CAP_DAC_OVERRIDE? Regardless, this horse bolted long before those syscalls were introduced. The time to address this issue was when XFS was merged into linux all those years ago, back when the apps that run in highly secure restricted environments that use these interfaces were being ported to linux. We can't change this now without breaking userspace Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Wed, Sep 26, 2018 at 07:24:26PM +0100, Alan Cox wrote: > On Wed, 26 Sep 2018 11:33:29 +1000 > Dave Chinner wrote: > > > On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > > > Hi, > > > > > > I'm bringing up this issue again to let of LSM developers know the > > > situation, and would like to know your thoughts. > > > Several weeks ago I sent an email to the security list to discuss the > > > issue where > > > XFS's ioctl interface can do things like vfs_readlink without asking LSM's > > > permission, which we think is kind of weird and this kind of operation > > > should be > > > audited by LSM. > > > > These aren't user interfaces. They are filesystem maintenance and > > extension interfaces. They are intended for low level filesystem > > utilities that require complete, unrestricted access to the > > underlying filesystem via holding CAP_SYSADMIN in the initns. > > CAP_SYS_ADMIN is meaningless in an environment using a strong LSM set up. Sure, but there are so many CAP_SYS_ADMIN-only ioctls in the kernel that have no LSM coverage that this is not an isolated problem that people setting up such systems have to deal with. the LSM hooks are already complex enough without adding hundreds more hooks to control individual ioctl behaviour for every filesystem, device, etc. Unless you are going to add LSM hooks to all ioctls, I don't see much point in singling out one set of ioctls in a way that will break existing configurations. It's just a knee jerk reaction (like ariport security) because it doesn't meaningfully address the problem of all the other management ioctls in the kernel being completely unconstrainted by LSMs. > CAP_SYS_ADMIN is also a bit weird because low level access usually > implies you can bypass access controls so you should also check > CAP_SYS_DAC ? Do you mean CAP_DAC_READ_SEARCH as per the newer handle syscalls? But that only allows bypassing directory search operations, so maybe you mean CAP_DAC_OVERRIDE? Regardless, this horse bolted long before those syscalls were introduced. The time to address this issue was when XFS was merged into linux all those years ago, back when the apps that run in highly secure restricted environments that use these interfaces were being ported to linux. We can't change this now without breaking userspace Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > Hi, > > I'm bringing up this issue again to let of LSM developers know the situation, > and would like to know your thoughts. > Several weeks ago I sent an email to the security list to discuss the issue > where > XFS's ioctl interface can do things like vfs_readlink without asking LSM's > permission, which we think is kind of weird and this kind of operation should > be > audited by LSM. These aren't user interfaces. They are filesystem maintenance and extension interfaces. They are intended for low level filesystem utilities that require complete, unrestricted access to the underlying filesystem via holding CAP_SYSADMIN in the initns. i.e. they are used to perform filesystem maintenance and extension operations that need to be completely invisible to users from userspace. e.g. online file defragmentation (xfs_fsr), data migration (e.g. HSM products), efficient backup of data (xfsdump), metadata and data scrubbing, online repair, etc. IOWs, I really don't think these interfaces are something the LSMs should be trying to intercept or audit, because they are essentially internal filesystem interfaces used by trusted code and not general user application facing APIs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: Leaking Path in XFS's ioctl interface(missing LSM check)
On Tue, Sep 25, 2018 at 08:51:50PM -0400, TongZhang wrote: > Hi, > > I'm bringing up this issue again to let of LSM developers know the situation, > and would like to know your thoughts. > Several weeks ago I sent an email to the security list to discuss the issue > where > XFS's ioctl interface can do things like vfs_readlink without asking LSM's > permission, which we think is kind of weird and this kind of operation should > be > audited by LSM. These aren't user interfaces. They are filesystem maintenance and extension interfaces. They are intended for low level filesystem utilities that require complete, unrestricted access to the underlying filesystem via holding CAP_SYSADMIN in the initns. i.e. they are used to perform filesystem maintenance and extension operations that need to be completely invisible to users from userspace. e.g. online file defragmentation (xfs_fsr), data migration (e.g. HSM products), efficient backup of data (xfsdump), metadata and data scrubbing, online repair, etc. IOWs, I really don't think these interfaces are something the LSMs should be trying to intercept or audit, because they are essentially internal filesystem interfaces used by trusted code and not general user application facing APIs. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 3.16 52/63] xfs: validate cached inodes are free when allocated
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote: > 3.16.58-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Dave Chinner > > commit afca6c5b2595fc44383919fba740c194b0b76aff upstream. > > A recent fuzzed filesystem image cached random dcache corruption > when the reproducer was run. This often showed up as panics in > lookup_slow() on a null inode->i_ops pointer when doing pathwalks. . > [bwh: Backported to 3.16: > - Look up mode in XFS inode, not VFS inode > - Use positive error codes, and EIO instead of EFSCORRUPTED] Again, why EIO? And > Signed-off-by: Ben Hutchings > --- > fs/xfs/xfs_icache.c | 73 + > 1 file changed, 48 insertions(+), 25 deletions(-) > > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -133,6 +133,46 @@ xfs_inode_free( > } > > /* > + * If we are allocating a new inode, then check what was returned is > + * actually a free, empty inode. If we are not allocating an inode, > + * then check we didn't find a free inode. > + * > + * Returns: > + * 0 if the inode free state matches the lookup context > + * ENOENT if the inode is free and we are not allocating > + * EFSCORRUPTED if there is any state mismatch at all You changed the code but not the comment. Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: [PATCH 3.16 52/63] xfs: validate cached inodes are free when allocated
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote: > 3.16.58-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Dave Chinner > > commit afca6c5b2595fc44383919fba740c194b0b76aff upstream. > > A recent fuzzed filesystem image cached random dcache corruption > when the reproducer was run. This often showed up as panics in > lookup_slow() on a null inode->i_ops pointer when doing pathwalks. . > [bwh: Backported to 3.16: > - Look up mode in XFS inode, not VFS inode > - Use positive error codes, and EIO instead of EFSCORRUPTED] Again, why EIO? And > Signed-off-by: Ben Hutchings > --- > fs/xfs/xfs_icache.c | 73 + > 1 file changed, 48 insertions(+), 25 deletions(-) > > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -133,6 +133,46 @@ xfs_inode_free( > } > > /* > + * If we are allocating a new inode, then check what was returned is > + * actually a free, empty inode. If we are not allocating an inode, > + * then check we didn't find a free inode. > + * > + * Returns: > + * 0 if the inode free state matches the lookup context > + * ENOENT if the inode is free and we are not allocating > + * EFSCORRUPTED if there is any state mismatch at all You changed the code but not the comment. Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: [PATCH 3.16 51/63] xfs: catch inode allocation state mismatch corruption
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote: > 3.16.58-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Dave Chinner > > commit ee457001ed6c6f31ddad69c24c1da8f377d8472d upstream. > > We recently came across a V4 filesystem causing memory corruption > due to a newly allocated inode being setup twice and being added to > the superblock inode list twice. From code inspection, the only way > this could happen is if a newly allocated inode was not marked as > free on disk (i.e. di_mode wasn't zero). > Signed-Off-By: Dave Chinner > Reviewed-by: Carlos Maiolino > Tested-by: Carlos Maiolino > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > [bwh: Backported to 3.16: > - Look up mode in XFS inode, not VFS inode > - Use positive error codes, and EIO instead of EFSCORRUPTED] Why EIO? Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: [PATCH 3.16 51/63] xfs: catch inode allocation state mismatch corruption
On Sat, Sep 22, 2018 at 01:15:42AM +0100, Ben Hutchings wrote: > 3.16.58-rc1 review patch. If anyone has any objections, please let me know. > > -- > > From: Dave Chinner > > commit ee457001ed6c6f31ddad69c24c1da8f377d8472d upstream. > > We recently came across a V4 filesystem causing memory corruption > due to a newly allocated inode being setup twice and being added to > the superblock inode list twice. From code inspection, the only way > this could happen is if a newly allocated inode was not marked as > free on disk (i.e. di_mode wasn't zero). > Signed-Off-By: Dave Chinner > Reviewed-by: Carlos Maiolino > Tested-by: Carlos Maiolino > Reviewed-by: Darrick J. Wong > Signed-off-by: Darrick J. Wong > [bwh: Backported to 3.16: > - Look up mode in XFS inode, not VFS inode > - Use positive error codes, and EIO instead of EFSCORRUPTED] Why EIO? Cheers, Dave. -- Dave Chinner dchin...@redhat.com
Re: metadata operation reordering regards to crash
On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote: > Hi, all, > > A probably bit of complex question: > Does nowadays practical filesystems, eg., extX, btfs, preserve metadata > operation order through a crash/power failure? Yes. Behaviour is filesystem dependent, but we have tests in fstests that specifically exercise order preservation across filesystem failures. > What I know is modern filesystems ensure metadata consistency > after crash/power failure. Journal filesystems like extX do that by > write-ahead logging of metadata operations into transactions. Other > filesystems do that in various ways as btfs do that by COW. > > What I'm not so far clear is whether these filesystems preserve > metadata operation order after a crash. > > For example, > op 1. rename(A, B) > op 2. rename(C, D) > > As mentioned above, metadata consistency is ensured after a crash. > Thus, B is either the original B(or not exists) or has been replaced by A. > The same to D. > > Is it possible that, after a crash, D has been replaced by C but B is still > the original file(or not exists)? Not for XFS, ext4, btrfs or f2fs. Other filesystems might be different. Cheers, Dave, -- Dave Chinner da...@fromorbit.com
Re: metadata operation reordering regards to crash
On Fri, Sep 14, 2018 at 05:06:44PM +0800, 焦晓冬 wrote: > Hi, all, > > A probably bit of complex question: > Does nowadays practical filesystems, eg., extX, btfs, preserve metadata > operation order through a crash/power failure? Yes. Behaviour is filesystem dependent, but we have tests in fstests that specifically exercise order preservation across filesystem failures. > What I know is modern filesystems ensure metadata consistency > after crash/power failure. Journal filesystems like extX do that by > write-ahead logging of metadata operations into transactions. Other > filesystems do that in various ways as btfs do that by COW. > > What I'm not so far clear is whether these filesystems preserve > metadata operation order after a crash. > > For example, > op 1. rename(A, B) > op 2. rename(C, D) > > As mentioned above, metadata consistency is ensured after a crash. > Thus, B is either the original B(or not exists) or has been replaced by A. > The same to D. > > Is it possible that, after a crash, D has been replaced by C but B is still > the original file(or not exists)? Not for XFS, ext4, btrfs or f2fs. Other filesystems might be different. Cheers, Dave, -- Dave Chinner da...@fromorbit.com
Re: [PATCH 2/2] fs: list init in the critical section
On Mon, Sep 10, 2018 at 05:09:52AM -0700, swkhack wrote: > the i_state init in the critical section,so as the list init should in it. Why? What bug does this fix? -Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH 2/2] fs: list init in the critical section
On Mon, Sep 10, 2018 at 05:09:52AM -0700, swkhack wrote: > the i_state init in the critical section,so as the list init should in it. Why? What bug does this fix? -Dave. -- Dave Chinner da...@fromorbit.com
Re: POSIX violation by writeback error
rite-flush-close operation by using AIO and not waiting for the fsync to complete before closing the file. I have this in fsmark, because lots of people with IO intensive apps have been asking for it over the past 10 years. -A . FSUse%Count SizeFiles/sec App Overhead 08 4096 28770.5 1569090 0 16 4096 31340.7 1356595 0 24 4096 30803.0 1423583 0 32 4096 30404.5 1510099 0 40 4096 30961.2 1500736 Yup, it's pretty much the same throughput as background async writeback. It's a little slower - about 160MB/s and 2,500 IOPS - due to the increase in overall journal writes caused by the fsync calls. What's clear, however, is that we're retiring 10 userspace fsync operations for every physical disk IO here, as opposed to 2 IOs per fsync in the above case. Put simply, the assumption that applications can't do more flush/fsync operations than disk IOs is not valid, and that performance of open-write-flush-close workloads on modern filesystems isn't anywhere near as bad as you think it is. To mangle a common saying into storage speak: "Caches are for show, IO is for go" Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: POSIX violation by writeback error
rite-flush-close operation by using AIO and not waiting for the fsync to complete before closing the file. I have this in fsmark, because lots of people with IO intensive apps have been asking for it over the past 10 years. -A . FSUse%Count SizeFiles/sec App Overhead 08 4096 28770.5 1569090 0 16 4096 31340.7 1356595 0 24 4096 30803.0 1423583 0 32 4096 30404.5 1510099 0 40 4096 30961.2 1500736 Yup, it's pretty much the same throughput as background async writeback. It's a little slower - about 160MB/s and 2,500 IOPS - due to the increase in overall journal writes caused by the fsync calls. What's clear, however, is that we're retiring 10 userspace fsync operations for every physical disk IO here, as opposed to 2 IOs per fsync in the above case. Put simply, the assumption that applications can't do more flush/fsync operations than disk IOs is not valid, and that performance of open-write-flush-close workloads on modern filesystems isn't anywhere near as bad as you think it is. To mangle a common saying into storage speak: "Caches are for show, IO is for go" Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: POSIX violation by writeback error
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote: > On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote: > > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote: > > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote: > > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote: > > > > > Well, I think the point was that in the above examples you'd prefer > > > > > that > > > > > the read just fail--no need to keep the data. A bit marking the file > > > > > (or even the entire filesystem) unreadable would satisfy posix, I > > > > > guess. > > > > > Whether that's practical, I don't know. > > > > > > > > When you would do it like that (mark the whole filesystem as "in > > > > error") things go from bad to worse even faster. The Linux kernel > > > > tries to keep the system up even in the face of errors. > > > > > > > > With that suggestion, having one application run into a writeback > > > > error would effectively crash the whole system because the filesystem > > > > may be the root filesystem and stuff like "sshd" that you need to > > > > diagnose the problem needs to be read from the disk > > > > > > Well, the absolutist position on posix compliance here would be that a > > > crash is still preferable to returning the wrong data. And for the > > > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in > > > general, I don't know. And we do already have filesystems with > > > panic-on-error options, so if they aren't used maybe then maybe users > > > have already voted against that level of strictness. > > > > > > > Yeah, idk. The problem here is that this is squarely in the domain of > > implementation defined behavior. I do think that the current "policy" > > (if you call it that) of what to do after a wb error is weird and wrong. > > What we probably ought to do is start considering how we'd like it to > > behave. > > > > How about something like this? > > > > Mark the pages as "uncleanable" after a writeback error. We'll satisfy > > reads from the cached data until someone calls fsync, at which point > > we'd return the error and invalidate the uncleanable pages. > > > > If no one calls fsync and scrapes the error, we'll hold on to it for as > > long as we can (or up to some predefined limit) and then after that > > we'll invalidate the uncleanable pages and start returning errors on > > reads. If someone eventually calls fsync afterward, we can return to > > normal operation. > > > > As always though...what about mmap? Would we need to SIGBUS at the point > > where we'd start returning errors on read()? > > > > Would that approximate the current behavior enough and make sense? > > Implementing it all sounds non-trivial though... > > > > Here's a crazy and potentially stupid idea: > > Implement a new class of swap space for backing dirty pages which fail > to write back. Pages in this space survive reboots, essentially backing > the implicit commitment POSIX establishes in the face of asynchronous > writeback errors. Rather than evicting these pages as clean, they are > swapped out to the persistent swap. And when that "swap" area gets write errors, too? What then? We're straight back to the same "what the hell do we do with the error" problem. Adding more turtles doesn't help solve this issue. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: POSIX violation by writeback error
On Tue, Sep 04, 2018 at 01:35:34PM -0700, Vito Caputo wrote: > On Tue, Sep 04, 2018 at 04:18:18PM -0400, Jeff Layton wrote: > > On Tue, 2018-09-04 at 14:54 -0400, J. Bruce Fields wrote: > > > On Tue, Sep 04, 2018 at 06:23:48PM +0200, Rogier Wolff wrote: > > > > On Tue, Sep 04, 2018 at 12:12:03PM -0400, J. Bruce Fields wrote: > > > > > Well, I think the point was that in the above examples you'd prefer > > > > > that > > > > > the read just fail--no need to keep the data. A bit marking the file > > > > > (or even the entire filesystem) unreadable would satisfy posix, I > > > > > guess. > > > > > Whether that's practical, I don't know. > > > > > > > > When you would do it like that (mark the whole filesystem as "in > > > > error") things go from bad to worse even faster. The Linux kernel > > > > tries to keep the system up even in the face of errors. > > > > > > > > With that suggestion, having one application run into a writeback > > > > error would effectively crash the whole system because the filesystem > > > > may be the root filesystem and stuff like "sshd" that you need to > > > > diagnose the problem needs to be read from the disk > > > > > > Well, the absolutist position on posix compliance here would be that a > > > crash is still preferable to returning the wrong data. And for the > > > cases 焦晓冬 gives, that sounds right? Maybe it's the wrong balance in > > > general, I don't know. And we do already have filesystems with > > > panic-on-error options, so if they aren't used maybe then maybe users > > > have already voted against that level of strictness. > > > > > > > Yeah, idk. The problem here is that this is squarely in the domain of > > implementation defined behavior. I do think that the current "policy" > > (if you call it that) of what to do after a wb error is weird and wrong. > > What we probably ought to do is start considering how we'd like it to > > behave. > > > > How about something like this? > > > > Mark the pages as "uncleanable" after a writeback error. We'll satisfy > > reads from the cached data until someone calls fsync, at which point > > we'd return the error and invalidate the uncleanable pages. > > > > If no one calls fsync and scrapes the error, we'll hold on to it for as > > long as we can (or up to some predefined limit) and then after that > > we'll invalidate the uncleanable pages and start returning errors on > > reads. If someone eventually calls fsync afterward, we can return to > > normal operation. > > > > As always though...what about mmap? Would we need to SIGBUS at the point > > where we'd start returning errors on read()? > > > > Would that approximate the current behavior enough and make sense? > > Implementing it all sounds non-trivial though... > > > > Here's a crazy and potentially stupid idea: > > Implement a new class of swap space for backing dirty pages which fail > to write back. Pages in this space survive reboots, essentially backing > the implicit commitment POSIX establishes in the face of asynchronous > writeback errors. Rather than evicting these pages as clean, they are > swapped out to the persistent swap. And when that "swap" area gets write errors, too? What then? We're straight back to the same "what the hell do we do with the error" problem. Adding more turtles doesn't help solve this issue. Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote: > On 08/26/2018 08:21 PM, Dave Chinner wrote: > > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: > >> The current log space reservation code allows multiple wakeups of the > >> same sleeping waiter to happen. This is a just a waste of cpu time as > >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is > >> added to track if a task is being waken up and skip the wake_up_process() > >> call if the flag is set. > >> > >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread > >> Broadwell system with a small xfs filesystem on ramfs, the performance > >> increased from 91,486 jobs/min to 192,666 jobs/min with this change. > > Oh, I just noticed you are using a ramfs for this benchmark, > > > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Oh sorry, I made a mistake. > > There were some problems with my test configuration. I was actually > running the test on a regular enterprise-class disk device mount on /. > > Filesystem 1K-blocks Used Available > Use% Mounted on > /dev/mapper/rhel_hp--xl420gen9--01-root 52403200 11284408 41118792 22% / > > It was not an SSD, nor ramdisk. I reran the test on ramdisk, the > performance of the patched kernel was 679,880 jobs/min which was a bit > more than double the 285,221 score that I got on a regular disk. Can you please re-run and report the results for each patch on the ramdisk setup? And, please, include the mkfs.xfs or xfs_info output for the ramdisk filesystem so I can see /exactly/ how much concurrency the filesystems are providing to the benchmark you are running. > So the filesystem used wasn't tiny, though it is still not very large. 50GB is tiny for XFS. Personally, I've been using ~1PB filesystems(*) for the performance testing I've been doing recently... Cheers, Dave. (*) Yes, petabytes. Sparse image files on really fast SSDs are a wonderful thing. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
On Mon, Aug 27, 2018 at 11:34:13AM -0400, Waiman Long wrote: > On 08/26/2018 08:21 PM, Dave Chinner wrote: > > On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: > >> The current log space reservation code allows multiple wakeups of the > >> same sleeping waiter to happen. This is a just a waste of cpu time as > >> well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is > >> added to track if a task is being waken up and skip the wake_up_process() > >> call if the flag is set. > >> > >> Running the AIM7 fserver workload on a 2-socket 24-core 48-thread > >> Broadwell system with a small xfs filesystem on ramfs, the performance > >> increased from 91,486 jobs/min to 192,666 jobs/min with this change. > > Oh, I just noticed you are using a ramfs for this benchmark, > > > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Oh sorry, I made a mistake. > > There were some problems with my test configuration. I was actually > running the test on a regular enterprise-class disk device mount on /. > > Filesystem 1K-blocks Used Available > Use% Mounted on > /dev/mapper/rhel_hp--xl420gen9--01-root 52403200 11284408 41118792 22% / > > It was not an SSD, nor ramdisk. I reran the test on ramdisk, the > performance of the patched kernel was 679,880 jobs/min which was a bit > more than double the 285,221 score that I got on a regular disk. Can you please re-run and report the results for each patch on the ramdisk setup? And, please, include the mkfs.xfs or xfs_info output for the ramdisk filesystem so I can see /exactly/ how much concurrency the filesystems are providing to the benchmark you are running. > So the filesystem used wasn't tiny, though it is still not very large. 50GB is tiny for XFS. Personally, I've been using ~1PB filesystems(*) for the performance testing I've been doing recently... Cheers, Dave. (*) Yes, petabytes. Sparse image files on really fast SSDs are a wonderful thing. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote: > On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote: > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Note that all these ramdisk issues you mentioned below will also apply > to using the pmem driver on nvdimms, which might be a more realistic > version. Even worse at least for cases where the nvdimms aren't > actually powerfail dram of some sort with write through caching and > ADR the latency is going to be much higher than the ramdisk as well. Yes, I realise that. I am expecting that when it comes to optimising for pmem, we'll actually rewrite the journal to map pmem and memcpy() directly rather than go through the buffering and IO layers we currently do so we can minimise write latency and control concurrency ourselves. Hence I'm not really concerned by performance issues with pmem at this point - most of our still users have traditional storage and will for a long time to come Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
On Mon, Aug 27, 2018 at 12:39:06AM -0700, Christoph Hellwig wrote: > On Mon, Aug 27, 2018 at 10:21:34AM +1000, Dave Chinner wrote: > > tl; dr: Once you pass a certain point, ramdisks can be *much* slower > > than SSDs on journal intensive workloads like AIM7. Hence it would be > > useful to see if you have the same problems on, say, high > > performance nvme SSDs. > > Note that all these ramdisk issues you mentioned below will also apply > to using the pmem driver on nvdimms, which might be a more realistic > version. Even worse at least for cases where the nvdimms aren't > actually powerfail dram of some sort with write through caching and > ADR the latency is going to be much higher than the ramdisk as well. Yes, I realise that. I am expecting that when it comes to optimising for pmem, we'll actually rewrite the journal to map pmem and memcpy() directly rather than go through the buffering and IO layers we currently do so we can minimise write latency and control concurrency ourselves. Hence I'm not really concerned by performance issues with pmem at this point - most of our still users have traditional storage and will for a long time to come Cheers, Dave. -- Dave Chinner da...@fromorbit.com
Re: [PATCH v2 2/3] xfs: Prevent multiple wakeups of the same log space waiter
On Sun, Aug 26, 2018 at 04:53:14PM -0400, Waiman Long wrote: > The current log space reservation code allows multiple wakeups of the > same sleeping waiter to happen. This is a just a waste of cpu time as > well as increasing spin lock hold time. So a new XLOG_TIC_WAKING flag is > added to track if a task is being waken up and skip the wake_up_process() > call if the flag is set. > > Running the AIM7 fserver workload on a 2-socket 24-core 48-thread > Broadwell system with a small xfs filesystem on ramfs, the performance > increased from 91,486 jobs/min to 192,666 jobs/min with this change. Oh, I just noticed you are using a ramfs for this benchmark, tl; dr: Once you pass a certain point, ramdisks can be *much* slower than SSDs on journal intensive workloads like AIM7. Hence it would be useful to see if you have the same problems on, say, high performance nvme SSDs. - Ramdisks have substantially different means log IO completion and wakeup behaviour compared to real storage on real production systems. Basically, ramdisks are synchronous and real storage is asynchronous. That is, on a ramdisk the IO completion is run synchronously in the same task as the IO submission because the IO is just a memcpy(). Hence a single dispatch thread can only drive an IO queue depth of 1 IO - there is no concurrency possible. This serialises large parts of the XFS journal - the journal is really an asynchronous IO engine that gets it's performance from driving deep IO queues and batching commits while IO is in flight. Ramdisks also have very low IO latency, which means there's only a very small window for "IO in flight" batching optimisations to be made effectively. It effectively stops such algorithms from working completely. This means the XFS journal behaves very differently on ramdisks when compared to normal storage. The submission batching techniques reduces log IOs by a factor of 10-20 under heavy synchrnous transaction loads when there is any noticeable journal IO delay - a few tens of microseconds is enough for it to function effectively, but a ramdisk doesn't even have this delay on journal IO. The submission batching also has the effect of reducing log space wakeups by the same factor there are less IO completions signalling that space has been made available. Further, when we get async IO completions from real hardware, they get processed in batches by a completion workqueue - this leads to there typically only being a single reservation space update from all batched IO completions. This tends to reduce log space wakeups due to log IO completion by a factor of 6-8 as the log can have up to 8 concurrent IOs in flight at a time. And when we throw in the lack of batching, merging and IO completion aggregation of metadata writeback because ramdisks are synchrnous and don't queue or merge adjacent IOs, we end up with lots more contention on the AIL lock and much more frequent log space wakeups (i.e. from log tail movement updates). This futher exacerbates the problems the log already has with synchronous IO. IOWs, log space wakeups on real storage are likely to be 50-100x lower than on a ramdisk for the same metadata and journal intensive workload, and as such those workloads often run faster on real storage than they do on ramdisks. This can be trivially seen with dbench, a simple IO benchmark that hammers the journal. On a ramdisk, I can only get 2-2.5GB/s throughput from the benchmark before the log bottlenecks at about 20,000 log tiny IOs per second. In comparison, on an old, badly abused Samsung 850EVO SSD, I see 5-6GB/s in 2,000 log IOs per second because of the pipelining and IO batching in the XFS journal async IO engine and the massive reduction in metadata IO due to merging of adjacent IOs in the block layer. i.e. the journal and metadata writeback design allows the filesystem to operate at a much higher synchronous transaction rate than would otherwise be possible by taking advantage of the IO concurrency that storage provides us with. So if you use proper storage hardware (e.g. nvme SSD) and/or an appropriately sized log, does the slowpath wakeup contention go away? Can you please test both of these things and report the results so we can properly evaluate the impact of these changes? Cheers, Dave. -- Dave Chinner da...@fromorbit.com