Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason wrote: > > The bigger question is do we have users that expect to be able to set > the blocksize after mmaping the block device (no writes required)? I > actually feel a little bad for taking up internet bandwidth asking, but > it is a change in behaviour. Yeah, it is. That said, I don't think people will really notice. Nobody mmap's block devices outside of some databases, afaik, and nobody sane mounts a partition at the same time a DB is using it. So I think the new EBUSY check is *ugly*, but I don't realistically believe that it is a problem. The ugliness of the locking is why I'm not a huge fan of it, but if it works I can live with it. But yes, the mmap tests are new with the locking, and could in theory be problematic if somebody reports that it breaks anything. And like the locking, they'd just go away if we just do the fs/buffer.c approach instead. Because doing things in fs/buffer.c simply means that we don't care (and serialization is provided by the page lock on a per-page basis, which is what mmap relies on anyway). So doing the per-page fs/buffer.c approach (along with the "ACCESS_ONCE()" on inode->i_blkbits to make sure we get *one* consistent value, even if we don't care *which* value it is) would basically revert to all the old semantics. The only thing it would change is that we wouldn't see oopses. (And in theory, it would allow us to actively mix-and-match different block sizes for a block device, but realistically I don't think there are any actual users of that - although I could imagine that a filesystem would use a smaller block size for file tail-blocks etc, and still want to use the fs/buffer.c code, so it's *possible* that it would be useful, but filesystems have been able to do things like that by just doing their buffers by hand anyway, so it's not really fundamentally new, just a possible generalization of code) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote: > On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason wrote: > > > > Jumping in based on Linus original patch, which is doing something like > > this: > > > > set_blocksize() { > > block new calls to writepage, prepare/commit_write > > set the block size > > unblock > > > > < --- can race in here and find bad buffers ---> > > > > sync_blockdev() > > kill_bdev() > > > > < --- now we're safe --- > > > } > > > > We could add a second semaphore and a page_mkwrite call: > > Yeah, we could be fancy, but the more I think about it, the less I can > say I care. > > After all, the only things that do the whole set_blocksize() thing should be: > > - filesystems at mount-time > > - things like loop/md at block device init time. > > and quite frankly, if there are any *concurrent* writes with either of > the above, I really *really* don't think we should care. I mean, > seriously. > > So the _only_ real reason for the locking in the first place is to > make sure of internal kernel consistency. We do not want to oops or > corrupt memory if people do odd things. But we really *really* don't > care if somebody writes to a partition at the same time as somebody > else mounts it. Not enough to do extra work to please insane people. > > It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST. > The whole sequence always used to be unlocked. The locking is entirely > new. There is certainly not any legacy users that can possibly rely on > "I did writes at the same time as the mount with no serialization, and > it worked". It never has worked. > > So I think this is a case of "perfect is the enemy of good". > Especially since I think that with the fs/buffer.c approach, we don't > actually need any locking at all at higher levels. The bigger question is do we have users that expect to be able to set the blocksize after mmaping the block device (no writes required)? I actually feel a little bad for taking up internet bandwidth asking, but it is a change in behaviour. Regardless, changing mmap for a race in the page cache is just backwards, and with the current 3.7 code, we can still trigger the race with fadvise -> readpage in the middle of set_blocksize() Obviously nobody does any of this, otherwise we'd have tons of reports from those handy WARN_ONs in fs/buffer.c. So its definitely hard to be worried one way or another. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason wrote: > > Jumping in based on Linus original patch, which is doing something like > this: > > set_blocksize() { > block new calls to writepage, prepare/commit_write > set the block size > unblock > > < --- can race in here and find bad buffers ---> > > sync_blockdev() > kill_bdev() > > < --- now we're safe --- > > } > > We could add a second semaphore and a page_mkwrite call: Yeah, we could be fancy, but the more I think about it, the less I can say I care. After all, the only things that do the whole set_blocksize() thing should be: - filesystems at mount-time - things like loop/md at block device init time. and quite frankly, if there are any *concurrent* writes with either of the above, I really *really* don't think we should care. I mean, seriously. So the _only_ real reason for the locking in the first place is to make sure of internal kernel consistency. We do not want to oops or corrupt memory if people do odd things. But we really *really* don't care if somebody writes to a partition at the same time as somebody else mounts it. Not enough to do extra work to please insane people. It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST. The whole sequence always used to be unlocked. The locking is entirely new. There is certainly not any legacy users that can possibly rely on "I did writes at the same time as the mount with no serialization, and it worked". It never has worked. So I think this is a case of "perfect is the enemy of good". Especially since I think that with the fs/buffer.c approach, we don't actually need any locking at all at higher levels. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote: > On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: > > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds > > wrote: > > > > > > But the fact that the code wants to do things like > > > > > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); > > > > > > seriously seems to be the main thing that keeps us using > > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly > > > enough to hurt (not everywhere, but on some machines). > > > > > > Very annoying. > > > > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the > > whole ilog2 thing, but at the same time, in other cases it actually > > seems to improve code generation (ie gets rid of the whole unnecessary > > two dereferences through page->mapping->host just to get the block > > size, when we have it in the buffer-head that we have to touch > > *anyway*). > > > > Comments? Again, untested. > > Jumping in based on Linus original patch, which is doing something like > this: > > set_blocksize() { > block new calls to writepage, prepare/commit_write > set the block size > unblock > > < --- can race in here and find bad buffers ---> > > sync_blockdev() > kill_bdev() > > < --- now we're safe --- > > } > > We could add a second semaphore and a page_mkwrite call: > > set_blocksize() { > > block new calls to prepare/commit_write and page_mkwrite(), but > leave writepage unblocked. > > sync_blockev() > > <--- now we're safe. There are no dirty pages and no ways to > make new ones ---> > > block new calls to readpage (writepage too for good luck?) > > kill_bdev() Whoops, kill_bdev needs the page lock, which sends us into ABBA when readpage does the down_read. So, slight modification, unblock readpage/writepage before the kill_bdev. We'd need to change readpage to discard buffers with the wrong size. The risk is that readpage can find buffers with the wrong size, and would need to be changed to discard them. The patch below is based on Linus' original and doesn't deal with the readpage race. But it does get the rest of the idea across. It boots and survives banging no blockdev --setbsz with mkfs, but I definitely wouldn't trust it. diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..1377171 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev); int set_blocksize(struct block_device *bdev, int size) { - struct address_space *mapping; - /* Size must be a power of two, and between 512 and PAGE_SIZE */ if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size)) return -EINVAL; @@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size) if (size < bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(>bd_block_size_semaphore); - - /* Check that the block device is not memory mapped */ - mapping = bdev->bd_inode->i_mapping; - mutex_lock(>i_mmap_mutex); - if (mapping_mapped(mapping)) { - mutex_unlock(>i_mmap_mutex); - percpu_up_write(>bd_block_size_semaphore); - return -EBUSY; - } - mutex_unlock(>i_mmap_mutex); - /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { + /* block all modifications via writing and page_mkwrite */ + percpu_down_write(>bd_block_size_semaphore); + + /* write everything that was dirty */ sync_blockdev(bdev); + + /* block readpage and writepage */ + percpu_down_write(>bd_page_semaphore); + bdev->bd_block_size = size; bdev->bd_inode->i_blkbits = blksize_bits(size); + + /* we can't call kill_bdev with the page_semaphore down +* because we'll deadlock against readpage. +* The block_size_semaphore should prevent any new +* pages from being dirty, but readpage can jump +* in once we up the bd_page_sem and find a +* page with buffers from the old size. +* +* The kill_bdev call below is going to get rid +* of those buffers, but we do have a race here. +* readpage needs to deal with it and verify +* any buffers on the page are the right size +*/ + percpu_up_write(>bd_page_semaphore); + + /* drop all the pages and all the buffers */ kill_bdev(bdev); - } - percpu_up_write(>bd_block_size_semaphore); + /* open the gates and let everyone back in */ + percpu_up_write(>bd_block_size_semaphore); +
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds > wrote: > > > > But the fact that the code wants to do things like > > > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); > > > > seriously seems to be the main thing that keeps us using > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly > > enough to hurt (not everywhere, but on some machines). > > > > Very annoying. > > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the > whole ilog2 thing, but at the same time, in other cases it actually > seems to improve code generation (ie gets rid of the whole unnecessary > two dereferences through page->mapping->host just to get the block > size, when we have it in the buffer-head that we have to touch > *anyway*). > > Comments? Again, untested. Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock < --- can race in here and find bad buffers ---> sync_blockdev() kill_bdev() < --- now we're safe --- > } We could add a second semaphore and a page_mkwrite call: set_blocksize() { block new calls to prepare/commit_write and page_mkwrite(), but leave writepage unblocked. sync_blockev() <--- now we're safe. There are no dirty pages and no ways to make new ones ---> block new calls to readpage (writepage too for good luck?) kill_bdev() set the block size unblock readpage/writepage unblock prepare/commit_write and page_mkwrite } Another way to look at things: As Linus said in a different email, we don't need to drop the pages, just the buffers. Once we've blocked prepare/commit_write, there is no way to make a partially up to date page with dirty data. We may make fully uptodate dirty pages, but for those we can just create dirty buffers for the whole page. As long as we had prepare/commit write blocked while we ran sync_blockdev, we can blindly detach any buffers that are the wrong size and just make new ones. This may or may not apply to loop.c, I'd have to read that more carefully. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 2:45 PM, Al Viro wrote: > On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote: >> On Wed, Nov 28, 2012 at 10:30 PM, Al Viro wrote: >> > >> > Note that sync_blockdev() a few lines prior to that is good only if we >> > have no other processes doing write(2) (or dirtying the mmapped pages, >> > for that matter). The window isn't too wide, but... >> >> So with Mikulas' patches, the write actually would block (at write >> level) due to the locking. The mmap'ed patches may be around and >> flushed, but the logic to not allow currently *active* mmaps (with the >> rather nasty random -EBUSY return value) should mean that there is no >> race. >> >> Or rather, there's a race, but it results in that EBUSY thing. > > Same as with fs mounted on it, or the sucker having been claimed for > RAID array, etc. Frankly, I'm more than slightly tempted to make > bdev mmap() just claim the sodding device exclusive for as long as > it's mmapped... > > In principle, I agree, but... I still have nightmares from mmap/truncate > races way back. You are stepping into what used to be a really nasty > minefield. I'll look into that, but it's *definitely* not -rc8 fodder. Just let me know which relevant patch(es) you want me to test or break. Thanks, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 2:45 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote: Note that sync_blockdev() a few lines prior to that is good only if we have no other processes doing write(2) (or dirtying the mmapped pages, for that matter). The window isn't too wide, but... So with Mikulas' patches, the write actually would block (at write level) due to the locking. The mmap'ed patches may be around and flushed, but the logic to not allow currently *active* mmaps (with the rather nasty random -EBUSY return value) should mean that there is no race. Or rather, there's a race, but it results in that EBUSY thing. Same as with fs mounted on it, or the sucker having been claimed for RAID array, etc. Frankly, I'm more than slightly tempted to make bdev mmap() just claim the sodding device exclusive for as long as it's mmapped... In principle, I agree, but... I still have nightmares from mmap/truncate races way back. You are stepping into what used to be a really nasty minefield. I'll look into that, but it's *definitely* not -rc8 fodder. Just let me know which relevant patch(es) you want me to test or break. Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: set_blocksize() { block new calls to prepare/commit_write and page_mkwrite(), but leave writepage unblocked. sync_blockev() --- now we're safe. There are no dirty pages and no ways to make new ones --- block new calls to readpage (writepage too for good luck?) kill_bdev() set the block size unblock readpage/writepage unblock prepare/commit_write and page_mkwrite } Another way to look at things: As Linus said in a different email, we don't need to drop the pages, just the buffers. Once we've blocked prepare/commit_write, there is no way to make a partially up to date page with dirty data. We may make fully uptodate dirty pages, but for those we can just create dirty buffers for the whole page. As long as we had prepare/commit write blocked while we ran sync_blockdev, we can blindly detach any buffers that are the wrong size and just make new ones. This may or may not apply to loop.c, I'd have to read that more carefully. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote: On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: set_blocksize() { block new calls to prepare/commit_write and page_mkwrite(), but leave writepage unblocked. sync_blockev() --- now we're safe. There are no dirty pages and no ways to make new ones --- block new calls to readpage (writepage too for good luck?) kill_bdev() Whoops, kill_bdev needs the page lock, which sends us into ABBA when readpage does the down_read. So, slight modification, unblock readpage/writepage before the kill_bdev. We'd need to change readpage to discard buffers with the wrong size. The risk is that readpage can find buffers with the wrong size, and would need to be changed to discard them. The patch below is based on Linus' original and doesn't deal with the readpage race. But it does get the rest of the idea across. It boots and survives banging no blockdev --setbsz with mkfs, but I definitely wouldn't trust it. diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..1377171 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev); int set_blocksize(struct block_device *bdev, int size) { - struct address_space *mapping; - /* Size must be a power of two, and between 512 and PAGE_SIZE */ if (size PAGE_SIZE || size 512 || !is_power_of_2(size)) return -EINVAL; @@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size) if (size bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(bdev-bd_block_size_semaphore); - - /* Check that the block device is not memory mapped */ - mapping = bdev-bd_inode-i_mapping; - mutex_lock(mapping-i_mmap_mutex); - if (mapping_mapped(mapping)) { - mutex_unlock(mapping-i_mmap_mutex); - percpu_up_write(bdev-bd_block_size_semaphore); - return -EBUSY; - } - mutex_unlock(mapping-i_mmap_mutex); - /* Don't change the size if it is same as current */ if (bdev-bd_block_size != size) { + /* block all modifications via writing and page_mkwrite */ + percpu_down_write(bdev-bd_block_size_semaphore); + + /* write everything that was dirty */ sync_blockdev(bdev); + + /* block readpage and writepage */ + percpu_down_write(bdev-bd_page_semaphore); + bdev-bd_block_size = size; bdev-bd_inode-i_blkbits = blksize_bits(size); + + /* we can't call kill_bdev with the page_semaphore down +* because we'll deadlock against readpage. +* The block_size_semaphore should prevent any new +* pages from being dirty, but readpage can jump +* in once we up the bd_page_sem and find a +* page with buffers from the old size. +* +* The kill_bdev call below is going to get rid +* of those buffers, but we do have a race here. +* readpage needs to deal with it and verify +* any buffers on the page are the right size +*/ + percpu_up_write(bdev-bd_page_semaphore); + + /* drop all the pages and all the buffers */ kill_bdev(bdev); - } - percpu_up_write(bdev-bd_block_size_semaphore); + /* open the gates and let everyone back in */ + percpu_up_write(bdev-bd_block_size_semaphore); + } return 0; } @@
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote: Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: Yeah, we could be fancy, but the more I think about it, the less I can say I care. After all, the only things that do the whole set_blocksize() thing should be: - filesystems at mount-time - things like loop/md at block device init time. and quite frankly, if there are any *concurrent* writes with either of the above, I really *really* don't think we should care. I mean, seriously. So the _only_ real reason for the locking in the first place is to make sure of internal kernel consistency. We do not want to oops or corrupt memory if people do odd things. But we really *really* don't care if somebody writes to a partition at the same time as somebody else mounts it. Not enough to do extra work to please insane people. It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST. The whole sequence always used to be unlocked. The locking is entirely new. There is certainly not any legacy users that can possibly rely on I did writes at the same time as the mount with no serialization, and it worked. It never has worked. So I think this is a case of perfect is the enemy of good. Especially since I think that with the fs/buffer.c approach, we don't actually need any locking at all at higher levels. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote: On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote: Jumping in based on Linus original patch, which is doing something like this: set_blocksize() { block new calls to writepage, prepare/commit_write set the block size unblock --- can race in here and find bad buffers --- sync_blockdev() kill_bdev() --- now we're safe --- } We could add a second semaphore and a page_mkwrite call: Yeah, we could be fancy, but the more I think about it, the less I can say I care. After all, the only things that do the whole set_blocksize() thing should be: - filesystems at mount-time - things like loop/md at block device init time. and quite frankly, if there are any *concurrent* writes with either of the above, I really *really* don't think we should care. I mean, seriously. So the _only_ real reason for the locking in the first place is to make sure of internal kernel consistency. We do not want to oops or corrupt memory if people do odd things. But we really *really* don't care if somebody writes to a partition at the same time as somebody else mounts it. Not enough to do extra work to please insane people. It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST. The whole sequence always used to be unlocked. The locking is entirely new. There is certainly not any legacy users that can possibly rely on I did writes at the same time as the mount with no serialization, and it worked. It never has worked. So I think this is a case of perfect is the enemy of good. Especially since I think that with the fs/buffer.c approach, we don't actually need any locking at all at higher levels. The bigger question is do we have users that expect to be able to set the blocksize after mmaping the block device (no writes required)? I actually feel a little bad for taking up internet bandwidth asking, but it is a change in behaviour. Regardless, changing mmap for a race in the page cache is just backwards, and with the current 3.7 code, we can still trigger the race with fadvise - readpage in the middle of set_blocksize() Obviously nobody does any of this, otherwise we'd have tons of reports from those handy WARN_ONs in fs/buffer.c. So its definitely hard to be worried one way or another. -chris -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason chris.ma...@fusionio.com wrote: The bigger question is do we have users that expect to be able to set the blocksize after mmaping the block device (no writes required)? I actually feel a little bad for taking up internet bandwidth asking, but it is a change in behaviour. Yeah, it is. That said, I don't think people will really notice. Nobody mmap's block devices outside of some databases, afaik, and nobody sane mounts a partition at the same time a DB is using it. So I think the new EBUSY check is *ugly*, but I don't realistically believe that it is a problem. The ugliness of the locking is why I'm not a huge fan of it, but if it works I can live with it. But yes, the mmap tests are new with the locking, and could in theory be problematic if somebody reports that it breaks anything. And like the locking, they'd just go away if we just do the fs/buffer.c approach instead. Because doing things in fs/buffer.c simply means that we don't care (and serialization is provided by the page lock on a per-page basis, which is what mmap relies on anyway). So doing the per-page fs/buffer.c approach (along with the ACCESS_ONCE() on inode-i_blkbits to make sure we get *one* consistent value, even if we don't care *which* value it is) would basically revert to all the old semantics. The only thing it would change is that we wouldn't see oopses. (And in theory, it would allow us to actively mix-and-match different block sizes for a block device, but realistically I don't think there are any actual users of that - although I could imagine that a filesystem would use a smaller block size for file tail-blocks etc, and still want to use the fs/buffer.c code, so it's *possible* that it would be useful, but filesystems have been able to do things like that by just doing their buffers by hand anyway, so it's not really fundamentally new, just a possible generalization of code) Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 10:30 PM, Al Viro wrote: > > > > Note that sync_blockdev() a few lines prior to that is good only if we > > have no other processes doing write(2) (or dirtying the mmapped pages, > > for that matter). The window isn't too wide, but... > > So with Mikulas' patches, the write actually would block (at write > level) due to the locking. The mmap'ed patches may be around and > flushed, but the logic to not allow currently *active* mmaps (with the > rather nasty random -EBUSY return value) should mean that there is no > race. > > Or rather, there's a race, but it results in that EBUSY thing. Same as with fs mounted on it, or the sucker having been claimed for RAID array, etc. Frankly, I'm more than slightly tempted to make bdev mmap() just claim the sodding device exclusive for as long as it's mmapped... In principle, I agree, but... I still have nightmares from mmap/truncate races way back. You are stepping into what used to be a really nasty minefield. I'll look into that, but it's *definitely* not -rc8 fodder. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 10:30 PM, Al Viro wrote: > > Note that sync_blockdev() a few lines prior to that is good only if we > have no other processes doing write(2) (or dirtying the mmapped pages, > for that matter). The window isn't too wide, but... So with Mikulas' patches, the write actually would block (at write level) due to the locking. The mmap'ed patches may be around and flushed, but the logic to not allow currently *active* mmaps (with the rather nasty random -EBUSY return value) should mean that there is no race. Or rather, there's a race, but it results in that EBUSY thing. With my simplfied locking, the sync_blockdev() is right before (not a few lines prior) to the kill_bdev(), and in a perfect world they'd actually be one single operation ("write back and invalidate pages with the wrong block-size"). But they aren't. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 10:25 PM, Al Viro wrote: > > Umm... set_blocksize() is calling kill_bdev(), which does > truncate_inode_pages(mapping, 0). What's going to happen to data in > the dirty pages? IO in progress is not the only thing to worry about... Hmm. Yes. I think it works by virtue of "if you change the blocksize while there is active IO, you're insane and you deserve whatever you get". It shouldn't even be fundamentally hard to make it work, although I suspect it would be more code than it would be worth. The sane model would be to not use truncate_inode_pages(), but instead just walk the pages and get rid of the buffer heads with the wrong size. Preferably *combining* that with the sync_blockdev(). We have no real reason to even invalidate the page cache, it's just the buffers we want to get rid of. But I suspect it's true that none of that is really *worth* it, considering that nobody likely wants to do any concurrent IO. We don't want to crash, or corrupt the data structures, but I suspect "you get what you deserve" might actually be the right model ;) So the current "sync_blockdev()+kill_bdev()" takes care of the *sane* case (we flush any data that happened *before* the block size change), and any concurrent writes with block-size changes are "good luck with that". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Thu, Nov 29, 2012 at 06:25:19AM +, Al Viro wrote: > Umm... set_blocksize() is calling kill_bdev(), which does > truncate_inode_pages(mapping, 0). What's going to happen to data in > the dirty pages? IO in progress is not the only thing to worry about... Note that sync_blockdev() a few lines prior to that is good only if we have no other processes doing write(2) (or dirtying the mmapped pages, for that matter). The window isn't too wide, but... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 10:16:21PM -0800, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds > wrote: > > > > But the fact that the code wants to do things like > > > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); > > > > seriously seems to be the main thing that keeps us using > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly > > enough to hurt (not everywhere, but on some machines). > > > > Very annoying. > > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the > whole ilog2 thing, but at the same time, in other cases it actually > seems to improve code generation (ie gets rid of the whole unnecessary > two dereferences through page->mapping->host just to get the block > size, when we have it in the buffer-head that we have to touch > *anyway*). > > Comments? Again, untested. > > And I notice that Al Viro hasn't been cc'd, which is sad, since he's > been involved in much of fs/block_dev.c. > > Al - this is an independent patch to fs/buffer.c to make > fs/block_dev.c able to change the block size of a block device while > there is IO in progress that may still use the old block size. The > discussion has been on fsdevel and lkml, but you may have missed it... Umm... set_blocksize() is calling kill_bdev(), which does truncate_inode_pages(mapping, 0). What's going to happen to data in the dirty pages? IO in progress is not the only thing to worry about... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds wrote: > > But the fact that the code wants to do things like > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); > > seriously seems to be the main thing that keeps us using > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly > enough to hurt (not everywhere, but on some machines). > > Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page->mapping->host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. And I notice that Al Viro hasn't been cc'd, which is sad, since he's been involved in much of fs/block_dev.c. Al - this is an independent patch to fs/buffer.c to make fs/block_dev.c able to change the block size of a block device while there is IO in progress that may still use the old block size. The discussion has been on fsdevel and lkml, but you may have missed it... Linus patch.diff Description: Binary data
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds wrote: > > Interesting. The code *has* the block size (it's in "bh->b_size"), but > it actually then uses the inode blocksize instead, and verifies the > two against each other. It could just have used the block size > directly (and then used the inode i_blkbits only when no buffers > existed), avoiding that dependency entirely.. Looking more at this code, that really would be the nicest solution. There's two cases for the whole get_block() thing: - filesystems. The block size will not change randomly, and "get_block()" seriously depends on the block size. - the raw device. The block size *will* change, but to simplify the problem, "get_block()" is a 1:1 mapping, so it doesn't even care about the block size because it will always return "bh->b_blocknr = nr". So we *could* just say that all the fs/buffer.c code should use "inode->i_blkbits" for creating buffers (because that's the size new buffers should always use), but use "bh->b_size" for any *existing* buffer use. And looking at it, it's even simple. Except for one *very* annoying thing: several users really don't want the size of the buffer, they really do want the *shift* of the buffer size. In fact, that single issue seems to be the reason why "inode->i_blkbits" is really used in fs/buffer.c. Otherwise it would be fairly trivial to just make the pattern be just a simple if (!page_has_buffers(page)) create_empty_buffers(page, 1 << inode->i_blkbits, 0); head = page_buffers(page); blocksize = head->b_size; and just use the blocksize that way, without any other games. All done, no silly WARN_ON() to verify against some global block-size, and the fs/buffer.c code would be perfectly simple, and would have no problem at all with multiple different blocksizes in different pages (the page lock serializes the buffers and thus the blocksize at the per-page level). But the fact that the code wants to do things like block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > A bigger issue is for things that emulate what blkdev.c does, and > doesn't do the locking. I see code in md/bitmap.c that seems a bit > suspicious, for example. That said, it's not *new* breakage, and the > "lock at mmap/read/write() time" approach doesn't fix it either (since > the mapping will be different for the underlying MD device). So I do > think that we should take a look at all the users of > "alloc_page_buffers()" and "create_empty_buffers()" to see what *they* > do to protect the block-size, but I think that's an independent issue > from the raw device access case in fs/block_dev.c.. > > I guess I have to actually test my patch. I don't have very > interesting test-cases, though. > >Linus Yes, it md looks suspicious. It disables write access in deny_bitmap_write_access (that functions looks buggy on its own - what happens if i_writecount == 0 or if it is already negative on entry?). So we could disallow changing block size if i_writecount < 0, that should do the trick. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
[ Sorry, I was offline for a while driving kids around ] On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka wrote: > > It can happen. Take your patch (the one that moves bd_block_size_semaphore > into blkdev_readpage, blkdev_writepage and blkdev_write_begin). Interesting. The code *has* the block size (it's in "bh->b_size"), but it actually then uses the inode blocksize instead, and verifies the two against each other. It could just have used the block size directly (and then used the inode i_blkbits only when no buffers existed), avoiding that dependency entirely.. It actually does the same thing (with the same verification) in __block_write_full_page() and (_without_ the verification) in __block_commit_write(). Ho humm. All of those places actually do hold the rwsem for reading, it's just that I don't want to hold it for writing over the sync.. Need to think on this, Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > > For example, __block_write_full_page and __block_write_begin do > > if (!page_has_buffers(page)) { create_empty_buffers... } > > and then they do > > WARN_ON(bh->b_size != blocksize) > > err = get_block(inode, block, bh, 1) > > Right. And none of this is new. > > > ... so if the buffers were left over from some previous call to > > create_empty_buffers with a different blocksize, that WARN_ON is trigged. > > None of this can happen. It can happen. Take your patch (the one that moves bd_block_size_semaphore into blkdev_readpage, blkdev_writepage and blkdev_write_begin). Insert msleep(1000) into set_blocksize, just before sync_blockdev. Run this program: #define _XOPEN_SOURCE 500 #include #include #include #include #include static char array[4096]; int main(void) { int h; system("dmsetup remove test 2>/dev/null"); if (system("dmsetup create test --table '0 1 zero'")) exit(1); h = open("/dev/mapper/test", O_RDWR); if (h < 0) perror("open"), exit(1); if (pread(h, array, 512, 0) != 512) perror("pread"), exit(1); if (system("dmsetup load test --table '0 8 zero'")) exit(1); if (system("dmsetup suspend test")) exit(1); if (system("dmsetup resume test")) exit(1); if (system("blockdev --setbsz 2048 /dev/mapper/test &")) exit(1); usleep(50); if (pwrite(h, array, 4096, 0) != 4096) perror("pwrite"), exit(1); return 0; } --- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin [ 1243.30] Backtrace: [ 1243.33] [<40230ba8>] block_write_begin+0x70/0xd0 [ 1243.40] [<402350cc>] blkdev_write_begin+0xb4/0x208 [ 1243.48] [<401a9f10>] generic_file_buffered_write+0x248/0x348 [ 1243.57] [<401ac8c4>] __generic_file_aio_write+0x1fc/0x388 [ 1243.66] [<40235e74>] blkdev_aio_write+0x64/0xf0 [ 1243.74] [<401f2108>] do_sync_write+0xd0/0x128 [ 1243.81] [<401f2930>] vfs_write+0xa0/0x180 [ 1243.88] [<401f2ecc>] sys_pwrite64+0xb4/0xd8 [ 1243.95] [<40122104>] parisc_pwrite64+0x1c/0x28 [ 1244.02] [<40106060>] syscall_exit+0x0/0x14 I'm not saying that your approach is wrong, you just have to carefuly review all memory management code for assumptions that block size doesn't change. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds wrote: > >> For example, __block_write_full_page and __block_write_begin do >> if (!page_has_buffers(page)) { create_empty_buffers... } >> and then they do >> WARN_ON(bh->b_size != blocksize) >> err = get_block(inode, block, bh, 1) > > Right. And none of this is new. .. which, btw, is not to say that *other* things aren't new. They are. The change to actually change the block device buffer size before then calling "sync_bdev()" is definitely a real change, and as mentioned, I have not tested the patch in any way. If any block device driver were to actually compare the IO size they get against the bdev->block_size thing, they'd see very different behavior (ie they'd see the new block size as they are asked to write old the old blocks with the old block size). So it does change semantics, no question about that. I don't think any block device does it, though. A bigger issue is for things that emulate what blkdev.c does, and doesn't do the locking. I see code in md/bitmap.c that seems a bit suspicious, for example. That said, it's not *new* breakage, and the "lock at mmap/read/write() time" approach doesn't fix it either (since the mapping will be different for the underlying MD device). So I do think that we should take a look at all the users of "alloc_page_buffers()" and "create_empty_buffers()" to see what *they* do to protect the block-size, but I think that's an independent issue from the raw device access case in fs/block_dev.c.. I guess I have to actually test my patch. I don't have very interesting test-cases, though. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka wrote: > > The problem with this approach is that it is very easy to miss points > where it is assumed that the block size doesn't change - and if you miss a > point, it results in a hidden bug that has a little possibility of being > found. Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw stones in glass houses, shall we? The whole reason for this long thread (and several threads before it) is that your model isn't working and is causing problems. I already pointed out how bogus your arguments about mmap() locking were, and then you have the gall to talk about potential bugs, when I have pointed you to *actual* bugs, and actual mistakes. > For example, __block_write_full_page and __block_write_begin do > if (!page_has_buffers(page)) { create_empty_buffers... } > and then they do > WARN_ON(bh->b_size != blocksize) > err = get_block(inode, block, bh, 1) Right. And none of this is new. > ... so if the buffers were left over from some previous call to > create_empty_buffers with a different blocksize, that WARN_ON is trigged. None of this can happen. > Locking the whole read/write/mmap operations is crude, but at least it can > be done without thorough review of all the memory management code. Umm. Which you clearly didn't do, and raised totally crap arguments for. In contrast, I have a very simple argument for the correctness of my patch: every single user of the "get_block[s]()" interface now takes the lock for as long as get_block[s]() is passed off to somebody else. And since get_block[s]() is the only way to create those empty buffers, I think I pretty much proved exactly what you ask for. And THAT is the whole point and advantage of making locking sane. Sane locking you can actually *think* about! In contrast, locking around "mmap()" is absolutely *guaranteed* to be insane, because mmap() doesn't actually do any of the IO that the lock is supposed to protect against! So Mikulas, quite frankly, your arguments argue against you. When you say "Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough", you are doubly correct: it *is* crude, and it clearly *was* done without thought, since it's a f*cking idiotic AND INCORRECT thing to do. Seriously. Locking around "mmap()" is insane. It leads to insane semantics (the whole EBUSY thing is purely because of that problem) and it leads to bad code (your "let's add a new "mmap_region" hook is just disgusting, and while Al's idea of doing it in the existing "->open" method is at least not nasty, it's definitely extra code and complexity). There are serious *CORRECTNESS* advantages to simplicity and directness. And locking at the right point is definitely very much part of that. Anyway, as far as block size goes, we have exactly two cases: - random IO that does not care about the block size, and will just do whatever the current block size is (ie normal anonymous accesses to the block device). This is the case that needs the locking - but it only needs it around the individual page operations, ie exactly where I put it. In fact, they can happily deal with different block sizes for different pages, they don't really care. - mounted filesystems etc that require a particular block size and set it at mount time, and they have exclusivity rules The second case is the case that actually calls set_blocksize(), and if "kill_bdev()" doesn't get rid of the old blocksizes, then they have always been in trouble, and would always _continue_ to be in trouble, regardless of locking. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds > wrote: > > > > Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably > > do unspeakable things to your family and pets. > > Btw, *if* this approach works, I suspect we could just switch the > bd_block_size_semaphore semaphore to be a regular rw-sem. > > Why? Because now it's no longer ever gotten in the cached IO paths, we > only get it when we're doing much more expensive things (ie actual IO, > and buffer head allocations etc etc). As long as we just work with the > page cache, we never get to the whole lock at all. > > Which means that the whole percpu-optimized thing is likely no longer > all that relevant. Using normal semaphore degrades direct-IO performance on raw devices, I measured that (45.1s with normal rw-semaphore vs. 42.8s with percpu-rw-semaphore). It could be measured on ramdisk or high-performance SSDs. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds > wrote: > > > > mmap() is in *no* way special. The exact same thing happens for > > regular read/write. Yet somehow the mmap code is special-cased, while > > the normal read-write code is not. > > I just double-checked, because it's been a long time since I actually > looked at the code. > > But yeah, block device read/write uses the pure page cache functions. > IOW, it has the *exact* same IO engine as mmap() would have. > > So here's my suggestion: > > - get rid of *all* the locking in aio_read/write and the splice paths > - get rid of all the stupid mmap games > > - instead, add them to the functions that actually use > "blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else. > >That's a fairly limited number of functions: > blkdev_{read,write}page(), blkdev_direct_IO() and > blkdev_write_{begin,end}() > > Doesn't that sounds simpler? And more logical: it protects the actual > places that use the block size of the device. > > I dunno. Maybe there is some fundamental reason why the above is > broken, but it seems to be a much simpler approach. Sure, you need to > guarantee that the people who get the write-lock cannot possibly cause > IO while holding it, but since the only reason to get the write lock > would be to change the block size, that should be pretty simple, no? > > Yeah, yeah, I'm probably missing something fundamental, but the above > sounds like the simple approach to fixing things. Aiming for having > the block size read-lock be taken by the things that pass in the > block-size itself. > > It would be nice for things to be logical and straightforward. > >Linus The problem with this approach is that it is very easy to miss points where it is assumed that the block size doesn't change - and if you miss a point, it results in a hidden bug that has a little possibility of being found. For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh->b_size != blocksize) err = get_block(inode, block, bh, 1) ... so if the buffers were left over from some previous call to create_empty_buffers with a different blocksize, that WARN_ON is trigged. And it's not only a harmless warning - now bh->b_size is left set to the old block size, but bh->b_blocknr is set to a number, that was calculated according to the new block size - and when you submit that buffer with submit_bh, it is written to the wrong place! Now, prove that there are no more bugs like this. Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough review of all the memory management code. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds wrote: > > Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably > do unspeakable things to your family and pets. Btw, *if* this approach works, I suspect we could just switch the bd_block_size_semaphore semaphore to be a regular rw-sem. Why? Because now it's no longer ever gotten in the cached IO paths, we only get it when we're doing much more expensive things (ie actual IO, and buffer head allocations etc etc). As long as we just work with the page cache, we never get to the whole lock at all. Which means that the whole percpu-optimized thing is likely no longer all that relevant. But that's an independent thing, and it's only true *if* my patch works. It looks fine on paper, but maybe there's something fundamentally broken about it. One big change my patch does is to move the sync_bdev/kill_bdev to *after* changing the block size. It does that so that it can guarantee that any old data (which didn't see the new block size) will be sync'ed even if there is new IO coming in as we change the block size. The old code locked the whole sync() region, which doesn't work with my approach, since the sync will do IO and would thus cause potential deadlocks while holding the rwsem for writing. So with this patch, as the block size changes, you can actually have some old pages with the old block size *and* some different new pages with the new block size all at the same time. It should all be perfectly fine, but it's worth pointing out. (It probably won't trigger in practice, though, since doing IO while somebody else is changing the blocksize is fundamentally an odd thing to do, but whatever. I also suspect that we *should* perhaps use the inode->i_sem thing to serialize concurrent block size changes, but that's again an independent issue) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds wrote: > > I dunno. Maybe there is some fundamental reason why the above is > broken, but it seems to be a much simpler approach. Sure, you need to > guarantee that the people who get the write-lock cannot possibly cause > IO while holding it, but since the only reason to get the write lock > would be to change the block size, that should be pretty simple, no? Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably do unspeakable things to your family and pets. Linus patch.diff Description: Binary data
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds wrote: > > mmap() is in *no* way special. The exact same thing happens for > regular read/write. Yet somehow the mmap code is special-cased, while > the normal read-write code is not. I just double-checked, because it's been a long time since I actually looked at the code. But yeah, block device read/write uses the pure page cache functions. IOW, it has the *exact* same IO engine as mmap() would have. So here's my suggestion: - get rid of *all* the locking in aio_read/write and the splice paths - get rid of all the stupid mmap games - instead, add them to the functions that actually use "blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else. That's a fairly limited number of functions: blkdev_{read,write}page(), blkdev_direct_IO() and blkdev_write_{begin,end}() Doesn't that sounds simpler? And more logical: it protects the actual places that use the block size of the device. I dunno. Maybe there is some fundamental reason why the above is broken, but it seems to be a much simpler approach. Sure, you need to guarantee that the people who get the write-lock cannot possibly cause IO while holding it, but since the only reason to get the write lock would be to change the block size, that should be pretty simple, no? Yeah, yeah, I'm probably missing something fundamental, but the above sounds like the simple approach to fixing things. Aiming for having the block size read-lock be taken by the things that pass in the block-size itself. It would be nice for things to be logical and straightforward. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka wrote: > > mmap_region() doesn't care about the block size. But a lot of > page-in/page-out code does. That seems a bogus argument. mmap() is in *no* way special. The exact same thing happens for regular read/write. Yet somehow the mmap code is special-cased, while the normal read-write code is not. I suspect it might be *easier* to trigger some issues with mmap, but that still isn't a good enough reason to special-case it. We don't add locking to one please just because that one place shows some race condition more easily. We fix the locking. So for example, maybe the code that *actually* cares about the buffer size (the stuff that allocates buffers in fs/buffer.c) needs to take that new percpu read lock. Basically, any caller of "alloc_page_buffers()/create_empty_buffers()" or whatever. I also wonder whether we need it *at*all*. I suspect that we could easily have multiple block-sizes these days for the same block device. It *used* to be (millions of years ago, when dinosaurs roamed the earth) that the block buffers were global and shared with all users of a partition. But that hasn't been true since we started using the page cache, and I suspect that some of the block size changing issues are simply entirely stale. Yeah, yeah, there could be some coherency issues if people write to the block device through different block sizes, but I think we have those coherency issues anyway. The page-cache is not coherent across different mapping inodes anyway. So I really suspect that some of this is "legacy logic". Or at least perhaps _should_ be. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 11:43 AM, Al Viro wrote: > Have a > private vm_operations - a copy of generic_file_vm_ops with ->open()/->close() > added to it. That sounds more reasonable. However, I suspect the *most* reasonable thing to do is to just remove the whole damn thing. We really shouldn't care about mmap. If somebody does a mmap on a block device, and somebody else then changes the block size, why-ever should we bother to go through any contortions at *all* to make that kind of insane behavior do anything sane at all. Just let people mmap things. Then just let the normal page cache invalidation work right. In fact, it is entirely possible that we could/should just not even invalidate the page cache at all, just make sure that the buffer heads attached to any pages get disconnected. No? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: > No, this is crap. > > We don't introduce random hooks like this just because the block layer > has shit-for-brains and cannot be bothered to do things right. > > The fact is, the whole locking in the block layer open routine is > total and utter crap. It doesn't lock the right thing, even with your > change *anyway* (or with the change Jens had). Absolutely nothing in > "mmap_region()" cares at all about the block-size anywhere - it's > generic, after all - so locking around it is f*cking pointless. There > is no way in hell that the caller of ->mmap can *ever* care about the > block size, since it never even looks at it. > > Don't do random crap like this. > > Why does the code think that mmap matters so much anyway? As you say, > the mmap itself does *nothing*. It has no impact for the block size. > > Linus mmap_region() doesn't care about the block size. But a lot of page-in/page-out code does. The problem is that once the block device is mapped, page faults or page writeback can happen anytime - so the simplest solution is to not allow the block device being mapped while we change block size. The function set_blocksize takes bd_block_size_semaphore for write (that blocks read/write/mmap), then it calls sync_blockdev (now we are sure that there is no more writeback), then it changes the block size, then it calls kill_bdev (now we are sure that there are no more any pages with buffers with the old blocksize). If you want to allow to change block size while a block device is mapped, you'd have to add explicit locks around all mm callbacks (so that the block size can't change while the callback is in progress) - and even then, there are some unsolvable cases - i.e. what are you going to do if the user mlocks a mapped block device and you change block size of that device? - you can't drop the pages (that would violate mlock semantics) and you can leave them there (because they have buffers with wrong size). If you don't like what I sent, propose a different solution. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote: > No, this is crap. > > We don't introduce random hooks like this just because the block layer > has shit-for-brains and cannot be bothered to do things right. > > The fact is, the whole locking in the block layer open routine is > total and utter crap. It doesn't lock the right thing, even with your > change *anyway* (or with the change Jens had). Absolutely nothing in > "mmap_region()" cares at all about the block-size anywhere - it's > generic, after all - so locking around it is f*cking pointless. There > is no way in hell that the caller of ->mmap can *ever* care about the > block size, since it never even looks at it. > > Don't do random crap like this. > > Why does the code think that mmap matters so much anyway? As you say, > the mmap itself does *nothing*. It has no impact for the block size. ... and here I was, trying to massage a reply into form that would be at least borderline printable... Anyway, this is certainly the wrong way to go. We want to catch if the damn thing is mapped and mapping_mapped() leaves a race? Fine, so let's not use mapping_mapped() at all. Have a private vm_operations - a copy of generic_file_vm_ops with ->open()/->close() added to it. Incrementing/decrementing a per-block_device atomic counter, with increment side done under your rwsem, to make sure that 0->positive transition doesn't change in critical section of set_blocksize(). And don't use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that local copy. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
No, this is crap. We don't introduce random hooks like this just because the block layer has shit-for-brains and cannot be bothered to do things right. The fact is, the whole locking in the block layer open routine is total and utter crap. It doesn't lock the right thing, even with your change *anyway* (or with the change Jens had). Absolutely nothing in "mmap_region()" cares at all about the block-size anywhere - it's generic, after all - so locking around it is f*cking pointless. There is no way in hell that the caller of ->mmap can *ever* care about the block size, since it never even looks at it. Don't do random crap like this. Why does the code think that mmap matters so much anyway? As you say, the mmap itself does *nothing*. It has no impact for the block size. Linus On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka wrote: > > > On Wed, 28 Nov 2012, Jens Axboe wrote: > >> On 2012-11-28 04:57, Mikulas Patocka wrote: >> > >> > This patch is wrong because you must check if the device is mapped while >> > holding bdev->bd_block_size_semaphore (because >> > bdev->bd_block_size_semaphore prevents new mappings from being created) >> >> No it doesn't. If you read the patch, that was moved to i_mmap_mutex. > > Hmm, it was wrong before the patch and it is wrong after the patch too. > > The problem is that ->mmap method doesn't do the actual mapping, the > caller of ->mmap (mmap_region) does it. So we must actually catch > mmap_region and protect it with the lock, not ->mmap. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)
On Wed, 28 Nov 2012, Jens Axboe wrote: > On 2012-11-28 04:57, Mikulas Patocka wrote: > > > > This patch is wrong because you must check if the device is mapped while > > holding bdev->bd_block_size_semaphore (because > > bdev->bd_block_size_semaphore prevents new mappings from being created) > > No it doesn't. If you read the patch, that was moved to i_mmap_mutex. Hmm, it was wrong before the patch and it is wrong after the patch too. The problem is that ->mmap method doesn't do the actual mapping, the caller of ->mmap (mmap_region) does it. So we must actually catch mmap_region and protect it with the lock, not ->mmap. Mikulas --- Introduce a method to catch mmap_region For block devices, we must prevent the device from being mapped while block size is being changed. This was implemented by catching the mmap method and taking bd_block_size_semaphore in it. The problem is that the generic_file_mmap method does almost nothing, it only sets vma->vm_ops = _file_vm_ops, all the hard work is done by the caller, mmap_region. Consequently, locking the mmap method is insufficient, to prevent the race condition. The race condition can happen for example this way: process 1: Starts mapping a block device, it goes to mmap_region, calls file->f_op->mmap (blkdev_mmap - that acquires and releases bd_block_size_semaphore), and reschedule happens just after blkdev_mmap returns. process 2: Changes block device size, goes into set_blocksize, acquires percpu_down_write, calls mapping_mapped to test if the device is mapped (it still isn't). Then, it reschedules. process 1: Continues in mmap_region, successfully finishes the mapping. process 2: Continues in set_blocksize, changes the block size while the block device is mapped. (which is incorrect and may result in a crash or misbehavior). To fix this possible race condition, we need to catch the function that does real mapping - mmap_region. This patch adds a new method, file_operations->mmap_region. mmap_region calls the method file_operations->mmap_region if it is non-NULL, otherwise, it calls generic_mmap_region. Signed-off-by: Mikulas Patocka --- fs/block_dev.c | 12 include/linux/fs.h |4 include/linux/mm.h |3 +++ mm/mmap.c | 10 ++ 4 files changed, 25 insertions(+), 4 deletions(-) Index: linux-3.7-rc7/include/linux/fs.h === --- linux-3.7-rc7.orig/include/linux/fs.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/fs.h2012-11-28 18:02:45.0 +0100 @@ -27,6 +27,7 @@ #include #include #include +#include #include #include @@ -1528,6 +1529,9 @@ struct file_operations { unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + unsigned long (*mmap_region)(struct file *, unsigned long, +unsigned long, unsigned long, vm_flags_t, +unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); Index: linux-3.7-rc7/include/linux/mm.h === --- linux-3.7-rc7.orig/include/linux/mm.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/mm.h2012-11-28 17:47:12.0 +0100 @@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff); +extern unsigned long generic_mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); Index: linux-3.7-rc7/mm/mmap.c === --- linux-3.7-rc7.orig/mm/mmap.c2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.0 +0100 @@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff) { + if (file && file->f_op->mmap_region) + return file->f_op->mmap_region(file, addr, len, flags, vm_flags, + pgoff); + return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff); +} + +unsigned long generic_mmap_region(struct file *file, unsigned long addr, + unsigned long len,
Re: Recent kernel "mount" slow
On Wed, Nov 28, 2012 at 4:33 PM, Jens Axboe wrote: > On 2012-11-28 04:57, Mikulas Patocka wrote: >> >> >> On Tue, 27 Nov 2012, Jens Axboe wrote: >> >>> On 2012-11-27 11:06, Jeff Chua wrote: On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: > On 2012-11-27 06:57, Jeff Chua wrote: >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua >> wrote: >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka >>> wrote: So it's better to slow down mount. >>> >>> I am quite proud of the linux boot time pitting against other OS. Even >>> with 10 partitions. Linux can boot up in just a few seconds, but now >>> you're saying that we need to do this semaphore check at boot up. By >>> doing so, it's inducing additional 4 seconds during boot up. >> >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what >> kind of degradation would this cause or just the same? > > It'd likely be the same slow down time wise, but as a percentage it > would appear smaller on a slower disk. > > Could you please test Mikulas' suggestion of changing > synchronize_sched() in include/linux/percpu-rwsem.h to > synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a "tick" slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. >>> >>> Excellent >>> > linux-next also has a re-write of the per-cpu rw sems, out of Andrews > tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. >>> >>> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix >>> the real issue, which is having to do synchronize_sched() in the first >>> place. >>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. >>> >>> I wonder how many of them are due to changing to the same block size. >>> Does the below patch make a difference? >> >> This patch is wrong because you must check if the device is mapped while >> holding bdev->bd_block_size_semaphore (because >> bdev->bd_block_size_semaphore prevents new mappings from being created) > > No it doesn't. If you read the patch, that was moved to i_mmap_mutex. > >> I'm sending another patch that has the same effect. >> >> >> Note that ext[234] filesystems set blocksize to 1024 temporarily during >> mount, so it doesn't help much (it only helps for other filesystems, such >> as jfs). For ext[234], you have a device with default block size 4096, the >> filesystem sets block size to 1024 during mount, reads the super block and >> sets it back to 4096. > > That is true, hence I was hesitant to think it'll actually help. In any > case, basically any block device will have at least one blocksize > transitioned when being mounted for the first time. I wonder if we just > shouldn't default to having a 4kb soft block size to avoid that one, > though it is working around the issue to some degree. I tested on reiserfs. It helped. 0.012s as in 3.6.0, but as Mikulas mentioned, it didn't really improve much for ext2. Jeff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On 2012-11-28 04:57, Mikulas Patocka wrote: > > > On Tue, 27 Nov 2012, Jens Axboe wrote: > >> On 2012-11-27 11:06, Jeff Chua wrote: >>> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: On 2012-11-27 06:57, Jeff Chua wrote: > On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua > wrote: >> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka >> wrote: >>> So it's better to slow down mount. >> >> I am quite proud of the linux boot time pitting against other OS. Even >> with 10 partitions. Linux can boot up in just a few seconds, but now >> you're saying that we need to do this semaphore check at boot up. By >> doing so, it's inducing additional 4 seconds during boot up. > > By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU > (2.8GHz). I wonder if those on slower hard disk or slower CPU, what > kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? >>> >>> Tested. It seems as fast as before, but may be a "tick" slower. Just >>> perception. I was getting pretty much 0.012s with everything reverted. >>> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. >>> So, it's good. >> >> Excellent >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. >>> >>> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. >> >> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix >> the real issue, which is having to do synchronize_sched() in the first >> place. >> >>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt >>> >>> >>> So, here's the comparison ... >>> >>> 0.500s 3.7.0-rc7 >>> 0.168s 3.7.0-rc2 >>> 0.012s 3.6.0 >>> 0.013s 3.7.0-rc7 + synchronize_sched_expedited() >>> 0.350s 3.7.0-rc7 + Oleg's patch. >> >> I wonder how many of them are due to changing to the same block size. >> Does the below patch make a difference? > > This patch is wrong because you must check if the device is mapped while > holding bdev->bd_block_size_semaphore (because > bdev->bd_block_size_semaphore prevents new mappings from being created) No it doesn't. If you read the patch, that was moved to i_mmap_mutex. > I'm sending another patch that has the same effect. > > > Note that ext[234] filesystems set blocksize to 1024 temporarily during > mount, so it doesn't help much (it only helps for other filesystems, such > as jfs). For ext[234], you have a device with default block size 4096, the > filesystem sets block size to 1024 during mount, reads the super block and > sets it back to 4096. That is true, hence I was hesitant to think it'll actually help. In any case, basically any block device will have at least one blocksize transitioned when being mounted for the first time. I wonder if we just shouldn't default to having a 4kb soft block size to avoid that one, though it is working around the issue to some degree. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On 2012-11-28 04:57, Mikulas Patocka wrote: On Tue, 27 Nov 2012, Jens Axboe wrote: On 2012-11-27 11:06, Jeff Chua wrote: On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a tick slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. Excellent linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix the real issue, which is having to do synchronize_sched() in the first place. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. I wonder how many of them are due to changing to the same block size. Does the below patch make a difference? This patch is wrong because you must check if the device is mapped while holding bdev-bd_block_size_semaphore (because bdev-bd_block_size_semaphore prevents new mappings from being created) No it doesn't. If you read the patch, that was moved to i_mmap_mutex. I'm sending another patch that has the same effect. Note that ext[234] filesystems set blocksize to 1024 temporarily during mount, so it doesn't help much (it only helps for other filesystems, such as jfs). For ext[234], you have a device with default block size 4096, the filesystem sets block size to 1024 during mount, reads the super block and sets it back to 4096. That is true, hence I was hesitant to think it'll actually help. In any case, basically any block device will have at least one blocksize transitioned when being mounted for the first time. I wonder if we just shouldn't default to having a 4kb soft block size to avoid that one, though it is working around the issue to some degree. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Wed, Nov 28, 2012 at 4:33 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-28 04:57, Mikulas Patocka wrote: On Tue, 27 Nov 2012, Jens Axboe wrote: On 2012-11-27 11:06, Jeff Chua wrote: On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a tick slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. Excellent linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix the real issue, which is having to do synchronize_sched() in the first place. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. I wonder how many of them are due to changing to the same block size. Does the below patch make a difference? This patch is wrong because you must check if the device is mapped while holding bdev-bd_block_size_semaphore (because bdev-bd_block_size_semaphore prevents new mappings from being created) No it doesn't. If you read the patch, that was moved to i_mmap_mutex. I'm sending another patch that has the same effect. Note that ext[234] filesystems set blocksize to 1024 temporarily during mount, so it doesn't help much (it only helps for other filesystems, such as jfs). For ext[234], you have a device with default block size 4096, the filesystem sets block size to 1024 during mount, reads the super block and sets it back to 4096. That is true, hence I was hesitant to think it'll actually help. In any case, basically any block device will have at least one blocksize transitioned when being mounted for the first time. I wonder if we just shouldn't default to having a 4kb soft block size to avoid that one, though it is working around the issue to some degree. I tested on reiserfs. It helped. 0.012s as in 3.6.0, but as Mikulas mentioned, it didn't really improve much for ext2. Jeff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Jens Axboe wrote: On 2012-11-28 04:57, Mikulas Patocka wrote: This patch is wrong because you must check if the device is mapped while holding bdev-bd_block_size_semaphore (because bdev-bd_block_size_semaphore prevents new mappings from being created) No it doesn't. If you read the patch, that was moved to i_mmap_mutex. Hmm, it was wrong before the patch and it is wrong after the patch too. The problem is that -mmap method doesn't do the actual mapping, the caller of -mmap (mmap_region) does it. So we must actually catch mmap_region and protect it with the lock, not -mmap. Mikulas --- Introduce a method to catch mmap_region For block devices, we must prevent the device from being mapped while block size is being changed. This was implemented by catching the mmap method and taking bd_block_size_semaphore in it. The problem is that the generic_file_mmap method does almost nothing, it only sets vma-vm_ops = generic_file_vm_ops, all the hard work is done by the caller, mmap_region. Consequently, locking the mmap method is insufficient, to prevent the race condition. The race condition can happen for example this way: process 1: Starts mapping a block device, it goes to mmap_region, calls file-f_op-mmap (blkdev_mmap - that acquires and releases bd_block_size_semaphore), and reschedule happens just after blkdev_mmap returns. process 2: Changes block device size, goes into set_blocksize, acquires percpu_down_write, calls mapping_mapped to test if the device is mapped (it still isn't). Then, it reschedules. process 1: Continues in mmap_region, successfully finishes the mapping. process 2: Continues in set_blocksize, changes the block size while the block device is mapped. (which is incorrect and may result in a crash or misbehavior). To fix this possible race condition, we need to catch the function that does real mapping - mmap_region. This patch adds a new method, file_operations-mmap_region. mmap_region calls the method file_operations-mmap_region if it is non-NULL, otherwise, it calls generic_mmap_region. Signed-off-by: Mikulas Patocka mpato...@redhat.com --- fs/block_dev.c | 12 include/linux/fs.h |4 include/linux/mm.h |3 +++ mm/mmap.c | 10 ++ 4 files changed, 25 insertions(+), 4 deletions(-) Index: linux-3.7-rc7/include/linux/fs.h === --- linux-3.7-rc7.orig/include/linux/fs.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/fs.h2012-11-28 18:02:45.0 +0100 @@ -27,6 +27,7 @@ #include linux/lockdep.h #include linux/percpu-rwsem.h #include linux/blk_types.h +#include linux/mm_types.h #include asm/byteorder.h #include uapi/linux/fs.h @@ -1528,6 +1529,9 @@ struct file_operations { unsigned int (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); + unsigned long (*mmap_region)(struct file *, unsigned long, +unsigned long, unsigned long, vm_flags_t, +unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *, fl_owner_t id); Index: linux-3.7-rc7/include/linux/mm.h === --- linux-3.7-rc7.orig/include/linux/mm.h 2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/include/linux/mm.h2012-11-28 17:47:12.0 +0100 @@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s extern unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff); +extern unsigned long generic_mmap_region(struct file *file, unsigned long addr, + unsigned long len, unsigned long flags, + vm_flags_t vm_flags, unsigned long pgoff); extern unsigned long do_mmap_pgoff(struct file *, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long); Index: linux-3.7-rc7/mm/mmap.c === --- linux-3.7-rc7.orig/mm/mmap.c2012-11-28 17:45:55.0 +0100 +++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.0 +0100 @@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f unsigned long len, unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff) { + if (file file-f_op-mmap_region) + return file-f_op-mmap_region(file, addr, len, flags, vm_flags, + pgoff); + return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff); +} + +unsigned long
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
No, this is crap. We don't introduce random hooks like this just because the block layer has shit-for-brains and cannot be bothered to do things right. The fact is, the whole locking in the block layer open routine is total and utter crap. It doesn't lock the right thing, even with your change *anyway* (or with the change Jens had). Absolutely nothing in mmap_region() cares at all about the block-size anywhere - it's generic, after all - so locking around it is f*cking pointless. There is no way in hell that the caller of -mmap can *ever* care about the block size, since it never even looks at it. Don't do random crap like this. Why does the code think that mmap matters so much anyway? As you say, the mmap itself does *nothing*. It has no impact for the block size. Linus On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka mpato...@redhat.com wrote: On Wed, 28 Nov 2012, Jens Axboe wrote: On 2012-11-28 04:57, Mikulas Patocka wrote: This patch is wrong because you must check if the device is mapped while holding bdev-bd_block_size_semaphore (because bdev-bd_block_size_semaphore prevents new mappings from being created) No it doesn't. If you read the patch, that was moved to i_mmap_mutex. Hmm, it was wrong before the patch and it is wrong after the patch too. The problem is that -mmap method doesn't do the actual mapping, the caller of -mmap (mmap_region) does it. So we must actually catch mmap_region and protect it with the lock, not -mmap. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote: No, this is crap. We don't introduce random hooks like this just because the block layer has shit-for-brains and cannot be bothered to do things right. The fact is, the whole locking in the block layer open routine is total and utter crap. It doesn't lock the right thing, even with your change *anyway* (or with the change Jens had). Absolutely nothing in mmap_region() cares at all about the block-size anywhere - it's generic, after all - so locking around it is f*cking pointless. There is no way in hell that the caller of -mmap can *ever* care about the block size, since it never even looks at it. Don't do random crap like this. Why does the code think that mmap matters so much anyway? As you say, the mmap itself does *nothing*. It has no impact for the block size. ... and here I was, trying to massage a reply into form that would be at least borderline printable... Anyway, this is certainly the wrong way to go. We want to catch if the damn thing is mapped and mapping_mapped() leaves a race? Fine, so let's not use mapping_mapped() at all. Have a private vm_operations - a copy of generic_file_vm_ops with -open()/-close() added to it. Incrementing/decrementing a per-block_device atomic counter, with increment side done under your rwsem, to make sure that 0-positive transition doesn't change in critical section of set_blocksize(). And don't use generic_file_mmap(), just do file_accessed() and set -vm_ops to that local copy. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: No, this is crap. We don't introduce random hooks like this just because the block layer has shit-for-brains and cannot be bothered to do things right. The fact is, the whole locking in the block layer open routine is total and utter crap. It doesn't lock the right thing, even with your change *anyway* (or with the change Jens had). Absolutely nothing in mmap_region() cares at all about the block-size anywhere - it's generic, after all - so locking around it is f*cking pointless. There is no way in hell that the caller of -mmap can *ever* care about the block size, since it never even looks at it. Don't do random crap like this. Why does the code think that mmap matters so much anyway? As you say, the mmap itself does *nothing*. It has no impact for the block size. Linus mmap_region() doesn't care about the block size. But a lot of page-in/page-out code does. The problem is that once the block device is mapped, page faults or page writeback can happen anytime - so the simplest solution is to not allow the block device being mapped while we change block size. The function set_blocksize takes bd_block_size_semaphore for write (that blocks read/write/mmap), then it calls sync_blockdev (now we are sure that there is no more writeback), then it changes the block size, then it calls kill_bdev (now we are sure that there are no more any pages with buffers with the old blocksize). If you want to allow to change block size while a block device is mapped, you'd have to add explicit locks around all mm callbacks (so that the block size can't change while the callback is in progress) - and even then, there are some unsolvable cases - i.e. what are you going to do if the user mlocks a mapped block device and you change block size of that device? - you can't drop the pages (that would violate mlock semantics) and you can leave them there (because they have buffers with wrong size). If you don't like what I sent, propose a different solution. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 11:43 AM, Al Viro v...@zeniv.linux.org.uk wrote: Have a private vm_operations - a copy of generic_file_vm_ops with -open()/-close() added to it. That sounds more reasonable. However, I suspect the *most* reasonable thing to do is to just remove the whole damn thing. We really shouldn't care about mmap. If somebody does a mmap on a block device, and somebody else then changes the block size, why-ever should we bother to go through any contortions at *all* to make that kind of insane behavior do anything sane at all. Just let people mmap things. Then just let the normal page cache invalidation work right. In fact, it is entirely possible that we could/should just not even invalidate the page cache at all, just make sure that the buffer heads attached to any pages get disconnected. No? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka mpato...@redhat.com wrote: mmap_region() doesn't care about the block size. But a lot of page-in/page-out code does. That seems a bogus argument. mmap() is in *no* way special. The exact same thing happens for regular read/write. Yet somehow the mmap code is special-cased, while the normal read-write code is not. I suspect it might be *easier* to trigger some issues with mmap, but that still isn't a good enough reason to special-case it. We don't add locking to one please just because that one place shows some race condition more easily. We fix the locking. So for example, maybe the code that *actually* cares about the buffer size (the stuff that allocates buffers in fs/buffer.c) needs to take that new percpu read lock. Basically, any caller of alloc_page_buffers()/create_empty_buffers() or whatever. I also wonder whether we need it *at*all*. I suspect that we could easily have multiple block-sizes these days for the same block device. It *used* to be (millions of years ago, when dinosaurs roamed the earth) that the block buffers were global and shared with all users of a partition. But that hasn't been true since we started using the page cache, and I suspect that some of the block size changing issues are simply entirely stale. Yeah, yeah, there could be some coherency issues if people write to the block device through different block sizes, but I think we have those coherency issues anyway. The page-cache is not coherent across different mapping inodes anyway. So I really suspect that some of this is legacy logic. Or at least perhaps _should_ be. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds torva...@linux-foundation.org wrote: mmap() is in *no* way special. The exact same thing happens for regular read/write. Yet somehow the mmap code is special-cased, while the normal read-write code is not. I just double-checked, because it's been a long time since I actually looked at the code. But yeah, block device read/write uses the pure page cache functions. IOW, it has the *exact* same IO engine as mmap() would have. So here's my suggestion: - get rid of *all* the locking in aio_read/write and the splice paths - get rid of all the stupid mmap games - instead, add them to the functions that actually use blkdev_get_block() and blkdev_get_blocks() and nowhere else. That's a fairly limited number of functions: blkdev_{read,write}page(), blkdev_direct_IO() and blkdev_write_{begin,end}() Doesn't that sounds simpler? And more logical: it protects the actual places that use the block size of the device. I dunno. Maybe there is some fundamental reason why the above is broken, but it seems to be a much simpler approach. Sure, you need to guarantee that the people who get the write-lock cannot possibly cause IO while holding it, but since the only reason to get the write lock would be to change the block size, that should be pretty simple, no? Yeah, yeah, I'm probably missing something fundamental, but the above sounds like the simple approach to fixing things. Aiming for having the block size read-lock be taken by the things that pass in the block-size itself. It would be nice for things to be logical and straightforward. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds torva...@linux-foundation.org wrote: I dunno. Maybe there is some fundamental reason why the above is broken, but it seems to be a much simpler approach. Sure, you need to guarantee that the people who get the write-lock cannot possibly cause IO while holding it, but since the only reason to get the write lock would be to change the block size, that should be pretty simple, no? Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably do unspeakable things to your family and pets. Linus patch.diff Description: Binary data
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds torva...@linux-foundation.org wrote: Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably do unspeakable things to your family and pets. Btw, *if* this approach works, I suspect we could just switch the bd_block_size_semaphore semaphore to be a regular rw-sem. Why? Because now it's no longer ever gotten in the cached IO paths, we only get it when we're doing much more expensive things (ie actual IO, and buffer head allocations etc etc). As long as we just work with the page cache, we never get to the whole lock at all. Which means that the whole percpu-optimized thing is likely no longer all that relevant. But that's an independent thing, and it's only true *if* my patch works. It looks fine on paper, but maybe there's something fundamentally broken about it. One big change my patch does is to move the sync_bdev/kill_bdev to *after* changing the block size. It does that so that it can guarantee that any old data (which didn't see the new block size) will be sync'ed even if there is new IO coming in as we change the block size. The old code locked the whole sync() region, which doesn't work with my approach, since the sync will do IO and would thus cause potential deadlocks while holding the rwsem for writing. So with this patch, as the block size changes, you can actually have some old pages with the old block size *and* some different new pages with the new block size all at the same time. It should all be perfectly fine, but it's worth pointing out. (It probably won't trigger in practice, though, since doing IO while somebody else is changing the blocksize is fundamentally an odd thing to do, but whatever. I also suspect that we *should* perhaps use the inode-i_sem thing to serialize concurrent block size changes, but that's again an independent issue) Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds torva...@linux-foundation.org wrote: mmap() is in *no* way special. The exact same thing happens for regular read/write. Yet somehow the mmap code is special-cased, while the normal read-write code is not. I just double-checked, because it's been a long time since I actually looked at the code. But yeah, block device read/write uses the pure page cache functions. IOW, it has the *exact* same IO engine as mmap() would have. So here's my suggestion: - get rid of *all* the locking in aio_read/write and the splice paths - get rid of all the stupid mmap games - instead, add them to the functions that actually use blkdev_get_block() and blkdev_get_blocks() and nowhere else. That's a fairly limited number of functions: blkdev_{read,write}page(), blkdev_direct_IO() and blkdev_write_{begin,end}() Doesn't that sounds simpler? And more logical: it protects the actual places that use the block size of the device. I dunno. Maybe there is some fundamental reason why the above is broken, but it seems to be a much simpler approach. Sure, you need to guarantee that the people who get the write-lock cannot possibly cause IO while holding it, but since the only reason to get the write lock would be to change the block size, that should be pretty simple, no? Yeah, yeah, I'm probably missing something fundamental, but the above sounds like the simple approach to fixing things. Aiming for having the block size read-lock be taken by the things that pass in the block-size itself. It would be nice for things to be logical and straightforward. Linus The problem with this approach is that it is very easy to miss points where it is assumed that the block size doesn't change - and if you miss a point, it results in a hidden bug that has a little possibility of being found. For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh-b_size != blocksize) err = get_block(inode, block, bh, 1) ... so if the buffers were left over from some previous call to create_empty_buffers with a different blocksize, that WARN_ON is trigged. And it's not only a harmless warning - now bh-b_size is left set to the old block size, but bh-b_blocknr is set to a number, that was calculated according to the new block size - and when you submit that buffer with submit_bh, it is written to the wrong place! Now, prove that there are no more bugs like this. Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough review of all the memory management code. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds torva...@linux-foundation.org wrote: Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably do unspeakable things to your family and pets. Btw, *if* this approach works, I suspect we could just switch the bd_block_size_semaphore semaphore to be a regular rw-sem. Why? Because now it's no longer ever gotten in the cached IO paths, we only get it when we're doing much more expensive things (ie actual IO, and buffer head allocations etc etc). As long as we just work with the page cache, we never get to the whole lock at all. Which means that the whole percpu-optimized thing is likely no longer all that relevant. Using normal semaphore degrades direct-IO performance on raw devices, I measured that (45.1s with normal rw-semaphore vs. 42.8s with percpu-rw-semaphore). It could be measured on ramdisk or high-performance SSDs. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka mpato...@redhat.com wrote: The problem with this approach is that it is very easy to miss points where it is assumed that the block size doesn't change - and if you miss a point, it results in a hidden bug that has a little possibility of being found. Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw stones in glass houses, shall we? The whole reason for this long thread (and several threads before it) is that your model isn't working and is causing problems. I already pointed out how bogus your arguments about mmap() locking were, and then you have the gall to talk about potential bugs, when I have pointed you to *actual* bugs, and actual mistakes. For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh-b_size != blocksize) err = get_block(inode, block, bh, 1) Right. And none of this is new. ... so if the buffers were left over from some previous call to create_empty_buffers with a different blocksize, that WARN_ON is trigged. None of this can happen. Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough review of all the memory management code. Umm. Which you clearly didn't do, and raised totally crap arguments for. In contrast, I have a very simple argument for the correctness of my patch: every single user of the get_block[s]() interface now takes the lock for as long as get_block[s]() is passed off to somebody else. And since get_block[s]() is the only way to create those empty buffers, I think I pretty much proved exactly what you ask for. And THAT is the whole point and advantage of making locking sane. Sane locking you can actually *think* about! In contrast, locking around mmap() is absolutely *guaranteed* to be insane, because mmap() doesn't actually do any of the IO that the lock is supposed to protect against! So Mikulas, quite frankly, your arguments argue against you. When you say Locking the whole read/write/mmap operations is crude, but at least it can be done without thorough, you are doubly correct: it *is* crude, and it clearly *was* done without thought, since it's a f*cking idiotic AND INCORRECT thing to do. Seriously. Locking around mmap() is insane. It leads to insane semantics (the whole EBUSY thing is purely because of that problem) and it leads to bad code (your let's add a new mmap_region hook is just disgusting, and while Al's idea of doing it in the existing -open method is at least not nasty, it's definitely extra code and complexity). There are serious *CORRECTNESS* advantages to simplicity and directness. And locking at the right point is definitely very much part of that. Anyway, as far as block size goes, we have exactly two cases: - random IO that does not care about the block size, and will just do whatever the current block size is (ie normal anonymous accesses to the block device). This is the case that needs the locking - but it only needs it around the individual page operations, ie exactly where I put it. In fact, they can happily deal with different block sizes for different pages, they don't really care. - mounted filesystems etc that require a particular block size and set it at mount time, and they have exclusivity rules The second case is the case that actually calls set_blocksize(), and if kill_bdev() doesn't get rid of the old blocksizes, then they have always been in trouble, and would always _continue_ to be in trouble, regardless of locking. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds torva...@linux-foundation.org wrote: For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh-b_size != blocksize) err = get_block(inode, block, bh, 1) Right. And none of this is new. .. which, btw, is not to say that *other* things aren't new. They are. The change to actually change the block device buffer size before then calling sync_bdev() is definitely a real change, and as mentioned, I have not tested the patch in any way. If any block device driver were to actually compare the IO size they get against the bdev-block_size thing, they'd see very different behavior (ie they'd see the new block size as they are asked to write old the old blocks with the old block size). So it does change semantics, no question about that. I don't think any block device does it, though. A bigger issue is for things that emulate what blkdev.c does, and doesn't do the locking. I see code in md/bitmap.c that seems a bit suspicious, for example. That said, it's not *new* breakage, and the lock at mmap/read/write() time approach doesn't fix it either (since the mapping will be different for the underlying MD device). So I do think that we should take a look at all the users of alloc_page_buffers() and create_empty_buffers() to see what *they* do to protect the block-size, but I think that's an independent issue from the raw device access case in fs/block_dev.c.. I guess I have to actually test my patch. I don't have very interesting test-cases, though. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: For example, __block_write_full_page and __block_write_begin do if (!page_has_buffers(page)) { create_empty_buffers... } and then they do WARN_ON(bh-b_size != blocksize) err = get_block(inode, block, bh, 1) Right. And none of this is new. ... so if the buffers were left over from some previous call to create_empty_buffers with a different blocksize, that WARN_ON is trigged. None of this can happen. It can happen. Take your patch (the one that moves bd_block_size_semaphore into blkdev_readpage, blkdev_writepage and blkdev_write_begin). Insert msleep(1000) into set_blocksize, just before sync_blockdev. Run this program: #define _XOPEN_SOURCE 500 #include stdio.h #include stdlib.h #include unistd.h #include fcntl.h #include sys/types.h static char array[4096]; int main(void) { int h; system(dmsetup remove test 2/dev/null); if (system(dmsetup create test --table '0 1 zero')) exit(1); h = open(/dev/mapper/test, O_RDWR); if (h 0) perror(open), exit(1); if (pread(h, array, 512, 0) != 512) perror(pread), exit(1); if (system(dmsetup load test --table '0 8 zero')) exit(1); if (system(dmsetup suspend test)) exit(1); if (system(dmsetup resume test)) exit(1); if (system(blockdev --setbsz 2048 /dev/mapper/test )) exit(1); usleep(50); if (pwrite(h, array, 4096, 0) != 4096) perror(pwrite), exit(1); return 0; } --- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin [ 1243.30] Backtrace: [ 1243.33] [40230ba8] block_write_begin+0x70/0xd0 [ 1243.40] [402350cc] blkdev_write_begin+0xb4/0x208 [ 1243.48] [401a9f10] generic_file_buffered_write+0x248/0x348 [ 1243.57] [401ac8c4] __generic_file_aio_write+0x1fc/0x388 [ 1243.66] [40235e74] blkdev_aio_write+0x64/0xf0 [ 1243.74] [401f2108] do_sync_write+0xd0/0x128 [ 1243.81] [401f2930] vfs_write+0xa0/0x180 [ 1243.88] [401f2ecc] sys_pwrite64+0xb4/0xd8 [ 1243.95] [40122104] parisc_pwrite64+0x1c/0x28 [ 1244.02] [40106060] syscall_exit+0x0/0x14 I'm not saying that your approach is wrong, you just have to carefuly review all memory management code for assumptions that block size doesn't change. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
[ Sorry, I was offline for a while driving kids around ] On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka mpato...@redhat.com wrote: It can happen. Take your patch (the one that moves bd_block_size_semaphore into blkdev_readpage, blkdev_writepage and blkdev_write_begin). Interesting. The code *has* the block size (it's in bh-b_size), but it actually then uses the inode blocksize instead, and verifies the two against each other. It could just have used the block size directly (and then used the inode i_blkbits only when no buffers existed), avoiding that dependency entirely.. It actually does the same thing (with the same verification) in __block_write_full_page() and (_without_ the verification) in __block_commit_write(). Ho humm. All of those places actually do hold the rwsem for reading, it's just that I don't want to hold it for writing over the sync.. Need to think on this, Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, 28 Nov 2012, Linus Torvalds wrote: A bigger issue is for things that emulate what blkdev.c does, and doesn't do the locking. I see code in md/bitmap.c that seems a bit suspicious, for example. That said, it's not *new* breakage, and the lock at mmap/read/write() time approach doesn't fix it either (since the mapping will be different for the underlying MD device). So I do think that we should take a look at all the users of alloc_page_buffers() and create_empty_buffers() to see what *they* do to protect the block-size, but I think that's an independent issue from the raw device access case in fs/block_dev.c.. I guess I have to actually test my patch. I don't have very interesting test-cases, though. Linus Yes, it md looks suspicious. It disables write access in deny_bitmap_write_access (that functions looks buggy on its own - what happens if i_writecount == 0 or if it is already negative on entry?). So we could disallow changing block size if i_writecount 0, that should do the trick. Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds torva...@linux-foundation.org wrote: Interesting. The code *has* the block size (it's in bh-b_size), but it actually then uses the inode blocksize instead, and verifies the two against each other. It could just have used the block size directly (and then used the inode i_blkbits only when no buffers existed), avoiding that dependency entirely.. Looking more at this code, that really would be the nicest solution. There's two cases for the whole get_block() thing: - filesystems. The block size will not change randomly, and get_block() seriously depends on the block size. - the raw device. The block size *will* change, but to simplify the problem, get_block() is a 1:1 mapping, so it doesn't even care about the block size because it will always return bh-b_blocknr = nr. So we *could* just say that all the fs/buffer.c code should use inode-i_blkbits for creating buffers (because that's the size new buffers should always use), but use bh-b_size for any *existing* buffer use. And looking at it, it's even simple. Except for one *very* annoying thing: several users really don't want the size of the buffer, they really do want the *shift* of the buffer size. In fact, that single issue seems to be the reason why inode-i_blkbits is really used in fs/buffer.c. Otherwise it would be fairly trivial to just make the pattern be just a simple if (!page_has_buffers(page)) create_empty_buffers(page, 1 inode-i_blkbits, 0); head = page_buffers(page); blocksize = head-b_size; and just use the blocksize that way, without any other games. All done, no silly WARN_ON() to verify against some global block-size, and the fs/buffer.c code would be perfectly simple, and would have no problem at all with multiple different blocksizes in different pages (the page lock serializes the buffers and thus the blocksize at the per-page level). But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. And I notice that Al Viro hasn't been cc'd, which is sad, since he's been involved in much of fs/block_dev.c. Al - this is an independent patch to fs/buffer.c to make fs/block_dev.c able to change the block size of a block device while there is IO in progress that may still use the old block size. The discussion has been on fsdevel and lkml, but you may have missed it... Linus patch.diff Description: Binary data
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 10:16:21PM -0800, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: But the fact that the code wants to do things like block = (sector_t)page-index (PAGE_CACHE_SHIFT - bbits); seriously seems to be the main thing that keeps us using 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly enough to hurt (not everywhere, but on some machines). Very annoying. Hmm. Here's a patch that does that anyway. I'm not 100% happy with the whole ilog2 thing, but at the same time, in other cases it actually seems to improve code generation (ie gets rid of the whole unnecessary two dereferences through page-mapping-host just to get the block size, when we have it in the buffer-head that we have to touch *anyway*). Comments? Again, untested. And I notice that Al Viro hasn't been cc'd, which is sad, since he's been involved in much of fs/block_dev.c. Al - this is an independent patch to fs/buffer.c to make fs/block_dev.c able to change the block size of a block device while there is IO in progress that may still use the old block size. The discussion has been on fsdevel and lkml, but you may have missed it... Umm... set_blocksize() is calling kill_bdev(), which does truncate_inode_pages(mapping, 0). What's going to happen to data in the dirty pages? IO in progress is not the only thing to worry about... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Thu, Nov 29, 2012 at 06:25:19AM +, Al Viro wrote: Umm... set_blocksize() is calling kill_bdev(), which does truncate_inode_pages(mapping, 0). What's going to happen to data in the dirty pages? IO in progress is not the only thing to worry about... Note that sync_blockdev() a few lines prior to that is good only if we have no other processes doing write(2) (or dirtying the mmapped pages, for that matter). The window isn't too wide, but... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 10:25 PM, Al Viro v...@zeniv.linux.org.uk wrote: Umm... set_blocksize() is calling kill_bdev(), which does truncate_inode_pages(mapping, 0). What's going to happen to data in the dirty pages? IO in progress is not the only thing to worry about... Hmm. Yes. I think it works by virtue of if you change the blocksize while there is active IO, you're insane and you deserve whatever you get. It shouldn't even be fundamentally hard to make it work, although I suspect it would be more code than it would be worth. The sane model would be to not use truncate_inode_pages(), but instead just walk the pages and get rid of the buffer heads with the wrong size. Preferably *combining* that with the sync_blockdev(). We have no real reason to even invalidate the page cache, it's just the buffers we want to get rid of. But I suspect it's true that none of that is really *worth* it, considering that nobody likely wants to do any concurrent IO. We don't want to crash, or corrupt the data structures, but I suspect you get what you deserve might actually be the right model ;) So the current sync_blockdev()+kill_bdev() takes care of the *sane* case (we flush any data that happened *before* the block size change), and any concurrent writes with block-size changes are good luck with that. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote: Note that sync_blockdev() a few lines prior to that is good only if we have no other processes doing write(2) (or dirtying the mmapped pages, for that matter). The window isn't too wide, but... So with Mikulas' patches, the write actually would block (at write level) due to the locking. The mmap'ed patches may be around and flushed, but the logic to not allow currently *active* mmaps (with the rather nasty random -EBUSY return value) should mean that there is no race. Or rather, there's a race, but it results in that EBUSY thing. With my simplfied locking, the sync_blockdev() is right before (not a few lines prior) to the kill_bdev(), and in a perfect world they'd actually be one single operation (write back and invalidate pages with the wrong block-size). But they aren't. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)
On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote: On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote: Note that sync_blockdev() a few lines prior to that is good only if we have no other processes doing write(2) (or dirtying the mmapped pages, for that matter). The window isn't too wide, but... So with Mikulas' patches, the write actually would block (at write level) due to the locking. The mmap'ed patches may be around and flushed, but the logic to not allow currently *active* mmaps (with the rather nasty random -EBUSY return value) should mean that there is no race. Or rather, there's a race, but it results in that EBUSY thing. Same as with fs mounted on it, or the sucker having been claimed for RAID array, etc. Frankly, I'm more than slightly tempted to make bdev mmap() just claim the sodding device exclusive for as long as it's mmapped... In principle, I agree, but... I still have nightmares from mmap/truncate races way back. You are stepping into what used to be a really nasty minefield. I'll look into that, but it's *definitely* not -rc8 fodder. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Tue, 27 Nov 2012, Jens Axboe wrote: > On 2012-11-27 11:06, Jeff Chua wrote: > > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: > >> On 2012-11-27 06:57, Jeff Chua wrote: > >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua > >>> wrote: > On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka > wrote: > > So it's better to slow down mount. > > I am quite proud of the linux boot time pitting against other OS. Even > with 10 partitions. Linux can boot up in just a few seconds, but now > you're saying that we need to do this semaphore check at boot up. By > doing so, it's inducing additional 4 seconds during boot up. > >>> > >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU > >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what > >>> kind of degradation would this cause or just the same? > >> > >> It'd likely be the same slow down time wise, but as a percentage it > >> would appear smaller on a slower disk. > >> > >> Could you please test Mikulas' suggestion of changing > >> synchronize_sched() in include/linux/percpu-rwsem.h to > >> synchronize_sched_expedited()? > > > > Tested. It seems as fast as before, but may be a "tick" slower. Just > > perception. I was getting pretty much 0.012s with everything reverted. > > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. > > So, it's good. > > Excellent > > >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews > >> tree. It would be a good data point it you could test that, too. > > > > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. > > Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix > the real issue, which is having to do synchronize_sched() in the first > place. > > > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > > > > > > So, here's the comparison ... > > > > 0.500s 3.7.0-rc7 > > 0.168s 3.7.0-rc2 > > 0.012s 3.6.0 > > 0.013s 3.7.0-rc7 + synchronize_sched_expedited() > > 0.350s 3.7.0-rc7 + Oleg's patch. > > I wonder how many of them are due to changing to the same block size. > Does the below patch make a difference? This patch is wrong because you must check if the device is mapped while holding bdev->bd_block_size_semaphore (because bdev->bd_block_size_semaphore prevents new mappings from being created) I'm sending another patch that has the same effect. Note that ext[234] filesystems set blocksize to 1024 temporarily during mount, so it doesn't help much (it only helps for other filesystems, such as jfs). For ext[234], you have a device with default block size 4096, the filesystem sets block size to 1024 during mount, reads the super block and sets it back to 4096. If you want, you can fix ext[234] to avoid this behavior. Mikulas > diff --git a/fs/block_dev.c b/fs/block_dev.c > index 1a1e5e3..f041c56 100644 > --- a/fs/block_dev.c > +++ b/fs/block_dev.c > @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size) > if (size < bdev_logical_block_size(bdev)) > return -EINVAL; > > - /* Prevent starting I/O or mapping the device */ > - percpu_down_write(>bd_block_size_semaphore); > - > /* Check that the block device is not memory mapped */ > mapping = bdev->bd_inode->i_mapping; > mutex_lock(>i_mmap_mutex); > if (mapping_mapped(mapping)) { > mutex_unlock(>i_mmap_mutex); > - percpu_up_write(>bd_block_size_semaphore); > return -EBUSY; > } > mutex_unlock(>i_mmap_mutex); > > /* Don't change the size if it is same as current */ > if (bdev->bd_block_size != size) { > - sync_blockdev(bdev); > - bdev->bd_block_size = size; > - bdev->bd_inode->i_blkbits = blksize_bits(size); > - kill_bdev(bdev); > + /* Prevent starting I/O */ > + percpu_down_write(>bd_block_size_semaphore); > + if (bdev->bd_block_size != size) { > + sync_blockdev(bdev); > + bdev->bd_block_size = size; > + bdev->bd_inode->i_blkbits = blksize_bits(size); > + kill_bdev(bdev); > + } > + percpu_up_write(>bd_block_size_semaphore); > } > > - percpu_up_write(>bd_block_size_semaphore); > - > return 0; > } > > @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write); > > static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) > { > + struct address_space *mapping = file->f_mapping; > int ret; > - struct block_device *bdev = I_BDEV(file->f_mapping->host); > - > - percpu_down_read(>bd_block_size_semaphore); > > + mutex_lock(>i_mmap_mutex); > ret = generic_file_mmap(file, vma); > - > - percpu_up_read(>bd_block_size_semaphore); > + mutex_unlock(>i_mmap_mutex); > > return ret; > } > > --
Re: Recent kernel "mount" slow
On 2012-11-27 11:06, Jeff Chua wrote: > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: >> On 2012-11-27 06:57, Jeff Chua wrote: >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua >>> wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka wrote: > So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. >>> >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what >>> kind of degradation would this cause or just the same? >> >> It'd likely be the same slow down time wise, but as a percentage it >> would appear smaller on a slower disk. >> >> Could you please test Mikulas' suggestion of changing >> synchronize_sched() in include/linux/percpu-rwsem.h to >> synchronize_sched_expedited()? > > Tested. It seems as fast as before, but may be a "tick" slower. Just > perception. I was getting pretty much 0.012s with everything reverted. > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. > So, it's good. Excellent >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews >> tree. It would be a good data point it you could test that, too. > > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix the real issue, which is having to do synchronize_sched() in the first place. > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > > > So, here's the comparison ... > > 0.500s 3.7.0-rc7 > 0.168s 3.7.0-rc2 > 0.012s 3.6.0 > 0.013s 3.7.0-rc7 + synchronize_sched_expedited() > 0.350s 3.7.0-rc7 + Oleg's patch. I wonder how many of them are due to changing to the same block size. Does the below patch make a difference? diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..f041c56 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size) if (size < bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(>bd_block_size_semaphore); - /* Check that the block device is not memory mapped */ mapping = bdev->bd_inode->i_mapping; mutex_lock(>i_mmap_mutex); if (mapping_mapped(mapping)) { mutex_unlock(>i_mmap_mutex); - percpu_up_write(>bd_block_size_semaphore); return -EBUSY; } mutex_unlock(>i_mmap_mutex); /* Don't change the size if it is same as current */ if (bdev->bd_block_size != size) { - sync_blockdev(bdev); - bdev->bd_block_size = size; - bdev->bd_inode->i_blkbits = blksize_bits(size); - kill_bdev(bdev); + /* Prevent starting I/O */ + percpu_down_write(>bd_block_size_semaphore); + if (bdev->bd_block_size != size) { + sync_blockdev(bdev); + bdev->bd_block_size = size; + bdev->bd_inode->i_blkbits = blksize_bits(size); + kill_bdev(bdev); + } + percpu_up_write(>bd_block_size_semaphore); } - percpu_up_write(>bd_block_size_semaphore); - return 0; } @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write); static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) { + struct address_space *mapping = file->f_mapping; int ret; - struct block_device *bdev = I_BDEV(file->f_mapping->host); - - percpu_down_read(>bd_block_size_semaphore); + mutex_lock(>i_mmap_mutex); ret = generic_file_mmap(file, vma); - - percpu_up_read(>bd_block_size_semaphore); + mutex_unlock(>i_mmap_mutex); return ret; } -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe wrote: > On 2012-11-27 06:57, Jeff Chua wrote: >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua wrote: >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka >>> wrote: So it's better to slow down mount. >>> >>> I am quite proud of the linux boot time pitting against other OS. Even >>> with 10 partitions. Linux can boot up in just a few seconds, but now >>> you're saying that we need to do this semaphore check at boot up. By >>> doing so, it's inducing additional 4 seconds during boot up. >> >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what >> kind of degradation would this cause or just the same? > > It'd likely be the same slow down time wise, but as a percentage it > would appear smaller on a slower disk. > > Could you please test Mikulas' suggestion of changing > synchronize_sched() in include/linux/percpu-rwsem.h to > synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a "tick" slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. > linux-next also has a re-write of the per-cpu rw sems, out of Andrews > tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. Thanks, Jeff. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
Jens, Limited access now at Incheon Airport. Will try the patch out when I arrived. Thanks, Jeff On 11/27/12, Jens Axboe wrote: > On 2012-11-27 08:38, Jens Axboe wrote: >> On 2012-11-27 06:57, Jeff Chua wrote: >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua >>> wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka wrote: > So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. >>> >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what >>> kind of degradation would this cause or just the same? >> >> It'd likely be the same slow down time wise, but as a percentage it >> would appear smaller on a slower disk. >> >> Could you please test Mikulas' suggestion of changing >> synchronize_sched() in include/linux/percpu-rwsem.h to >> synchronize_sched_expedited()? >> >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews >> tree. It would be a good data point it you could test that, too. >> >> In any case, the slow down definitely isn't acceptable. Fixing an >> obscure issue like block sizes changing while O_DIRECT is in flight >> definitely does NOT warrant a mount slow down. > > Here's Olegs patch, might be easier for you than switching to > linux-next. Please try that. > > From: Oleg Nesterov > Subject: percpu_rw_semaphore: reimplement to not block the readers > unnecessarily > > Currently the writer does msleep() plus synchronize_sched() 3 times to > acquire/release the semaphore, and during this time the readers are > blocked completely. Even if the "write" section was not actually started > or if it was already finished. > > With this patch down_write/up_write does synchronize_sched() twice and > down_read/up_read are still possible during this time, just they use the > slow path. > > percpu_down_write() first forces the readers to use rw_semaphore and > increment the "slow" counter to take the lock for reading, then it > takes that rw_semaphore for writing and blocks the readers. > > Also. With this patch the code relies on the documented behaviour of > synchronize_sched(), it doesn't try to pair synchronize_sched() with > barrier. > > Signed-off-by: Oleg Nesterov > Reviewed-by: Paul E. McKenney > Cc: Linus Torvalds > Cc: Mikulas Patocka > Cc: Peter Zijlstra > Cc: Ingo Molnar > Cc: Srikar Dronamraju > Cc: Ananth N Mavinakayanahalli > Cc: Anton Arapov > Cc: Jens Axboe > Signed-off-by: Andrew Morton > --- > > include/linux/percpu-rwsem.h | 85 +++--- > lib/Makefile |2 > lib/percpu-rwsem.c | 123 + > 3 files changed, 138 insertions(+), 72 deletions(-) > > diff -puN > include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily > include/linux/percpu-rwsem.h > --- > a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily > +++ a/include/linux/percpu-rwsem.h > @@ -2,82 +2,25 @@ > #define _LINUX_PERCPU_RWSEM_H > > #include > +#include > #include > -#include > -#include > +#include > > struct percpu_rw_semaphore { > - unsigned __percpu *counters; > - bool locked; > - struct mutex mtx; > + unsigned int __percpu *fast_read_ctr; > + struct mutexwriter_mutex; > + struct rw_semaphore rw_sem; > + atomic_tslow_read_ctr; > + wait_queue_head_t write_waitq; > }; > > -#define light_mb() barrier() > -#define heavy_mb() synchronize_sched() > +extern void percpu_down_read(struct percpu_rw_semaphore *); > +extern void percpu_up_read(struct percpu_rw_semaphore *); > > -static inline void percpu_down_read(struct percpu_rw_semaphore *p) > -{ > - rcu_read_lock_sched(); > - if (unlikely(p->locked)) { > - rcu_read_unlock_sched(); > - mutex_lock(>mtx); > - this_cpu_inc(*p->counters); > - mutex_unlock(>mtx); > - return; > - } > - this_cpu_inc(*p->counters); > - rcu_read_unlock_sched(); > - light_mb(); /* A, between read of p->locked and read of data, paired > with > D */ > -} > - > -static inline void percpu_up_read(struct percpu_rw_semaphore *p) > -{ > - light_mb(); /* B, between read of the data and write to p->counter, > paired > with C */ > - this_cpu_dec(*p->counters); > -} > - > -static inline unsigned __percpu_count(unsigned __percpu *counters) > -{ > - unsigned total = 0; > - int cpu; > - > - for_each_possible_cpu(cpu) > - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); > - > - return total; > -} > - > -static inline void percpu_down_write(struct
Re: Recent kernel mount slow
Jens, Limited access now at Incheon Airport. Will try the patch out when I arrived. Thanks, Jeff On 11/27/12, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 08:38, Jens Axboe wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. In any case, the slow down definitely isn't acceptable. Fixing an obscure issue like block sizes changing while O_DIRECT is in flight definitely does NOT warrant a mount slow down. Here's Olegs patch, might be easier for you than switching to linux-next. Please try that. From: Oleg Nesterov o...@redhat.com Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov o...@redhat.com Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Mikulas Patocka mpato...@redhat.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@elte.hu Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Anton Arapov an...@redhat.com Cc: Jens Axboe ax...@kernel.dk Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/percpu-rwsem.h | 85 +++--- lib/Makefile |2 lib/percpu-rwsem.c | 123 + 3 files changed, 138 insertions(+), 72 deletions(-) diff -puN include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily include/linux/percpu-rwsem.h --- a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily +++ a/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include linux/mutex.h +#include linux/rwsem.h #include linux/percpu.h -#include linux/rcupdate.h -#include linux/delay.h +#include linux/wait.h struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p-locked)) { - rcu_read_unlock_sched(); - mutex_lock(p-mtx); - this_cpu_inc(*p-counters); - mutex_unlock(p-mtx); - return; - } - this_cpu_inc(*p-counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p-locked and read of data, paired with D */ -} - -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p-counter, paired with C */ - this_cpu_dec(*p-counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - -
Re: Recent kernel mount slow
On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a tick slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. Thanks, Jeff. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On 2012-11-27 11:06, Jeff Chua wrote: On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a tick slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. Excellent linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix the real issue, which is having to do synchronize_sched() in the first place. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. I wonder how many of them are due to changing to the same block size. Does the below patch make a difference? diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..f041c56 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size) if (size bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(bdev-bd_block_size_semaphore); - /* Check that the block device is not memory mapped */ mapping = bdev-bd_inode-i_mapping; mutex_lock(mapping-i_mmap_mutex); if (mapping_mapped(mapping)) { mutex_unlock(mapping-i_mmap_mutex); - percpu_up_write(bdev-bd_block_size_semaphore); return -EBUSY; } mutex_unlock(mapping-i_mmap_mutex); /* Don't change the size if it is same as current */ if (bdev-bd_block_size != size) { - sync_blockdev(bdev); - bdev-bd_block_size = size; - bdev-bd_inode-i_blkbits = blksize_bits(size); - kill_bdev(bdev); + /* Prevent starting I/O */ + percpu_down_write(bdev-bd_block_size_semaphore); + if (bdev-bd_block_size != size) { + sync_blockdev(bdev); + bdev-bd_block_size = size; + bdev-bd_inode-i_blkbits = blksize_bits(size); + kill_bdev(bdev); + } + percpu_up_write(bdev-bd_block_size_semaphore); } - percpu_up_write(bdev-bd_block_size_semaphore); - return 0; } @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write); static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) { + struct address_space *mapping = file-f_mapping; int ret; - struct block_device *bdev = I_BDEV(file-f_mapping-host); - - percpu_down_read(bdev-bd_block_size_semaphore); + mutex_lock(mapping-i_mmap_mutex); ret = generic_file_mmap(file, vma); - - percpu_up_read(bdev-bd_block_size_semaphore); + mutex_unlock(mapping-i_mmap_mutex); return ret; } -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Tue, 27 Nov 2012, Jens Axboe wrote: On 2012-11-27 11:06, Jeff Chua wrote: On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? Tested. It seems as fast as before, but may be a tick slower. Just perception. I was getting pretty much 0.012s with everything reverted. With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s. So, it's good. Excellent linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. Tested. It's slower. 0.350s. But still faster than 0.500s without the patch. Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix the real issue, which is having to do synchronize_sched() in the first place. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt So, here's the comparison ... 0.500s 3.7.0-rc7 0.168s 3.7.0-rc2 0.012s 3.6.0 0.013s 3.7.0-rc7 + synchronize_sched_expedited() 0.350s 3.7.0-rc7 + Oleg's patch. I wonder how many of them are due to changing to the same block size. Does the below patch make a difference? This patch is wrong because you must check if the device is mapped while holding bdev-bd_block_size_semaphore (because bdev-bd_block_size_semaphore prevents new mappings from being created) I'm sending another patch that has the same effect. Note that ext[234] filesystems set blocksize to 1024 temporarily during mount, so it doesn't help much (it only helps for other filesystems, such as jfs). For ext[234], you have a device with default block size 4096, the filesystem sets block size to 1024 during mount, reads the super block and sets it back to 4096. If you want, you can fix ext[234] to avoid this behavior. Mikulas diff --git a/fs/block_dev.c b/fs/block_dev.c index 1a1e5e3..f041c56 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size) if (size bdev_logical_block_size(bdev)) return -EINVAL; - /* Prevent starting I/O or mapping the device */ - percpu_down_write(bdev-bd_block_size_semaphore); - /* Check that the block device is not memory mapped */ mapping = bdev-bd_inode-i_mapping; mutex_lock(mapping-i_mmap_mutex); if (mapping_mapped(mapping)) { mutex_unlock(mapping-i_mmap_mutex); - percpu_up_write(bdev-bd_block_size_semaphore); return -EBUSY; } mutex_unlock(mapping-i_mmap_mutex); /* Don't change the size if it is same as current */ if (bdev-bd_block_size != size) { - sync_blockdev(bdev); - bdev-bd_block_size = size; - bdev-bd_inode-i_blkbits = blksize_bits(size); - kill_bdev(bdev); + /* Prevent starting I/O */ + percpu_down_write(bdev-bd_block_size_semaphore); + if (bdev-bd_block_size != size) { + sync_blockdev(bdev); + bdev-bd_block_size = size; + bdev-bd_inode-i_blkbits = blksize_bits(size); + kill_bdev(bdev); + } + percpu_up_write(bdev-bd_block_size_semaphore); } - percpu_up_write(bdev-bd_block_size_semaphore); - return 0; } @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write); static int blkdev_mmap(struct file *file, struct vm_area_struct *vma) { + struct address_space *mapping = file-f_mapping; int ret; - struct block_device *bdev = I_BDEV(file-f_mapping-host); - - percpu_down_read(bdev-bd_block_size_semaphore); + mutex_lock(mapping-i_mmap_mutex); ret = generic_file_mmap(file, vma); - - percpu_up_read(bdev-bd_block_size_semaphore); + mutex_unlock(mapping-i_mmap_mutex); return ret; } -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of
Re: Recent kernel "mount" slow
On 2012-11-27 08:38, Jens Axboe wrote: > On 2012-11-27 06:57, Jeff Chua wrote: >> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua wrote: >>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka >>> wrote: So it's better to slow down mount. >>> >>> I am quite proud of the linux boot time pitting against other OS. Even >>> with 10 partitions. Linux can boot up in just a few seconds, but now >>> you're saying that we need to do this semaphore check at boot up. By >>> doing so, it's inducing additional 4 seconds during boot up. >> >> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU >> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what >> kind of degradation would this cause or just the same? > > It'd likely be the same slow down time wise, but as a percentage it > would appear smaller on a slower disk. > > Could you please test Mikulas' suggestion of changing > synchronize_sched() in include/linux/percpu-rwsem.h to > synchronize_sched_expedited()? > > linux-next also has a re-write of the per-cpu rw sems, out of Andrews > tree. It would be a good data point it you could test that, too. > > In any case, the slow down definitely isn't acceptable. Fixing an > obscure issue like block sizes changing while O_DIRECT is in flight > definitely does NOT warrant a mount slow down. Here's Olegs patch, might be easier for you than switching to linux-next. Please try that. From: Oleg Nesterov Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the "write" section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the "slow" counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov Reviewed-by: Paul E. McKenney Cc: Linus Torvalds Cc: Mikulas Patocka Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Srikar Dronamraju Cc: Ananth N Mavinakayanahalli Cc: Anton Arapov Cc: Jens Axboe Signed-off-by: Andrew Morton --- include/linux/percpu-rwsem.h | 85 +++--- lib/Makefile |2 lib/percpu-rwsem.c | 123 + 3 files changed, 138 insertions(+), 72 deletions(-) diff -puN include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily include/linux/percpu-rwsem.h --- a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily +++ a/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include +#include #include -#include -#include +#include struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p->locked)) { - rcu_read_unlock_sched(); - mutex_lock(>mtx); - this_cpu_inc(*p->counters); - mutex_unlock(>mtx); - return; - } - this_cpu_inc(*p->counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p->locked and read of data, paired with D */ -} - -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p->counter, paired with C */ - this_cpu_dec(*p->counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct percpu_rw_semaphore *p) -{ - mutex_lock(>mtx); - p->locked = true; - synchronize_sched(); /* make sure that all readers exit the rcu_read_lock_sched region */ - while (__percpu_count(p->counters)) - msleep(1); - heavy_mb(); /* C, between read of p->counter and
Re: Recent kernel "mount" slow
On 2012-11-27 06:57, Jeff Chua wrote: > On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua wrote: >> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka wrote: >>> So it's better to slow down mount. >> >> I am quite proud of the linux boot time pitting against other OS. Even >> with 10 partitions. Linux can boot up in just a few seconds, but now >> you're saying that we need to do this semaphore check at boot up. By >> doing so, it's inducing additional 4 seconds during boot up. > > By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU > (2.8GHz). I wonder if those on slower hard disk or slower CPU, what > kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. In any case, the slow down definitely isn't acceptable. Fixing an obscure issue like block sizes changing while O_DIRECT is in flight definitely does NOT warrant a mount slow down. -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua wrote: > On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka wrote: >> So it's better to slow down mount. > > I am quite proud of the linux boot time pitting against other OS. Even > with 10 partitions. Linux can boot up in just a few seconds, but now > you're saying that we need to do this semaphore check at boot up. By > doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? Thanks, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. In any case, the slow down definitely isn't acceptable. Fixing an obscure issue like block sizes changing while O_DIRECT is in flight definitely does NOT warrant a mount slow down. -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On 2012-11-27 08:38, Jens Axboe wrote: On 2012-11-27 06:57, Jeff Chua wrote: On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU (2.8GHz). I wonder if those on slower hard disk or slower CPU, what kind of degradation would this cause or just the same? It'd likely be the same slow down time wise, but as a percentage it would appear smaller on a slower disk. Could you please test Mikulas' suggestion of changing synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited()? linux-next also has a re-write of the per-cpu rw sems, out of Andrews tree. It would be a good data point it you could test that, too. In any case, the slow down definitely isn't acceptable. Fixing an obscure issue like block sizes changing while O_DIRECT is in flight definitely does NOT warrant a mount slow down. Here's Olegs patch, might be easier for you than switching to linux-next. Please try that. From: Oleg Nesterov o...@redhat.com Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily Currently the writer does msleep() plus synchronize_sched() 3 times to acquire/release the semaphore, and during this time the readers are blocked completely. Even if the write section was not actually started or if it was already finished. With this patch down_write/up_write does synchronize_sched() twice and down_read/up_read are still possible during this time, just they use the slow path. percpu_down_write() first forces the readers to use rw_semaphore and increment the slow counter to take the lock for reading, then it takes that rw_semaphore for writing and blocks the readers. Also. With this patch the code relies on the documented behaviour of synchronize_sched(), it doesn't try to pair synchronize_sched() with barrier. Signed-off-by: Oleg Nesterov o...@redhat.com Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com Cc: Linus Torvalds torva...@linux-foundation.org Cc: Mikulas Patocka mpato...@redhat.com Cc: Peter Zijlstra pet...@infradead.org Cc: Ingo Molnar mi...@elte.hu Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com Cc: Anton Arapov an...@redhat.com Cc: Jens Axboe ax...@kernel.dk Signed-off-by: Andrew Morton a...@linux-foundation.org --- include/linux/percpu-rwsem.h | 85 +++--- lib/Makefile |2 lib/percpu-rwsem.c | 123 + 3 files changed, 138 insertions(+), 72 deletions(-) diff -puN include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily include/linux/percpu-rwsem.h --- a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily +++ a/include/linux/percpu-rwsem.h @@ -2,82 +2,25 @@ #define _LINUX_PERCPU_RWSEM_H #include linux/mutex.h +#include linux/rwsem.h #include linux/percpu.h -#include linux/rcupdate.h -#include linux/delay.h +#include linux/wait.h struct percpu_rw_semaphore { - unsigned __percpu *counters; - bool locked; - struct mutex mtx; + unsigned int __percpu *fast_read_ctr; + struct mutexwriter_mutex; + struct rw_semaphore rw_sem; + atomic_tslow_read_ctr; + wait_queue_head_t write_waitq; }; -#define light_mb() barrier() -#define heavy_mb() synchronize_sched() +extern void percpu_down_read(struct percpu_rw_semaphore *); +extern void percpu_up_read(struct percpu_rw_semaphore *); -static inline void percpu_down_read(struct percpu_rw_semaphore *p) -{ - rcu_read_lock_sched(); - if (unlikely(p-locked)) { - rcu_read_unlock_sched(); - mutex_lock(p-mtx); - this_cpu_inc(*p-counters); - mutex_unlock(p-mtx); - return; - } - this_cpu_inc(*p-counters); - rcu_read_unlock_sched(); - light_mb(); /* A, between read of p-locked and read of data, paired with D */ -} - -static inline void percpu_up_read(struct percpu_rw_semaphore *p) -{ - light_mb(); /* B, between read of the data and write to p-counter, paired with C */ - this_cpu_dec(*p-counters); -} - -static inline unsigned __percpu_count(unsigned __percpu *counters) -{ - unsigned total = 0; - int cpu; - - for_each_possible_cpu(cpu) - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu)); - - return total; -} - -static inline void percpu_down_write(struct
Re: Recent kernel "mount" slow
On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka wrote: > So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. What about moving the locking mechanism to the "mount" program itself? Won't that be more feasible? As for the cases of simultaneous mounts, it's usually administrator that's doing something bad. I would say this is not a kernel issue. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Sat, 24 Nov 2012, Jeff Chua wrote: > On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe wrote: > > On 2012-11-22 20:21, Linus Torvalds wrote: > >> Doesn't sound like a fsdevel issue since it seems to be independent of > >> filesystems. More like some generic block layer thing. Adding Jens > >> (and quoting the whole thing) > >> > >> Jens, any ideas? Most of your stuff came in after -rc2, which would > >> fit with the fact that most of the slowdown seems to be after -rc2 > >> according to Jeff. > > > > No ideas. Looking at what went in from my side, only the rq plug sorting > > is a core change, and that should not cause any change in behaviour for > > a single device. That's commit 975927b9. > > > >> Jeff, more bisecting would be good, though. > > > > Probably required, yes... > > > This one slows mount from 0.012s to 0.168s. > > commit 62ac665ff9fc07497ca524bd20d6a96893d11071 > Author: Mikulas Patocka > Date: Wed Sep 26 07:46:43 2012 +0200 > > blockdev: turn a rw semaphore into a percpu rw semaphore > > > There were couple of more changes to percpu-rw-semaphores after > 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't > really know, but I'm suspecting these. Still bisecting. The problem there is that you either use normal semaphores and slow down I/O or you use percpu-semaphores, you don't slow down I/O, but you slow down mount. So it's better to slow down mount. (if you don't use any semaphore at all, as it was in 3.6 kernel and before, there is a race condition that can crash the kernel if someone does mount and direct I/O read on the same device at the same time) You can improve mount time if you change all occurences of synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited(). But some people say that synchronize_sched_expedited() is bad for real time latency. (can there be something like: if (realtime) synchronize_sched(); else synchronize_sched_expedited(); ?) Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Sat, 24 Nov 2012, Jeff Chua wrote: On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. Jeff, more bisecting would be good, though. Probably required, yes... This one slows mount from 0.012s to 0.168s. commit 62ac665ff9fc07497ca524bd20d6a96893d11071 Author: Mikulas Patocka mpato...@redhat.com Date: Wed Sep 26 07:46:43 2012 +0200 blockdev: turn a rw semaphore into a percpu rw semaphore There were couple of more changes to percpu-rw-semaphores after 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't really know, but I'm suspecting these. Still bisecting. The problem there is that you either use normal semaphores and slow down I/O or you use percpu-semaphores, you don't slow down I/O, but you slow down mount. So it's better to slow down mount. (if you don't use any semaphore at all, as it was in 3.6 kernel and before, there is a race condition that can crash the kernel if someone does mount and direct I/O read on the same device at the same time) You can improve mount time if you change all occurences of synchronize_sched() in include/linux/percpu-rwsem.h to synchronize_sched_expedited(). But some people say that synchronize_sched_expedited() is bad for real time latency. (can there be something like: if (realtime) synchronize_sched(); else synchronize_sched_expedited(); ?) Mikulas -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote: So it's better to slow down mount. I am quite proud of the linux boot time pitting against other OS. Even with 10 partitions. Linux can boot up in just a few seconds, but now you're saying that we need to do this semaphore check at boot up. By doing so, it's inducing additional 4 seconds during boot up. What about moving the locking mechanism to the mount program itself? Won't that be more feasible? As for the cases of simultaneous mounts, it's usually administrator that's doing something bad. I would say this is not a kernel issue. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Sat, Nov 24, 2012 at 7:31 AM, Jeff Chua wrote: > On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua wrote: >> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe wrote: >>> On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. >>> >>> No ideas. Looking at what went in from my side, only the rq plug sorting >>> is a core change, and that should not cause any change in behaviour for >>> a single device. That's commit 975927b9. >>> Jeff, more bisecting would be good, though. >>> >>> Probably required, yes... >> >> >> This one slows mount from 0.012s to 0.168s. >> >> commit 62ac665ff9fc07497ca524bd20d6a96893d11071 >> Author: Mikulas Patocka >> Date: Wed Sep 26 07:46:43 2012 +0200 >> >> blockdev: turn a rw semaphore into a percpu rw semaphore >> >> >> There were couple of more changes to percpu-rw-semaphores after >> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't >> really know, but I'm suspecting these. Still bisecting. >> >> commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab >> Author: Mikulas Patocka >> Date: Mon Oct 22 19:37:47 2012 -0400 >> >> percpu-rw-semaphores: use light/heavy barriers >> >> >> commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d >> Author: Mikulas Patocka >> Date: Mon Oct 22 19:39:16 2012 -0400 >> >> percpu-rw-semaphores: use rcu_read_lock_sched > > > I reverted these 3 patches and mount time is now 0.012s. > > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > > >> commit 1a25b1c4ce189e3926f2981f3302352a930086db >> Author: Mikulas Patocka >> Date: Mon Oct 15 17:20:17 2012 -0400 >> >> Lock splice_read and splice_write functions > > Looks like this one is not the problem. But I reverted it anyway > because it's part of the same chunk. > > > Happy Thanksgiving! I'm on latest git 3.7.0-rc6 5e351cdc998db82935d1248a053a1be37d1160fd and with the above patches reverted and mount time is now 0.012s. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua wrote: > On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe wrote: >> On 2012-11-22 20:21, Linus Torvalds wrote: >>> Doesn't sound like a fsdevel issue since it seems to be independent of >>> filesystems. More like some generic block layer thing. Adding Jens >>> (and quoting the whole thing) >>> >>> Jens, any ideas? Most of your stuff came in after -rc2, which would >>> fit with the fact that most of the slowdown seems to be after -rc2 >>> according to Jeff. >> >> No ideas. Looking at what went in from my side, only the rq plug sorting >> is a core change, and that should not cause any change in behaviour for >> a single device. That's commit 975927b9. >> >>> Jeff, more bisecting would be good, though. >> >> Probably required, yes... > > > This one slows mount from 0.012s to 0.168s. > > commit 62ac665ff9fc07497ca524bd20d6a96893d11071 > Author: Mikulas Patocka > Date: Wed Sep 26 07:46:43 2012 +0200 > > blockdev: turn a rw semaphore into a percpu rw semaphore > > > There were couple of more changes to percpu-rw-semaphores after > 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't > really know, but I'm suspecting these. Still bisecting. > > commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab > Author: Mikulas Patocka > Date: Mon Oct 22 19:37:47 2012 -0400 > > percpu-rw-semaphores: use light/heavy barriers > > > commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d > Author: Mikulas Patocka > Date: Mon Oct 22 19:39:16 2012 -0400 > > percpu-rw-semaphores: use rcu_read_lock_sched I reverted these 3 patches and mount time is now 0.012s. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt > commit 1a25b1c4ce189e3926f2981f3302352a930086db > Author: Mikulas Patocka > Date: Mon Oct 15 17:20:17 2012 -0400 > > Lock splice_read and splice_write functions Looks like this one is not the problem. But I reverted it anyway because it's part of the same chunk. Happy Thanksgiving! Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe wrote: > On 2012-11-22 20:21, Linus Torvalds wrote: >> Doesn't sound like a fsdevel issue since it seems to be independent of >> filesystems. More like some generic block layer thing. Adding Jens >> (and quoting the whole thing) >> >> Jens, any ideas? Most of your stuff came in after -rc2, which would >> fit with the fact that most of the slowdown seems to be after -rc2 >> according to Jeff. > > No ideas. Looking at what went in from my side, only the rq plug sorting > is a core change, and that should not cause any change in behaviour for > a single device. That's commit 975927b9. > >> Jeff, more bisecting would be good, though. > > Probably required, yes... This one slows mount from 0.012s to 0.168s. commit 62ac665ff9fc07497ca524bd20d6a96893d11071 Author: Mikulas Patocka Date: Wed Sep 26 07:46:43 2012 +0200 blockdev: turn a rw semaphore into a percpu rw semaphore There were couple of more changes to percpu-rw-semaphores after 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't really know, but I'm suspecting these. Still bisecting. commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab Author: Mikulas Patocka Date: Mon Oct 22 19:37:47 2012 -0400 percpu-rw-semaphores: use light/heavy barriers commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d Author: Mikulas Patocka Date: Mon Oct 22 19:39:16 2012 -0400 percpu-rw-semaphores: use rcu_read_lock_sched commit 1a25b1c4ce189e3926f2981f3302352a930086db Author: Mikulas Patocka Date: Mon Oct 15 17:20:17 2012 -0400 Lock splice_read and splice_write functions I couldn't unpatch with the latest kernel, but still trying. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On 2012-11-22 20:21, Linus Torvalds wrote: > Doesn't sound like a fsdevel issue since it seems to be independent of > filesystems. More like some generic block layer thing. Adding Jens > (and quoting the whole thing) > > Jens, any ideas? Most of your stuff came in after -rc2, which would > fit with the fact that most of the slowdown seems to be after -rc2 > according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. > Jeff, more bisecting would be good, though. Probably required, yes... -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. Jeff, more bisecting would be good, though. Probably required, yes... -- Jens Axboe -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. Jeff, more bisecting would be good, though. Probably required, yes... This one slows mount from 0.012s to 0.168s. commit 62ac665ff9fc07497ca524bd20d6a96893d11071 Author: Mikulas Patocka mpato...@redhat.com Date: Wed Sep 26 07:46:43 2012 +0200 blockdev: turn a rw semaphore into a percpu rw semaphore There were couple of more changes to percpu-rw-semaphores after 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't really know, but I'm suspecting these. Still bisecting. commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:37:47 2012 -0400 percpu-rw-semaphores: use light/heavy barriers commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:39:16 2012 -0400 percpu-rw-semaphores: use rcu_read_lock_sched commit 1a25b1c4ce189e3926f2981f3302352a930086db Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 15 17:20:17 2012 -0400 Lock splice_read and splice_write functions I couldn't unpatch with the latest kernel, but still trying. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. Jeff, more bisecting would be good, though. Probably required, yes... This one slows mount from 0.012s to 0.168s. commit 62ac665ff9fc07497ca524bd20d6a96893d11071 Author: Mikulas Patocka mpato...@redhat.com Date: Wed Sep 26 07:46:43 2012 +0200 blockdev: turn a rw semaphore into a percpu rw semaphore There were couple of more changes to percpu-rw-semaphores after 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't really know, but I'm suspecting these. Still bisecting. commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:37:47 2012 -0400 percpu-rw-semaphores: use light/heavy barriers commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:39:16 2012 -0400 percpu-rw-semaphores: use rcu_read_lock_sched I reverted these 3 patches and mount time is now 0.012s. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt commit 1a25b1c4ce189e3926f2981f3302352a930086db Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 15 17:20:17 2012 -0400 Lock splice_read and splice_write functions Looks like this one is not the problem. But I reverted it anyway because it's part of the same chunk. Happy Thanksgiving! Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Sat, Nov 24, 2012 at 7:31 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote: On 2012-11-22 20:21, Linus Torvalds wrote: Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. No ideas. Looking at what went in from my side, only the rq plug sorting is a core change, and that should not cause any change in behaviour for a single device. That's commit 975927b9. Jeff, more bisecting would be good, though. Probably required, yes... This one slows mount from 0.012s to 0.168s. commit 62ac665ff9fc07497ca524bd20d6a96893d11071 Author: Mikulas Patocka mpato...@redhat.com Date: Wed Sep 26 07:46:43 2012 +0200 blockdev: turn a rw semaphore into a percpu rw semaphore There were couple of more changes to percpu-rw-semaphores after 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't really know, but I'm suspecting these. Still bisecting. commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:37:47 2012 -0400 percpu-rw-semaphores: use light/heavy barriers commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 22 19:39:16 2012 -0400 percpu-rw-semaphores: use rcu_read_lock_sched I reverted these 3 patches and mount time is now 0.012s. # time mount /dev/sda1 /mnt; sync; sync; umount /mnt commit 1a25b1c4ce189e3926f2981f3302352a930086db Author: Mikulas Patocka mpato...@redhat.com Date: Mon Oct 15 17:20:17 2012 -0400 Lock splice_read and splice_write functions Looks like this one is not the problem. But I reverted it anyway because it's part of the same chunk. Happy Thanksgiving! I'm on latest git 3.7.0-rc6 5e351cdc998db82935d1248a053a1be37d1160fd and with the above patches reverted and mount time is now 0.012s. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua wrote: > On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara wrote: >> I haven't heard about such problem so far. What filesystem are you using? > > I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower > than before. Seems to be fs independent. > >> Can you quantify 'is slower'? Bisecting would be welcome of course. > > Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. > I'll start bisecting:( > So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up. I started bisecting, and it seems to have multiple points of slowing down. The latest kernel is really slow mounting /dev/sda1 (fs independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm verifying them again. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. Jeff, more bisecting would be good, though. Linus On Thu, Nov 22, 2012 at 4:30 AM, Jeff Chua wrote: > On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua wrote: >> On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara wrote: >>> I haven't heard about such problem so far. What filesystem are you using? >> >> I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower >> than before. Seems to be fs independent. >> >>> Can you quantify 'is slower'? Bisecting would be welcome of course. >> >> Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. >> I'll start bisecting:( >> So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot >> up. > > I started bisecting, and it seems to have multiple points of slowing > down. The latest kernel is really slow mounting /dev/sda1 (fs > independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel > 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm > verifying them again. > > > Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
Doesn't sound like a fsdevel issue since it seems to be independent of filesystems. More like some generic block layer thing. Adding Jens (and quoting the whole thing) Jens, any ideas? Most of your stuff came in after -rc2, which would fit with the fact that most of the slowdown seems to be after -rc2 according to Jeff. Jeff, more bisecting would be good, though. Linus On Thu, Nov 22, 2012 at 4:30 AM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote: I haven't heard about such problem so far. What filesystem are you using? I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower than before. Seems to be fs independent. Can you quantify 'is slower'? Bisecting would be welcome of course. Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. I'll start bisecting:( So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up. I started bisecting, and it seems to have multiple points of slowing down. The latest kernel is really slow mounting /dev/sda1 (fs independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm verifying them again. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua jeff.chua.li...@gmail.com wrote: On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote: I haven't heard about such problem so far. What filesystem are you using? I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower than before. Seems to be fs independent. Can you quantify 'is slower'? Bisecting would be welcome of course. Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. I'll start bisecting:( So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up. I started bisecting, and it seems to have multiple points of slowing down. The latest kernel is really slow mounting /dev/sda1 (fs independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm verifying them again. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara wrote: > I haven't heard about such problem so far. What filesystem are you using? I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower than before. Seems to be fs independent. > Can you quantify 'is slower'? Bisecting would be welcome of course. Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. I'll start bisecting:( So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up. Thanks, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote: I haven't heard about such problem so far. What filesystem are you using? I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower than before. Seems to be fs independent. Can you quantify 'is slower'? Bisecting would be welcome of course. Haven't measure, but it seems to be 1 sec instead of .3 sec to mount. I'll start bisecting:( So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up. Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel "mount" slow
On Mon 19-11-12 08:33:25, Jeff Chua wrote: > It seems the recent kernel is slower mounting hard disk than older > kernels. I've not bisect down to exact when this happen as it might > already been reported or solved. I'm on the latest commit, but it > doesn't seems to be fixed yet. > > commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5 > Author: Al Viro > Date: Sun Nov 18 19:19:00 2012 + > > fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() I haven't heard about such problem so far. What filesystem are you using? Can you quantify 'is slower'? Bisecting would be welcome of course. Honza -- Jan Kara SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Recent kernel mount slow
On Mon 19-11-12 08:33:25, Jeff Chua wrote: It seems the recent kernel is slower mounting hard disk than older kernels. I've not bisect down to exact when this happen as it might already been reported or solved. I'm on the latest commit, but it doesn't seems to be fixed yet. commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5 Author: Al Viro v...@zeniv.linux.org.uk Date: Sun Nov 18 19:19:00 2012 + fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() I haven't heard about such problem so far. What filesystem are you using? Can you quantify 'is slower'? Bisecting would be welcome of course. Honza -- Jan Kara j...@suse.cz SUSE Labs, CR -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Recent kernel "mount" slow
It seems the recent kernel is slower mounting hard disk than older kernels. I've not bisect down to exact when this happen as it might already been reported or solved. I'm on the latest commit, but it doesn't seems to be fixed yet. commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5 Author: Al Viro Date: Sun Nov 18 19:19:00 2012 + fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() Thanks, Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Recent kernel mount slow
It seems the recent kernel is slower mounting hard disk than older kernels. I've not bisect down to exact when this happen as it might already been reported or solved. I'm on the latest commit, but it doesn't seems to be fixed yet. commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5 Author: Al Viro v...@zeniv.linux.org.uk Date: Sun Nov 18 19:19:00 2012 + fanotify: fix FAN_Q_OVERFLOW case of fanotify_read() Thanks, Jeff -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/