subject:"Recent kernel \"mount\" slow"

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason  wrote:
>
> The bigger question is do we have users that expect to be able to set
> the blocksize after mmaping the block device (no writes required)?  I
> actually feel a little bad for taking up internet bandwidth asking, but
> it is a change in behaviour.

Yeah, it is. That said, I don't think people will really notice.
Nobody mmap's block devices outside of some databases, afaik, and
nobody sane mounts a partition at the same time a DB is using it. So I
think the new EBUSY check is *ugly*, but I don't realistically believe
that it is a problem.  The ugliness of the locking is why I'm not a
huge fan of it, but if it works I can live with it.

But yes, the mmap tests are new with the locking, and could in theory
be problematic if somebody reports that it breaks anything.

And like the locking, they'd just go away if we just do the
fs/buffer.c approach instead. Because doing things in fs/buffer.c
simply means that we don't care (and serialization is provided by the
page lock on a per-page basis, which is what mmap relies on anyway).

So doing the per-page fs/buffer.c approach (along with the
"ACCESS_ONCE()" on inode->i_blkbits to make sure we get *one*
consistent value, even if we don't care *which* value it is) would
basically revert to all the old semantics. The only thing it would
change is that we wouldn't see oopses.

(And in theory, it would allow us to actively mix-and-match different
block sizes for a block device, but realistically I don't think there
are any actual users of that - although I could imagine that a
filesystem would use a smaller block size for file tail-blocks etc,
and still want to use the fs/buffer.c code, so it's *possible* that it
would be useful, but filesystems have been able to do things like that
by just doing their buffers by hand anyway, so it's not really
fundamentally new, just a possible generalization of code)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote:
> On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason  wrote:
> >
> > Jumping in based on Linus original patch, which is doing something like
> > this:
> >
> > set_blocksize() {
> > block new calls to writepage, prepare/commit_write
> > set the block size
> > unblock
> >
> > < --- can race in here and find bad buffers --->
> >
> > sync_blockdev()
> > kill_bdev()
> >
> > < --- now we're safe --- >
> > }
> >
> > We could add a second semaphore and a page_mkwrite call:
> 
> Yeah, we could be fancy, but the more I think about it, the less I can
> say I care.
> 
> After all, the only things that do the whole set_blocksize() thing should be:
> 
>  - filesystems at mount-time
> 
>  - things like loop/md at block device init time.
> 
> and quite frankly, if there are any *concurrent* writes with either of
> the above, I really *really* don't think we should care. I mean,
> seriously.
> 
> So the _only_ real reason for the locking in the first place is to
> make sure of internal kernel consistency. We do not want to oops or
> corrupt memory if people do odd things. But we really *really* don't
> care if somebody writes to a partition at the same time as somebody
> else mounts it. Not enough to do extra work to please insane people.
> 
> It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
> The whole sequence always used to be unlocked. The locking is entirely
> new. There is certainly not any legacy users that can possibly rely on
> "I did writes at the same time as the mount with no serialization, and
> it worked". It never has worked.
> 
> So I think this is a case of "perfect is the enemy of good".
> Especially since I think that with the fs/buffer.c approach, we don't
> actually need any locking at all at higher levels.

The bigger question is do we have users that expect to be able to set
the blocksize after mmaping the block device (no writes required)?  I
actually feel a little bad for taking up internet bandwidth asking, but
it is a change in behaviour.

Regardless, changing mmap for a race in the page cache is just backwards, and
with the current 3.7 code, we can still trigger the race with fadvise ->
readpage in the middle of set_blocksize()

Obviously nobody does any of this, otherwise we'd have tons of reports
from those handy WARN_ONs in fs/buffer.c.  So its definitely hard to be
worried one way or another.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason  wrote:
>
> Jumping in based on Linus original patch, which is doing something like
> this:
>
> set_blocksize() {
> block new calls to writepage, prepare/commit_write
> set the block size
> unblock
>
> < --- can race in here and find bad buffers --->
>
> sync_blockdev()
> kill_bdev()
>
> < --- now we're safe --- >
> }
>
> We could add a second semaphore and a page_mkwrite call:

Yeah, we could be fancy, but the more I think about it, the less I can
say I care.

After all, the only things that do the whole set_blocksize() thing should be:

 - filesystems at mount-time

 - things like loop/md at block device init time.

and quite frankly, if there are any *concurrent* writes with either of
the above, I really *really* don't think we should care. I mean,
seriously.

So the _only_ real reason for the locking in the first place is to
make sure of internal kernel consistency. We do not want to oops or
corrupt memory if people do odd things. But we really *really* don't
care if somebody writes to a partition at the same time as somebody
else mounts it. Not enough to do extra work to please insane people.

It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
The whole sequence always used to be unlocked. The locking is entirely
new. There is certainly not any legacy users that can possibly rely on
"I did writes at the same time as the mount with no serialization, and
it worked". It never has worked.

So I think this is a case of "perfect is the enemy of good".
Especially since I think that with the fs/buffer.c approach, we don't
actually need any locking at all at higher levels.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote:
> On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> > On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
> >  wrote:
> > >
> > > But the fact that the code wants to do things like
> > >
> > > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> > >
> > > seriously seems to be the main thing that keeps us using
> > > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > > enough to hurt (not everywhere, but on some machines).
> > >
> > > Very annoying.
> > 
> > Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> > whole ilog2 thing, but at the same time, in other cases it actually
> > seems to improve code generation (ie gets rid of the whole unnecessary
> > two dereferences through page->mapping->host just to get the block
> > size, when we have it in the buffer-head that we have to touch
> > *anyway*).
> > 
> > Comments? Again, untested.
> 
> Jumping in based on Linus original patch, which is doing something like
> this:
> 
> set_blocksize() {
>   block new calls to writepage, prepare/commit_write
>   set the block size
>   unblock
> 
>   < --- can race in here and find bad buffers --->
> 
>   sync_blockdev()
>   kill_bdev() 
>   
>   < --- now we're safe --- >
> }
> 
> We could add a second semaphore and a page_mkwrite call:
> 
> set_blocksize() {
> 
>   block new calls to prepare/commit_write and page_mkwrite(), but
>   leave writepage unblocked.
> 
>   sync_blockev()
> 
>   <--- now we're safe.  There are no dirty pages and no ways to
>   make new ones ---> 
> 
>   block new calls to readpage (writepage too for good luck?)
> 
>   kill_bdev()

Whoops, kill_bdev needs the page lock, which sends us into ABBA when
readpage does the down_read.  So, slight modification, unblock
readpage/writepage before the kill_bdev.  We'd need to change readpage
to discard buffers with the wrong size.  The risk is that readpage can
find buffers with the wrong size, and would need to be changed to
discard them.

The patch below is based on Linus' original and doesn't deal with the
readpage race.  But it does get the rest of the idea across.  It boots
and survives banging no blockdev --setbsz with mkfs, but I definitely
wouldn't trust it.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..1377171 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-   struct address_space *mapping;
-
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(>bd_block_size_semaphore);
-
-   /* Check that the block device is not memory mapped */
-   mapping = bdev->bd_inode->i_mapping;
-   mutex_lock(>i_mmap_mutex);
-   if (mapping_mapped(mapping)) {
-   mutex_unlock(>i_mmap_mutex);
-   percpu_up_write(>bd_block_size_semaphore);
-   return -EBUSY;
-   }
-   mutex_unlock(>i_mmap_mutex);
-
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
+   /* block all modifications via writing and page_mkwrite */
+   percpu_down_write(>bd_block_size_semaphore);
+
+   /* write everything that was dirty */
sync_blockdev(bdev);
+
+   /* block readpage and writepage */
+   percpu_down_write(>bd_page_semaphore);
+
bdev->bd_block_size = size;
bdev->bd_inode->i_blkbits = blksize_bits(size);
+
+   /* we can't call kill_bdev with the page_semaphore down
+* because we'll deadlock against readpage.
+* The block_size_semaphore should prevent any new
+* pages from being dirty, but readpage can jump
+* in once we up the bd_page_sem and find a
+* page with buffers from the old size.
+*
+* The kill_bdev call below is going to get rid
+* of those buffers, but we do have a race here.
+* readpage needs to deal with it and verify
+* any buffers on the page are the right size
+*/
+   percpu_up_write(>bd_page_semaphore);
+
+   /* drop all the pages and all the buffers */
kill_bdev(bdev);
-   }
 
-   percpu_up_write(>bd_block_size_semaphore);
+   /* open the gates and let everyone back in */
+   percpu_up_write(>bd_block_size_semaphore);
+

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Chris Mason

On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
>  wrote:
> >
> > But the fact that the code wants to do things like
> >
> > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> >
> > seriously seems to be the main thing that keeps us using
> > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > enough to hurt (not everywhere, but on some machines).
> >
> > Very annoying.
> 
> Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> whole ilog2 thing, but at the same time, in other cases it actually
> seems to improve code generation (ie gets rid of the whole unnecessary
> two dereferences through page->mapping->host just to get the block
> size, when we have it in the buffer-head that we have to touch
> *anyway*).
> 
> Comments? Again, untested.

Jumping in based on Linus original patch, which is doing something like
this:

set_blocksize() {
block new calls to writepage, prepare/commit_write
set the block size
unblock

< --- can race in here and find bad buffers --->

sync_blockdev()
kill_bdev() 

< --- now we're safe --- >
}

We could add a second semaphore and a page_mkwrite call:

set_blocksize() {

block new calls to prepare/commit_write and page_mkwrite(), but
leave writepage unblocked.

sync_blockev()

<--- now we're safe.  There are no dirty pages and no ways to
make new ones ---> 

block new calls to readpage (writepage too for good luck?)

kill_bdev()
set the block size

unblock readpage/writepage
unblock prepare/commit_write and page_mkwrite

}

Another way to look at things:

As Linus said in a different email, we don't need to drop the pages, just
the buffers.  Once we've blocked prepare/commit_write,
there is no way to make a partially up to date page with dirty data.
We may make fully uptodate dirty pages, but for those we can
just create dirty buffers for the whole page.

As long as we had prepare/commit write blocked while we ran
sync_blockdev, we can blindly detach any buffers that are the wrong size
and just make new ones.

This may or may not apply to loop.c, I'd have to read that more
carefully.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-29 Thread Jeff Chua

On Thu, Nov 29, 2012 at 2:45 PM, Al Viro  wrote:
> On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
>> On Wed, Nov 28, 2012 at 10:30 PM, Al Viro  wrote:
>> >
>> > Note that sync_blockdev() a few lines prior to that is good only if we
>> > have no other processes doing write(2) (or dirtying the mmapped pages,
>> > for that matter).  The window isn't too wide, but...
>>
>> So with Mikulas' patches, the write actually would block (at write
>> level) due to the locking. The mmap'ed patches may be around and
>> flushed, but the logic to not allow currently *active* mmaps (with the
>> rather nasty random -EBUSY return value) should mean that there is no
>> race.
>>
>> Or rather, there's a race, but it results in that EBUSY thing.
>
> Same as with fs mounted on it, or the sucker having been claimed for
> RAID array, etc.  Frankly, I'm more than slightly tempted to make
> bdev mmap() just claim the sodding device exclusive for as long as
> it's mmapped...
>
> In principle, I agree, but...  I still have nightmares from mmap/truncate
> races way back.  You are stepping into what used to be a really nasty
> minefield.  I'll look into that, but it's *definitely* not -rc8 fodder.

Just let me know which relevant patch(es) you want me to test or break.

Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Jeff Chua

On Thu, Nov 29, 2012 at 2:45 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
 On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  Note that sync_blockdev() a few lines prior to that is good only if we
  have no other processes doing write(2) (or dirtying the mmapped pages,
  for that matter).  The window isn't too wide, but...

 So with Mikulas' patches, the write actually would block (at write
 level) due to the locking. The mmap'ed patches may be around and
 flushed, but the logic to not allow currently *active* mmaps (with the
 rather nasty random -EBUSY return value) should mean that there is no
 race.

 Or rather, there's a race, but it results in that EBUSY thing.

 Same as with fs mounted on it, or the sucker having been claimed for
 RAID array, etc.  Frankly, I'm more than slightly tempted to make
 bdev mmap() just claim the sodding device exclusive for as long as
 it's mmapped...

 In principle, I agree, but...  I still have nightmares from mmap/truncate
 races way back.  You are stepping into what used to be a really nasty
 minefield.  I'll look into that, but it's *definitely* not -rc8 fodder.

Just let me know which relevant patch(es) you want me to test or break.

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason

On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
 On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  But the fact that the code wants to do things like
 
  block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);
 
  seriously seems to be the main thing that keeps us using
  'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
  enough to hurt (not everywhere, but on some machines).
 
  Very annoying.
 
 Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
 whole ilog2 thing, but at the same time, in other cases it actually
 seems to improve code generation (ie gets rid of the whole unnecessary
 two dereferences through page-mapping-host just to get the block
 size, when we have it in the buffer-head that we have to touch
 *anyway*).
 
 Comments? Again, untested.

Jumping in based on Linus original patch, which is doing something like
this:

set_blocksize() {
block new calls to writepage, prepare/commit_write
set the block size
unblock

 --- can race in here and find bad buffers ---

sync_blockdev()
kill_bdev() 

 --- now we're safe --- 
}

We could add a second semaphore and a page_mkwrite call:

set_blocksize() {

block new calls to prepare/commit_write and page_mkwrite(), but
leave writepage unblocked.

sync_blockev()

--- now we're safe.  There are no dirty pages and no ways to
make new ones --- 

block new calls to readpage (writepage too for good luck?)

kill_bdev()
set the block size

unblock readpage/writepage
unblock prepare/commit_write and page_mkwrite

}

Another way to look at things:

As Linus said in a different email, we don't need to drop the pages, just
the buffers.  Once we've blocked prepare/commit_write,
there is no way to make a partially up to date page with dirty data.
We may make fully uptodate dirty pages, but for those we can
just create dirty buffers for the whole page.

As long as we had prepare/commit write blocked while we ran
sync_blockdev, we can blindly detach any buffers that are the wrong size
and just make new ones.

This may or may not apply to loop.c, I'd have to read that more
carefully.

-chris

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 07:12:49AM -0700, Chris Mason wrote:
 On Wed, Nov 28, 2012 at 11:16:21PM -0700, Linus Torvalds wrote:
  On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
  torva...@linux-foundation.org wrote:
  
   But the fact that the code wants to do things like
  
   block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);
  
   seriously seems to be the main thing that keeps us using
   'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
   enough to hurt (not everywhere, but on some machines).
  
   Very annoying.
  
  Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
  whole ilog2 thing, but at the same time, in other cases it actually
  seems to improve code generation (ie gets rid of the whole unnecessary
  two dereferences through page-mapping-host just to get the block
  size, when we have it in the buffer-head that we have to touch
  *anyway*).
  
  Comments? Again, untested.
 
 Jumping in based on Linus original patch, which is doing something like
 this:
 
 set_blocksize() {
   block new calls to writepage, prepare/commit_write
   set the block size
   unblock
 
--- can race in here and find bad buffers ---
 
   sync_blockdev()
   kill_bdev() 
   
--- now we're safe --- 
 }
 
 We could add a second semaphore and a page_mkwrite call:
 
 set_blocksize() {
 
   block new calls to prepare/commit_write and page_mkwrite(), but
   leave writepage unblocked.
 
   sync_blockev()
 
   --- now we're safe.  There are no dirty pages and no ways to
   make new ones --- 
 
   block new calls to readpage (writepage too for good luck?)
 
   kill_bdev()

Whoops, kill_bdev needs the page lock, which sends us into ABBA when
readpage does the down_read.  So, slight modification, unblock
readpage/writepage before the kill_bdev.  We'd need to change readpage
to discard buffers with the wrong size.  The risk is that readpage can
find buffers with the wrong size, and would need to be changed to
discard them.

The patch below is based on Linus' original and doesn't deal with the
readpage race.  But it does get the rest of the idea across.  It boots
and survives banging no blockdev --setbsz with mkfs, but I definitely
wouldn't trust it.

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..1377171 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -116,8 +116,6 @@ EXPORT_SYMBOL(invalidate_bdev);
 
 int set_blocksize(struct block_device *bdev, int size)
 {
-   struct address_space *mapping;
-
/* Size must be a power of two, and between 512 and PAGE_SIZE */
if (size  PAGE_SIZE || size  512 || !is_power_of_2(size))
return -EINVAL;
@@ -126,28 +124,40 @@ int set_blocksize(struct block_device *bdev, int size)
if (size  bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(bdev-bd_block_size_semaphore);
-
-   /* Check that the block device is not memory mapped */
-   mapping = bdev-bd_inode-i_mapping;
-   mutex_lock(mapping-i_mmap_mutex);
-   if (mapping_mapped(mapping)) {
-   mutex_unlock(mapping-i_mmap_mutex);
-   percpu_up_write(bdev-bd_block_size_semaphore);
-   return -EBUSY;
-   }
-   mutex_unlock(mapping-i_mmap_mutex);
-
/* Don't change the size if it is same as current */
if (bdev-bd_block_size != size) {
+   /* block all modifications via writing and page_mkwrite */
+   percpu_down_write(bdev-bd_block_size_semaphore);
+
+   /* write everything that was dirty */
sync_blockdev(bdev);
+
+   /* block readpage and writepage */
+   percpu_down_write(bdev-bd_page_semaphore);
+
bdev-bd_block_size = size;
bdev-bd_inode-i_blkbits = blksize_bits(size);
+
+   /* we can't call kill_bdev with the page_semaphore down
+* because we'll deadlock against readpage.
+* The block_size_semaphore should prevent any new
+* pages from being dirty, but readpage can jump
+* in once we up the bd_page_sem and find a
+* page with buffers from the old size.
+*
+* The kill_bdev call below is going to get rid
+* of those buffers, but we do have a race here.
+* readpage needs to deal with it and verify
+* any buffers on the page are the right size
+*/
+   percpu_up_write(bdev-bd_page_semaphore);
+
+   /* drop all the pages and all the buffers */
kill_bdev(bdev);
-   }
 
-   percpu_up_write(bdev-bd_block_size_semaphore);
+   /* open the gates and let everyone back in */
+   percpu_up_write(bdev-bd_block_size_semaphore);
+   }
 
return 0;
 }
@@

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote:

 Jumping in based on Linus original patch, which is doing something like
 this:

 set_blocksize() {
 block new calls to writepage, prepare/commit_write
 set the block size
 unblock

  --- can race in here and find bad buffers ---

 sync_blockdev()
 kill_bdev()

  --- now we're safe --- 
 }

 We could add a second semaphore and a page_mkwrite call:

Yeah, we could be fancy, but the more I think about it, the less I can
say I care.

After all, the only things that do the whole set_blocksize() thing should be:

 - filesystems at mount-time

 - things like loop/md at block device init time.

and quite frankly, if there are any *concurrent* writes with either of
the above, I really *really* don't think we should care. I mean,
seriously.

So the _only_ real reason for the locking in the first place is to
make sure of internal kernel consistency. We do not want to oops or
corrupt memory if people do odd things. But we really *really* don't
care if somebody writes to a partition at the same time as somebody
else mounts it. Not enough to do extra work to please insane people.

It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
The whole sequence always used to be unlocked. The locking is entirely
new. There is certainly not any legacy users that can possibly rely on
I did writes at the same time as the mount with no serialization, and
it worked. It never has worked.

So I think this is a case of perfect is the enemy of good.
Especially since I think that with the fs/buffer.c approach, we don't
actually need any locking at all at higher levels.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Chris Mason

On Thu, Nov 29, 2012 at 10:26:56AM -0700, Linus Torvalds wrote:
 On Thu, Nov 29, 2012 at 6:12 AM, Chris Mason chris.ma...@fusionio.com wrote:
 
  Jumping in based on Linus original patch, which is doing something like
  this:
 
  set_blocksize() {
  block new calls to writepage, prepare/commit_write
  set the block size
  unblock
 
   --- can race in here and find bad buffers ---
 
  sync_blockdev()
  kill_bdev()
 
   --- now we're safe --- 
  }
 
  We could add a second semaphore and a page_mkwrite call:
 
 Yeah, we could be fancy, but the more I think about it, the less I can
 say I care.
 
 After all, the only things that do the whole set_blocksize() thing should be:
 
  - filesystems at mount-time
 
  - things like loop/md at block device init time.
 
 and quite frankly, if there are any *concurrent* writes with either of
 the above, I really *really* don't think we should care. I mean,
 seriously.
 
 So the _only_ real reason for the locking in the first place is to
 make sure of internal kernel consistency. We do not want to oops or
 corrupt memory if people do odd things. But we really *really* don't
 care if somebody writes to a partition at the same time as somebody
 else mounts it. Not enough to do extra work to please insane people.
 
 It's also worth noting that NONE OF THIS HAS EVER WORKED IN THE PAST.
 The whole sequence always used to be unlocked. The locking is entirely
 new. There is certainly not any legacy users that can possibly rely on
 I did writes at the same time as the mount with no serialization, and
 it worked. It never has worked.
 
 So I think this is a case of perfect is the enemy of good.
 Especially since I think that with the fs/buffer.c approach, we don't
 actually need any locking at all at higher levels.

The bigger question is do we have users that expect to be able to set
the blocksize after mmaping the block device (no writes required)?  I
actually feel a little bad for taking up internet bandwidth asking, but
it is a change in behaviour.

Regardless, changing mmap for a race in the page cache is just backwards, and
with the current 3.7 code, we can still trigger the race with fadvise -
readpage in the middle of set_blocksize()

Obviously nobody does any of this, otherwise we'd have tons of reports
from those handy WARN_ONs in fs/buffer.c.  So its definitely hard to be
worried one way or another.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-29 Thread Linus Torvalds

On Thu, Nov 29, 2012 at 9:51 AM, Chris Mason chris.ma...@fusionio.com wrote:

 The bigger question is do we have users that expect to be able to set
 the blocksize after mmaping the block device (no writes required)?  I
 actually feel a little bad for taking up internet bandwidth asking, but
 it is a change in behaviour.

Yeah, it is. That said, I don't think people will really notice.
Nobody mmap's block devices outside of some databases, afaik, and
nobody sane mounts a partition at the same time a DB is using it. So I
think the new EBUSY check is *ugly*, but I don't realistically believe
that it is a problem.  The ugliness of the locking is why I'm not a
huge fan of it, but if it works I can live with it.

But yes, the mmap tests are new with the locking, and could in theory
be problematic if somebody reports that it breaks anything.

And like the locking, they'd just go away if we just do the
fs/buffer.c approach instead. Because doing things in fs/buffer.c
simply means that we don't care (and serialization is provided by the
page lock on a per-page basis, which is what mmap relies on anyway).

So doing the per-page fs/buffer.c approach (along with the
ACCESS_ONCE() on inode-i_blkbits to make sure we get *one*
consistent value, even if we don't care *which* value it is) would
basically revert to all the old semantics. The only thing it would
change is that we wouldn't see oopses.

(And in theory, it would allow us to actively mix-and-match different
block sizes for a block device, but realistically I don't think there
are any actual users of that - although I could imagine that a
filesystem would use a smaller block size for file tail-blocks etc,
and still want to use the fs/buffer.c code, so it's *possible* that it
would be useful, but filesystems have been able to do things like that
by just doing their buffers by hand anyway, so it's not really
fundamentally new, just a possible generalization of code)

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 10:30 PM, Al Viro  wrote:
> >
> > Note that sync_blockdev() a few lines prior to that is good only if we
> > have no other processes doing write(2) (or dirtying the mmapped pages,
> > for that matter).  The window isn't too wide, but...
> 
> So with Mikulas' patches, the write actually would block (at write
> level) due to the locking. The mmap'ed patches may be around and
> flushed, but the logic to not allow currently *active* mmaps (with the
> rather nasty random -EBUSY return value) should mean that there is no
> race.
> 
> Or rather, there's a race, but it results in that EBUSY thing.

Same as with fs mounted on it, or the sucker having been claimed for
RAID array, etc.  Frankly, I'm more than slightly tempted to make
bdev mmap() just claim the sodding device exclusive for as long as
it's mmapped...

In principle, I agree, but...  I still have nightmares from mmap/truncate
races way back.  You are stepping into what used to be a really nasty
minefield.  I'll look into that, but it's *definitely* not -rc8 fodder.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 10:30 PM, Al Viro  wrote:
>
> Note that sync_blockdev() a few lines prior to that is good only if we
> have no other processes doing write(2) (or dirtying the mmapped pages,
> for that matter).  The window isn't too wide, but...

So with Mikulas' patches, the write actually would block (at write
level) due to the locking. The mmap'ed patches may be around and
flushed, but the logic to not allow currently *active* mmaps (with the
rather nasty random -EBUSY return value) should mean that there is no
race.

Or rather, there's a race, but it results in that EBUSY thing.

With my simplfied locking, the sync_blockdev() is right before (not a
few lines prior) to the kill_bdev(), and in a perfect world they'd
actually be one single operation ("write back and invalidate pages
with the wrong block-size"). But they aren't.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 10:25 PM, Al Viro  wrote:
>
> Umm...  set_blocksize() is calling kill_bdev(), which does
> truncate_inode_pages(mapping, 0).  What's going to happen to data in
> the dirty pages?  IO in progress is not the only thing to worry about...

Hmm. Yes. I think it works by virtue of "if you change the blocksize
while there is active IO, you're insane and you deserve whatever you
get".

It shouldn't even be fundamentally hard to make it work, although I
suspect it would be more code than it would be worth. The sane model
would be to not use truncate_inode_pages(), but instead just walk the
pages and get rid of the buffer heads with the wrong size. Preferably
*combining* that with the sync_blockdev().

We have no real reason to even invalidate the page cache, it's just
the buffers we want to get rid of.

But I suspect it's true that none of that is really *worth* it,
considering that nobody likely wants to do any concurrent IO. We don't
want to crash, or corrupt the data structures, but I suspect "you get
what you deserve" might actually be the right model ;)

So the current "sync_blockdev()+kill_bdev()" takes care of the *sane*
case (we flush any data that happened *before* the block size change),
and any concurrent writes with block-size changes are "good luck with
that".

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Al Viro

On Thu, Nov 29, 2012 at 06:25:19AM +, Al Viro wrote:

>   Umm...  set_blocksize() is calling kill_bdev(), which does
> truncate_inode_pages(mapping, 0).  What's going to happen to data in
> the dirty pages?  IO in progress is not the only thing to worry about...

Note that sync_blockdev() a few lines prior to that is good only if we
have no other processes doing write(2) (or dirtying the mmapped pages,
for that matter).  The window isn't too wide, but...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 10:16:21PM -0800, Linus Torvalds wrote:
> On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
>  wrote:
> >
> > But the fact that the code wants to do things like
> >
> > block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
> >
> > seriously seems to be the main thing that keeps us using
> > 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> > enough to hurt (not everywhere, but on some machines).
> >
> > Very annoying.
> 
> Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
> whole ilog2 thing, but at the same time, in other cases it actually
> seems to improve code generation (ie gets rid of the whole unnecessary
> two dereferences through page->mapping->host just to get the block
> size, when we have it in the buffer-head that we have to touch
> *anyway*).
> 
> Comments? Again, untested.
> 
> And I notice that Al Viro hasn't been cc'd, which is sad, since he's
> been involved in much of fs/block_dev.c.
> 
> Al - this is an independent patch to fs/buffer.c to make
> fs/block_dev.c able to change the block size of a block device while
> there is IO in progress that may still use the old block size. The
> discussion has been on fsdevel and lkml, but you may have missed it...

Umm...  set_blocksize() is calling kill_bdev(), which does
truncate_inode_pages(mapping, 0).  What's going to happen to data in
the dirty pages?  IO in progress is not the only thing to worry about...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
 wrote:
>
> But the fact that the code wants to do things like
>
> block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);
>
> seriously seems to be the main thing that keeps us using
> 'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
> enough to hurt (not everywhere, but on some machines).
>
> Very annoying.

Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
whole ilog2 thing, but at the same time, in other cases it actually
seems to improve code generation (ie gets rid of the whole unnecessary
two dereferences through page->mapping->host just to get the block
size, when we have it in the buffer-head that we have to touch
*anyway*).

Comments? Again, untested.

And I notice that Al Viro hasn't been cc'd, which is sad, since he's
been involved in much of fs/block_dev.c.

Al - this is an independent patch to fs/buffer.c to make
fs/block_dev.c able to change the block size of a block device while
there is IO in progress that may still use the old block size. The
discussion has been on fsdevel and lkml, but you may have missed it...

Linus

patch.diff
Description: Binary data

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds
 wrote:
>
> Interesting. The code *has* the block size (it's in "bh->b_size"), but
> it actually then uses the inode blocksize instead, and verifies the
> two against each other. It could just have used the block size
> directly (and then used the inode i_blkbits only when no buffers
> existed), avoiding that dependency entirely..

Looking more at this code, that really would be the nicest solution.

There's two cases for the whole get_block() thing:

 - filesystems. The block size will not change randomly, and
"get_block()" seriously depends on the block size.

 - the raw device. The block size *will* change, but to simplify the
problem, "get_block()" is a 1:1 mapping, so it doesn't even care about
the block size because it will always return "bh->b_blocknr = nr".

So we *could* just say that all the fs/buffer.c code should use
"inode->i_blkbits" for creating buffers (because that's the size new
buffers should always use), but use "bh->b_size" for any *existing*
buffer use.

And looking at it, it's even simple. Except for one *very* annoying
thing: several users really don't want the size of the buffer, they
really do want the *shift* of the buffer size.

In fact, that single issue seems to be the reason why
"inode->i_blkbits" is really used in fs/buffer.c.

Otherwise it would be fairly trivial to just make the pattern be just a simple

if (!page_has_buffers(page))
create_empty_buffers(page, 1 << inode->i_blkbits, 0);
head = page_buffers(page);
blocksize = head->b_size;

and just use the blocksize that way, without any other games. All
done, no silly WARN_ON() to verify against some global block-size, and
the fs/buffer.c code would be perfectly simple, and would have no
problem at all with multiple different blocksizes in different pages
(the page lock serializes the buffers and thus the blocksize at the
per-page level).

But the fact that the code wants to do things like

block = (sector_t)page->index << (PAGE_CACHE_SHIFT - bbits);

seriously seems to be the main thing that keeps us using
'inode->i_blkbits'. Calculating bbits from bh->b_size is just costly
enough to hurt (not everywhere, but on some machines).

Very annoying.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

> A bigger issue is for things that emulate what blkdev.c does, and
> doesn't do the locking. I see code in md/bitmap.c that seems a bit
> suspicious, for example. That said, it's not *new* breakage, and the
> "lock at mmap/read/write() time" approach doesn't fix it either (since
> the mapping will be different for the underlying MD device). So I do
> think that we should take a look at all the users of
> "alloc_page_buffers()" and "create_empty_buffers()" to see what *they*
> do to protect the block-size, but I think that's an independent issue
> from the raw device access case in fs/block_dev.c..
> 
> I guess I have to actually test my patch. I don't have very
> interesting test-cases, though.
> 
>Linus

Yes, it md looks suspicious. It disables write access in 
deny_bitmap_write_access (that functions looks buggy on its own - what 
happens if i_writecount == 0 or if it is already negative on entry?).

So we could disallow changing block size if i_writecount < 0, that should 
do the trick.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

[ Sorry, I was offline for a while driving kids around ]

On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka  wrote:
>
> It can happen. Take your patch (the one that moves bd_block_size_semaphore
> into blkdev_readpage, blkdev_writepage and blkdev_write_begin).

Interesting. The code *has* the block size (it's in "bh->b_size"), but
it actually then uses the inode blocksize instead, and verifies the
two against each other. It could just have used the block size
directly (and then used the inode i_blkbits only when no buffers
existed), avoiding that dependency entirely..

It actually does the same thing (with the same verification) in
__block_write_full_page() and (_without_ the verification) in
__block_commit_write().

Ho humm. All of those places actually do hold the rwsem for reading,
it's just that I don't want to hold it for writing over the sync..

Need to think on this,

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

> > For example, __block_write_full_page and __block_write_begin do
> > if (!page_has_buffers(page)) { create_empty_buffers... }
> > and then they do
> > WARN_ON(bh->b_size != blocksize)
> > err = get_block(inode, block, bh, 1)
> 
> Right. And none of this is new.
> 
> > ... so if the buffers were left over from some previous call to
> > create_empty_buffers with a different blocksize, that WARN_ON is trigged.
> 
> None of this can happen.

It can happen. Take your patch (the one that moves bd_block_size_semaphore 
into blkdev_readpage, blkdev_writepage and blkdev_write_begin).

Insert msleep(1000) into set_blocksize, just before sync_blockdev.

Run this program:
#define _XOPEN_SOURCE 500
#include 
#include 
#include 
#include 
#include 

static char array[4096];

int main(void)
{
int h;
system("dmsetup remove test 2>/dev/null");
if (system("dmsetup create test --table '0 1 zero'")) exit(1);
h = open("/dev/mapper/test", O_RDWR);
if (h < 0) perror("open"), exit(1);
if (pread(h, array, 512, 0) != 512) perror("pread"), exit(1);
if (system("dmsetup load test --table '0 8 zero'")) exit(1);
if (system("dmsetup suspend test")) exit(1);
if (system("dmsetup resume test")) exit(1);
if (system("blockdev --setbsz 2048 /dev/mapper/test &")) exit(1);
usleep(50);
if (pwrite(h, array, 4096, 0) != 4096) perror("pwrite"), exit(1);
return 0;
}

--- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin
[ 1243.30] Backtrace:
[ 1243.33]  [<40230ba8>] block_write_begin+0x70/0xd0
[ 1243.40]  [<402350cc>] blkdev_write_begin+0xb4/0x208
[ 1243.48]  [<401a9f10>] generic_file_buffered_write+0x248/0x348
[ 1243.57]  [<401ac8c4>] __generic_file_aio_write+0x1fc/0x388
[ 1243.66]  [<40235e74>] blkdev_aio_write+0x64/0xf0
[ 1243.74]  [<401f2108>] do_sync_write+0xd0/0x128
[ 1243.81]  [<401f2930>] vfs_write+0xa0/0x180
[ 1243.88]  [<401f2ecc>] sys_pwrite64+0xb4/0xd8
[ 1243.95]  [<40122104>] parisc_pwrite64+0x1c/0x28
[ 1244.02]  [<40106060>] syscall_exit+0x0/0x14


I'm not saying that your approach is wrong, you just have to carefuly
review all memory management code for assumptions that block size
doesn't change.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds
 wrote:
>
>> For example, __block_write_full_page and __block_write_begin do
>> if (!page_has_buffers(page)) { create_empty_buffers... }
>> and then they do
>> WARN_ON(bh->b_size != blocksize)
>> err = get_block(inode, block, bh, 1)
>
> Right. And none of this is new.

.. which, btw, is not to say that *other* things aren't new. They are.

The change to actually change the block device buffer size before then
calling "sync_bdev()" is definitely a real change, and as mentioned, I
have not tested the patch in any way. If any block device driver were
to actually compare the IO size they get against the bdev->block_size
thing, they'd see very different behavior (ie they'd see the new block
size as they are asked to write old the old blocks with the old block
size).

So it does change semantics, no question about that. I don't think any
block device does it, though.

A bigger issue is for things that emulate what blkdev.c does, and
doesn't do the locking. I see code in md/bitmap.c that seems a bit
suspicious, for example. That said, it's not *new* breakage, and the
"lock at mmap/read/write() time" approach doesn't fix it either (since
the mapping will be different for the underlying MD device). So I do
think that we should take a look at all the users of
"alloc_page_buffers()" and "create_empty_buffers()" to see what *they*
do to protect the block-size, but I think that's an independent issue
from the raw device access case in fs/block_dev.c..

I guess I have to actually test my patch. I don't have very
interesting test-cases, though.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka  wrote:
>
> The problem with this approach is that it is very easy to miss points
> where it is assumed that the block size doesn't change - and if you miss a
> point, it results in a hidden bug that has a little possibility of being
> found.

Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw
stones in glass houses, shall we?

The whole reason for this long thread (and several threads before it)
is that your model isn't working and is causing problems. I already
pointed out how bogus your arguments about mmap() locking were, and
then you have the gall to talk about potential bugs, when I have
pointed you to *actual* bugs, and actual mistakes.

> For example, __block_write_full_page and __block_write_begin do
> if (!page_has_buffers(page)) { create_empty_buffers... }
> and then they do
> WARN_ON(bh->b_size != blocksize)
> err = get_block(inode, block, bh, 1)

Right. And none of this is new.

> ... so if the buffers were left over from some previous call to
> create_empty_buffers with a different blocksize, that WARN_ON is trigged.

None of this can happen.

> Locking the whole read/write/mmap operations is crude, but at least it can
> be done without thorough review of all the memory management code.

Umm. Which you clearly didn't do, and raised totally crap arguments for.

In contrast, I have a very simple argument for the correctness of my
patch: every single user of the "get_block[s]()" interface now takes
the lock for as long as get_block[s]() is passed off to somebody else.
And since get_block[s]() is the only way to create those empty
buffers, I think I pretty much proved exactly what you ask for.

And THAT is the whole point and advantage of making locking sane. Sane
locking you can actually *think* about!

In contrast, locking around "mmap()" is absolutely *guaranteed* to be
insane, because mmap() doesn't actually do any of the IO that the lock
is supposed to protect against!

So Mikulas, quite frankly, your arguments argue against you. When you
say "Locking the whole read/write/mmap operations is crude, but at
least it can
be done without thorough", you are doubly correct: it *is* crude, and
it clearly *was* done without thought, since it's a f*cking idiotic
AND INCORRECT thing to do.

Seriously. Locking around "mmap()" is insane. It leads to insane
semantics (the whole EBUSY thing is purely because of that problem)
and it leads to bad code (your "let's add a new "mmap_region" hook is
just disgusting, and while Al's idea of doing it in the existing
"->open" method is at least not nasty, it's definitely extra code and
complexity).

There are serious *CORRECTNESS* advantages to simplicity and
directness. And locking at the right point is definitely very much
part of that.

Anyway, as far as block size goes, we have exactly two cases:

 - random IO that does not care about the block size, and will just do
whatever the current block size is (ie normal anonymous accesses to
the block device).  This is the case that needs the locking - but it
only needs it around the individual page operations, ie exactly where
I put it. In fact, they can happily deal with different block sizes
for different pages, they don't really care.

 - mounted filesystems etc that require a particular block size and
set it at mount time, and they have exclusivity rules

The second case is the case that actually calls set_blocksize(), and
if "kill_bdev()" doesn't get rid of the old blocksizes, then they have
always been in trouble, and would always _continue_ to be in trouble,
regardless of locking.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

> On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
>  wrote:
> >
> > Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> > do unspeakable things to your family and pets.
> 
> Btw, *if* this approach works, I suspect we could just switch the
> bd_block_size_semaphore semaphore to be a regular rw-sem.
> 
> Why? Because now it's no longer ever gotten in the cached IO paths, we
> only get it when we're doing much more expensive things (ie actual IO,
> and buffer head allocations etc etc). As long as we just work with the
> page cache, we never get to the whole lock at all.
> 
> Which means that the whole percpu-optimized thing is likely no longer
> all that relevant.

Using normal semaphore degrades direct-IO performance on raw devices, I 
measured that (45.1s with normal rw-semaphore vs. 42.8s with 
percpu-rw-semaphore). It could be measured on ramdisk or high-performance 
SSDs.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka

On Wed, 28 Nov 2012, Linus Torvalds wrote:

> On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
>  wrote:
> >
> > mmap() is in *no* way special. The exact same thing happens for
> > regular read/write. Yet somehow the mmap code is special-cased, while
> > the normal read-write code is not.
> 
> I just double-checked, because it's been a long time since I actually
> looked at the code.
> 
> But yeah, block device read/write uses the pure page cache functions.
> IOW, it has the *exact* same IO engine as mmap() would have.
> 
> So here's my suggestion:
> 
>  - get rid of *all* the locking in aio_read/write and the splice paths
>  - get rid of all the stupid mmap games
> 
>  - instead, add them to the functions that actually use
> "blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.
> 
>That's a fairly limited number of functions:
> blkdev_{read,write}page(), blkdev_direct_IO() and
> blkdev_write_{begin,end}()
> 
> Doesn't that sounds simpler? And more logical: it protects the actual
> places that use the block size of the device.
> 
> I dunno. Maybe there is some fundamental reason why the above is
> broken, but it seems to be a much simpler approach. Sure, you need to
> guarantee that the people who get the write-lock cannot possibly cause
> IO while holding it, but since the only reason to get the write lock
> would be to change the block size, that should be pretty simple, no?
> 
> Yeah, yeah, I'm probably missing something fundamental, but the above
> sounds like the simple approach to fixing things. Aiming for having
> the block size read-lock be taken by the things that pass in the
> block-size itself.
> 
> It would be nice for things to be logical and straightforward.
> 
>Linus

The problem with this approach is that it is very easy to miss points 
where it is assumed that the block size doesn't change - and if you miss a 
point, it results in a hidden bug that has a little possibility of being 
found.

For example, __block_write_full_page and __block_write_begin do
if (!page_has_buffers(page)) { create_empty_buffers... }
and then they do
WARN_ON(bh->b_size != blocksize)
err = get_block(inode, block, bh, 1)

... so if the buffers were left over from some previous call to 
create_empty_buffers with a different blocksize, that WARN_ON is trigged. 
And it's not only a harmless warning - now bh->b_size is left set to the 
old block size, but bh->b_blocknr is set to a number, that was calculated 
according to the new block size - and when you submit that buffer with 
submit_bh, it is written to the wrong place!

Now, prove that there are no more bugs like this.

Locking the whole read/write/mmap operations is crude, but at least it can 
be done without thorough review of all the memory management code.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
 wrote:
>
> Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
> do unspeakable things to your family and pets.

Btw, *if* this approach works, I suspect we could just switch the
bd_block_size_semaphore semaphore to be a regular rw-sem.

Why? Because now it's no longer ever gotten in the cached IO paths, we
only get it when we're doing much more expensive things (ie actual IO,
and buffer head allocations etc etc). As long as we just work with the
page cache, we never get to the whole lock at all.

Which means that the whole percpu-optimized thing is likely no longer
all that relevant.

But that's an independent thing, and it's only true *if* my patch
works. It looks fine on paper, but maybe there's something
fundamentally broken about it.

One big change my patch does is to move the sync_bdev/kill_bdev to
*after* changing the block size. It does that so that it can guarantee
that any old data (which didn't see the new block size) will be
sync'ed even if there is new IO coming in as we change the block size.

The old code locked the whole sync() region, which doesn't work with
my approach, since the sync will do IO and would thus cause potential
deadlocks while holding the rwsem for writing.

So with this patch, as the block size changes, you can actually have
some old pages with the old block size *and* some different new pages
with the new  block size all at the same time. It should all be
perfectly fine, but it's worth pointing out.

(It probably won't trigger in practice, though, since doing IO while
somebody else is changing the blocksize is fundamentally an odd thing
to do, but whatever. I also suspect that we *should* perhaps use the
inode->i_sem thing to serialize concurrent block size changes, but
that's again an independent issue)

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds
 wrote:
>
> I dunno. Maybe there is some fundamental reason why the above is
> broken, but it seems to be a much simpler approach. Sure, you need to
> guarantee that the people who get the write-lock cannot possibly cause
> IO while holding it, but since the only reason to get the write lock
> would be to change the block size, that should be pretty simple, no?

Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
do unspeakable things to your family and pets.

  Linus


patch.diff
Description: Binary data

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
 wrote:
>
> mmap() is in *no* way special. The exact same thing happens for
> regular read/write. Yet somehow the mmap code is special-cased, while
> the normal read-write code is not.

I just double-checked, because it's been a long time since I actually
looked at the code.

But yeah, block device read/write uses the pure page cache functions.
IOW, it has the *exact* same IO engine as mmap() would have.

So here's my suggestion:

 - get rid of *all* the locking in aio_read/write and the splice paths
 - get rid of all the stupid mmap games

 - instead, add them to the functions that actually use
"blkdev_get_block()" and "blkdev_get_blocks()" and nowhere else.

   That's a fairly limited number of functions:
blkdev_{read,write}page(), blkdev_direct_IO() and
blkdev_write_{begin,end}()

Doesn't that sounds simpler? And more logical: it protects the actual
places that use the block size of the device.

I dunno. Maybe there is some fundamental reason why the above is
broken, but it seems to be a much simpler approach. Sure, you need to
guarantee that the people who get the write-lock cannot possibly cause
IO while holding it, but since the only reason to get the write lock
would be to change the block size, that should be pretty simple, no?

Yeah, yeah, I'm probably missing something fundamental, but the above
sounds like the simple approach to fixing things. Aiming for having
the block size read-lock be taken by the things that pass in the
block-size itself.

It would be nice for things to be logical and straightforward.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka  wrote:
>
> mmap_region() doesn't care about the block size. But a lot of
> page-in/page-out code does.

That seems a bogus argument.

mmap() is in *no* way special. The exact same thing happens for
regular read/write. Yet somehow the mmap code is special-cased, while
the normal read-write code is not.

I suspect it might be *easier* to trigger some issues with mmap, but
that still isn't a good enough reason to special-case it. We don't add
locking to one please just because that one place shows some race
condition more easily. We fix the locking.

So for example, maybe the code that *actually* cares about the buffer
size (the stuff that allocates buffers in fs/buffer.c) needs to take
that new percpu read lock. Basically, any caller of
"alloc_page_buffers()/create_empty_buffers()" or whatever.

I also wonder whether we need it *at*all*. I suspect that we could
easily have multiple block-sizes these days for the same block device.
It *used* to be (millions of years ago, when dinosaurs roamed the
earth) that the block buffers were global and shared with all users of
a partition. But that hasn't been true since we started using the page
cache, and I suspect that some of the block size changing issues are
simply entirely stale.

Yeah, yeah, there could be some coherency issues if people write to
the block device through different block sizes, but I think we have
those coherency issues anyway. The page-cache is not coherent across
different mapping inodes anyway.

So I really suspect that some of this is "legacy logic". Or at least
perhaps _should_ be.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 11:43 AM, Al Viro  wrote:
> Have a
> private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
> added to it.

That sounds more reasonable.

However, I suspect the *most* reasonable thing to do is to just remove
the whole damn thing. We really shouldn't care about mmap. If somebody
does a mmap on a block device, and somebody else then changes the
block size, why-ever should we bother to go through any contortions at
*all* to make that kind of insane behavior do anything sane at all.

Just let people mmap things. Then just let the normal page cache
invalidation work right. In fact, it is entirely possible that we
could/should just not even invalidate the page cache at all, just make
sure that the buffer heads attached to any pages get disconnected. No?

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka

On Wed, 28 Nov 2012, Linus Torvalds wrote:

> No, this is crap.
> 
> We don't introduce random hooks like this just because the block layer
> has shit-for-brains and cannot be bothered to do things right.
> 
> The fact is, the whole locking in the block layer open routine is
> total and utter crap. It doesn't lock the right thing, even with your
> change *anyway* (or with the change Jens had). Absolutely nothing in
> "mmap_region()" cares at all about the block-size anywhere - it's
> generic, after all - so locking around it is f*cking pointless. There
> is no way in hell that the caller of ->mmap can *ever* care about the
> block size, since it never even looks at it.
> 
> Don't do random crap like this.
> 
> Why does the code think that mmap matters so much anyway? As you say,
> the mmap itself does *nothing*. It has no impact for the block size.
> 
>  Linus

mmap_region() doesn't care about the block size. But a lot of 
page-in/page-out code does.

The problem is that once the block device is mapped, page faults or page 
writeback can happen anytime - so the simplest solution is to not allow 
the block device being mapped while we change block size.

The function set_blocksize takes bd_block_size_semaphore for write (that 
blocks read/write/mmap), then it calls sync_blockdev (now we are sure that 
there is no more writeback), then it changes the block size, then it calls 
kill_bdev (now we are sure that there are no more any pages with buffers 
with the old blocksize).

If you want to allow to change block size while a block device is mapped, 
you'd have to add explicit locks around all mm callbacks (so that the 
block size can't change while the callback is in progress) - and even 
then, there are some unsolvable cases - i.e. what are you going to do if 
the user mlocks a mapped block device and you change block size of that 
device? - you can't drop the pages (that would violate mlock semantics) 
and you can leave them there (because they have buffers with wrong size).

If you don't like what I sent, propose a different solution.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote:
> No, this is crap.
> 
> We don't introduce random hooks like this just because the block layer
> has shit-for-brains and cannot be bothered to do things right.
> 
> The fact is, the whole locking in the block layer open routine is
> total and utter crap. It doesn't lock the right thing, even with your
> change *anyway* (or with the change Jens had). Absolutely nothing in
> "mmap_region()" cares at all about the block-size anywhere - it's
> generic, after all - so locking around it is f*cking pointless. There
> is no way in hell that the caller of ->mmap can *ever* care about the
> block size, since it never even looks at it.
> 
> Don't do random crap like this.
> 
> Why does the code think that mmap matters so much anyway? As you say,
> the mmap itself does *nothing*. It has no impact for the block size.

... and here I was, trying to massage a reply into form that would be
at least borderline printable...  Anyway, this is certainly the wrong
way to go.  We want to catch if the damn thing is mapped and mapping_mapped()
leaves a race?  Fine, so let's not use mapping_mapped() at all.  Have a
private vm_operations - a copy of generic_file_vm_ops with ->open()/->close()
added to it.  Incrementing/decrementing a per-block_device atomic counter,
with increment side done under your rwsem, to make sure that 0->positive
transition doesn't change in critical section of set_blocksize().  And don't
use generic_file_mmap(), just do file_accessed() and set ->vm_ops to that
local copy.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Linus Torvalds

No, this is crap.

We don't introduce random hooks like this just because the block layer
has shit-for-brains and cannot be bothered to do things right.

The fact is, the whole locking in the block layer open routine is
total and utter crap. It doesn't lock the right thing, even with your
change *anyway* (or with the change Jens had). Absolutely nothing in
"mmap_region()" cares at all about the block-size anywhere - it's
generic, after all - so locking around it is f*cking pointless. There
is no way in hell that the caller of ->mmap can *ever* care about the
block size, since it never even looks at it.

Don't do random crap like this.

Why does the code think that mmap matters so much anyway? As you say,
the mmap itself does *nothing*. It has no impact for the block size.

 Linus

On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka  wrote:
>
>
> On Wed, 28 Nov 2012, Jens Axboe wrote:
>
>> On 2012-11-28 04:57, Mikulas Patocka wrote:
>> >
>> > This patch is wrong because you must check if the device is mapped while
>> > holding bdev->bd_block_size_semaphore (because
>> > bdev->bd_block_size_semaphore prevents new mappings from being created)
>>
>> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
>
> Hmm, it was wrong before the patch and it is wrong after the patch too.
>
> The problem is that ->mmap method doesn't do the actual mapping, the
> caller of ->mmap (mmap_region) does it. So we must actually catch
> mmap_region and protect it with the lock, not ->mmap.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Introduce a method to catch mmap_region (was: Recent kernel "mount" slow)

2012-11-28 Thread Mikulas Patocka

On Wed, 28 Nov 2012, Jens Axboe wrote:

> On 2012-11-28 04:57, Mikulas Patocka wrote:
> > 
> > This patch is wrong because you must check if the device is mapped while 
> > holding bdev->bd_block_size_semaphore (because 
> > bdev->bd_block_size_semaphore prevents new mappings from being created)
> 
> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

Hmm, it was wrong before the patch and it is wrong after the patch too.

The problem is that ->mmap method doesn't do the actual mapping, the 
caller of ->mmap (mmap_region) does it. So we must actually catch 
mmap_region and protect it with the lock, not ->mmap.

Mikulas

---

Introduce a method to catch mmap_region

For block devices, we must prevent the device from being mapped while
block size is being changed.

This was implemented by catching the mmap method and taking
bd_block_size_semaphore in it.

The problem is that the generic_file_mmap method does almost nothing, it
only sets vma->vm_ops = _file_vm_ops, all the hard work is done
by the caller, mmap_region. Consequently, locking the mmap method is
insufficient, to prevent the race condition. The race condition can
happen for example this way:

process 1:
Starts mapping a block device, it goes to mmap_region, calls
file->f_op->mmap (blkdev_mmap - that acquires and releases
bd_block_size_semaphore), and reschedule happens just after blkdev_mmap
returns.

process 2:
Changes block device size, goes into set_blocksize, acquires
percpu_down_write, calls mapping_mapped to test if the device is mapped
(it still isn't). Then, it reschedules.

process 1:
Continues in mmap_region, successfully finishes the mapping.

process 2:
Continues in set_blocksize, changes the block size while the block
device is mapped. (which is incorrect and may result in a crash or
misbehavior).

To fix this possible race condition, we need to catch the function that
does real mapping - mmap_region. This patch adds a new method,
file_operations->mmap_region. mmap_region calls the method
file_operations->mmap_region if it is non-NULL, otherwise, it calls
generic_mmap_region.

Signed-off-by: Mikulas Patocka 

---
 fs/block_dev.c |   12 
 include/linux/fs.h |4 
 include/linux/mm.h |3 +++
 mm/mmap.c  |   10 ++
 4 files changed, 25 insertions(+), 4 deletions(-)

Index: linux-3.7-rc7/include/linux/fs.h
===
--- linux-3.7-rc7.orig/include/linux/fs.h   2012-11-28 17:45:55.0 
+0100
+++ linux-3.7-rc7/include/linux/fs.h2012-11-28 18:02:45.0 +0100
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1528,6 +1529,9 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+   unsigned long (*mmap_region)(struct file *, unsigned long,
+unsigned long, unsigned long, vm_flags_t,
+unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
Index: linux-3.7-rc7/include/linux/mm.h
===
--- linux-3.7-rc7.orig/include/linux/mm.h   2012-11-28 17:45:55.0 
+0100
+++ linux-3.7-rc7/include/linux/mm.h2012-11-28 17:47:12.0 +0100
@@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long generic_mmap_region(struct file *file, unsigned long addr,
+   unsigned long len, unsigned long flags,
+   vm_flags_t vm_flags, unsigned long pgoff);
 extern unsigned long do_mmap_pgoff(struct file *, unsigned long,
 unsigned long, unsigned long,
 unsigned long, unsigned long);
Index: linux-3.7-rc7/mm/mmap.c
===
--- linux-3.7-rc7.orig/mm/mmap.c2012-11-28 17:45:55.0 +0100
+++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.0 +0100
@@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f
  unsigned long len, unsigned long flags,
  vm_flags_t vm_flags, unsigned long pgoff)
 {
+   if (file && file->f_op->mmap_region)
+   return file->f_op->mmap_region(file, addr, len, flags, vm_flags,
+  pgoff);
+   return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff);
+}
+
+unsigned long generic_mmap_region(struct file *file, unsigned long addr,
+ unsigned long len,

Re: Recent kernel "mount" slow

2012-11-28 Thread Jeff Chua

On Wed, Nov 28, 2012 at 4:33 PM, Jens Axboe  wrote:
> On 2012-11-28 04:57, Mikulas Patocka wrote:
>>
>>
>> On Tue, 27 Nov 2012, Jens Axboe wrote:
>>
>>> On 2012-11-27 11:06, Jeff Chua wrote:
 On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe  wrote:
> On 2012-11-27 06:57, Jeff Chua wrote:
>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  
>> wrote:
>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
>>> wrote:
 So it's better to slow down mount.
>>>
>>> I am quite proud of the linux boot time pitting against other OS. Even
>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>> you're saying that we need to do this semaphore check at boot up. By
>>> doing so, it's inducing additional 4 seconds during boot up.
>>
>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> kind of degradation would this cause or just the same?
>
> It'd likely be the same slow down time wise, but as a percentage it
> would appear smaller on a slower disk.
>
> Could you please test Mikulas' suggestion of changing
> synchronize_sched() in include/linux/percpu-rwsem.h to
> synchronize_sched_expedited()?

 Tested. It seems as fast as before, but may be a "tick" slower. Just
 perception. I was getting pretty much 0.012s with everything reverted.
 With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
 So, it's good.
>>>
>>> Excellent
>>>
> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> tree. It would be a good data point it you could test that, too.

 Tested. It's slower. 0.350s. But still faster than 0.500s without the 
 patch.
>>>
>>> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
>>> the real issue, which is having to do synchronize_sched() in the first
>>> place.
>>>
 # time mount /dev/sda1 /mnt; sync; sync; umount /mnt


 So, here's the comparison ...

 0.500s 3.7.0-rc7
 0.168s 3.7.0-rc2
 0.012s 3.6.0
 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
 0.350s 3.7.0-rc7 + Oleg's patch.
>>>
>>> I wonder how many of them are due to changing to the same block size.
>>> Does the below patch make a difference?
>>
>> This patch is wrong because you must check if the device is mapped while
>> holding bdev->bd_block_size_semaphore (because
>> bdev->bd_block_size_semaphore prevents new mappings from being created)
>
> No it doesn't. If you read the patch, that was moved to i_mmap_mutex.
>
>> I'm sending another patch that has the same effect.
>>
>>
>> Note that ext[234] filesystems set blocksize to 1024 temporarily during
>> mount, so it doesn't help much (it only helps for other filesystems, such
>> as jfs). For ext[234], you have a device with default block size 4096, the
>> filesystem sets block size to 1024 during mount, reads the super block and
>> sets it back to 4096.
>
> That is true, hence I was hesitant to think it'll actually help. In any
> case, basically any block device will have at least one blocksize
> transitioned when being mounted for the first time. I wonder if we just
> shouldn't default to having a 4kb soft block size to avoid that one,
> though it is working around the issue to some degree.

I tested on reiserfs. It helped. 0.012s as in 3.6.0, but as Mikulas
mentioned, it didn't really improve much for ext2.

Jeff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-28 Thread Jens Axboe

On 2012-11-28 04:57, Mikulas Patocka wrote:
> 
> 
> On Tue, 27 Nov 2012, Jens Axboe wrote:
> 
>> On 2012-11-27 11:06, Jeff Chua wrote:
>>> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe  wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  
> wrote:
>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
>> wrote:
>>> So it's better to slow down mount.
>>
>> I am quite proud of the linux boot time pitting against other OS. Even
>> with 10 partitions. Linux can boot up in just a few seconds, but now
>> you're saying that we need to do this semaphore check at boot up. By
>> doing so, it's inducing additional 4 seconds during boot up.
>
> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?
>>>
>>> Tested. It seems as fast as before, but may be a "tick" slower. Just
>>> perception. I was getting pretty much 0.012s with everything reverted.
>>> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
>>> So, it's good.
>>
>> Excellent
>>
 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.
>>>
>>> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
>>
>> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
>> the real issue, which is having to do synchronize_sched() in the first
>> place.
>>
>>> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>>>
>>>
>>> So, here's the comparison ...
>>>
>>> 0.500s 3.7.0-rc7
>>> 0.168s 3.7.0-rc2
>>> 0.012s 3.6.0
>>> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
>>> 0.350s 3.7.0-rc7 + Oleg's patch.
>>
>> I wonder how many of them are due to changing to the same block size.
>> Does the below patch make a difference?
> 
> This patch is wrong because you must check if the device is mapped while 
> holding bdev->bd_block_size_semaphore (because 
> bdev->bd_block_size_semaphore prevents new mappings from being created)

No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

> I'm sending another patch that has the same effect.
> 
> 
> Note that ext[234] filesystems set blocksize to 1024 temporarily during 
> mount, so it doesn't help much (it only helps for other filesystems, such 
> as jfs). For ext[234], you have a device with default block size 4096, the 
> filesystem sets block size to 1024 during mount, reads the super block and 
> sets it back to 4096.

That is true, hence I was hesitant to think it'll actually help. In any
case, basically any block device will have at least one blocksize
transitioned when being mounted for the first time. I wonder if we just
shouldn't default to having a 4kb soft block size to avoid that one,
though it is working around the issue to some degree.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-28 Thread Jens Axboe

On 2012-11-28 04:57, Mikulas Patocka wrote:
 
 
 On Tue, 27 Nov 2012, Jens Axboe wrote:
 
 On 2012-11-27 11:06, Jeff Chua wrote:
 On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com 
 wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?

 Tested. It seems as fast as before, but may be a tick slower. Just
 perception. I was getting pretty much 0.012s with everything reverted.
 With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
 So, it's good.

 Excellent

 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.

 Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.

 Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
 the real issue, which is having to do synchronize_sched() in the first
 place.

 # time mount /dev/sda1 /mnt; sync; sync; umount /mnt


 So, here's the comparison ...

 0.500s 3.7.0-rc7
 0.168s 3.7.0-rc2
 0.012s 3.6.0
 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
 0.350s 3.7.0-rc7 + Oleg's patch.

 I wonder how many of them are due to changing to the same block size.
 Does the below patch make a difference?
 
 This patch is wrong because you must check if the device is mapped while 
 holding bdev-bd_block_size_semaphore (because 
 bdev-bd_block_size_semaphore prevents new mappings from being created)

No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

 I'm sending another patch that has the same effect.
 
 
 Note that ext[234] filesystems set blocksize to 1024 temporarily during 
 mount, so it doesn't help much (it only helps for other filesystems, such 
 as jfs). For ext[234], you have a device with default block size 4096, the 
 filesystem sets block size to 1024 during mount, reads the super block and 
 sets it back to 4096.

That is true, hence I was hesitant to think it'll actually help. In any
case, basically any block device will have at least one blocksize
transitioned when being mounted for the first time. I wonder if we just
shouldn't default to having a 4kb soft block size to avoid that one,
though it is working around the issue to some degree.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-28 Thread Jeff Chua

On Wed, Nov 28, 2012 at 4:33 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-28 04:57, Mikulas Patocka wrote:


 On Tue, 27 Nov 2012, Jens Axboe wrote:

 On 2012-11-27 11:06, Jeff Chua wrote:
 On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com 
 wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?

 Tested. It seems as fast as before, but may be a tick slower. Just
 perception. I was getting pretty much 0.012s with everything reverted.
 With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
 So, it's good.

 Excellent

 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.

 Tested. It's slower. 0.350s. But still faster than 0.500s without the 
 patch.

 Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
 the real issue, which is having to do synchronize_sched() in the first
 place.

 # time mount /dev/sda1 /mnt; sync; sync; umount /mnt


 So, here's the comparison ...

 0.500s 3.7.0-rc7
 0.168s 3.7.0-rc2
 0.012s 3.6.0
 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
 0.350s 3.7.0-rc7 + Oleg's patch.

 I wonder how many of them are due to changing to the same block size.
 Does the below patch make a difference?

 This patch is wrong because you must check if the device is mapped while
 holding bdev-bd_block_size_semaphore (because
 bdev-bd_block_size_semaphore prevents new mappings from being created)

 No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

 I'm sending another patch that has the same effect.


 Note that ext[234] filesystems set blocksize to 1024 temporarily during
 mount, so it doesn't help much (it only helps for other filesystems, such
 as jfs). For ext[234], you have a device with default block size 4096, the
 filesystem sets block size to 1024 during mount, reads the super block and
 sets it back to 4096.

 That is true, hence I was hesitant to think it'll actually help. In any
 case, basically any block device will have at least one blocksize
 transitioned when being mounted for the first time. I wonder if we just
 shouldn't default to having a 4kb soft block size to avoid that one,
 though it is working around the issue to some degree.

I tested on reiserfs. It helped. 0.012s as in 3.6.0, but as Mikulas
mentioned, it didn't really improve much for ext2.

Jeff.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Jens Axboe wrote:

 On 2012-11-28 04:57, Mikulas Patocka wrote:
  
  This patch is wrong because you must check if the device is mapped while 
  holding bdev-bd_block_size_semaphore (because 
  bdev-bd_block_size_semaphore prevents new mappings from being created)
 
 No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

Hmm, it was wrong before the patch and it is wrong after the patch too.

The problem is that -mmap method doesn't do the actual mapping, the 
caller of -mmap (mmap_region) does it. So we must actually catch 
mmap_region and protect it with the lock, not -mmap.

Mikulas

---

Introduce a method to catch mmap_region

For block devices, we must prevent the device from being mapped while
block size is being changed.

This was implemented by catching the mmap method and taking
bd_block_size_semaphore in it.

The problem is that the generic_file_mmap method does almost nothing, it
only sets vma-vm_ops = generic_file_vm_ops, all the hard work is done
by the caller, mmap_region. Consequently, locking the mmap method is
insufficient, to prevent the race condition. The race condition can
happen for example this way:

process 1:
Starts mapping a block device, it goes to mmap_region, calls
file-f_op-mmap (blkdev_mmap - that acquires and releases
bd_block_size_semaphore), and reschedule happens just after blkdev_mmap
returns.

process 2:
Changes block device size, goes into set_blocksize, acquires
percpu_down_write, calls mapping_mapped to test if the device is mapped
(it still isn't). Then, it reschedules.

process 1:
Continues in mmap_region, successfully finishes the mapping.

process 2:
Continues in set_blocksize, changes the block size while the block
device is mapped. (which is incorrect and may result in a crash or
misbehavior).

To fix this possible race condition, we need to catch the function that
does real mapping - mmap_region. This patch adds a new method,
file_operations-mmap_region. mmap_region calls the method
file_operations-mmap_region if it is non-NULL, otherwise, it calls
generic_mmap_region.

Signed-off-by: Mikulas Patocka mpato...@redhat.com

---
 fs/block_dev.c |   12 
 include/linux/fs.h |4 
 include/linux/mm.h |3 +++
 mm/mmap.c  |   10 ++
 4 files changed, 25 insertions(+), 4 deletions(-)

Index: linux-3.7-rc7/include/linux/fs.h
===
--- linux-3.7-rc7.orig/include/linux/fs.h   2012-11-28 17:45:55.0 
+0100
+++ linux-3.7-rc7/include/linux/fs.h2012-11-28 18:02:45.0 +0100
@@ -27,6 +27,7 @@
 #include linux/lockdep.h
 #include linux/percpu-rwsem.h
 #include linux/blk_types.h
+#include linux/mm_types.h
 
 #include asm/byteorder.h
 #include uapi/linux/fs.h
@@ -1528,6 +1529,9 @@ struct file_operations {
unsigned int (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
+   unsigned long (*mmap_region)(struct file *, unsigned long,
+unsigned long, unsigned long, vm_flags_t,
+unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
Index: linux-3.7-rc7/include/linux/mm.h
===
--- linux-3.7-rc7.orig/include/linux/mm.h   2012-11-28 17:45:55.0 
+0100
+++ linux-3.7-rc7/include/linux/mm.h2012-11-28 17:47:12.0 +0100
@@ -1444,6 +1444,9 @@ extern unsigned long get_unmapped_area(s
 extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long generic_mmap_region(struct file *file, unsigned long addr,
+   unsigned long len, unsigned long flags,
+   vm_flags_t vm_flags, unsigned long pgoff);
 extern unsigned long do_mmap_pgoff(struct file *, unsigned long,
 unsigned long, unsigned long,
 unsigned long, unsigned long);
Index: linux-3.7-rc7/mm/mmap.c
===
--- linux-3.7-rc7.orig/mm/mmap.c2012-11-28 17:45:55.0 +0100
+++ linux-3.7-rc7/mm/mmap.c 2012-11-28 17:57:40.0 +0100
@@ -1244,6 +1244,16 @@ unsigned long mmap_region(struct file *f
  unsigned long len, unsigned long flags,
  vm_flags_t vm_flags, unsigned long pgoff)
 {
+   if (file  file-f_op-mmap_region)
+   return file-f_op-mmap_region(file, addr, len, flags, vm_flags,
+  pgoff);
+   return generic_mmap_region(file, addr, len, flags, vm_flags, pgoff);
+}
+
+unsigned long

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

No, this is crap.

We don't introduce random hooks like this just because the block layer
has shit-for-brains and cannot be bothered to do things right.

The fact is, the whole locking in the block layer open routine is
total and utter crap. It doesn't lock the right thing, even with your
change *anyway* (or with the change Jens had). Absolutely nothing in
mmap_region() cares at all about the block-size anywhere - it's
generic, after all - so locking around it is f*cking pointless. There
is no way in hell that the caller of -mmap can *ever* care about the
block size, since it never even looks at it.

Don't do random crap like this.

Why does the code think that mmap matters so much anyway? As you say,
the mmap itself does *nothing*. It has no impact for the block size.

 Linus

On Wed, Nov 28, 2012 at 9:25 AM, Mikulas Patocka mpato...@redhat.com wrote:


 On Wed, 28 Nov 2012, Jens Axboe wrote:

 On 2012-11-28 04:57, Mikulas Patocka wrote:
 
  This patch is wrong because you must check if the device is mapped while
  holding bdev-bd_block_size_semaphore (because
  bdev-bd_block_size_semaphore prevents new mappings from being created)

 No it doesn't. If you read the patch, that was moved to i_mmap_mutex.

 Hmm, it was wrong before the patch and it is wrong after the patch too.

 The problem is that -mmap method doesn't do the actual mapping, the
 caller of -mmap (mmap_region) does it. So we must actually catch
 mmap_region and protect it with the lock, not -mmap.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 11:15:12AM -0800, Linus Torvalds wrote:
 No, this is crap.
 
 We don't introduce random hooks like this just because the block layer
 has shit-for-brains and cannot be bothered to do things right.
 
 The fact is, the whole locking in the block layer open routine is
 total and utter crap. It doesn't lock the right thing, even with your
 change *anyway* (or with the change Jens had). Absolutely nothing in
 mmap_region() cares at all about the block-size anywhere - it's
 generic, after all - so locking around it is f*cking pointless. There
 is no way in hell that the caller of -mmap can *ever* care about the
 block size, since it never even looks at it.
 
 Don't do random crap like this.
 
 Why does the code think that mmap matters so much anyway? As you say,
 the mmap itself does *nothing*. It has no impact for the block size.

... and here I was, trying to massage a reply into form that would be
at least borderline printable...  Anyway, this is certainly the wrong
way to go.  We want to catch if the damn thing is mapped and mapping_mapped()
leaves a race?  Fine, so let's not use mapping_mapped() at all.  Have a
private vm_operations - a copy of generic_file_vm_ops with -open()/-close()
added to it.  Incrementing/decrementing a per-block_device atomic counter,
with increment side done under your rwsem, to make sure that 0-positive
transition doesn't change in critical section of set_blocksize().  And don't
use generic_file_mmap(), just do file_accessed() and set -vm_ops to that
local copy.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

 No, this is crap.
 
 We don't introduce random hooks like this just because the block layer
 has shit-for-brains and cannot be bothered to do things right.
 
 The fact is, the whole locking in the block layer open routine is
 total and utter crap. It doesn't lock the right thing, even with your
 change *anyway* (or with the change Jens had). Absolutely nothing in
 mmap_region() cares at all about the block-size anywhere - it's
 generic, after all - so locking around it is f*cking pointless. There
 is no way in hell that the caller of -mmap can *ever* care about the
 block size, since it never even looks at it.
 
 Don't do random crap like this.
 
 Why does the code think that mmap matters so much anyway? As you say,
 the mmap itself does *nothing*. It has no impact for the block size.
 
  Linus

mmap_region() doesn't care about the block size. But a lot of 
page-in/page-out code does.

The problem is that once the block device is mapped, page faults or page 
writeback can happen anytime - so the simplest solution is to not allow 
the block device being mapped while we change block size.

The function set_blocksize takes bd_block_size_semaphore for write (that 
blocks read/write/mmap), then it calls sync_blockdev (now we are sure that 
there is no more writeback), then it changes the block size, then it calls 
kill_bdev (now we are sure that there are no more any pages with buffers 
with the old blocksize).

If you want to allow to change block size while a block device is mapped, 
you'd have to add explicit locks around all mm callbacks (so that the 
block size can't change while the callback is in progress) - and even 
then, there are some unsolvable cases - i.e. what are you going to do if 
the user mlocks a mapped block device and you change block size of that 
device? - you can't drop the pages (that would violate mlock semantics) 
and you can leave them there (because they have buffers with wrong size).

If you don't like what I sent, propose a different solution.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 11:43 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 Have a
 private vm_operations - a copy of generic_file_vm_ops with -open()/-close()
 added to it.

That sounds more reasonable.

However, I suspect the *most* reasonable thing to do is to just remove
the whole damn thing. We really shouldn't care about mmap. If somebody
does a mmap on a block device, and somebody else then changes the
block size, why-ever should we bother to go through any contortions at
*all* to make that kind of insane behavior do anything sane at all.

Just let people mmap things. Then just let the normal page cache
invalidation work right. In fact, it is entirely possible that we
could/should just not even invalidate the page cache at all, just make
sure that the buffer heads attached to any pages get disconnected. No?

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 11:50 AM, Mikulas Patocka mpato...@redhat.com wrote:

 mmap_region() doesn't care about the block size. But a lot of
 page-in/page-out code does.

That seems a bogus argument.

mmap() is in *no* way special. The exact same thing happens for
regular read/write. Yet somehow the mmap code is special-cased, while
the normal read-write code is not.

I suspect it might be *easier* to trigger some issues with mmap, but
that still isn't a good enough reason to special-case it. We don't add
locking to one please just because that one place shows some race
condition more easily. We fix the locking.

So for example, maybe the code that *actually* cares about the buffer
size (the stuff that allocates buffers in fs/buffer.c) needs to take
that new percpu read lock. Basically, any caller of
alloc_page_buffers()/create_empty_buffers() or whatever.

I also wonder whether we need it *at*all*. I suspect that we could
easily have multiple block-sizes these days for the same block device.
It *used* to be (millions of years ago, when dinosaurs roamed the
earth) that the block buffers were global and shared with all users of
a partition. But that hasn't been true since we started using the page
cache, and I suspect that some of the block size changing issues are
simply entirely stale.

Yeah, yeah, there could be some coherency issues if people write to
the block device through different block sizes, but I think we have
those coherency issues anyway. The page-cache is not coherent across
different mapping inodes anyway.

So I really suspect that some of this is legacy logic. Or at least
perhaps _should_ be.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 mmap() is in *no* way special. The exact same thing happens for
 regular read/write. Yet somehow the mmap code is special-cased, while
 the normal read-write code is not.

I just double-checked, because it's been a long time since I actually
looked at the code.

But yeah, block device read/write uses the pure page cache functions.
IOW, it has the *exact* same IO engine as mmap() would have.

So here's my suggestion:

 - get rid of *all* the locking in aio_read/write and the splice paths
 - get rid of all the stupid mmap games

 - instead, add them to the functions that actually use
blkdev_get_block() and blkdev_get_blocks() and nowhere else.

   That's a fairly limited number of functions:
blkdev_{read,write}page(), blkdev_direct_IO() and
blkdev_write_{begin,end}()

Doesn't that sounds simpler? And more logical: it protects the actual
places that use the block size of the device.

I dunno. Maybe there is some fundamental reason why the above is
broken, but it seems to be a much simpler approach. Sure, you need to
guarantee that the people who get the write-lock cannot possibly cause
IO while holding it, but since the only reason to get the write lock
would be to change the block size, that should be pretty simple, no?

Yeah, yeah, I'm probably missing something fundamental, but the above
sounds like the simple approach to fixing things. Aiming for having
the block size read-lock be taken by the things that pass in the
block-size itself.

It would be nice for things to be logical and straightforward.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:13 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I dunno. Maybe there is some fundamental reason why the above is
 broken, but it seems to be a much simpler approach. Sure, you need to
 guarantee that the people who get the write-lock cannot possibly cause
 IO while holding it, but since the only reason to get the write lock
 would be to change the block size, that should be pretty simple, no?

Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
do unspeakable things to your family and pets.

  Linus


patch.diff
Description: Binary data

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
 do unspeakable things to your family and pets.

Btw, *if* this approach works, I suspect we could just switch the
bd_block_size_semaphore semaphore to be a regular rw-sem.

Why? Because now it's no longer ever gotten in the cached IO paths, we
only get it when we're doing much more expensive things (ie actual IO,
and buffer head allocations etc etc). As long as we just work with the
page cache, we never get to the whole lock at all.

Which means that the whole percpu-optimized thing is likely no longer
all that relevant.

But that's an independent thing, and it's only true *if* my patch
works. It looks fine on paper, but maybe there's something
fundamentally broken about it.

One big change my patch does is to move the sync_bdev/kill_bdev to
*after* changing the block size. It does that so that it can guarantee
that any old data (which didn't see the new block size) will be
sync'ed even if there is new IO coming in as we change the block size.

The old code locked the whole sync() region, which doesn't work with
my approach, since the sync will do IO and would thus cause potential
deadlocks while holding the rwsem for writing.

So with this patch, as the block size changes, you can actually have
some old pages with the old block size *and* some different new pages
with the new  block size all at the same time. It should all be
perfectly fine, but it's worth pointing out.

(It probably won't trigger in practice, though, since doing IO while
somebody else is changing the blocksize is fundamentally an odd thing
to do, but whatever. I also suspect that we *should* perhaps use the
inode-i_sem thing to serialize concurrent block size changes, but
that's again an independent issue)

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

 On Wed, Nov 28, 2012 at 12:03 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  mmap() is in *no* way special. The exact same thing happens for
  regular read/write. Yet somehow the mmap code is special-cased, while
  the normal read-write code is not.
 
 I just double-checked, because it's been a long time since I actually
 looked at the code.
 
 But yeah, block device read/write uses the pure page cache functions.
 IOW, it has the *exact* same IO engine as mmap() would have.
 
 So here's my suggestion:
 
  - get rid of *all* the locking in aio_read/write and the splice paths
  - get rid of all the stupid mmap games
 
  - instead, add them to the functions that actually use
 blkdev_get_block() and blkdev_get_blocks() and nowhere else.
 
That's a fairly limited number of functions:
 blkdev_{read,write}page(), blkdev_direct_IO() and
 blkdev_write_{begin,end}()
 
 Doesn't that sounds simpler? And more logical: it protects the actual
 places that use the block size of the device.
 
 I dunno. Maybe there is some fundamental reason why the above is
 broken, but it seems to be a much simpler approach. Sure, you need to
 guarantee that the people who get the write-lock cannot possibly cause
 IO while holding it, but since the only reason to get the write lock
 would be to change the block size, that should be pretty simple, no?
 
 Yeah, yeah, I'm probably missing something fundamental, but the above
 sounds like the simple approach to fixing things. Aiming for having
 the block size read-lock be taken by the things that pass in the
 block-size itself.
 
 It would be nice for things to be logical and straightforward.
 
Linus


The problem with this approach is that it is very easy to miss points 
where it is assumed that the block size doesn't change - and if you miss a 
point, it results in a hidden bug that has a little possibility of being 
found.

For example, __block_write_full_page and __block_write_begin do
if (!page_has_buffers(page)) { create_empty_buffers... }
and then they do
WARN_ON(bh-b_size != blocksize)
err = get_block(inode, block, bh, 1)

... so if the buffers were left over from some previous call to 
create_empty_buffers with a different blocksize, that WARN_ON is trigged. 
And it's not only a harmless warning - now bh-b_size is left set to the 
old block size, but bh-b_blocknr is set to a number, that was calculated 
according to the new block size - and when you submit that buffer with 
submit_bh, it is written to the wrong place!

Now, prove that there are no more bugs like this.


Locking the whole read/write/mmap operations is crude, but at least it can 
be done without thorough review of all the memory management code.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

 On Wed, Nov 28, 2012 at 12:32 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  Here is a *COMPLETELY* untested patch. Caveat emptor. It will probably
  do unspeakable things to your family and pets.
 
 Btw, *if* this approach works, I suspect we could just switch the
 bd_block_size_semaphore semaphore to be a regular rw-sem.
 
 Why? Because now it's no longer ever gotten in the cached IO paths, we
 only get it when we're doing much more expensive things (ie actual IO,
 and buffer head allocations etc etc). As long as we just work with the
 page cache, we never get to the whole lock at all.
 
 Which means that the whole percpu-optimized thing is likely no longer
 all that relevant.

Using normal semaphore degrades direct-IO performance on raw devices, I 
measured that (45.1s with normal rw-semaphore vs. 42.8s with 
percpu-rw-semaphore). It could be measured on ramdisk or high-performance 
SSDs.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 1:29 PM, Mikulas Patocka mpato...@redhat.com wrote:

 The problem with this approach is that it is very easy to miss points
 where it is assumed that the block size doesn't change - and if you miss a
 point, it results in a hidden bug that has a little possibility of being
 found.

Umm. Mikulas, *your* approach has resulted in bugs. So let's not throw
stones in glass houses, shall we?

The whole reason for this long thread (and several threads before it)
is that your model isn't working and is causing problems. I already
pointed out how bogus your arguments about mmap() locking were, and
then you have the gall to talk about potential bugs, when I have
pointed you to *actual* bugs, and actual mistakes.

 For example, __block_write_full_page and __block_write_begin do
 if (!page_has_buffers(page)) { create_empty_buffers... }
 and then they do
 WARN_ON(bh-b_size != blocksize)
 err = get_block(inode, block, bh, 1)

Right. And none of this is new.

 ... so if the buffers were left over from some previous call to
 create_empty_buffers with a different blocksize, that WARN_ON is trigged.

None of this can happen.

 Locking the whole read/write/mmap operations is crude, but at least it can
 be done without thorough review of all the memory management code.

Umm. Which you clearly didn't do, and raised totally crap arguments for.

In contrast, I have a very simple argument for the correctness of my
patch: every single user of the get_block[s]() interface now takes
the lock for as long as get_block[s]() is passed off to somebody else.
And since get_block[s]() is the only way to create those empty
buffers, I think I pretty much proved exactly what you ask for.

And THAT is the whole point and advantage of making locking sane. Sane
locking you can actually *think* about!

In contrast, locking around mmap() is absolutely *guaranteed* to be
insane, because mmap() doesn't actually do any of the IO that the lock
is supposed to protect against!

So Mikulas, quite frankly, your arguments argue against you. When you
say Locking the whole read/write/mmap operations is crude, but at
least it can
be done without thorough, you are doubly correct: it *is* crude, and
it clearly *was* done without thought, since it's a f*cking idiotic
AND INCORRECT thing to do.

Seriously. Locking around mmap() is insane. It leads to insane
semantics (the whole EBUSY thing is purely because of that problem)
and it leads to bad code (your let's add a new mmap_region hook is
just disgusting, and while Al's idea of doing it in the existing
-open method is at least not nasty, it's definitely extra code and
complexity).

There are serious *CORRECTNESS* advantages to simplicity and
directness. And locking at the right point is definitely very much
part of that.

Anyway, as far as block size goes, we have exactly two cases:

 - random IO that does not care about the block size, and will just do
whatever the current block size is (ie normal anonymous accesses to
the block device).  This is the case that needs the locking - but it
only needs it around the individual page operations, ie exactly where
I put it. In fact, they can happily deal with different block sizes
for different pages, they don't really care.

 - mounted filesystems etc that require a particular block size and
set it at mount time, and they have exclusivity rules

The second case is the case that actually calls set_blocksize(), and
if kill_bdev() doesn't get rid of the old blocksizes, then they have
always been in trouble, and would always _continue_ to be in trouble,
regardless of locking.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 2:52 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 For example, __block_write_full_page and __block_write_begin do
 if (!page_has_buffers(page)) { create_empty_buffers... }
 and then they do
 WARN_ON(bh-b_size != blocksize)
 err = get_block(inode, block, bh, 1)

 Right. And none of this is new.

.. which, btw, is not to say that *other* things aren't new. They are.

The change to actually change the block device buffer size before then
calling sync_bdev() is definitely a real change, and as mentioned, I
have not tested the patch in any way. If any block device driver were
to actually compare the IO size they get against the bdev-block_size
thing, they'd see very different behavior (ie they'd see the new block
size as they are asked to write old the old blocks with the old block
size).

So it does change semantics, no question about that. I don't think any
block device does it, though.

A bigger issue is for things that emulate what blkdev.c does, and
doesn't do the locking. I see code in md/bitmap.c that seems a bit
suspicious, for example. That said, it's not *new* breakage, and the
lock at mmap/read/write() time approach doesn't fix it either (since
the mapping will be different for the underlying MD device). So I do
think that we should take a look at all the users of
alloc_page_buffers() and create_empty_buffers() to see what *they*
do to protect the block-size, but I think that's an independent issue
from the raw device access case in fs/block_dev.c..

I guess I have to actually test my patch. I don't have very
interesting test-cases, though.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

  For example, __block_write_full_page and __block_write_begin do
  if (!page_has_buffers(page)) { create_empty_buffers... }
  and then they do
  WARN_ON(bh-b_size != blocksize)
  err = get_block(inode, block, bh, 1)
 
 Right. And none of this is new.
 
  ... so if the buffers were left over from some previous call to
  create_empty_buffers with a different blocksize, that WARN_ON is trigged.
 
 None of this can happen.

It can happen. Take your patch (the one that moves bd_block_size_semaphore 
into blkdev_readpage, blkdev_writepage and blkdev_write_begin).

Insert msleep(1000) into set_blocksize, just before sync_blockdev.

Run this program:
#define _XOPEN_SOURCE 500
#include stdio.h
#include stdlib.h
#include unistd.h
#include fcntl.h
#include sys/types.h

static char array[4096];

int main(void)
{
int h;
system(dmsetup remove test 2/dev/null);
if (system(dmsetup create test --table '0 1 zero')) exit(1);
h = open(/dev/mapper/test, O_RDWR);
if (h  0) perror(open), exit(1);
if (pread(h, array, 512, 0) != 512) perror(pread), exit(1);
if (system(dmsetup load test --table '0 8 zero')) exit(1);
if (system(dmsetup suspend test)) exit(1);
if (system(dmsetup resume test)) exit(1);
if (system(blockdev --setbsz 2048 /dev/mapper/test )) exit(1);
usleep(50);
if (pwrite(h, array, 4096, 0) != 4096) perror(pwrite), exit(1);
return 0;
}

--- it triggers WARNING: at fs/buffer.c:1830 in __block_write_begin
[ 1243.30] Backtrace:
[ 1243.33]  [40230ba8] block_write_begin+0x70/0xd0
[ 1243.40]  [402350cc] blkdev_write_begin+0xb4/0x208
[ 1243.48]  [401a9f10] generic_file_buffered_write+0x248/0x348
[ 1243.57]  [401ac8c4] __generic_file_aio_write+0x1fc/0x388
[ 1243.66]  [40235e74] blkdev_aio_write+0x64/0xf0
[ 1243.74]  [401f2108] do_sync_write+0xd0/0x128
[ 1243.81]  [401f2930] vfs_write+0xa0/0x180
[ 1243.88]  [401f2ecc] sys_pwrite64+0xb4/0xd8
[ 1243.95]  [40122104] parisc_pwrite64+0x1c/0x28
[ 1244.02]  [40106060] syscall_exit+0x0/0x14


I'm not saying that your approach is wrong, you just have to carefuly
review all memory management code for assumptions that block size
doesn't change.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

[ Sorry, I was offline for a while driving kids around ]

On Wed, Nov 28, 2012 at 4:38 PM, Mikulas Patocka mpato...@redhat.com wrote:

 It can happen. Take your patch (the one that moves bd_block_size_semaphore
 into blkdev_readpage, blkdev_writepage and blkdev_write_begin).

Interesting. The code *has* the block size (it's in bh-b_size), but
it actually then uses the inode blocksize instead, and verifies the
two against each other. It could just have used the block size
directly (and then used the inode i_blkbits only when no buffers
existed), avoiding that dependency entirely..

It actually does the same thing (with the same verification) in
__block_write_full_page() and (_without_ the verification) in
__block_commit_write().

Ho humm. All of those places actually do hold the rwsem for reading,
it's just that I don't want to hold it for writing over the sync..

Need to think on this,

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Mikulas Patocka



On Wed, 28 Nov 2012, Linus Torvalds wrote:

 A bigger issue is for things that emulate what blkdev.c does, and
 doesn't do the locking. I see code in md/bitmap.c that seems a bit
 suspicious, for example. That said, it's not *new* breakage, and the
 lock at mmap/read/write() time approach doesn't fix it either (since
 the mapping will be different for the underlying MD device). So I do
 think that we should take a look at all the users of
 alloc_page_buffers() and create_empty_buffers() to see what *they*
 do to protect the block-size, but I think that's an independent issue
 from the raw device access case in fs/block_dev.c..
 
 I guess I have to actually test my patch. I don't have very
 interesting test-cases, though.
 
Linus

Yes, it md looks suspicious. It disables write access in 
deny_bitmap_write_access (that functions looks buggy on its own - what 
happens if i_writecount == 0 or if it is already negative on entry?).

So we could disallow changing block size if i_writecount  0, that should 
do the trick.

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 6:04 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 Interesting. The code *has* the block size (it's in bh-b_size), but
 it actually then uses the inode blocksize instead, and verifies the
 two against each other. It could just have used the block size
 directly (and then used the inode i_blkbits only when no buffers
 existed), avoiding that dependency entirely..

Looking more at this code, that really would be the nicest solution.

There's two cases for the whole get_block() thing:

 - filesystems. The block size will not change randomly, and
get_block() seriously depends on the block size.

 - the raw device. The block size *will* change, but to simplify the
problem, get_block() is a 1:1 mapping, so it doesn't even care about
the block size because it will always return bh-b_blocknr = nr.

So we *could* just say that all the fs/buffer.c code should use
inode-i_blkbits for creating buffers (because that's the size new
buffers should always use), but use bh-b_size for any *existing*
buffer use.

And looking at it, it's even simple. Except for one *very* annoying
thing: several users really don't want the size of the buffer, they
really do want the *shift* of the buffer size.

In fact, that single issue seems to be the reason why
inode-i_blkbits is really used in fs/buffer.c.

Otherwise it would be fairly trivial to just make the pattern be just a simple

if (!page_has_buffers(page))
create_empty_buffers(page, 1  inode-i_blkbits, 0);
head = page_buffers(page);
blocksize = head-b_size;

and just use the blocksize that way, without any other games. All
done, no silly WARN_ON() to verify against some global block-size, and
the fs/buffer.c code would be perfectly simple, and would have no
problem at all with multiple different blocksizes in different pages
(the page lock serializes the buffers and thus the blocksize at the
per-page level).

But the fact that the code wants to do things like

block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);

seriously seems to be the main thing that keeps us using
'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
enough to hurt (not everywhere, but on some machines).

Very annoying.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 But the fact that the code wants to do things like

 block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);

 seriously seems to be the main thing that keeps us using
 'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
 enough to hurt (not everywhere, but on some machines).

 Very annoying.

Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
whole ilog2 thing, but at the same time, in other cases it actually
seems to improve code generation (ie gets rid of the whole unnecessary
two dereferences through page-mapping-host just to get the block
size, when we have it in the buffer-head that we have to touch
*anyway*).

Comments? Again, untested.

And I notice that Al Viro hasn't been cc'd, which is sad, since he's
been involved in much of fs/block_dev.c.

Al - this is an independent patch to fs/buffer.c to make
fs/block_dev.c able to change the block size of a block device while
there is IO in progress that may still use the old block size. The
discussion has been on fsdevel and lkml, but you may have missed it...

Linus


patch.diff
Description: Binary data

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 10:16:21PM -0800, Linus Torvalds wrote:
 On Wed, Nov 28, 2012 at 6:58 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  But the fact that the code wants to do things like
 
  block = (sector_t)page-index  (PAGE_CACHE_SHIFT - bbits);
 
  seriously seems to be the main thing that keeps us using
  'inode-i_blkbits'. Calculating bbits from bh-b_size is just costly
  enough to hurt (not everywhere, but on some machines).
 
  Very annoying.
 
 Hmm. Here's a patch that does that anyway. I'm not 100% happy with the
 whole ilog2 thing, but at the same time, in other cases it actually
 seems to improve code generation (ie gets rid of the whole unnecessary
 two dereferences through page-mapping-host just to get the block
 size, when we have it in the buffer-head that we have to touch
 *anyway*).
 
 Comments? Again, untested.
 
 And I notice that Al Viro hasn't been cc'd, which is sad, since he's
 been involved in much of fs/block_dev.c.
 
 Al - this is an independent patch to fs/buffer.c to make
 fs/block_dev.c able to change the block size of a block device while
 there is IO in progress that may still use the old block size. The
 discussion has been on fsdevel and lkml, but you may have missed it...

Umm...  set_blocksize() is calling kill_bdev(), which does
truncate_inode_pages(mapping, 0).  What's going to happen to data in
the dirty pages?  IO in progress is not the only thing to worry about...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Al Viro

On Thu, Nov 29, 2012 at 06:25:19AM +, Al Viro wrote:

   Umm...  set_blocksize() is calling kill_bdev(), which does
 truncate_inode_pages(mapping, 0).  What's going to happen to data in
 the dirty pages?  IO in progress is not the only thing to worry about...

Note that sync_blockdev() a few lines prior to that is good only if we
have no other processes doing write(2) (or dirtying the mmapped pages,
for that matter).  The window isn't too wide, but...
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 10:25 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 Umm...  set_blocksize() is calling kill_bdev(), which does
 truncate_inode_pages(mapping, 0).  What's going to happen to data in
 the dirty pages?  IO in progress is not the only thing to worry about...

Hmm. Yes. I think it works by virtue of if you change the blocksize
while there is active IO, you're insane and you deserve whatever you
get.

It shouldn't even be fundamentally hard to make it work, although I
suspect it would be more code than it would be worth. The sane model
would be to not use truncate_inode_pages(), but instead just walk the
pages and get rid of the buffer heads with the wrong size. Preferably
*combining* that with the sync_blockdev().

We have no real reason to even invalidate the page cache, it's just
the buffers we want to get rid of.

But I suspect it's true that none of that is really *worth* it,
considering that nobody likely wants to do any concurrent IO. We don't
want to crash, or corrupt the data structures, but I suspect you get
what you deserve might actually be the right model ;)

So the current sync_blockdev()+kill_bdev() takes care of the *sane*
case (we flush any data that happened *before* the block size change),
and any concurrent writes with block-size changes are good luck with
that.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Linus Torvalds

On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 Note that sync_blockdev() a few lines prior to that is good only if we
 have no other processes doing write(2) (or dirtying the mmapped pages,
 for that matter).  The window isn't too wide, but...

So with Mikulas' patches, the write actually would block (at write
level) due to the locking. The mmap'ed patches may be around and
flushed, but the logic to not allow currently *active* mmaps (with the
rather nasty random -EBUSY return value) should mean that there is no
race.

Or rather, there's a race, but it results in that EBUSY thing.

With my simplfied locking, the sync_blockdev() is right before (not a
few lines prior) to the kill_bdev(), and in a perfect world they'd
actually be one single operation (write back and invalidate pages
with the wrong block-size). But they aren't.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Introduce a method to catch mmap_region (was: Recent kernel mount slow)

2012-11-28 Thread Al Viro

On Wed, Nov 28, 2012 at 10:37:27PM -0800, Linus Torvalds wrote:
 On Wed, Nov 28, 2012 at 10:30 PM, Al Viro v...@zeniv.linux.org.uk wrote:
 
  Note that sync_blockdev() a few lines prior to that is good only if we
  have no other processes doing write(2) (or dirtying the mmapped pages,
  for that matter).  The window isn't too wide, but...
 
 So with Mikulas' patches, the write actually would block (at write
 level) due to the locking. The mmap'ed patches may be around and
 flushed, but the logic to not allow currently *active* mmaps (with the
 rather nasty random -EBUSY return value) should mean that there is no
 race.
 
 Or rather, there's a race, but it results in that EBUSY thing.

Same as with fs mounted on it, or the sucker having been claimed for
RAID array, etc.  Frankly, I'm more than slightly tempted to make
bdev mmap() just claim the sodding device exclusive for as long as
it's mmapped...

In principle, I agree, but...  I still have nightmares from mmap/truncate
races way back.  You are stepping into what used to be a really nasty
minefield.  I'll look into that, but it's *definitely* not -rc8 fodder.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-27 Thread Mikulas Patocka



On Tue, 27 Nov 2012, Jens Axboe wrote:

> On 2012-11-27 11:06, Jeff Chua wrote:
> > On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe  wrote:
> >> On 2012-11-27 06:57, Jeff Chua wrote:
> >>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  
> >>> wrote:
>  On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
>  wrote:
> > So it's better to slow down mount.
> 
>  I am quite proud of the linux boot time pitting against other OS. Even
>  with 10 partitions. Linux can boot up in just a few seconds, but now
>  you're saying that we need to do this semaphore check at boot up. By
>  doing so, it's inducing additional 4 seconds during boot up.
> >>>
> >>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> >>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> >>> kind of degradation would this cause or just the same?
> >>
> >> It'd likely be the same slow down time wise, but as a percentage it
> >> would appear smaller on a slower disk.
> >>
> >> Could you please test Mikulas' suggestion of changing
> >> synchronize_sched() in include/linux/percpu-rwsem.h to
> >> synchronize_sched_expedited()?
> > 
> > Tested. It seems as fast as before, but may be a "tick" slower. Just
> > perception. I was getting pretty much 0.012s with everything reverted.
> > With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> > So, it's good.
> 
> Excellent
> 
> >> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> >> tree. It would be a good data point it you could test that, too.
> > 
> > Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
> 
> Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
> the real issue, which is having to do synchronize_sched() in the first
> place.
> 
> > # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
> > 
> > 
> > So, here's the comparison ...
> > 
> > 0.500s 3.7.0-rc7
> > 0.168s 3.7.0-rc2
> > 0.012s 3.6.0
> > 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
> > 0.350s 3.7.0-rc7 + Oleg's patch.
> 
> I wonder how many of them are due to changing to the same block size.
> Does the below patch make a difference?

This patch is wrong because you must check if the device is mapped while 
holding bdev->bd_block_size_semaphore (because 
bdev->bd_block_size_semaphore prevents new mappings from being created)

I'm sending another patch that has the same effect.


Note that ext[234] filesystems set blocksize to 1024 temporarily during 
mount, so it doesn't help much (it only helps for other filesystems, such 
as jfs). For ext[234], you have a device with default block size 4096, the 
filesystem sets block size to 1024 during mount, reads the super block and 
sets it back to 4096.

If you want, you can fix ext[234] to avoid this behavior.

Mikulas


> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index 1a1e5e3..f041c56 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
>   if (size < bdev_logical_block_size(bdev))
>   return -EINVAL;
>  
> - /* Prevent starting I/O or mapping the device */
> - percpu_down_write(>bd_block_size_semaphore);
> -
>   /* Check that the block device is not memory mapped */
>   mapping = bdev->bd_inode->i_mapping;
>   mutex_lock(>i_mmap_mutex);
>   if (mapping_mapped(mapping)) {
>   mutex_unlock(>i_mmap_mutex);
> - percpu_up_write(>bd_block_size_semaphore);
>   return -EBUSY;
>   }
>   mutex_unlock(>i_mmap_mutex);
>  
>   /* Don't change the size if it is same as current */
>   if (bdev->bd_block_size != size) {
> - sync_blockdev(bdev);
> - bdev->bd_block_size = size;
> - bdev->bd_inode->i_blkbits = blksize_bits(size);
> - kill_bdev(bdev);
> + /* Prevent starting I/O */
> + percpu_down_write(>bd_block_size_semaphore);
> + if (bdev->bd_block_size != size) {
> + sync_blockdev(bdev);
> + bdev->bd_block_size = size;
> + bdev->bd_inode->i_blkbits = blksize_bits(size);
> + kill_bdev(bdev);
> + }
> + percpu_up_write(>bd_block_size_semaphore);
>   }
>  
> - percpu_up_write(>bd_block_size_semaphore);
> -
>   return 0;
>  }
>  
> @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
>  
>  static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> + struct address_space *mapping = file->f_mapping;
>   int ret;
> - struct block_device *bdev = I_BDEV(file->f_mapping->host);
> -
> - percpu_down_read(>bd_block_size_semaphore);
>  
> + mutex_lock(>i_mmap_mutex);
>   ret = generic_file_mmap(file, vma);
> -
> - percpu_up_read(>bd_block_size_semaphore);
> + mutex_unlock(>i_mmap_mutex);
>  
>   return ret;
>  }
> 
> --

Re: Recent kernel "mount" slow

2012-11-27 Thread Jens Axboe

On 2012-11-27 11:06, Jeff Chua wrote:
> On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe  wrote:
>> On 2012-11-27 06:57, Jeff Chua wrote:
>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  
>>> wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
 wrote:
> So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.
>>>
>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>> kind of degradation would this cause or just the same?
>>
>> It'd likely be the same slow down time wise, but as a percentage it
>> would appear smaller on a slower disk.
>>
>> Could you please test Mikulas' suggestion of changing
>> synchronize_sched() in include/linux/percpu-rwsem.h to
>> synchronize_sched_expedited()?
> 
> Tested. It seems as fast as before, but may be a "tick" slower. Just
> perception. I was getting pretty much 0.012s with everything reverted.
> With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
> So, it's good.

Excellent

>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> tree. It would be a good data point it you could test that, too.
> 
> Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.

Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
the real issue, which is having to do synchronize_sched() in the first
place.

> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
> 
> 
> So, here's the comparison ...
> 
> 0.500s 3.7.0-rc7
> 0.168s 3.7.0-rc2
> 0.012s 3.6.0
> 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
> 0.350s 3.7.0-rc7 + Oleg's patch.

I wonder how many of them are due to changing to the same block size.
Does the below patch make a difference?

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..f041c56 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
if (size < bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(>bd_block_size_semaphore);
-
/* Check that the block device is not memory mapped */
mapping = bdev->bd_inode->i_mapping;
mutex_lock(>i_mmap_mutex);
if (mapping_mapped(mapping)) {
mutex_unlock(>i_mmap_mutex);
-   percpu_up_write(>bd_block_size_semaphore);
return -EBUSY;
}
mutex_unlock(>i_mmap_mutex);
 
/* Don't change the size if it is same as current */
if (bdev->bd_block_size != size) {
-   sync_blockdev(bdev);
-   bdev->bd_block_size = size;
-   bdev->bd_inode->i_blkbits = blksize_bits(size);
-   kill_bdev(bdev);
+   /* Prevent starting I/O */
+   percpu_down_write(>bd_block_size_semaphore);
+   if (bdev->bd_block_size != size) {
+   sync_blockdev(bdev);
+   bdev->bd_block_size = size;
+   bdev->bd_inode->i_blkbits = blksize_bits(size);
+   kill_bdev(bdev);
+   }
+   percpu_up_write(>bd_block_size_semaphore);
}
 
-   percpu_up_write(>bd_block_size_semaphore);
-
return 0;
 }
 
@@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
 static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct address_space *mapping = file->f_mapping;
int ret;
-   struct block_device *bdev = I_BDEV(file->f_mapping->host);
-
-   percpu_down_read(>bd_block_size_semaphore);
 
+   mutex_lock(>i_mmap_mutex);
ret = generic_file_mmap(file, vma);
-
-   percpu_up_read(>bd_block_size_semaphore);
+   mutex_unlock(>i_mmap_mutex);
 
return ret;
 }

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-27 Thread Jeff Chua

On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe  wrote:
> On 2012-11-27 06:57, Jeff Chua wrote:
>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  wrote:
>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
>>> wrote:
 So it's better to slow down mount.
>>>
>>> I am quite proud of the linux boot time pitting against other OS. Even
>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>> you're saying that we need to do this semaphore check at boot up. By
>>> doing so, it's inducing additional 4 seconds during boot up.
>>
>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> kind of degradation would this cause or just the same?
>
> It'd likely be the same slow down time wise, but as a percentage it
> would appear smaller on a slower disk.
>
> Could you please test Mikulas' suggestion of changing
> synchronize_sched() in include/linux/percpu-rwsem.h to
> synchronize_sched_expedited()?

Tested. It seems as fast as before, but may be a "tick" slower. Just
perception. I was getting pretty much 0.012s with everything reverted.
With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
So, it's good.


> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> tree. It would be a good data point it you could test that, too.

Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.

# time mount /dev/sda1 /mnt; sync; sync; umount /mnt


So, here's the comparison ...

0.500s 3.7.0-rc7
0.168s 3.7.0-rc2
0.012s 3.6.0
0.013s 3.7.0-rc7 + synchronize_sched_expedited()
0.350s 3.7.0-rc7 + Oleg's patch.


Thanks,
Jeff.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-27 Thread Jeff Chua

Jens,

Limited access now at Incheon Airport. Will try the patch out when I arrived.

Thanks,
Jeff

On 11/27/12, Jens Axboe  wrote:
> On 2012-11-27 08:38, Jens Axboe wrote:
>> On 2012-11-27 06:57, Jeff Chua wrote:
>>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua 
>>> wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka 
 wrote:
> So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.
>>>
>>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>>> kind of degradation would this cause or just the same?
>>
>> It'd likely be the same slow down time wise, but as a percentage it
>> would appear smaller on a slower disk.
>>
>> Could you please test Mikulas' suggestion of changing
>> synchronize_sched() in include/linux/percpu-rwsem.h to
>> synchronize_sched_expedited()?
>>
>> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
>> tree. It would be a good data point it you could test that, too.
>>
>> In any case, the slow down definitely isn't acceptable. Fixing an
>> obscure issue like block sizes changing while O_DIRECT is in flight
>> definitely does NOT warrant a mount slow down.
>
> Here's Olegs patch, might be easier for you than switching to
> linux-next. Please try that.
>
> From: Oleg Nesterov 
> Subject: percpu_rw_semaphore: reimplement to not block the readers
> unnecessarily
>
> Currently the writer does msleep() plus synchronize_sched() 3 times to
> acquire/release the semaphore, and during this time the readers are
> blocked completely.  Even if the "write" section was not actually started
> or if it was already finished.
>
> With this patch down_write/up_write does synchronize_sched() twice and
> down_read/up_read are still possible during this time, just they use the
> slow path.
>
> percpu_down_write() first forces the readers to use rw_semaphore and
> increment the "slow" counter to take the lock for reading, then it
> takes that rw_semaphore for writing and blocks the readers.
>
> Also.  With this patch the code relies on the documented behaviour of
> synchronize_sched(), it doesn't try to pair synchronize_sched() with
> barrier.
>
> Signed-off-by: Oleg Nesterov 
> Reviewed-by: Paul E. McKenney 
> Cc: Linus Torvalds 
> Cc: Mikulas Patocka 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Srikar Dronamraju 
> Cc: Ananth N Mavinakayanahalli 
> Cc: Anton Arapov 
> Cc: Jens Axboe 
> Signed-off-by: Andrew Morton 
> ---
>
>  include/linux/percpu-rwsem.h |   85 +++---
>  lib/Makefile |2
>  lib/percpu-rwsem.c   |  123 +
>  3 files changed, 138 insertions(+), 72 deletions(-)
>
> diff -puN
> include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> include/linux/percpu-rwsem.h
> ---
> a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
> +++ a/include/linux/percpu-rwsem.h
> @@ -2,82 +2,25 @@
>  #define _LINUX_PERCPU_RWSEM_H
>
>  #include 
> +#include 
>  #include 
> -#include 
> -#include 
> +#include 
>
>  struct percpu_rw_semaphore {
> - unsigned __percpu *counters;
> - bool locked;
> - struct mutex mtx;
> + unsigned int __percpu   *fast_read_ctr;
> + struct mutexwriter_mutex;
> + struct rw_semaphore rw_sem;
> + atomic_tslow_read_ctr;
> + wait_queue_head_t   write_waitq;
>  };
>
> -#define light_mb()   barrier()
> -#define heavy_mb()   synchronize_sched()
> +extern void percpu_down_read(struct percpu_rw_semaphore *);
> +extern void percpu_up_read(struct percpu_rw_semaphore *);
>
> -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
> -{
> - rcu_read_lock_sched();
> - if (unlikely(p->locked)) {
> - rcu_read_unlock_sched();
> - mutex_lock(>mtx);
> - this_cpu_inc(*p->counters);
> - mutex_unlock(>mtx);
> - return;
> - }
> - this_cpu_inc(*p->counters);
> - rcu_read_unlock_sched();
> - light_mb(); /* A, between read of p->locked and read of data, paired 
> with
> D */
> -}
> -
> -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
> -{
> - light_mb(); /* B, between read of the data and write to p->counter, 
> paired
> with C */
> - this_cpu_dec(*p->counters);
> -}
> -
> -static inline unsigned __percpu_count(unsigned __percpu *counters)
> -{
> - unsigned total = 0;
> - int cpu;
> -
> - for_each_possible_cpu(cpu)
> - total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
> -
> - return total;
> -}
> -
> -static inline void percpu_down_write(struct

Re: Recent kernel mount slow

2012-11-27 Thread Jeff Chua

Jens,

Limited access now at Incheon Airport. Will try the patch out when I arrived.

Thanks,
Jeff

On 11/27/12, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-27 08:38, Jens Axboe wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com
 wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?

 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.

 In any case, the slow down definitely isn't acceptable. Fixing an
 obscure issue like block sizes changing while O_DIRECT is in flight
 definitely does NOT warrant a mount slow down.

 Here's Olegs patch, might be easier for you than switching to
 linux-next. Please try that.

 From: Oleg Nesterov o...@redhat.com
 Subject: percpu_rw_semaphore: reimplement to not block the readers
 unnecessarily

 Currently the writer does msleep() plus synchronize_sched() 3 times to
 acquire/release the semaphore, and during this time the readers are
 blocked completely.  Even if the write section was not actually started
 or if it was already finished.

 With this patch down_write/up_write does synchronize_sched() twice and
 down_read/up_read are still possible during this time, just they use the
 slow path.

 percpu_down_write() first forces the readers to use rw_semaphore and
 increment the slow counter to take the lock for reading, then it
 takes that rw_semaphore for writing and blocks the readers.

 Also.  With this patch the code relies on the documented behaviour of
 synchronize_sched(), it doesn't try to pair synchronize_sched() with
 barrier.

 Signed-off-by: Oleg Nesterov o...@redhat.com
 Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com
 Cc: Linus Torvalds torva...@linux-foundation.org
 Cc: Mikulas Patocka mpato...@redhat.com
 Cc: Peter Zijlstra pet...@infradead.org
 Cc: Ingo Molnar mi...@elte.hu
 Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
 Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
 Cc: Anton Arapov an...@redhat.com
 Cc: Jens Axboe ax...@kernel.dk
 Signed-off-by: Andrew Morton a...@linux-foundation.org
 ---

  include/linux/percpu-rwsem.h |   85 +++---
  lib/Makefile |2
  lib/percpu-rwsem.c   |  123 +
  3 files changed, 138 insertions(+), 72 deletions(-)

 diff -puN
 include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
 include/linux/percpu-rwsem.h
 ---
 a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
 +++ a/include/linux/percpu-rwsem.h
 @@ -2,82 +2,25 @@
  #define _LINUX_PERCPU_RWSEM_H

  #include linux/mutex.h
 +#include linux/rwsem.h
  #include linux/percpu.h
 -#include linux/rcupdate.h
 -#include linux/delay.h
 +#include linux/wait.h

  struct percpu_rw_semaphore {
 - unsigned __percpu *counters;
 - bool locked;
 - struct mutex mtx;
 + unsigned int __percpu   *fast_read_ctr;
 + struct mutexwriter_mutex;
 + struct rw_semaphore rw_sem;
 + atomic_tslow_read_ctr;
 + wait_queue_head_t   write_waitq;
  };

 -#define light_mb()   barrier()
 -#define heavy_mb()   synchronize_sched()
 +extern void percpu_down_read(struct percpu_rw_semaphore *);
 +extern void percpu_up_read(struct percpu_rw_semaphore *);

 -static inline void percpu_down_read(struct percpu_rw_semaphore *p)
 -{
 - rcu_read_lock_sched();
 - if (unlikely(p-locked)) {
 - rcu_read_unlock_sched();
 - mutex_lock(p-mtx);
 - this_cpu_inc(*p-counters);
 - mutex_unlock(p-mtx);
 - return;
 - }
 - this_cpu_inc(*p-counters);
 - rcu_read_unlock_sched();
 - light_mb(); /* A, between read of p-locked and read of data, paired 
 with
 D */
 -}
 -
 -static inline void percpu_up_read(struct percpu_rw_semaphore *p)
 -{
 - light_mb(); /* B, between read of the data and write to p-counter, 
 paired
 with C */
 - this_cpu_dec(*p-counters);
 -}
 -
 -static inline unsigned __percpu_count(unsigned __percpu *counters)
 -{
 - unsigned total = 0;
 - int cpu;
 -
 -

Re: Recent kernel mount slow

2012-11-27 Thread Jeff Chua

On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?

Tested. It seems as fast as before, but may be a tick slower. Just
perception. I was getting pretty much 0.012s with everything reverted.
With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
So, it's good.


 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.

Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.

# time mount /dev/sda1 /mnt; sync; sync; umount /mnt


So, here's the comparison ...

0.500s 3.7.0-rc7
0.168s 3.7.0-rc2
0.012s 3.6.0
0.013s 3.7.0-rc7 + synchronize_sched_expedited()
0.350s 3.7.0-rc7 + Oleg's patch.


Thanks,
Jeff.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-27 Thread Jens Axboe

On 2012-11-27 11:06, Jeff Chua wrote:
 On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com 
 wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.

 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?
 
 Tested. It seems as fast as before, but may be a tick slower. Just
 perception. I was getting pretty much 0.012s with everything reverted.
 With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
 So, it's good.

Excellent

 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.
 
 Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.

Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
the real issue, which is having to do synchronize_sched() in the first
place.

 # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
 
 
 So, here's the comparison ...
 
 0.500s 3.7.0-rc7
 0.168s 3.7.0-rc2
 0.012s 3.6.0
 0.013s 3.7.0-rc7 + synchronize_sched_expedited()
 0.350s 3.7.0-rc7 + Oleg's patch.

I wonder how many of them are due to changing to the same block size.
Does the below patch make a difference?

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 1a1e5e3..f041c56 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
if (size  bdev_logical_block_size(bdev))
return -EINVAL;
 
-   /* Prevent starting I/O or mapping the device */
-   percpu_down_write(bdev-bd_block_size_semaphore);
-
/* Check that the block device is not memory mapped */
mapping = bdev-bd_inode-i_mapping;
mutex_lock(mapping-i_mmap_mutex);
if (mapping_mapped(mapping)) {
mutex_unlock(mapping-i_mmap_mutex);
-   percpu_up_write(bdev-bd_block_size_semaphore);
return -EBUSY;
}
mutex_unlock(mapping-i_mmap_mutex);
 
/* Don't change the size if it is same as current */
if (bdev-bd_block_size != size) {
-   sync_blockdev(bdev);
-   bdev-bd_block_size = size;
-   bdev-bd_inode-i_blkbits = blksize_bits(size);
-   kill_bdev(bdev);
+   /* Prevent starting I/O */
+   percpu_down_write(bdev-bd_block_size_semaphore);
+   if (bdev-bd_block_size != size) {
+   sync_blockdev(bdev);
+   bdev-bd_block_size = size;
+   bdev-bd_inode-i_blkbits = blksize_bits(size);
+   kill_bdev(bdev);
+   }
+   percpu_up_write(bdev-bd_block_size_semaphore);
}
 
-   percpu_up_write(bdev-bd_block_size_semaphore);
-
return 0;
 }
 
@@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
 
 static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct address_space *mapping = file-f_mapping;
int ret;
-   struct block_device *bdev = I_BDEV(file-f_mapping-host);
-
-   percpu_down_read(bdev-bd_block_size_semaphore);
 
+   mutex_lock(mapping-i_mmap_mutex);
ret = generic_file_mmap(file, vma);
-
-   percpu_up_read(bdev-bd_block_size_semaphore);
+   mutex_unlock(mapping-i_mmap_mutex);
 
return ret;
 }

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-27 Thread Mikulas Patocka



On Tue, 27 Nov 2012, Jens Axboe wrote:

 On 2012-11-27 11:06, Jeff Chua wrote:
  On Tue, Nov 27, 2012 at 3:38 PM, Jens Axboe ax...@kernel.dk wrote:
  On 2012-11-27 06:57, Jeff Chua wrote:
  On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com 
  wrote:
  On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
  wrote:
  So it's better to slow down mount.
 
  I am quite proud of the linux boot time pitting against other OS. Even
  with 10 partitions. Linux can boot up in just a few seconds, but now
  you're saying that we need to do this semaphore check at boot up. By
  doing so, it's inducing additional 4 seconds during boot up.
 
  By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
  (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
  kind of degradation would this cause or just the same?
 
  It'd likely be the same slow down time wise, but as a percentage it
  would appear smaller on a slower disk.
 
  Could you please test Mikulas' suggestion of changing
  synchronize_sched() in include/linux/percpu-rwsem.h to
  synchronize_sched_expedited()?
  
  Tested. It seems as fast as before, but may be a tick slower. Just
  perception. I was getting pretty much 0.012s with everything reverted.
  With synchronize_sched_expedited(), it seems to be 0.012s ~ 0.013s.
  So, it's good.
 
 Excellent
 
  linux-next also has a re-write of the per-cpu rw sems, out of Andrews
  tree. It would be a good data point it you could test that, too.
  
  Tested. It's slower. 0.350s. But still faster than 0.500s without the patch.
 
 Makes sense, it's 2 synchronize_sched() instead of 3. So it doesn't fix
 the real issue, which is having to do synchronize_sched() in the first
 place.
 
  # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
  
  
  So, here's the comparison ...
  
  0.500s 3.7.0-rc7
  0.168s 3.7.0-rc2
  0.012s 3.6.0
  0.013s 3.7.0-rc7 + synchronize_sched_expedited()
  0.350s 3.7.0-rc7 + Oleg's patch.
 
 I wonder how many of them are due to changing to the same block size.
 Does the below patch make a difference?

This patch is wrong because you must check if the device is mapped while 
holding bdev-bd_block_size_semaphore (because 
bdev-bd_block_size_semaphore prevents new mappings from being created)

I'm sending another patch that has the same effect.


Note that ext[234] filesystems set blocksize to 1024 temporarily during 
mount, so it doesn't help much (it only helps for other filesystems, such 
as jfs). For ext[234], you have a device with default block size 4096, the 
filesystem sets block size to 1024 during mount, reads the super block and 
sets it back to 4096.

If you want, you can fix ext[234] to avoid this behavior.

Mikulas


 diff --git a/fs/block_dev.c b/fs/block_dev.c
 index 1a1e5e3..f041c56 100644
 --- a/fs/block_dev.c
 +++ b/fs/block_dev.c
 @@ -126,29 +126,28 @@ int set_blocksize(struct block_device *bdev, int size)
   if (size  bdev_logical_block_size(bdev))
   return -EINVAL;
  
 - /* Prevent starting I/O or mapping the device */
 - percpu_down_write(bdev-bd_block_size_semaphore);
 -
   /* Check that the block device is not memory mapped */
   mapping = bdev-bd_inode-i_mapping;
   mutex_lock(mapping-i_mmap_mutex);
   if (mapping_mapped(mapping)) {
   mutex_unlock(mapping-i_mmap_mutex);
 - percpu_up_write(bdev-bd_block_size_semaphore);
   return -EBUSY;
   }
   mutex_unlock(mapping-i_mmap_mutex);
  
   /* Don't change the size if it is same as current */
   if (bdev-bd_block_size != size) {
 - sync_blockdev(bdev);
 - bdev-bd_block_size = size;
 - bdev-bd_inode-i_blkbits = blksize_bits(size);
 - kill_bdev(bdev);
 + /* Prevent starting I/O */
 + percpu_down_write(bdev-bd_block_size_semaphore);
 + if (bdev-bd_block_size != size) {
 + sync_blockdev(bdev);
 + bdev-bd_block_size = size;
 + bdev-bd_inode-i_blkbits = blksize_bits(size);
 + kill_bdev(bdev);
 + }
 + percpu_up_write(bdev-bd_block_size_semaphore);
   }
  
 - percpu_up_write(bdev-bd_block_size_semaphore);
 -
   return 0;
  }
  
 @@ -1649,14 +1648,12 @@ EXPORT_SYMBOL_GPL(blkdev_aio_write);
  
  static int blkdev_mmap(struct file *file, struct vm_area_struct *vma)
  {
 + struct address_space *mapping = file-f_mapping;
   int ret;
 - struct block_device *bdev = I_BDEV(file-f_mapping-host);
 -
 - percpu_down_read(bdev-bd_block_size_semaphore);
  
 + mutex_lock(mapping-i_mmap_mutex);
   ret = generic_file_mmap(file, vma);
 -
 - percpu_up_read(bdev-bd_block_size_semaphore);
 + mutex_unlock(mapping-i_mmap_mutex);
  
   return ret;
  }
 
 -- 
 Jens Axboe
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of

Re: Recent kernel "mount" slow

2012-11-26 Thread Jens Axboe

On 2012-11-27 08:38, Jens Axboe wrote:
> On 2012-11-27 06:57, Jeff Chua wrote:
>> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  wrote:
>>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  
>>> wrote:
 So it's better to slow down mount.
>>>
>>> I am quite proud of the linux boot time pitting against other OS. Even
>>> with 10 partitions. Linux can boot up in just a few seconds, but now
>>> you're saying that we need to do this semaphore check at boot up. By
>>> doing so, it's inducing additional 4 seconds during boot up.
>>
>> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
>> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
>> kind of degradation would this cause or just the same?
> 
> It'd likely be the same slow down time wise, but as a percentage it
> would appear smaller on a slower disk.
> 
> Could you please test Mikulas' suggestion of changing
> synchronize_sched() in include/linux/percpu-rwsem.h to
> synchronize_sched_expedited()?
> 
> linux-next also has a re-write of the per-cpu rw sems, out of Andrews
> tree. It would be a good data point it you could test that, too.
> 
> In any case, the slow down definitely isn't acceptable. Fixing an
> obscure issue like block sizes changing while O_DIRECT is in flight
> definitely does NOT warrant a mount slow down.

Here's Olegs patch, might be easier for you than switching to
linux-next. Please try that.

From: Oleg Nesterov 
Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily

Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely.  Even if the "write" section was not actually started
or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the "slow" counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also.  With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov 
Reviewed-by: Paul E. McKenney 
Cc: Linus Torvalds 
Cc: Mikulas Patocka 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Srikar Dronamraju 
Cc: Ananth N Mavinakayanahalli 
Cc: Anton Arapov 
Cc: Jens Axboe 
Signed-off-by: Andrew Morton 
---

 include/linux/percpu-rwsem.h |   85 +++---
 lib/Makefile |2 
 lib/percpu-rwsem.c   |  123 +
 3 files changed, 138 insertions(+), 72 deletions(-)

diff -puN 
include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
 include/linux/percpu-rwsem.h
--- 
a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include 
+#include 
 #include 
-#include 
-#include 
+#include 
 
 struct percpu_rw_semaphore {
-   unsigned __percpu *counters;
-   bool locked;
-   struct mutex mtx;
+   unsigned int __percpu   *fast_read_ctr;
+   struct mutexwriter_mutex;
+   struct rw_semaphore rw_sem;
+   atomic_tslow_read_ctr;
+   wait_queue_head_t   write_waitq;
 };
 
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-   rcu_read_lock_sched();
-   if (unlikely(p->locked)) {
-   rcu_read_unlock_sched();
-   mutex_lock(>mtx);
-   this_cpu_inc(*p->counters);
-   mutex_unlock(>mtx);
-   return;
-   }
-   this_cpu_inc(*p->counters);
-   rcu_read_unlock_sched();
-   light_mb(); /* A, between read of p->locked and read of data, paired 
with D */
-}
-
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-   light_mb(); /* B, between read of the data and write to p->counter, 
paired with C */
-   this_cpu_dec(*p->counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-   unsigned total = 0;
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-   return total;
-}
-
-static inline void percpu_down_write(struct percpu_rw_semaphore *p)
-{
-   mutex_lock(>mtx);
-   p->locked = true;
-   synchronize_sched(); /* make sure that all readers exit the 
rcu_read_lock_sched region */
-   while (__percpu_count(p->counters))
-   msleep(1);
-   heavy_mb(); /* C, between read of p->counter and

Re: Recent kernel "mount" slow

2012-11-26 Thread Jens Axboe

On 2012-11-27 06:57, Jeff Chua wrote:
> On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  wrote:
>> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  wrote:
>>> So it's better to slow down mount.
>>
>> I am quite proud of the linux boot time pitting against other OS. Even
>> with 10 partitions. Linux can boot up in just a few seconds, but now
>> you're saying that we need to do this semaphore check at boot up. By
>> doing so, it's inducing additional 4 seconds during boot up.
> 
> By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
> (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
> kind of degradation would this cause or just the same?

It'd likely be the same slow down time wise, but as a percentage it
would appear smaller on a slower disk.

Could you please test Mikulas' suggestion of changing
synchronize_sched() in include/linux/percpu-rwsem.h to
synchronize_sched_expedited()?

linux-next also has a re-write of the per-cpu rw sems, out of Andrews
tree. It would be a good data point it you could test that, too.

In any case, the slow down definitely isn't acceptable. Fixing an
obscure issue like block sizes changing while O_DIRECT is in flight
definitely does NOT warrant a mount slow down.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-26 Thread Jeff Chua

On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua  wrote:
> On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  wrote:
>> So it's better to slow down mount.
>
> I am quite proud of the linux boot time pitting against other OS. Even
> with 10 partitions. Linux can boot up in just a few seconds, but now
> you're saying that we need to do this semaphore check at boot up. By
> doing so, it's inducing additional 4 seconds during boot up.

By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
(2.8GHz). I wonder if those on slower hard disk or slower CPU, what
kind of degradation would this cause or just the same?

Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-26 Thread Jeff Chua

On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
(2.8GHz). I wonder if those on slower hard disk or slower CPU, what
kind of degradation would this cause or just the same?

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-26 Thread Jens Axboe

On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.
 
 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?

It'd likely be the same slow down time wise, but as a percentage it
would appear smaller on a slower disk.

Could you please test Mikulas' suggestion of changing
synchronize_sched() in include/linux/percpu-rwsem.h to
synchronize_sched_expedited()?

linux-next also has a re-write of the per-cpu rw sems, out of Andrews
tree. It would be a good data point it you could test that, too.

In any case, the slow down definitely isn't acceptable. Fixing an
obscure issue like block sizes changing while O_DIRECT is in flight
definitely does NOT warrant a mount slow down.

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-26 Thread Jens Axboe

On 2012-11-27 08:38, Jens Axboe wrote:
 On 2012-11-27 06:57, Jeff Chua wrote:
 On Sun, Nov 25, 2012 at 7:23 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com 
 wrote:
 So it's better to slow down mount.

 I am quite proud of the linux boot time pitting against other OS. Even
 with 10 partitions. Linux can boot up in just a few seconds, but now
 you're saying that we need to do this semaphore check at boot up. By
 doing so, it's inducing additional 4 seconds during boot up.

 By the way, I'm using a pretty fast SSD (Samsung PM830) and fast CPU
 (2.8GHz). I wonder if those on slower hard disk or slower CPU, what
 kind of degradation would this cause or just the same?
 
 It'd likely be the same slow down time wise, but as a percentage it
 would appear smaller on a slower disk.
 
 Could you please test Mikulas' suggestion of changing
 synchronize_sched() in include/linux/percpu-rwsem.h to
 synchronize_sched_expedited()?
 
 linux-next also has a re-write of the per-cpu rw sems, out of Andrews
 tree. It would be a good data point it you could test that, too.
 
 In any case, the slow down definitely isn't acceptable. Fixing an
 obscure issue like block sizes changing while O_DIRECT is in flight
 definitely does NOT warrant a mount slow down.

Here's Olegs patch, might be easier for you than switching to
linux-next. Please try that.

From: Oleg Nesterov o...@redhat.com
Subject: percpu_rw_semaphore: reimplement to not block the readers unnecessarily

Currently the writer does msleep() plus synchronize_sched() 3 times to
acquire/release the semaphore, and during this time the readers are
blocked completely.  Even if the write section was not actually started
or if it was already finished.

With this patch down_write/up_write does synchronize_sched() twice and
down_read/up_read are still possible during this time, just they use the
slow path.

percpu_down_write() first forces the readers to use rw_semaphore and
increment the slow counter to take the lock for reading, then it
takes that rw_semaphore for writing and blocks the readers.

Also.  With this patch the code relies on the documented behaviour of
synchronize_sched(), it doesn't try to pair synchronize_sched() with
barrier.

Signed-off-by: Oleg Nesterov o...@redhat.com
Reviewed-by: Paul E. McKenney paul...@linux.vnet.ibm.com
Cc: Linus Torvalds torva...@linux-foundation.org
Cc: Mikulas Patocka mpato...@redhat.com
Cc: Peter Zijlstra pet...@infradead.org
Cc: Ingo Molnar mi...@elte.hu
Cc: Srikar Dronamraju sri...@linux.vnet.ibm.com
Cc: Ananth N Mavinakayanahalli ana...@in.ibm.com
Cc: Anton Arapov an...@redhat.com
Cc: Jens Axboe ax...@kernel.dk
Signed-off-by: Andrew Morton a...@linux-foundation.org
---

 include/linux/percpu-rwsem.h |   85 +++---
 lib/Makefile |2 
 lib/percpu-rwsem.c   |  123 +
 3 files changed, 138 insertions(+), 72 deletions(-)

diff -puN 
include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
 include/linux/percpu-rwsem.h
--- 
a/include/linux/percpu-rwsem.h~percpu_rw_semaphore-reimplement-to-not-block-the-readers-unnecessarily
+++ a/include/linux/percpu-rwsem.h
@@ -2,82 +2,25 @@
 #define _LINUX_PERCPU_RWSEM_H
 
 #include linux/mutex.h
+#include linux/rwsem.h
 #include linux/percpu.h
-#include linux/rcupdate.h
-#include linux/delay.h
+#include linux/wait.h
 
 struct percpu_rw_semaphore {
-   unsigned __percpu *counters;
-   bool locked;
-   struct mutex mtx;
+   unsigned int __percpu   *fast_read_ctr;
+   struct mutexwriter_mutex;
+   struct rw_semaphore rw_sem;
+   atomic_tslow_read_ctr;
+   wait_queue_head_t   write_waitq;
 };
 
-#define light_mb() barrier()
-#define heavy_mb() synchronize_sched()
+extern void percpu_down_read(struct percpu_rw_semaphore *);
+extern void percpu_up_read(struct percpu_rw_semaphore *);
 
-static inline void percpu_down_read(struct percpu_rw_semaphore *p)
-{
-   rcu_read_lock_sched();
-   if (unlikely(p-locked)) {
-   rcu_read_unlock_sched();
-   mutex_lock(p-mtx);
-   this_cpu_inc(*p-counters);
-   mutex_unlock(p-mtx);
-   return;
-   }
-   this_cpu_inc(*p-counters);
-   rcu_read_unlock_sched();
-   light_mb(); /* A, between read of p-locked and read of data, paired 
with D */
-}
-
-static inline void percpu_up_read(struct percpu_rw_semaphore *p)
-{
-   light_mb(); /* B, between read of the data and write to p-counter, 
paired with C */
-   this_cpu_dec(*p-counters);
-}
-
-static inline unsigned __percpu_count(unsigned __percpu *counters)
-{
-   unsigned total = 0;
-   int cpu;
-
-   for_each_possible_cpu(cpu)
-   total += ACCESS_ONCE(*per_cpu_ptr(counters, cpu));
-
-   return total;
-}
-
-static inline void percpu_down_write(struct

Re: Recent kernel "mount" slow

2012-11-24 Thread Jeff Chua

On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka  wrote:
> So it's better to slow down mount.

I am quite proud of the linux boot time pitting against other OS. Even
with 10 partitions. Linux can boot up in just a few seconds, but now
you're saying that we need to do this semaphore check at boot up. By
doing so, it's inducing additional 4 seconds during boot up.

What about moving the locking mechanism to the "mount" program itself?
Won't that be more feasible?

As for the cases of simultaneous mounts, it's usually administrator
that's doing something bad. I would say this is not a kernel issue.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-24 Thread Mikulas Patocka



On Sat, 24 Nov 2012, Jeff Chua wrote:

> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe  wrote:
> > On 2012-11-22 20:21, Linus Torvalds wrote:
> >> Doesn't sound like a fsdevel issue since it seems to be independent of
> >> filesystems. More like some generic block layer thing. Adding Jens
> >> (and quoting the whole thing)
> >>
> >> Jens, any ideas? Most of your stuff came in after -rc2, which would
> >> fit with the fact that most of the slowdown seems to be after -rc2
> >> according to Jeff.
> >
> > No ideas. Looking at what went in from my side, only the rq plug sorting
> > is a core change, and that should not cause any change in behaviour for
> > a single device. That's commit 975927b9.
> >
> >> Jeff, more bisecting would be good, though.
> >
> > Probably required, yes...
> 
> 
> This one slows mount from 0.012s to 0.168s.
> 
> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
> Author: Mikulas Patocka 
> Date:   Wed Sep 26 07:46:43 2012 +0200
> 
> blockdev: turn a rw semaphore into a percpu rw semaphore
> 
> 
> There were couple of more changes to percpu-rw-semaphores after
> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
> really know, but I'm suspecting these. Still bisecting.

The problem there is that you either use normal semaphores and slow down 
I/O or you use percpu-semaphores, you don't slow down I/O, but you slow 
down mount.

So it's better to slow down mount.

(if you don't use any semaphore at all, as it was in 3.6 kernel and 
before, there is a race condition that can crash the kernel if someone 
does mount and direct I/O read on the same device at the same time)

You can improve mount time if you change all occurences of 
synchronize_sched() in include/linux/percpu-rwsem.h to 
synchronize_sched_expedited().

But some people say that synchronize_sched_expedited() is bad for real 
time latency. (can there be something like: if (realtime) 
synchronize_sched(); else synchronize_sched_expedited(); ?)

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-24 Thread Mikulas Patocka



On Sat, 24 Nov 2012, Jeff Chua wrote:

 On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote:
  On 2012-11-22 20:21, Linus Torvalds wrote:
  Doesn't sound like a fsdevel issue since it seems to be independent of
  filesystems. More like some generic block layer thing. Adding Jens
  (and quoting the whole thing)
 
  Jens, any ideas? Most of your stuff came in after -rc2, which would
  fit with the fact that most of the slowdown seems to be after -rc2
  according to Jeff.
 
  No ideas. Looking at what went in from my side, only the rq plug sorting
  is a core change, and that should not cause any change in behaviour for
  a single device. That's commit 975927b9.
 
  Jeff, more bisecting would be good, though.
 
  Probably required, yes...
 
 
 This one slows mount from 0.012s to 0.168s.
 
 commit 62ac665ff9fc07497ca524bd20d6a96893d11071
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Wed Sep 26 07:46:43 2012 +0200
 
 blockdev: turn a rw semaphore into a percpu rw semaphore
 
 
 There were couple of more changes to percpu-rw-semaphores after
 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
 really know, but I'm suspecting these. Still bisecting.

The problem there is that you either use normal semaphores and slow down 
I/O or you use percpu-semaphores, you don't slow down I/O, but you slow 
down mount.

So it's better to slow down mount.

(if you don't use any semaphore at all, as it was in 3.6 kernel and 
before, there is a race condition that can crash the kernel if someone 
does mount and direct I/O read on the same device at the same time)

You can improve mount time if you change all occurences of 
synchronize_sched() in include/linux/percpu-rwsem.h to 
synchronize_sched_expedited().

But some people say that synchronize_sched_expedited() is bad for real 
time latency. (can there be something like: if (realtime) 
synchronize_sched(); else synchronize_sched_expedited(); ?)

Mikulas
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-24 Thread Jeff Chua

On Sun, Nov 25, 2012 at 5:09 AM, Mikulas Patocka mpato...@redhat.com wrote:
 So it's better to slow down mount.

I am quite proud of the linux boot time pitting against other OS. Even
with 10 partitions. Linux can boot up in just a few seconds, but now
you're saying that we need to do this semaphore check at boot up. By
doing so, it's inducing additional 4 seconds during boot up.

What about moving the locking mechanism to the mount program itself?
Won't that be more feasible?

As for the cases of simultaneous mounts, it's usually administrator
that's doing something bad. I would say this is not a kernel issue.

Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-23 Thread Jeff Chua

On Sat, Nov 24, 2012 at 7:31 AM, Jeff Chua  wrote:
> On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua  wrote:
>> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe  wrote:
>>> On 2012-11-22 20:21, Linus Torvalds wrote:
 Doesn't sound like a fsdevel issue since it seems to be independent of
 filesystems. More like some generic block layer thing. Adding Jens
 (and quoting the whole thing)

 Jens, any ideas? Most of your stuff came in after -rc2, which would
 fit with the fact that most of the slowdown seems to be after -rc2
 according to Jeff.
>>>
>>> No ideas. Looking at what went in from my side, only the rq plug sorting
>>> is a core change, and that should not cause any change in behaviour for
>>> a single device. That's commit 975927b9.
>>>
 Jeff, more bisecting would be good, though.
>>>
>>> Probably required, yes...
>>
>>
>> This one slows mount from 0.012s to 0.168s.
>>
>> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
>> Author: Mikulas Patocka 
>> Date:   Wed Sep 26 07:46:43 2012 +0200
>>
>> blockdev: turn a rw semaphore into a percpu rw semaphore
>>
>>
>> There were couple of more changes to percpu-rw-semaphores after
>> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
>> really know, but I'm suspecting these. Still bisecting.
>>
>> commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
>> Author: Mikulas Patocka 
>> Date:   Mon Oct 22 19:37:47 2012 -0400
>>
>> percpu-rw-semaphores: use light/heavy barriers
>>
>>
>> commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
>> Author: Mikulas Patocka 
>> Date:   Mon Oct 22 19:39:16 2012 -0400
>>
>> percpu-rw-semaphores: use rcu_read_lock_sched
>
>
> I reverted these 3 patches and mount time is now 0.012s.
>
> # time mount /dev/sda1 /mnt; sync; sync; umount /mnt
>
>
>> commit 1a25b1c4ce189e3926f2981f3302352a930086db
>> Author: Mikulas Patocka 
>> Date:   Mon Oct 15 17:20:17 2012 -0400
>>
>> Lock splice_read and splice_write functions
>
> Looks like this one is not the problem. But I reverted it anyway
> because it's part of the same chunk.
>
>
> Happy Thanksgiving!

I'm on latest git 3.7.0-rc6 5e351cdc998db82935d1248a053a1be37d1160fd
and with the above patches reverted and mount time is now 0.012s.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-23 Thread Jeff Chua

On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua  wrote:
> On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe  wrote:
>> On 2012-11-22 20:21, Linus Torvalds wrote:
>>> Doesn't sound like a fsdevel issue since it seems to be independent of
>>> filesystems. More like some generic block layer thing. Adding Jens
>>> (and quoting the whole thing)
>>>
>>> Jens, any ideas? Most of your stuff came in after -rc2, which would
>>> fit with the fact that most of the slowdown seems to be after -rc2
>>> according to Jeff.
>>
>> No ideas. Looking at what went in from my side, only the rq plug sorting
>> is a core change, and that should not cause any change in behaviour for
>> a single device. That's commit 975927b9.
>>
>>> Jeff, more bisecting would be good, though.
>>
>> Probably required, yes...
>
>
> This one slows mount from 0.012s to 0.168s.
>
> commit 62ac665ff9fc07497ca524bd20d6a96893d11071
> Author: Mikulas Patocka 
> Date:   Wed Sep 26 07:46:43 2012 +0200
>
> blockdev: turn a rw semaphore into a percpu rw semaphore
>
>
> There were couple of more changes to percpu-rw-semaphores after
> 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
> really know, but I'm suspecting these. Still bisecting.
>
> commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
> Author: Mikulas Patocka 
> Date:   Mon Oct 22 19:37:47 2012 -0400
>
> percpu-rw-semaphores: use light/heavy barriers
>
>
> commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
> Author: Mikulas Patocka 
> Date:   Mon Oct 22 19:39:16 2012 -0400
>
> percpu-rw-semaphores: use rcu_read_lock_sched


I reverted these 3 patches and mount time is now 0.012s.

# time mount /dev/sda1 /mnt; sync; sync; umount /mnt


> commit 1a25b1c4ce189e3926f2981f3302352a930086db
> Author: Mikulas Patocka 
> Date:   Mon Oct 15 17:20:17 2012 -0400
>
> Lock splice_read and splice_write functions

Looks like this one is not the problem. But I reverted it anyway
because it's part of the same chunk.


Happy Thanksgiving!

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-23 Thread Jeff Chua

On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe  wrote:
> On 2012-11-22 20:21, Linus Torvalds wrote:
>> Doesn't sound like a fsdevel issue since it seems to be independent of
>> filesystems. More like some generic block layer thing. Adding Jens
>> (and quoting the whole thing)
>>
>> Jens, any ideas? Most of your stuff came in after -rc2, which would
>> fit with the fact that most of the slowdown seems to be after -rc2
>> according to Jeff.
>
> No ideas. Looking at what went in from my side, only the rq plug sorting
> is a core change, and that should not cause any change in behaviour for
> a single device. That's commit 975927b9.
>
>> Jeff, more bisecting would be good, though.
>
> Probably required, yes...


This one slows mount from 0.012s to 0.168s.

commit 62ac665ff9fc07497ca524bd20d6a96893d11071
Author: Mikulas Patocka 
Date:   Wed Sep 26 07:46:43 2012 +0200

blockdev: turn a rw semaphore into a percpu rw semaphore


There were couple of more changes to percpu-rw-semaphores after
3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
really know, but I'm suspecting these. Still bisecting.

commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
Author: Mikulas Patocka 
Date:   Mon Oct 22 19:37:47 2012 -0400

percpu-rw-semaphores: use light/heavy barriers


commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
Author: Mikulas Patocka 
Date:   Mon Oct 22 19:39:16 2012 -0400

percpu-rw-semaphores: use rcu_read_lock_sched


commit 1a25b1c4ce189e3926f2981f3302352a930086db
Author: Mikulas Patocka 
Date:   Mon Oct 15 17:20:17 2012 -0400

Lock splice_read and splice_write functions



I couldn't unpatch with the latest kernel, but still trying.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-23 Thread Jens Axboe

On 2012-11-22 20:21, Linus Torvalds wrote:
> Doesn't sound like a fsdevel issue since it seems to be independent of
> filesystems. More like some generic block layer thing. Adding Jens
> (and quoting the whole thing)
> 
> Jens, any ideas? Most of your stuff came in after -rc2, which would
> fit with the fact that most of the slowdown seems to be after -rc2
> according to Jeff.

No ideas. Looking at what went in from my side, only the rq plug sorting
is a core change, and that should not cause any change in behaviour for
a single device. That's commit 975927b9.

> Jeff, more bisecting would be good, though.

Probably required, yes...

-- 
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-23 Thread Jens Axboe

On 2012-11-22 20:21, Linus Torvalds wrote:
 Doesn't sound like a fsdevel issue since it seems to be independent of
 filesystems. More like some generic block layer thing. Adding Jens
 (and quoting the whole thing)
 
 Jens, any ideas? Most of your stuff came in after -rc2, which would
 fit with the fact that most of the slowdown seems to be after -rc2
 according to Jeff.

No ideas. Looking at what went in from my side, only the rq plug sorting
is a core change, and that should not cause any change in behaviour for
a single device. That's commit 975927b9.

 Jeff, more bisecting would be good, though.

Probably required, yes...

-- 
Jens Axboe

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-23 Thread Jeff Chua

On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-22 20:21, Linus Torvalds wrote:
 Doesn't sound like a fsdevel issue since it seems to be independent of
 filesystems. More like some generic block layer thing. Adding Jens
 (and quoting the whole thing)

 Jens, any ideas? Most of your stuff came in after -rc2, which would
 fit with the fact that most of the slowdown seems to be after -rc2
 according to Jeff.

 No ideas. Looking at what went in from my side, only the rq plug sorting
 is a core change, and that should not cause any change in behaviour for
 a single device. That's commit 975927b9.

 Jeff, more bisecting would be good, though.

 Probably required, yes...


This one slows mount from 0.012s to 0.168s.

commit 62ac665ff9fc07497ca524bd20d6a96893d11071
Author: Mikulas Patocka mpato...@redhat.com
Date:   Wed Sep 26 07:46:43 2012 +0200

blockdev: turn a rw semaphore into a percpu rw semaphore


There were couple of more changes to percpu-rw-semaphores after
3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
really know, but I'm suspecting these. Still bisecting.

commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
Author: Mikulas Patocka mpato...@redhat.com
Date:   Mon Oct 22 19:37:47 2012 -0400

percpu-rw-semaphores: use light/heavy barriers


commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
Author: Mikulas Patocka mpato...@redhat.com
Date:   Mon Oct 22 19:39:16 2012 -0400

percpu-rw-semaphores: use rcu_read_lock_sched


commit 1a25b1c4ce189e3926f2981f3302352a930086db
Author: Mikulas Patocka mpato...@redhat.com
Date:   Mon Oct 15 17:20:17 2012 -0400

Lock splice_read and splice_write functions



I couldn't unpatch with the latest kernel, but still trying.

Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-23 Thread Jeff Chua

On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-22 20:21, Linus Torvalds wrote:
 Doesn't sound like a fsdevel issue since it seems to be independent of
 filesystems. More like some generic block layer thing. Adding Jens
 (and quoting the whole thing)

 Jens, any ideas? Most of your stuff came in after -rc2, which would
 fit with the fact that most of the slowdown seems to be after -rc2
 according to Jeff.

 No ideas. Looking at what went in from my side, only the rq plug sorting
 is a core change, and that should not cause any change in behaviour for
 a single device. That's commit 975927b9.

 Jeff, more bisecting would be good, though.

 Probably required, yes...


 This one slows mount from 0.012s to 0.168s.

 commit 62ac665ff9fc07497ca524bd20d6a96893d11071
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Wed Sep 26 07:46:43 2012 +0200

 blockdev: turn a rw semaphore into a percpu rw semaphore


 There were couple of more changes to percpu-rw-semaphores after
 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
 really know, but I'm suspecting these. Still bisecting.

 commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 22 19:37:47 2012 -0400

 percpu-rw-semaphores: use light/heavy barriers


 commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 22 19:39:16 2012 -0400

 percpu-rw-semaphores: use rcu_read_lock_sched


I reverted these 3 patches and mount time is now 0.012s.

# time mount /dev/sda1 /mnt; sync; sync; umount /mnt


 commit 1a25b1c4ce189e3926f2981f3302352a930086db
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 15 17:20:17 2012 -0400

 Lock splice_read and splice_write functions

Looks like this one is not the problem. But I reverted it anyway
because it's part of the same chunk.


Happy Thanksgiving!

Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-23 Thread Jeff Chua

On Sat, Nov 24, 2012 at 7:31 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Sat, Nov 24, 2012 at 6:21 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Fri, Nov 23, 2012 at 9:24 PM, Jens Axboe ax...@kernel.dk wrote:
 On 2012-11-22 20:21, Linus Torvalds wrote:
 Doesn't sound like a fsdevel issue since it seems to be independent of
 filesystems. More like some generic block layer thing. Adding Jens
 (and quoting the whole thing)

 Jens, any ideas? Most of your stuff came in after -rc2, which would
 fit with the fact that most of the slowdown seems to be after -rc2
 according to Jeff.

 No ideas. Looking at what went in from my side, only the rq plug sorting
 is a core change, and that should not cause any change in behaviour for
 a single device. That's commit 975927b9.

 Jeff, more bisecting would be good, though.

 Probably required, yes...


 This one slows mount from 0.012s to 0.168s.

 commit 62ac665ff9fc07497ca524bd20d6a96893d11071
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Wed Sep 26 07:46:43 2012 +0200

 blockdev: turn a rw semaphore into a percpu rw semaphore


 There were couple of more changes to percpu-rw-semaphores after
 3.7.0-rc2 and those slows mount further from 0.168s to 0.500s. I don't
 really know, but I'm suspecting these. Still bisecting.

 commit 5c1eabe68501d1e1b1586c7f4c46cc531828c4ab
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 22 19:37:47 2012 -0400

 percpu-rw-semaphores: use light/heavy barriers


 commit 1bf11c53535ab87e3bf14ecdf6747bf46f601c5d
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 22 19:39:16 2012 -0400

 percpu-rw-semaphores: use rcu_read_lock_sched


 I reverted these 3 patches and mount time is now 0.012s.

 # time mount /dev/sda1 /mnt; sync; sync; umount /mnt


 commit 1a25b1c4ce189e3926f2981f3302352a930086db
 Author: Mikulas Patocka mpato...@redhat.com
 Date:   Mon Oct 15 17:20:17 2012 -0400

 Lock splice_read and splice_write functions

 Looks like this one is not the problem. But I reverted it anyway
 because it's part of the same chunk.


 Happy Thanksgiving!

I'm on latest git 3.7.0-rc6 5e351cdc998db82935d1248a053a1be37d1160fd
and with the above patches reverted and mount time is now 0.012s.

Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-22 Thread Jeff Chua

On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua  wrote:
> On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara  wrote:
>>   I haven't heard about such problem so far. What filesystem are you using?
>
> I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
> than before. Seems to be fs independent.
>
>> Can you quantify 'is slower'? Bisecting would be welcome of course.
>
> Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
> I'll start bisecting:(
> So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.

I started bisecting, and it seems to have multiple points of slowing
down. The latest kernel is really slow mounting /dev/sda1 (fs
independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
verifying them again.


Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-22 Thread Linus Torvalds

Doesn't sound like a fsdevel issue since it seems to be independent of
filesystems. More like some generic block layer thing. Adding Jens
(and quoting the whole thing)

Jens, any ideas? Most of your stuff came in after -rc2, which would
fit with the fact that most of the slowdown seems to be after -rc2
according to Jeff.

Jeff, more bisecting would be good, though.

 Linus

On Thu, Nov 22, 2012 at 4:30 AM, Jeff Chua  wrote:
> On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua  wrote:
>> On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara  wrote:
>>>   I haven't heard about such problem so far. What filesystem are you using?
>>
>> I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
>> than before. Seems to be fs independent.
>>
>>> Can you quantify 'is slower'? Bisecting would be welcome of course.
>>
>> Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
>> I'll start bisecting:(
>> So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot 
>> up.
>
> I started bisecting, and it seems to have multiple points of slowing
> down. The latest kernel is really slow mounting /dev/sda1 (fs
> independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
> 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
> verifying them again.
>
>
> Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-22 Thread Linus Torvalds

Doesn't sound like a fsdevel issue since it seems to be independent of
filesystems. More like some generic block layer thing. Adding Jens
(and quoting the whole thing)

Jens, any ideas? Most of your stuff came in after -rc2, which would
fit with the fact that most of the slowdown seems to be after -rc2
according to Jeff.

Jeff, more bisecting would be good, though.

 Linus

On Thu, Nov 22, 2012 at 4:30 AM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote:
   I haven't heard about such problem so far. What filesystem are you using?

 I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
 than before. Seems to be fs independent.

 Can you quantify 'is slower'? Bisecting would be welcome of course.

 Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
 I'll start bisecting:(
 So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot 
 up.

 I started bisecting, and it seems to have multiple points of slowing
 down. The latest kernel is really slow mounting /dev/sda1 (fs
 independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
 3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
 verifying them again.


 Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-22 Thread Jeff Chua

On Wed, Nov 21, 2012 at 11:46 PM, Jeff Chua jeff.chua.li...@gmail.com wrote:
 On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote:
   I haven't heard about such problem so far. What filesystem are you using?

 I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
 than before. Seems to be fs independent.

 Can you quantify 'is slower'? Bisecting would be welcome of course.

 Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
 I'll start bisecting:(
 So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.

I started bisecting, and it seems to have multiple points of slowing
down. The latest kernel is really slow mounting /dev/sda1 (fs
independent) ... 0.529sec. Kernel 3.6.0 took 0.012sec, Kernel
3.7.0-rc2 took 0.168sec. Again, these are very early results. I'm
verifying them again.


Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-21 Thread Jeff Chua

On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara  wrote:
>   I haven't heard about such problem so far. What filesystem are you using?

I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
than before. Seems to be fs independent.

> Can you quantify 'is slower'? Bisecting would be welcome of course.

Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
I'll start bisecting:(
So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.

Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-21 Thread Jeff Chua

On Wed, Nov 21, 2012 at 2:09 AM, Jan Kara j...@suse.cz wrote:
   I haven't heard about such problem so far. What filesystem are you using?

I've tried ext2/ext3/ext4/reiserfs/btrfs ... all seems to be slower
than before. Seems to be fs independent.

 Can you quantify 'is slower'? Bisecting would be welcome of course.

Haven't measure, but it seems to be 1 sec instead of .3 sec to mount.
I'll start bisecting:(
So, since I've 6 mounts on start up, now it takes 4 seconds longer to boot up.

Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel "mount" slow

2012-11-20 Thread Jan Kara

On Mon 19-11-12 08:33:25, Jeff Chua wrote:
> It seems the recent kernel is slower mounting hard disk than older
> kernels. I've not bisect down to exact when this happen as it might
> already been reported or solved. I'm on the latest commit, but it
> doesn't seems to be fixed yet.
> 
> commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
> Author: Al Viro 
> Date:   Sun Nov 18 19:19:00 2012 +
> 
> fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()
  I haven't heard about such problem so far. What filesystem are you using?
Can you quantify 'is slower'? Bisecting would be welcome of course.

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Recent kernel mount slow

2012-11-20 Thread Jan Kara

On Mon 19-11-12 08:33:25, Jeff Chua wrote:
 It seems the recent kernel is slower mounting hard disk than older
 kernels. I've not bisect down to exact when this happen as it might
 already been reported or solved. I'm on the latest commit, but it
 doesn't seems to be fixed yet.
 
 commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
 Author: Al Viro v...@zeniv.linux.org.uk
 Date:   Sun Nov 18 19:19:00 2012 +
 
 fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()
  I haven't heard about such problem so far. What filesystem are you using?
Can you quantify 'is slower'? Bisecting would be welcome of course.

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Recent kernel "mount" slow

2012-11-18 Thread Jeff Chua

It seems the recent kernel is slower mounting hard disk than older
kernels. I've not bisect down to exact when this happen as it might
already been reported or solved. I'm on the latest commit, but it
doesn't seems to be fixed yet.

commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
Author: Al Viro 
Date:   Sun Nov 18 19:19:00 2012 +

fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()


Thanks,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Recent kernel mount slow

2012-11-18 Thread Jeff Chua

It seems the recent kernel is slower mounting hard disk than older
kernels. I've not bisect down to exact when this happen as it might
already been reported or solved. I'm on the latest commit, but it
doesn't seems to be fixed yet.

commit 3587b1b097d70c2eb9fee95ea7995d13c05f66e5
Author: Al Viro v...@zeniv.linux.org.uk
Date:   Sun Nov 18 19:19:00 2012 +

fanotify: fix FAN_Q_OVERFLOW case of fanotify_read()


Thanks,
Jeff
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

98 matches

Mail list logo