Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Dave Chinner
On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> > @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
> > *inode);
> >  static struct inode *find_inode(struct super_block *sb,
> > struct hlist_head *head,
> > int (*test)(struct inode *, void *),
> > -   void *data)
> > +   void *data, bool locked)
> >  {
> > struct hlist_node *node;
> > struct inode *inode = NULL;
> > 
> >  repeat:
> > -   hlist_for_each_entry(inode, node, head, i_hash) {
> > +   rcu_read_lock();
> > +   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
> > spin_lock(>i_lock);
> > +   if (inode_unhashed(inode)) {
> > +   spin_unlock(>i_lock);
> > +   continue;
> > +   }
> 
> Is this check too early? If the unhashed inode happened to be the target
> inode, we are wasting our time to continue the traversal and we do not wait 
> on it.

If the inode is unhashed, then it is already passing through evict()
or has already passed through. If it has already passed through
evict() then it is too late to call __wait_on_freeing_inode() as the
wakeup occurs in evict() immediately after the inode is removed
from the hash. i.e:

remove_inode_hash(inode);

spin_lock(>i_lock);
wake_up_bit(>i_state, __I_NEW);
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
spin_unlock(>i_lock);

i.e. if we get the case:

Thread 1, RCU hash traversalThread 2, evicting foo

rcu_read_lock()
found inode foo
remove_inode_hash(inode);
spin_lock(>i_lock);
wake_up(I_NEW)
spin_unlock(>i_lock);
destroy_inode()
..
spin_lock(foo->i_lock)
match sb, ino
I_FREEING
  rcu_read_unlock()



  wait_on_freeing_inode
wait_on_bit(I_NEW)



Hence if the inode is unhashed, it doesn't matter what inode it is,
it is never valid to use it any further because it may have already
been freed and the only reason we can safely access here it is that
the RCU grace period will not expire until we call
rcu_read_unlock().

> > @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
> > unsigned long ino)
> > struct inode *old;
> > 
> > spin_lock(_hash_lock);
> > -   /* We released the lock, so.. */
> > -   old = find_inode_fast(sb, head, ino);
> > +   old = find_inode_fast(sb, head, ino, true);
> > if (!old) {
> > inode->i_ino = ino;
> > spin_lock(>i_lock);
> 
> E ... couldn't we use memory barrier API instead of irrelevant spin
> lock on newly allocated inode to publish I_NEW?

Yes, we could.

However, having multiple synchronisation methods for a single
variable that should only be used in certain circumstances is
something that is easy to misunderstand and get wrong. Memory
barriers are much more subtle and harder to understand than spin
locks, and every memory barrier needs to be commented to explain
what the barrier is actually protecting against.

In the case where a spin lock is guaranteed to be uncontended and
the cache line hot in the CPU cache, it makes no sense to replace
the spin lock with a memory barrier, especially when every other
place we modify the i_state/i_hash fields we have to wrap them
with i_lock

Simple code is good code - save the complexity for something that
needs it.

> I go through many mails of the last trend of scaling VFS. Many patches
> seem quite natural, say RCU inode lookup

Sure, but the implementation in those RCU lookup patches sucked.

> or per-bucket inode hash lock or 

It was a bad idea. At minimum, you can't use lockdep on it. Worse
for the realtime guys is the fact it can't be converted to a
sleeping lock. Worst was the refusal to change it in any way to
address concerns.

And realistically, the fundamental problem is not with the
inode_hash_lock, it's with the fact that the cache is based on a
hash table rather than a more scalable structure like a radix tree
or btree. This is a primary reason for XFS having it's own inode
cache - hashes can only hold a certain number of entries before
performance collapses catastrophically and so don't scale well to
tens or hundreds of millions of entries.

> per-superblock inode list lock,

Because it isn't a particularly hot lock, and given that
most workloads hit on a single filesystem, scalability is not
improved by making this change. As such, as long as there is a
single linked list used to iterate all inodes in the superblock,
a single lock is as good as scalability will get

> did 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Guo Chao
On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
> @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
> *inode);
>  static struct inode *find_inode(struct super_block *sb,
>   struct hlist_head *head,
>   int (*test)(struct inode *, void *),
> - void *data)
> + void *data, bool locked)
>  {
>   struct hlist_node *node;
>   struct inode *inode = NULL;
> 
>  repeat:
> - hlist_for_each_entry(inode, node, head, i_hash) {
> + rcu_read_lock();
> + hlist_for_each_entry_rcu(inode, node, head, i_hash) {
>   spin_lock(>i_lock);
> + if (inode_unhashed(inode)) {
> + spin_unlock(>i_lock);
> + continue;
> + }

Is this check too early? If the unhashed inode happened to be the target
inode, we are wasting our time to continue the traversal and we do not wait 
on it.

> @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
> unsigned long ino)
>   struct inode *old;
> 
>   spin_lock(_hash_lock);
> - /* We released the lock, so.. */
> - old = find_inode_fast(sb, head, ino);
> + old = find_inode_fast(sb, head, ino, true);
>   if (!old) {
>   inode->i_ino = ino;
>   spin_lock(>i_lock);

E ... couldn't we use memory barrier API instead of irrelevant spin
lock on newly allocated inode to publish I_NEW?

I go through many mails of the last trend of scaling VFS. Many patches
seem quite natural, say RCU inode lookup or per-bucket inode hash lock or 
per-superblock inode list lock, did not get merged. I wonder what
stopped them back then and what has changed that (part of) them can be
considered again.

Regards,
Guo Chao

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Guo Chao
On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
 @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
 *inode);
  static struct inode *find_inode(struct super_block *sb,
   struct hlist_head *head,
   int (*test)(struct inode *, void *),
 - void *data)
 + void *data, bool locked)
  {
   struct hlist_node *node;
   struct inode *inode = NULL;
 
  repeat:
 - hlist_for_each_entry(inode, node, head, i_hash) {
 + rcu_read_lock();
 + hlist_for_each_entry_rcu(inode, node, head, i_hash) {
   spin_lock(inode-i_lock);
 + if (inode_unhashed(inode)) {
 + spin_unlock(inode-i_lock);
 + continue;
 + }

Is this check too early? If the unhashed inode happened to be the target
inode, we are wasting our time to continue the traversal and we do not wait 
on it.

 @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
 unsigned long ino)
   struct inode *old;
 
   spin_lock(inode_hash_lock);
 - /* We released the lock, so.. */
 - old = find_inode_fast(sb, head, ino);
 + old = find_inode_fast(sb, head, ino, true);
   if (!old) {
   inode-i_ino = ino;
   spin_lock(inode-i_lock);

E ... couldn't we use memory barrier API instead of irrelevant spin
lock on newly allocated inode to publish I_NEW?

I go through many mails of the last trend of scaling VFS. Many patches
seem quite natural, say RCU inode lookup or per-bucket inode hash lock or 
per-superblock inode list lock, did not get merged. I wonder what
stopped them back then and what has changed that (part of) them can be
considered again.

Regards,
Guo Chao

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-25 Thread Dave Chinner
On Tue, Sep 25, 2012 at 04:59:55PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 06:26:54PM +1000, Dave Chinner wrote:
  @@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode 
  *inode);
   static struct inode *find_inode(struct super_block *sb,
  struct hlist_head *head,
  int (*test)(struct inode *, void *),
  -   void *data)
  +   void *data, bool locked)
   {
  struct hlist_node *node;
  struct inode *inode = NULL;
  
   repeat:
  -   hlist_for_each_entry(inode, node, head, i_hash) {
  +   rcu_read_lock();
  +   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
  spin_lock(inode-i_lock);
  +   if (inode_unhashed(inode)) {
  +   spin_unlock(inode-i_lock);
  +   continue;
  +   }
 
 Is this check too early? If the unhashed inode happened to be the target
 inode, we are wasting our time to continue the traversal and we do not wait 
 on it.

If the inode is unhashed, then it is already passing through evict()
or has already passed through. If it has already passed through
evict() then it is too late to call __wait_on_freeing_inode() as the
wakeup occurs in evict() immediately after the inode is removed
from the hash. i.e:

remove_inode_hash(inode);

spin_lock(inode-i_lock);
wake_up_bit(inode-i_state, __I_NEW);
BUG_ON(inode-i_state != (I_FREEING | I_CLEAR));
spin_unlock(inode-i_lock);

i.e. if we get the case:

Thread 1, RCU hash traversalThread 2, evicting foo

rcu_read_lock()
found inode foo
remove_inode_hash(inode);
spin_lock(foo-i_lock);
wake_up(I_NEW)
spin_unlock(foo-i_lock);
destroy_inode()
..
spin_lock(foo-i_lock)
match sb, ino
I_FREEING
  rcu_read_unlock()

rcu grace period can expire at any time now,
 so use after free is guaranteed at some point

  wait_on_freeing_inode
wait_on_bit(I_NEW)

will never get woken

Hence if the inode is unhashed, it doesn't matter what inode it is,
it is never valid to use it any further because it may have already
been freed and the only reason we can safely access here it is that
the RCU grace period will not expire until we call
rcu_read_unlock().

  @@ -1078,8 +1098,7 @@ struct inode *iget_locked(struct super_block *sb, 
  unsigned long ino)
  struct inode *old;
  
  spin_lock(inode_hash_lock);
  -   /* We released the lock, so.. */
  -   old = find_inode_fast(sb, head, ino);
  +   old = find_inode_fast(sb, head, ino, true);
  if (!old) {
  inode-i_ino = ino;
  spin_lock(inode-i_lock);
 
 E ... couldn't we use memory barrier API instead of irrelevant spin
 lock on newly allocated inode to publish I_NEW?

Yes, we could.

However, having multiple synchronisation methods for a single
variable that should only be used in certain circumstances is
something that is easy to misunderstand and get wrong. Memory
barriers are much more subtle and harder to understand than spin
locks, and every memory barrier needs to be commented to explain
what the barrier is actually protecting against.

In the case where a spin lock is guaranteed to be uncontended and
the cache line hot in the CPU cache, it makes no sense to replace
the spin lock with a memory barrier, especially when every other
place we modify the i_state/i_hash fields we have to wrap them
with i_lock

Simple code is good code - save the complexity for something that
needs it.

 I go through many mails of the last trend of scaling VFS. Many patches
 seem quite natural, say RCU inode lookup

Sure, but the implementation in those RCU lookup patches sucked.

 or per-bucket inode hash lock or 

It was a bad idea. At minimum, you can't use lockdep on it. Worse
for the realtime guys is the fact it can't be converted to a
sleeping lock. Worst was the refusal to change it in any way to
address concerns.

And realistically, the fundamental problem is not with the
inode_hash_lock, it's with the fact that the cache is based on a
hash table rather than a more scalable structure like a radix tree
or btree. This is a primary reason for XFS having it's own inode
cache - hashes can only hold a certain number of entries before
performance collapses catastrophically and so don't scale well to
tens or hundreds of millions of entries.

 per-superblock inode list lock,

Because it isn't a particularly hot lock, and given that
most workloads hit on a single filesystem, scalability is not
improved by making this change. As such, as long as there is a
single linked list used to iterate all inodes in 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 03:08:52PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
> > > Ah, this is intended to be a code clean patchset actually. I thought these
> > > locks are redundant in an obvious and trivial manner. If, on the 
> > > contrary, 
> > > they are such tricky, then never mind :) Thanks for your patient.
> > 
> > The RCU conversion is actually trivial - everything is already set
> > up for it to be done, and is simpler than this patch set. It pretty
> > much is simply replacing all the read side inode_hash_lock pairs
> > with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
> > want to clean up this code, then RCU traversals are the conversion
> > to make.
> >
> 
> Thanks for your suggestion. Though I doubt it's such trivial, I will try this
> after a little investigation.

Probably best to start with the patch below - it's run under heavy
concurrent load on ext4 with a working set of inodes about 20x
larger than can fit in memory for the past hour

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

fs: Use RCU lookups for inode cache

From: Dave Chinner 

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Smoke tested w/ ext4 on concurrent fsmark/lookup/unlink workloads
over 50 million or so inodes...

Signed-off-by: Dave Chinner 
---
 fs/inode.c |   74 ++--
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..2e92674 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -464,7 +464,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long 
hashval)
 
spin_lock(_hash_lock);
spin_lock(>i_lock);
-   hlist_add_head(>i_hash, b);
+   hlist_add_head_rcu(>i_hash, b);
spin_unlock(>i_lock);
spin_unlock(_hash_lock);
 }
@@ -480,7 +480,7 @@ void __remove_inode_hash(struct inode *inode)
 {
spin_lock(_hash_lock);
spin_lock(>i_lock);
-   hlist_del_init(>i_hash);
+   hlist_del_init_rcu(>i_hash);
spin_unlock(>i_lock);
spin_unlock(_hash_lock);
 }
@@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
-   void *data)
+   void *data, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(>i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
if (inode->i_sb != sb) {
spin_unlock(>i_lock);
continue;
@@ -800,13 +805,20 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(>i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -815,14 +827,20 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-   struct hlist_head *head, unsigned long ino)
+struct hlist_head *head,
+unsigned long ino, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(>i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(>i_lock);
+   continue;
+   }
if (inode->i_ino != ino) {
spin_unlock(>i_lock);
continue;
@@ -832,13 +850,20 @@ repeat:
continue;
}
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(_hash_lock);

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Guo Chao
On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
> On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
> > On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
> > > On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> > > > On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> > > > 
> > > > > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > > > > This patchset optimizes several places which take the per inode 
> > > > > > spin lock.
> > > > > > They have not been fully tested yet, thus they are marked as RFC. 
> > > > > 
> > > > > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > > > > part of the memory barrier that allows the RCU read side to
> > > > > correctly detect a freed inode during a RCU protected cache lookup
> > > > > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > > > > The i_lock usage around the hahs list operations ensures the hash
> > > > > list operations are atomic with state changes so that such changes
> > > > > are correctly detected during RCU-protected traversals...
> > > > > 
> > > > > IOWs, removing the i_lock from around the i_state transitions and
> > > > > inode hash insert/remove/traversal operations will cause races in
> > > > > the RCU lookups and result in incorrectly using freed inodes instead
> > > > > of failing the lookup and creating a new one.
> > > > > 
> > > > > So I don't think this is a good idea at all...
> .
> > > The inode hash lookup needs to check i_state atomically during the
> > > traversal so inodes being freed are skipped (e.g. I_FREEING,
> > > I_WILL_FREE). those i_state flags are set only with the i_lock held,
> > > and so inode hash lookups need to take the i_lock to guarantee the
> > > i_state field is correct. This inode data field synchronisation is
> > > separate to the cache hash list traversal protection.
> > > 
> > > The only way to do this is to have an inner lock (inode->i_lock)
> > > that protects both the inode->i_hash_list and inode->i_state fields,
> > > and a lock order that provides outer list traversal protections
> > > (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
> > > rcu_read_lock(), the lock order and the data fields the locks are
> > > protecting are the same
> > > 
> > > > Of course, maybe they are there for something. Could you speak
> > > > more about the race this change (patch 1,2?) brings up? Thank you.
> > > 
> > > When you drop the lock from the i_state initialisation, you end up
> > > dropping the implicit unlock->lock memory barrier that the
> > > inode->i_lock provides. i.e. you get this in iget_locked():
> > > 
> > > 
> > >   thread 1thread 2
> > > 
> > >   lock(inode_hash_lock)
> > >   for_each_hash_item()
> > > 
> > >   inode->i_state = I_NEW
> > >   hash_list_insert
> > > 
> > >   
> > >   lock(inode->i_lock)
> > >   unlock(inode->i_lock)
> > >   unlock(inode_hash_lock)
> > > 
> > >   wait_on_inode()
> > >   i_state = 0 >
> > >> >is complete>
> > > 
> > > IOWs, there is no unlock->lock transition occurring on any lock, so
> > > there are no implicit memory barriers in this code, and so other
> > > CPUs are not guaranteed to see the "inode->i_state = I_NEW" write
> > > that thread 2 did. The lock/unlock pair around this I_NEW assignment
> > > guarantees that thread 1 will see the change to i_state correctly.
> > > 
> > > So even without RCU, dropping the i_lock from these
> > > i_state/hash insert/remove operations will result in races
> > > occurring...
> >
> > This interleave can never happen because of inode_hash_lock.
> 
> Ah, sorry, I'm context switching too much right now.
> 
> s/lock(inode_hash_lock)/rcu_read_lock.
> 
> And that's the race condition the the locking order is *intended* to
> avoid. It's just that we haven't done the last piece of the work,
> which is replacing the read side inode_hash_lock usage with
> rcu_read_lock.
> 
> > > Seriously, if you want to improve the locking of this code, go back
> > > an resurrect the basic RCU hash traversal patches (i.e. Nick's
> > > original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
> > > has much more benefit to many more workloads than just removing a
> > > non-global, uncontended locks like this patch set does.
> > > 
> > 
> > Ah, this is intended to be a code clean patchset actually. I thought these
> > locks are redundant in an obvious and trivial manner. If, on the contrary, 
> > they are such tricky, then never mind :) Thanks for your patient.
> 
> The RCU conversion is actually trivial - everything is already set
> up for it to be done, and is simpler than this patch set. It pretty
> much is simply replacing all the read side inode_hash_lock pairs
> with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
> want to clean up this code, then RCU traversals 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
> On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
> > On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> > > On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> > > 
> > > > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > > > This patchset optimizes several places which take the per inode spin 
> > > > > lock.
> > > > > They have not been fully tested yet, thus they are marked as RFC. 
> > > > 
> > > > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > > > part of the memory barrier that allows the RCU read side to
> > > > correctly detect a freed inode during a RCU protected cache lookup
> > > > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > > > The i_lock usage around the hahs list operations ensures the hash
> > > > list operations are atomic with state changes so that such changes
> > > > are correctly detected during RCU-protected traversals...
> > > > 
> > > > IOWs, removing the i_lock from around the i_state transitions and
> > > > inode hash insert/remove/traversal operations will cause races in
> > > > the RCU lookups and result in incorrectly using freed inodes instead
> > > > of failing the lookup and creating a new one.
> > > > 
> > > > So I don't think this is a good idea at all...
.
> > The inode hash lookup needs to check i_state atomically during the
> > traversal so inodes being freed are skipped (e.g. I_FREEING,
> > I_WILL_FREE). those i_state flags are set only with the i_lock held,
> > and so inode hash lookups need to take the i_lock to guarantee the
> > i_state field is correct. This inode data field synchronisation is
> > separate to the cache hash list traversal protection.
> > 
> > The only way to do this is to have an inner lock (inode->i_lock)
> > that protects both the inode->i_hash_list and inode->i_state fields,
> > and a lock order that provides outer list traversal protections
> > (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
> > rcu_read_lock(), the lock order and the data fields the locks are
> > protecting are the same
> > 
> > > Of course, maybe they are there for something. Could you speak
> > > more about the race this change (patch 1,2?) brings up? Thank you.
> > 
> > When you drop the lock from the i_state initialisation, you end up
> > dropping the implicit unlock->lock memory barrier that the
> > inode->i_lock provides. i.e. you get this in iget_locked():
> > 
> > 
> > thread 1thread 2
> > 
> > lock(inode_hash_lock)
> > for_each_hash_item()
> > 
> > inode->i_state = I_NEW
> > hash_list_insert
> > 
> > 
> > lock(inode->i_lock)
> > unlock(inode->i_lock)
> > unlock(inode_hash_lock)
> > 
> > wait_on_inode()
> > i_state = 0 >
> >  >  is complete>
> > 
> > IOWs, there is no unlock->lock transition occurring on any lock, so
> > there are no implicit memory barriers in this code, and so other
> > CPUs are not guaranteed to see the "inode->i_state = I_NEW" write
> > that thread 2 did. The lock/unlock pair around this I_NEW assignment
> > guarantees that thread 1 will see the change to i_state correctly.
> > 
> > So even without RCU, dropping the i_lock from these
> > i_state/hash insert/remove operations will result in races
> > occurring...
>
> This interleave can never happen because of inode_hash_lock.

Ah, sorry, I'm context switching too much right now.

s/lock(inode_hash_lock)/rcu_read_lock.

And that's the race condition the the locking order is *intended* to
avoid. It's just that we haven't done the last piece of the work,
which is replacing the read side inode_hash_lock usage with
rcu_read_lock.

> > Seriously, if you want to improve the locking of this code, go back
> > an resurrect the basic RCU hash traversal patches (i.e. Nick's
> > original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
> > has much more benefit to many more workloads than just removing a
> > non-global, uncontended locks like this patch set does.
> > 
> 
> Ah, this is intended to be a code clean patchset actually. I thought these
> locks are redundant in an obvious and trivial manner. If, on the contrary, 
> they are such tricky, then never mind :) Thanks for your patient.

The RCU conversion is actually trivial - everything is already set
up for it to be done, and is simpler than this patch set. It pretty
much is simply replacing all the read side inode_hash_lock pairs
with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
want to clean up this code, then RCU traversals are the conversion
to make.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Guo Chao
On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
> On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> > On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> > 
> > > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > > This patchset optimizes several places which take the per inode spin 
> > > > lock.
> > > > They have not been fully tested yet, thus they are marked as RFC. 
> > > 
> > > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > > part of the memory barrier that allows the RCU read side to
> > > correctly detect a freed inode during a RCU protected cache lookup
> > > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > > The i_lock usage around the hahs list operations ensures the hash
> > > list operations are atomic with state changes so that such changes
> > > are correctly detected during RCU-protected traversals...
> > > 
> > > IOWs, removing the i_lock from around the i_state transitions and
> > > inode hash insert/remove/traversal operations will cause races in
> > > the RCU lookups and result in incorrectly using freed inodes instead
> > > of failing the lookup and creating a new one.
> > > 
> > > So I don't think this is a good idea at all...
> > >
> > 
> > Hello, Dave:
> > 
> >   Thanks for your explanation.
> >  
> >   Though I can't fully understand it, your concern seems to be that
> > RCU inode lookup will be bothered by this change. But we do not have 
> > RCU inode lookup in VFS: inode lookup is done by rather a tranditional
> > way. 
> 
> Ah, I'd forgotten that neither of these RCU-based lookups ever got
> merged:
> 
> https://lkml.org/lkml/2010/6/23/397
> http://thread.gmane.org/gmane.linux.kernel/1056494
> 
> That, however, is the approach that the inode caches shoul dbe
> moving towards - RCU lookups to reduce locking, not changing
> i_lock/i_state atomicity that has been designed to facilitate RCU
> safe lookups...
> 
> >   XFS gives me the impression that it implements its own inode cache.
> > There may be such thing there. I have little knowledge on XFS, but I
> > guess it's unlikely impacted by the change of code implementing VFS 
> > inode cache.
> 
> Yeah, I dropped the generic inode hash RCU conversion - the
> SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
> have any motiviation to see it through because I'd already converted
> XFs to avoid the global inode_hash_lock and use RCU lookups on it's
> internal inode cache...
> 
> >   As far as I can see, RCU inode free is for RCU dentry lookup, which
> > seems have nothing to do with 'detect a freed inode'.
> 
> If you know nothing of the history of this, then it might seem that
> way
> 
> > Taking i_lock in these
> > places looks like to me a result of following old lock scheme blindly when 
> > breaking the big global inode lock.
> 
> The i_state/i_hash_list/i_lock relationship was created specifically
> during the inode_lock breakup to allow us to guarantee that certain
> fields of the inode are unchanging without needing to take multiple
> nested locks:
> 
> $ gl -n 1 250df6e
> commit 250df6ed274d767da844a5d9f05720b804240197
> Author: Dave Chinner 
> Date:   Tue Mar 22 22:23:36 2011 +1100
> 
> fs: protect inode->i_state with inode->i_lock
> 
> Protect inode state transitions and validity checks with the
> inode->i_lock. This enables us to make inode state transitions
> independently of the inode_lock and is the first step to peeling
> away the inode_lock from the code.
> 
> This requires that __iget() is done atomically with i_state checks
> during list traversals so that we don't race with another thread
> marking the inode I_FREEING between the state check and grabbing the
> reference.
> 
> Also remove the unlock_new_inode() memory barrier optimisation
> required to avoid taking the inode_lock when clearing I_NEW.
> Simplify the code by simply taking the inode->i_lock around the
> state change and wakeup. Because the wakeup is no longer tricky,
> remove the wake_up_inode() function and open code the wakeup where
> necessary.
> 
> Signed-off-by: Dave Chinner 
> Signed-off-by: Al Viro 
> 
> The inode hash lookup needs to check i_state atomically during the
> traversal so inodes being freed are skipped (e.g. I_FREEING,
> I_WILL_FREE). those i_state flags are set only with the i_lock held,
> and so inode hash lookups need to take the i_lock to guarantee the
> i_state field is correct. This inode data field synchronisation is
> separate to the cache hash list traversal protection.
> 
> The only way to do this is to have an inner lock (inode->i_lock)
> that protects both the inode->i_hash_list and inode->i_state fields,
> and a lock order that provides outer list traversal protections
> (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
> rcu_read_lock(), the lock order and the data fields the locks are
> protecting 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Guo Chao
On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
 On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
  On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
  
   On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
This patchset optimizes several places which take the per inode spin 
lock.
They have not been fully tested yet, thus they are marked as RFC. 
   
   Inodes are RCU freed. The i_lock spinlock on the i_state field forms
   part of the memory barrier that allows the RCU read side to
   correctly detect a freed inode during a RCU protected cache lookup
   (hash list traversal for the VFS, or a radix tree traversal for XFS).
   The i_lock usage around the hahs list operations ensures the hash
   list operations are atomic with state changes so that such changes
   are correctly detected during RCU-protected traversals...
   
   IOWs, removing the i_lock from around the i_state transitions and
   inode hash insert/remove/traversal operations will cause races in
   the RCU lookups and result in incorrectly using freed inodes instead
   of failing the lookup and creating a new one.
   
   So I don't think this is a good idea at all...
  
  
  Hello, Dave:
  
Thanks for your explanation.
   
Though I can't fully understand it, your concern seems to be that
  RCU inode lookup will be bothered by this change. But we do not have 
  RCU inode lookup in VFS: inode lookup is done by rather a tranditional
  way. 
 
 Ah, I'd forgotten that neither of these RCU-based lookups ever got
 merged:
 
 https://lkml.org/lkml/2010/6/23/397
 http://thread.gmane.org/gmane.linux.kernel/1056494
 
 That, however, is the approach that the inode caches shoul dbe
 moving towards - RCU lookups to reduce locking, not changing
 i_lock/i_state atomicity that has been designed to facilitate RCU
 safe lookups...
 
XFS gives me the impression that it implements its own inode cache.
  There may be such thing there. I have little knowledge on XFS, but I
  guess it's unlikely impacted by the change of code implementing VFS 
  inode cache.
 
 Yeah, I dropped the generic inode hash RCU conversion - the
 SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
 have any motiviation to see it through because I'd already converted
 XFs to avoid the global inode_hash_lock and use RCU lookups on it's
 internal inode cache...
 
As far as I can see, RCU inode free is for RCU dentry lookup, which
  seems have nothing to do with 'detect a freed inode'.
 
 If you know nothing of the history of this, then it might seem that
 way
 
  Taking i_lock in these
  places looks like to me a result of following old lock scheme blindly when 
  breaking the big global inode lock.
 
 The i_state/i_hash_list/i_lock relationship was created specifically
 during the inode_lock breakup to allow us to guarantee that certain
 fields of the inode are unchanging without needing to take multiple
 nested locks:
 
 $ gl -n 1 250df6e
 commit 250df6ed274d767da844a5d9f05720b804240197
 Author: Dave Chinner dchin...@redhat.com
 Date:   Tue Mar 22 22:23:36 2011 +1100
 
 fs: protect inode-i_state with inode-i_lock
 
 Protect inode state transitions and validity checks with the
 inode-i_lock. This enables us to make inode state transitions
 independently of the inode_lock and is the first step to peeling
 away the inode_lock from the code.
 
 This requires that __iget() is done atomically with i_state checks
 during list traversals so that we don't race with another thread
 marking the inode I_FREEING between the state check and grabbing the
 reference.
 
 Also remove the unlock_new_inode() memory barrier optimisation
 required to avoid taking the inode_lock when clearing I_NEW.
 Simplify the code by simply taking the inode-i_lock around the
 state change and wakeup. Because the wakeup is no longer tricky,
 remove the wake_up_inode() function and open code the wakeup where
 necessary.
 
 Signed-off-by: Dave Chinner dchin...@redhat.com
 Signed-off-by: Al Viro v...@zeniv.linux.org.uk
 
 The inode hash lookup needs to check i_state atomically during the
 traversal so inodes being freed are skipped (e.g. I_FREEING,
 I_WILL_FREE). those i_state flags are set only with the i_lock held,
 and so inode hash lookups need to take the i_lock to guarantee the
 i_state field is correct. This inode data field synchronisation is
 separate to the cache hash list traversal protection.
 
 The only way to do this is to have an inner lock (inode-i_lock)
 that protects both the inode-i_hash_list and inode-i_state fields,
 and a lock order that provides outer list traversal protections
 (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
 rcu_read_lock(), the lock order and the data fields the locks are
 protecting are the same
 
  Of course, maybe they are there for something. Could you speak
  more about the race this change 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
  On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
   On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
   
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
 This patchset optimizes several places which take the per inode spin 
 lock.
 They have not been fully tested yet, thus they are marked as RFC. 

Inodes are RCU freed. The i_lock spinlock on the i_state field forms
part of the memory barrier that allows the RCU read side to
correctly detect a freed inode during a RCU protected cache lookup
(hash list traversal for the VFS, or a radix tree traversal for XFS).
The i_lock usage around the hahs list operations ensures the hash
list operations are atomic with state changes so that such changes
are correctly detected during RCU-protected traversals...

IOWs, removing the i_lock from around the i_state transitions and
inode hash insert/remove/traversal operations will cause races in
the RCU lookups and result in incorrectly using freed inodes instead
of failing the lookup and creating a new one.

So I don't think this is a good idea at all...
.
  The inode hash lookup needs to check i_state atomically during the
  traversal so inodes being freed are skipped (e.g. I_FREEING,
  I_WILL_FREE). those i_state flags are set only with the i_lock held,
  and so inode hash lookups need to take the i_lock to guarantee the
  i_state field is correct. This inode data field synchronisation is
  separate to the cache hash list traversal protection.
  
  The only way to do this is to have an inner lock (inode-i_lock)
  that protects both the inode-i_hash_list and inode-i_state fields,
  and a lock order that provides outer list traversal protections
  (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
  rcu_read_lock(), the lock order and the data fields the locks are
  protecting are the same
  
   Of course, maybe they are there for something. Could you speak
   more about the race this change (patch 1,2?) brings up? Thank you.
  
  When you drop the lock from the i_state initialisation, you end up
  dropping the implicit unlock-lock memory barrier that the
  inode-i_lock provides. i.e. you get this in iget_locked():
  
  
  thread 1thread 2
  
  lock(inode_hash_lock)
  for_each_hash_item()
  
  inode-i_state = I_NEW
  hash_list_insert
  
  finds newly inserted inode
  lock(inode-i_lock)
  unlock(inode-i_lock)
  unlock(inode_hash_lock)
  
  wait_on_inode()
  see inode-i_state = 0 
  uses inode before initialisation
   is complete
  
  IOWs, there is no unlock-lock transition occurring on any lock, so
  there are no implicit memory barriers in this code, and so other
  CPUs are not guaranteed to see the inode-i_state = I_NEW write
  that thread 2 did. The lock/unlock pair around this I_NEW assignment
  guarantees that thread 1 will see the change to i_state correctly.
  
  So even without RCU, dropping the i_lock from these
  i_state/hash insert/remove operations will result in races
  occurring...

 This interleave can never happen because of inode_hash_lock.

Ah, sorry, I'm context switching too much right now.

s/lock(inode_hash_lock)/rcu_read_lock.

And that's the race condition the the locking order is *intended* to
avoid. It's just that we haven't done the last piece of the work,
which is replacing the read side inode_hash_lock usage with
rcu_read_lock.

  Seriously, if you want to improve the locking of this code, go back
  an resurrect the basic RCU hash traversal patches (i.e. Nick's
  original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
  has much more benefit to many more workloads than just removing a
  non-global, uncontended locks like this patch set does.
  
 
 Ah, this is intended to be a code clean patchset actually. I thought these
 locks are redundant in an obvious and trivial manner. If, on the contrary, 
 they are such tricky, then never mind :) Thanks for your patient.

The RCU conversion is actually trivial - everything is already set
up for it to be done, and is simpler than this patch set. It pretty
much is simply replacing all the read side inode_hash_lock pairs
with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
want to clean up this code, then RCU traversals are the conversion
to make.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Guo Chao
On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
 On Mon, Sep 24, 2012 at 02:12:05PM +0800, Guo Chao wrote:
  On Mon, Sep 24, 2012 at 02:23:43PM +1000, Dave Chinner wrote:
   On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:

 On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
  This patchset optimizes several places which take the per inode 
  spin lock.
  They have not been fully tested yet, thus they are marked as RFC. 
 
 Inodes are RCU freed. The i_lock spinlock on the i_state field forms
 part of the memory barrier that allows the RCU read side to
 correctly detect a freed inode during a RCU protected cache lookup
 (hash list traversal for the VFS, or a radix tree traversal for XFS).
 The i_lock usage around the hahs list operations ensures the hash
 list operations are atomic with state changes so that such changes
 are correctly detected during RCU-protected traversals...
 
 IOWs, removing the i_lock from around the i_state transitions and
 inode hash insert/remove/traversal operations will cause races in
 the RCU lookups and result in incorrectly using freed inodes instead
 of failing the lookup and creating a new one.
 
 So I don't think this is a good idea at all...
 .
   The inode hash lookup needs to check i_state atomically during the
   traversal so inodes being freed are skipped (e.g. I_FREEING,
   I_WILL_FREE). those i_state flags are set only with the i_lock held,
   and so inode hash lookups need to take the i_lock to guarantee the
   i_state field is correct. This inode data field synchronisation is
   separate to the cache hash list traversal protection.
   
   The only way to do this is to have an inner lock (inode-i_lock)
   that protects both the inode-i_hash_list and inode-i_state fields,
   and a lock order that provides outer list traversal protections
   (inode_hash_lock). Whether the outer lock is the inode_hash_lock or
   rcu_read_lock(), the lock order and the data fields the locks are
   protecting are the same
   
Of course, maybe they are there for something. Could you speak
more about the race this change (patch 1,2?) brings up? Thank you.
   
   When you drop the lock from the i_state initialisation, you end up
   dropping the implicit unlock-lock memory barrier that the
   inode-i_lock provides. i.e. you get this in iget_locked():
   
   
 thread 1thread 2
   
 lock(inode_hash_lock)
 for_each_hash_item()
   
 inode-i_state = I_NEW
 hash_list_insert
   
 finds newly inserted inode
 lock(inode-i_lock)
 unlock(inode-i_lock)
 unlock(inode_hash_lock)
   
 wait_on_inode()
 see inode-i_state = 0 
 uses inode before initialisation
  is complete
   
   IOWs, there is no unlock-lock transition occurring on any lock, so
   there are no implicit memory barriers in this code, and so other
   CPUs are not guaranteed to see the inode-i_state = I_NEW write
   that thread 2 did. The lock/unlock pair around this I_NEW assignment
   guarantees that thread 1 will see the change to i_state correctly.
   
   So even without RCU, dropping the i_lock from these
   i_state/hash insert/remove operations will result in races
   occurring...
 
  This interleave can never happen because of inode_hash_lock.
 
 Ah, sorry, I'm context switching too much right now.
 
 s/lock(inode_hash_lock)/rcu_read_lock.
 
 And that's the race condition the the locking order is *intended* to
 avoid. It's just that we haven't done the last piece of the work,
 which is replacing the read side inode_hash_lock usage with
 rcu_read_lock.
 
   Seriously, if you want to improve the locking of this code, go back
   an resurrect the basic RCU hash traversal patches (i.e. Nick's
   original patch rather than my SLAB_DESTROY_BY_RCU based ones). That
   has much more benefit to many more workloads than just removing a
   non-global, uncontended locks like this patch set does.
   
  
  Ah, this is intended to be a code clean patchset actually. I thought these
  locks are redundant in an obvious and trivial manner. If, on the contrary, 
  they are such tricky, then never mind :) Thanks for your patient.
 
 The RCU conversion is actually trivial - everything is already set
 up for it to be done, and is simpler than this patch set. It pretty
 much is simply replacing all the read side inode_hash_lock pairs
 with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
 want to clean up this code, then RCU traversals are the conversion
 to make.


Thanks for your suggestion. Though I doubt it's such trivial, I will try this
after a little investigation.

Regards,
Guo Chao


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-24 Thread Dave Chinner
On Mon, Sep 24, 2012 at 03:08:52PM +0800, Guo Chao wrote:
 On Mon, Sep 24, 2012 at 04:28:12PM +1000, Dave Chinner wrote:
   Ah, this is intended to be a code clean patchset actually. I thought these
   locks are redundant in an obvious and trivial manner. If, on the 
   contrary, 
   they are such tricky, then never mind :) Thanks for your patient.
  
  The RCU conversion is actually trivial - everything is already set
  up for it to be done, and is simpler than this patch set. It pretty
  much is simply replacing all the read side inode_hash_lock pairs
  with rcu_read_lock()/rcu_read_unlock() pairs. Like I said, if you
  want to clean up this code, then RCU traversals are the conversion
  to make.
 
 
 Thanks for your suggestion. Though I doubt it's such trivial, I will try this
 after a little investigation.

Probably best to start with the patch below - it's run under heavy
concurrent load on ext4 with a working set of inodes about 20x
larger than can fit in memory for the past hour

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

fs: Use RCU lookups for inode cache

From: Dave Chinner dchin...@redhat.com

Convert inode cache lookups to be protected by RCU locking rather
than the global inode_hash_lock. This will improve scalability of
inode lookup intensive workloads.

Smoke tested w/ ext4 on concurrent fsmark/lookup/unlink workloads
over 50 million or so inodes...

Signed-off-by: Dave Chinner dchin...@redhat.com
---
 fs/inode.c |   74 ++--
 1 file changed, 42 insertions(+), 32 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index ac8d904..2e92674 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -464,7 +464,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long 
hashval)
 
spin_lock(inode_hash_lock);
spin_lock(inode-i_lock);
-   hlist_add_head(inode-i_hash, b);
+   hlist_add_head_rcu(inode-i_hash, b);
spin_unlock(inode-i_lock);
spin_unlock(inode_hash_lock);
 }
@@ -480,7 +480,7 @@ void __remove_inode_hash(struct inode *inode)
 {
spin_lock(inode_hash_lock);
spin_lock(inode-i_lock);
-   hlist_del_init(inode-i_hash);
+   hlist_del_init_rcu(inode-i_hash);
spin_unlock(inode-i_lock);
spin_unlock(inode_hash_lock);
 }
@@ -783,14 +783,19 @@ static void __wait_on_freeing_inode(struct inode *inode);
 static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
-   void *data)
+   void *data, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(inode-i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
if (inode-i_sb != sb) {
spin_unlock(inode-i_lock);
continue;
@@ -800,13 +805,20 @@ repeat:
continue;
}
if (inode-i_state  (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if (locked)
+   spin_unlock(inode_hash_lock);
__wait_on_freeing_inode(inode);
+   if (locked)
+   spin_lock(inode_hash_lock);
goto repeat;
}
__iget(inode);
spin_unlock(inode-i_lock);
+   rcu_read_unlock();
return inode;
}
+   rcu_read_unlock();
return NULL;
 }
 
@@ -815,14 +827,20 @@ repeat:
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-   struct hlist_head *head, unsigned long ino)
+struct hlist_head *head,
+unsigned long ino, bool locked)
 {
struct hlist_node *node;
struct inode *inode = NULL;
 
 repeat:
-   hlist_for_each_entry(inode, node, head, i_hash) {
+   rcu_read_lock();
+   hlist_for_each_entry_rcu(inode, node, head, i_hash) {
spin_lock(inode-i_lock);
+   if (inode_unhashed(inode)) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
if (inode-i_ino != ino) {
spin_unlock(inode-i_lock);
continue;
@@ -832,13 +850,20 @@ repeat:
continue;
}
if (inode-i_state  (I_FREEING|I_WILL_FREE)) {
+   rcu_read_unlock();
+   if 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Dave Chinner
On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
> On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
> 
> > On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > > This patchset optimizes several places which take the per inode spin lock.
> > > They have not been fully tested yet, thus they are marked as RFC. 
> > 
> > Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> > part of the memory barrier that allows the RCU read side to
> > correctly detect a freed inode during a RCU protected cache lookup
> > (hash list traversal for the VFS, or a radix tree traversal for XFS).
> > The i_lock usage around the hahs list operations ensures the hash
> > list operations are atomic with state changes so that such changes
> > are correctly detected during RCU-protected traversals...
> > 
> > IOWs, removing the i_lock from around the i_state transitions and
> > inode hash insert/remove/traversal operations will cause races in
> > the RCU lookups and result in incorrectly using freed inodes instead
> > of failing the lookup and creating a new one.
> > 
> > So I don't think this is a good idea at all...
> >
> 
> Hello, Dave:
> 
>   Thanks for your explanation.
>  
>   Though I can't fully understand it, your concern seems to be that
> RCU inode lookup will be bothered by this change. But we do not have 
> RCU inode lookup in VFS: inode lookup is done by rather a tranditional
> way. 

Ah, I'd forgotten that neither of these RCU-based lookups ever got
merged:

https://lkml.org/lkml/2010/6/23/397
http://thread.gmane.org/gmane.linux.kernel/1056494

That, however, is the approach that the inode caches shoul dbe
moving towards - RCU lookups to reduce locking, not changing
i_lock/i_state atomicity that has been designed to facilitate RCU
safe lookups...

>   XFS gives me the impression that it implements its own inode cache.
> There may be such thing there. I have little knowledge on XFS, but I
> guess it's unlikely impacted by the change of code implementing VFS 
> inode cache.

Yeah, I dropped the generic inode hash RCU conversion - the
SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
have any motiviation to see it through because I'd already converted
XFs to avoid the global inode_hash_lock and use RCU lookups on it's
internal inode cache...

>   As far as I can see, RCU inode free is for RCU dentry lookup, which
> seems have nothing to do with 'detect a freed inode'.

If you know nothing of the history of this, then it might seem that
way

> Taking i_lock in these
> places looks like to me a result of following old lock scheme blindly when 
> breaking the big global inode lock.

The i_state/i_hash_list/i_lock relationship was created specifically
during the inode_lock breakup to allow us to guarantee that certain
fields of the inode are unchanging without needing to take multiple
nested locks:

$ gl -n 1 250df6e
commit 250df6ed274d767da844a5d9f05720b804240197
Author: Dave Chinner 
Date:   Tue Mar 22 22:23:36 2011 +1100

fs: protect inode->i_state with inode->i_lock

Protect inode state transitions and validity checks with the
inode->i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode->i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner 
Signed-off-by: Al Viro 

The inode hash lookup needs to check i_state atomically during the
traversal so inodes being freed are skipped (e.g. I_FREEING,
I_WILL_FREE). those i_state flags are set only with the i_lock held,
and so inode hash lookups need to take the i_lock to guarantee the
i_state field is correct. This inode data field synchronisation is
separate to the cache hash list traversal protection.

The only way to do this is to have an inner lock (inode->i_lock)
that protects both the inode->i_hash_list and inode->i_state fields,
and a lock order that provides outer list traversal protections
(inode_hash_lock). Whether the outer lock is the inode_hash_lock or
rcu_read_lock(), the lock order and the data fields the locks are
protecting are the same

> Of course, maybe they are there for something. Could you speak
> more about the race this change (patch 1,2?) brings up? Thank you.

When you drop the lock from the i_state initialisation, you end up
dropping the implicit unlock->lock memory barrier that the

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Guo Chao
On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:

> On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> > This patchset optimizes several places which take the per inode spin lock.
> > They have not been fully tested yet, thus they are marked as RFC. 
> 
> Inodes are RCU freed. The i_lock spinlock on the i_state field forms
> part of the memory barrier that allows the RCU read side to
> correctly detect a freed inode during a RCU protected cache lookup
> (hash list traversal for the VFS, or a radix tree traversal for XFS).
> The i_lock usage around the hahs list operations ensures the hash
> list operations are atomic with state changes so that such changes
> are correctly detected during RCU-protected traversals...
> 
> IOWs, removing the i_lock from around the i_state transitions and
> inode hash insert/remove/traversal operations will cause races in
> the RCU lookups and result in incorrectly using freed inodes instead
> of failing the lookup and creating a new one.
> 
> So I don't think this is a good idea at all...
>

Hello, Dave:

  Thanks for your explanation.
 
  Though I can't fully understand it, your concern seems to be that
RCU inode lookup will be bothered by this change. But we do not have 
RCU inode lookup in VFS: inode lookup is done by rather a tranditional
way. 

  XFS gives me the impression that it implements its own inode cache.
There may be such thing there. I have little knowledge on XFS, but I
guess it's unlikely impacted by the change of code implementing VFS 
inode cache.

  As far as I can see, RCU inode free is for RCU dentry lookup, which
seems have nothing to do with 'detect a freed inode'. Taking i_lock in these
places looks like to me a result of following old lock scheme blindly when 
breaking the big global inode lock. Of course, maybe they are there for
something. Could you speak more about the race this change (patch 1,2?) brings 
up? Thank you.

Regards,
Guo Chao

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Guo Chao
On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:

 On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
  This patchset optimizes several places which take the per inode spin lock.
  They have not been fully tested yet, thus they are marked as RFC. 
 
 Inodes are RCU freed. The i_lock spinlock on the i_state field forms
 part of the memory barrier that allows the RCU read side to
 correctly detect a freed inode during a RCU protected cache lookup
 (hash list traversal for the VFS, or a radix tree traversal for XFS).
 The i_lock usage around the hahs list operations ensures the hash
 list operations are atomic with state changes so that such changes
 are correctly detected during RCU-protected traversals...
 
 IOWs, removing the i_lock from around the i_state transitions and
 inode hash insert/remove/traversal operations will cause races in
 the RCU lookups and result in incorrectly using freed inodes instead
 of failing the lookup and creating a new one.
 
 So I don't think this is a good idea at all...


Hello, Dave:

  Thanks for your explanation.
 
  Though I can't fully understand it, your concern seems to be that
RCU inode lookup will be bothered by this change. But we do not have 
RCU inode lookup in VFS: inode lookup is done by rather a tranditional
way. 

  XFS gives me the impression that it implements its own inode cache.
There may be such thing there. I have little knowledge on XFS, but I
guess it's unlikely impacted by the change of code implementing VFS 
inode cache.

  As far as I can see, RCU inode free is for RCU dentry lookup, which
seems have nothing to do with 'detect a freed inode'. Taking i_lock in these
places looks like to me a result of following old lock scheme blindly when 
breaking the big global inode lock. Of course, maybe they are there for
something. Could you speak more about the race this change (patch 1,2?) brings 
up? Thank you.

Regards,
Guo Chao

 Cheers,
 
 Dave.
 -- 
 Dave Chinner
 da...@fromorbit.com

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-23 Thread Dave Chinner
On Mon, Sep 24, 2012 at 10:42:21AM +0800, Guo Chao wrote:
 On Sat, Sep 22, 2012 at 08:49:12AM +1000, Dave Chinner wrote:
 
  On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
   This patchset optimizes several places which take the per inode spin lock.
   They have not been fully tested yet, thus they are marked as RFC. 
  
  Inodes are RCU freed. The i_lock spinlock on the i_state field forms
  part of the memory barrier that allows the RCU read side to
  correctly detect a freed inode during a RCU protected cache lookup
  (hash list traversal for the VFS, or a radix tree traversal for XFS).
  The i_lock usage around the hahs list operations ensures the hash
  list operations are atomic with state changes so that such changes
  are correctly detected during RCU-protected traversals...
  
  IOWs, removing the i_lock from around the i_state transitions and
  inode hash insert/remove/traversal operations will cause races in
  the RCU lookups and result in incorrectly using freed inodes instead
  of failing the lookup and creating a new one.
  
  So I don't think this is a good idea at all...
 
 
 Hello, Dave:
 
   Thanks for your explanation.
  
   Though I can't fully understand it, your concern seems to be that
 RCU inode lookup will be bothered by this change. But we do not have 
 RCU inode lookup in VFS: inode lookup is done by rather a tranditional
 way. 

Ah, I'd forgotten that neither of these RCU-based lookups ever got
merged:

https://lkml.org/lkml/2010/6/23/397
http://thread.gmane.org/gmane.linux.kernel/1056494

That, however, is the approach that the inode caches shoul dbe
moving towards - RCU lookups to reduce locking, not changing
i_lock/i_state atomicity that has been designed to facilitate RCU
safe lookups...

   XFS gives me the impression that it implements its own inode cache.
 There may be such thing there. I have little knowledge on XFS, but I
 guess it's unlikely impacted by the change of code implementing VFS 
 inode cache.

Yeah, I dropped the generic inode hash RCU conversion - the
SLAB_DESTROY_BY_RCU was proving to be rather complex, and I didn't
have any motiviation to see it through because I'd already converted
XFs to avoid the global inode_hash_lock and use RCU lookups on it's
internal inode cache...

   As far as I can see, RCU inode free is for RCU dentry lookup, which
 seems have nothing to do with 'detect a freed inode'.

If you know nothing of the history of this, then it might seem that
way

 Taking i_lock in these
 places looks like to me a result of following old lock scheme blindly when 
 breaking the big global inode lock.

The i_state/i_hash_list/i_lock relationship was created specifically
during the inode_lock breakup to allow us to guarantee that certain
fields of the inode are unchanging without needing to take multiple
nested locks:

$ gl -n 1 250df6e
commit 250df6ed274d767da844a5d9f05720b804240197
Author: Dave Chinner dchin...@redhat.com
Date:   Tue Mar 22 22:23:36 2011 +1100

fs: protect inode-i_state with inode-i_lock

Protect inode state transitions and validity checks with the
inode-i_lock. This enables us to make inode state transitions
independently of the inode_lock and is the first step to peeling
away the inode_lock from the code.

This requires that __iget() is done atomically with i_state checks
during list traversals so that we don't race with another thread
marking the inode I_FREEING between the state check and grabbing the
reference.

Also remove the unlock_new_inode() memory barrier optimisation
required to avoid taking the inode_lock when clearing I_NEW.
Simplify the code by simply taking the inode-i_lock around the
state change and wakeup. Because the wakeup is no longer tricky,
remove the wake_up_inode() function and open code the wakeup where
necessary.

Signed-off-by: Dave Chinner dchin...@redhat.com
Signed-off-by: Al Viro v...@zeniv.linux.org.uk

The inode hash lookup needs to check i_state atomically during the
traversal so inodes being freed are skipped (e.g. I_FREEING,
I_WILL_FREE). those i_state flags are set only with the i_lock held,
and so inode hash lookups need to take the i_lock to guarantee the
i_state field is correct. This inode data field synchronisation is
separate to the cache hash list traversal protection.

The only way to do this is to have an inner lock (inode-i_lock)
that protects both the inode-i_hash_list and inode-i_state fields,
and a lock order that provides outer list traversal protections
(inode_hash_lock). Whether the outer lock is the inode_hash_lock or
rcu_read_lock(), the lock order and the data fields the locks are
protecting are the same

 Of course, maybe they are there for something. Could you speak
 more about the race this change (patch 1,2?) brings up? Thank you.

When you drop the lock from the i_state initialisation, you end up
dropping the implicit unlock-lock memory barrier that the
inode-i_lock 

Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Dave Chinner
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> This patchset optimizes several places which take the per inode spin lock.
> They have not been fully tested yet, thus they are marked as RFC. 

Inodes are RCU freed. The i_lock spinlock on the i_state field forms
part of the memory barrier that allows the RCU read side to
correctly detect a freed inode during a RCU protected cache lookup
(hash list traversal for the VFS, or a radix tree traversal for XFS).
The i_lock usage around the hahs list operations ensures the hash
list operations are atomic with state changes so that such changes
are correctly detected during RCU-protected traversals...

IOWs, removing the i_lock from around the i_state transitions and
inode hash insert/remove/traversal operations will cause races in
the RCU lookups and result in incorrectly using freed inodes instead
of failing the lookup and creating a new one.

So I don't think this is a good idea at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Matthew Wilcox
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
> This patchset optimizes several places which take the per inode spin lock.
> They have not been fully tested yet, thus they are marked as RFC. 
> 
> I do limited tests after all patches applied: use two 'find' to traverse the 
> filesystems and touch all files in parallel. This runs for several days in a 
> virtual machine, no suspicious log appears. 

Have you done any performance testing?  Taking and releasing a lock which
isn't contended is not particularly expensive.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Guo Chao
This patchset optimizes several places which take the per inode spin lock.
They have not been fully tested yet, thus they are marked as RFC. 

I do limited tests after all patches applied: use two 'find' to traverse the 
filesystems and touch all files in parallel. This runs for several days in a 
virtual machine, no suspicious log appears. 

Guo Chao (4):
  fs/inode.c: do not take i_lock on newly allocated inode
  fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash
  fs/inode.c: do not take i_lock when identify an inode
  fs/inode.c: always take i_lock before calling filesystem's test()
method

 fs/inode.c |   32 +---
 1 file changed, 9 insertions(+), 23 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Guo Chao
This patchset optimizes several places which take the per inode spin lock.
They have not been fully tested yet, thus they are marked as RFC. 

I do limited tests after all patches applied: use two 'find' to traverse the 
filesystems and touch all files in parallel. This runs for several days in a 
virtual machine, no suspicious log appears. 

Guo Chao (4):
  fs/inode.c: do not take i_lock on newly allocated inode
  fs/inode.c: do not take i_lock in __(insert|remove)_inode_hash
  fs/inode.c: do not take i_lock when identify an inode
  fs/inode.c: always take i_lock before calling filesystem's test()
method

 fs/inode.c |   32 +---
 1 file changed, 9 insertions(+), 23 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Matthew Wilcox
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
 This patchset optimizes several places which take the per inode spin lock.
 They have not been fully tested yet, thus they are marked as RFC. 
 
 I do limited tests after all patches applied: use two 'find' to traverse the 
 filesystems and touch all files in parallel. This runs for several days in a 
 virtual machine, no suspicious log appears. 

Have you done any performance testing?  Taking and releasing a lock which
isn't contended is not particularly expensive.

-- 
Matthew Wilcox  Intel Open Source Technology Centre
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v4 Patch 0/4] fs/inode.c: optimization for inode lock usage

2012-09-21 Thread Dave Chinner
On Fri, Sep 21, 2012 at 05:31:02PM +0800, Guo Chao wrote:
 This patchset optimizes several places which take the per inode spin lock.
 They have not been fully tested yet, thus they are marked as RFC. 

Inodes are RCU freed. The i_lock spinlock on the i_state field forms
part of the memory barrier that allows the RCU read side to
correctly detect a freed inode during a RCU protected cache lookup
(hash list traversal for the VFS, or a radix tree traversal for XFS).
The i_lock usage around the hahs list operations ensures the hash
list operations are atomic with state changes so that such changes
are correctly detected during RCU-protected traversals...

IOWs, removing the i_lock from around the i_state transitions and
inode hash insert/remove/traversal operations will cause races in
the RCU lookups and result in incorrectly using freed inodes instead
of failing the lookup and creating a new one.

So I don't think this is a good idea at all...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/