[PATCH] Optimize wait_sb_inodes()

2013-06-26 Thread OGAWA Hirofumi
Hi,

On the following stress, sync(2) became top of cpu load.

fsstress -s 1 -n 5 -p 3 -d `pwd`

After profile by perf, cpu load was lock contention of inode_sb_list_lock.

sync(2) is data integrity operation, so it has to make sure all dirty
data was written before sync(2) point. The bdi flusher flushes current
dirty data and wait those. But, if there is in-flight I/O before
sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.

So, wait_sb_inodes() is walking the all inodes on sb to make sure
in-flight I/O was done too. When it is walking inodes,
inode_sb_list_lock is contending with other operations like
create(2). This is the cause of cpu load.

On another view, wait_sb_inodes() would (arguably) be necessary for
legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
would be more than useless, because ext* can be done it by own
transaction list (and more efficient way).

Likewise, on tux3, the state is same with data=journal.

Also, even if data=ordered, ext* might be able to check in-flight I/O by
ordered data list (with some new additional check, I'm not sure).


So, this patch adds the sb-s_op-wait_inodes() handler to replace
wait_sb_inodes(). With this, FSes can optimize it by using own
internal knowledge.

[Alternative idea to optimize this, push down wait_sb_inodes() into
-sync_fs() handler on all FSes. Or if someone fixes this with another
idea, I'm happy enough.]


The following is profile of result on tux3 (-wait_inodes() handler is
noop function).

Thanks.

-  13.11%  fsstress  [kernel.kallsyms]  [k] sync_inodes_sb 
   - sync_inodes_sb
  - 99.97% sync_inodes_one_sb
   iterate_supers
   sys_sync
   system_call
   sync
-   9.39%  fsstress  [kernel.kallsyms]  [k] _raw_spin_lock
   - _raw_spin_lock
  + 62.72% sync_inodes_sb
  + 7.46% _atomic_dec_and_lock
  + 6.15% inode_add_lru
  + 4.43% map_region
  + 3.07% __find_get_buffer
  + 2.71% sync_inodes_one_sb
  + 1.77% tux3_set_buffer_dirty_list
  + 1.43% find_inode
  + 1.02% iput
  + 0.69% dput
  + 0.57% __tux3_get_block
  + 0.53% shrink_dcache_parent
  + 0.53% __d_instantiate

Before patch:

real 2m40.994s
user 0m1.344s
sys  0m15.832s

After patch:

real 2m34.748
user 0m1.360s
sys  0m5.356s

Signed-off-by: OGAWA Hirofumi hirof...@mail.parknet.co.jp
---

 fs/fs-writeback.c  |5 -
 include/linux/fs.h |1 +
 2 files changed, 5 insertions(+), 1 deletion(-)

diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
--- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes   2013-06-24 
19:03:10.0 +0900
+++ tux3fs-hirofumi/include/linux/fs.h  2013-06-24 19:03:10.0 +0900
@@ -1595,6 +1595,7 @@ struct super_operations {
int (*write_inode) (struct inode *, struct writeback_control *wbc);
int (*drop_inode) (struct inode *);
void (*evict_inode) (struct inode *);
+   void (*wait_inodes)(struct super_block *);
void (*put_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
--- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes2013-06-24 
19:03:10.0 +0900
+++ tux3fs-hirofumi/fs/fs-writeback.c   2013-06-24 19:03:10.0 +0900
@@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
bdi_queue_work(sb-s_bdi, work);
wait_for_completion(done);
 
-   wait_sb_inodes(sb);
+   if (sb-s_op-wait_inodes)
+   sb-s_op-wait_inodes(sb);
+   else
+   wait_sb_inodes(sb);
 }
 EXPORT_SYMBOL(sync_inodes_sb);
 
_

-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [PATCH] Optimize wait_sb_inodes()

2013-06-26 Thread Jörn Engel
On Wed, 26 June 2013 17:45:23 +0900, OGAWA Hirofumi wrote:

 
  fs/fs-writeback.c  |5 -
  include/linux/fs.h |1 +
  2 files changed, 5 insertions(+), 1 deletion(-)
 
 diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
 --- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes 2013-06-24 
 19:03:10.0 +0900
 +++ tux3fs-hirofumi/include/linux/fs.h2013-06-24 19:03:10.0 
 +0900
 @@ -1595,6 +1595,7 @@ struct super_operations {
   int (*write_inode) (struct inode *, struct writeback_control *wbc);
   int (*drop_inode) (struct inode *);
   void (*evict_inode) (struct inode *);
 + void (*wait_inodes)(struct super_block *);
   void (*put_super) (struct super_block *);
   int (*sync_fs)(struct super_block *sb, int wait);
   int (*freeze_fs) (struct super_block *);
 diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
 --- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes  2013-06-24 
 19:03:10.0 +0900
 +++ tux3fs-hirofumi/fs/fs-writeback.c 2013-06-24 19:03:10.0 +0900
 @@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
   bdi_queue_work(sb-s_bdi, work);
   wait_for_completion(done);
  
 - wait_sb_inodes(sb);
 + if (sb-s_op-wait_inodes)
 + sb-s_op-wait_inodes(sb);
 + else
 + wait_sb_inodes(sb);
  }
  EXPORT_SYMBOL(sync_inodes_sb);

Two things.  Until there are actual implementations of
s_op-wait_inodes, this is pure obfuscation.  You already know this,
of course.

More interestingly, I personally hate methods with a default option if
they are not implemented.  Not too bad in this particular case, but
the same pattern has burned me a number of times and wasted weeks of
my life.  So I would prefer to unconditionally call
sb-s_op-wait_inodes(sb) and set it to wait_sb_inodes for all
filesystems that don't have a smarter way to do things.

Jörn

--
Linux is more the core point of a concept that surrounds open source
which, in turn, is based on a false concept. This concept is that
people actually want to look at source code.
-- Rob Enderle

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [PATCH] Optimize wait_sb_inodes()

2013-06-26 Thread OGAWA Hirofumi
Dave Chinner da...@fromorbit.com writes:

 Optimizing wait_sb_inodes() might help lock contention, but it doesn't
 help unnecessary wait/check.

 You have your own wait code, that doesn't make what the VFS does
 unnecesary. Quite frankly, I don't trust individual filesystems to
 get it right - there's a long history of filesystem specific data
 sync problems (including in XFS), and the best way to avoid that is
 to ensure the VFS gets it right for you.

 Indeed, we've gone from having sooper special secret sauce data sync
 code in XFS to using the VFS over the past few years, and the result
 is that it is now more reliable and faster than when we were trying
 to be smart and do it all ourselves. We got to where we are by
 fixing the problems in the VFS rather than continuing to try to work
 around them.

I guess you are assuming FS which is using data=writeback or such.

 Since some FSes know about current
 in-flight I/O already in those internal, so I think, those FSes can be
 done it here, or are already doing in -sync_fs().

 Sure, do your internal checks in -sync_fs(), but if
 wait_sb_inodes() does not have any lock contention and very little
 overhead, then why do you need to avoid it? This wait has to be done
 somewhere between sync_inodes_sb() dispatching all the IO and
 -sync_fs completing, so what's the problem with hving the VFS do
 that *for everyone* efficiently?

Are you saying the vfs should track all in-flight I/O with some sort of
transactional way?

Otherwise, vfs can't know the data is whether after sync point or before
sync point, and have to wait or not. FS is using the behavior like
data=journal has tracking of those already, and can reuse it.

 Fix the root cause of the problem - the sub-optimal VFS code.
 Hacking around it specifically for out-of-tree code is not the way
 things get done around here...

I'm thinking the root cause is vfs can't have knowledge of FS internal,
e.g. FS is handling data transactional way, or not.

Thanks.
-- 
OGAWA Hirofumi hirof...@mail.parknet.co.jp

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3


Re: [PATCH] Optimize wait_sb_inodes()

2013-06-26 Thread Daniel Phillips
Hi Dave,

On Wed, Jun 26, 2013 at 9:47 PM, Dave Chinner da...@fromorbit.com wrote:
 You have your own wait code, that doesn't make what the VFS does
 unnecesary. Quite frankly, I don't trust individual filesystems to
 get it right - there's a long history of filesystem specific data
 sync problems (including in XFS), and the best way to avoid that is
 to ensure the VFS gets it right for you.

I agree that some of the methods Tux3 uses to implement data integrity, sync
and friends may be worth lifting up to core, or better, to a library,
but we will
all be better served if such methods are given time to mature first. After all,
that basically describes the entire evolution of the VFS: new concepts start
in a filesystem, prove themselves useful, then may be lifted up to be shared.

It is important to get the order right: prove first, then lift.

Regards,

Daniel

___
Tux3 mailing list
Tux3@phunq.net
http://phunq.net/mailman/listinfo/tux3