Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:

 So how about if you revise fsync so that it always sends data blocks to 
 the journal not to the main disk?

This gets a little sticky.

Once you log a block, it might be replayed after a crash.  So, you have
to protect against corner cases like this:

write(file)
fsync(file) ; /* logs modified data blocks */
write(file) ; /* write the same blocks without fsync */
sync ;/* use expects new version of the blocks on disk */
crash

During replay, the logged data blocks overwrite the blocks sent to disk
via sync().

This isn't hard to correct for, every time a buffer is marked dirty, you
check the journal hash tables to see if it is replayable, and if so you
log it instead (the 2.2.x code did this due to tails).  This translates
to increased CPU usage for every write.

I'd rather not put it back in because it adds yet another corner case to
maintain for all time.  Most of the fsync/O_SYNC bound applications are
just given their own partition anyway, so most users that need data
logging need it for every write.

-chris







RE: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread berthiaume_wayne

I'll add the write caching into the test just for info. Until there
is a way to guaranty the data is safe I'll have to go with no write caching
though. I should have all this testing done by the end of the week.

-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED]]
Sent: Friday, May 03, 2002 6:00 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: RE: [reiserfs-list] fsync() Performance Issue


On Fri, 2002-05-03 at 16:35, [EMAIL PROTECTED] wrote:
   Chris, I have some quick preliminary results for you. I have
 additional testing to perform and haven't run debugreiserfs() yet. If you
 have a preference for which tests to run debugreiserfs() let me know.
   Base testing was done against 2.4.13 built on RH 7.1 using the
 test_writes.c code I forwarded to you. The system is a Tyan with single
 PIII, IDE Promise 20269, Maxtor 160GB drive - write cache disabled. All
 numbers are with fsync() and 1KB files. As I said, more testing, i.e.
 filesizes, need to be performed.

 2.4.19-pre7 speedup, data logging, write barrier / no options
   = 47.1ms/file

Hi Wayne, thanks for sending these along.

I expected a slight improvement over the 2.4.13 code even with the data
logging turned off.  I'm curious to see how it does with the IDE cache
turned on.  With scsi, I see 10-15% better without any options than an
unpatched kernel.

 2.4.19-pre7 speedup, data logging, write barrier / data=journal
   = 25.2ms/file
 2.4.19-pre7 speedup, data logging, write barrier /
data=journal,barrier=none
   = 27.8ms/file

The barrier option doesn't make much difference because the write cache
is off.  With write cache on, the barrier code should allow you to be
faster than with the caching off, but without risking the data (Jens and
I are working on final fsync safety issues though).

Hans, data=journal turns on the data journaling.  The data journaling
patches also include optimizations to write metadata back to disk in
bigger chunks for tiny transactions (the current method is to write one
transaction's worth back, when a transaction has 3 blocks, this is
pretty slow).

I've put these patches up on:

ftp.suse.com/pub/people/mason/patches/data-logging

   One question is will these patches be going into the 2.4 tree and
 when?

The data logging patches are a huge change, but the good news is they
are based on the nesting patches that have been stable for a long time
in the quota code.  I'll probably want a month or more of heavy testing
before I think about submitting them.

-chris




Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Hans Reiser

Chris Mason wrote:

On Sat, 2002-05-04 at 10:59, Hans Reiser wrote:
  

So how about if you revise fsync so that it always sends data blocks to 
the journal not to the main disk?



This gets a little sticky.

Once you log a block, it might be replayed after a crash.  So, you have
to protect against corner cases like this:

write(file)
fsync(file) ; /* logs modified data blocks */
write(file) ; /* write the same blocks without fsync */
sync ;/* use expects new version of the blocks on disk */
crash

During replay, the logged data blocks overwrite the blocks sent to disk
via sync().

This isn't hard to correct for, every time a buffer is marked dirty, you
check the journal hash tables to see if it is replayable, and if so you
log it instead (the 2.2.x code did this due to tails).  This translates
to increased CPU usage for every write.

Significant increased CPU usage?


I'd rather not put it back in because it adds yet another corner case to
maintain for all time.  Most of the fsync/O_SYNC bound applications are
just given their own partition anyway, so most users that need data
logging need it for every write.

most users don't know enough to turn it on;-)


-chris






  







Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:

 I'd rather not put it back in because it adds yet another corner case to
 maintain for all time.  Most of the fsync/O_SYNC bound applications are
 just given their own partition anyway, so most users that need data
 logging need it for every write.
 
 Does mozilla's mail user agent use fsync?  Should I give it its own 
 partition?  I bet it is fsync bound;-)

[ I took Wayne off the cc list, he's probably not horribly interested ]

Perhaps, but I'll also bet the fsync performance hit doesn't affect the
performance of the system as a whole.  Remember that data=journal
doesn't make the fsyncs fast, it just makes them faster.

 
 Most persons using small fsyncs are using it because the person who 
 wrote their application wrote it wrong.  What's more, many of the 
 persons who wrote those applications cannot understand that they did it 
 wrong even if you tell them (e.g. qmail author reportedly cannot 
 understand, sendmail guys now understand but had Kirk McKusick on their 
 staff and attending the meeting when I explained it to them so they are 
 not very typical).  
 
 In other words, handling stupidity is an important life skill, and we 
 all need to excell at it.;-)

A real strength to linux is the application designers can talk directly
to their own personal bottlenecks.  Hopefully we reward those that hunt
us down and spend the time convincing us their applications are worth
tuning for.  They then proceed to beat the pants off their competition.

 
 Tell me what your thoughts are on the following:
 
 If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
 the ones who would never send you an email)  the following 
 questions, what percentage will answer which choice?
 
 The filesystem you are using is named:
 
 a) the Performance Optimized SuSE FS
 
 b) NTFS
 
 c) FAT
 
 d) ext2
 
 e) ReiserFS

I believe the ones that know what a filesystem is will answer ReiserFS,
You might get a lot of ext2 answers, just because that's what a lot of
people think the linux filesystem is.

 
 If you want to change reiserfs to use data journaling you must do which:
 
 a) reinstall the reiserfs package using rpm
 
 b) modify /etc/fs.conf
 
 c) reinstall the operating system from scratch, and select different 
 options during the install this time
 
 d) reformat your reiserfs partition using mkreiserfs
 
 e) none of the above
 
 f) all of the above except e)

These people won't be admins of systems big enough for the difference to
matter.  data journaling is targeted at people with so much load they
would have to buy more hardware to make up for it.  The new option
lowers the price to performance ratio, which is exactly what we want to
do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
deliver a mail message, cutting the time down to 10ms just won't matter.

 
 
 What do you think the chances are that you can convince Hubert that 
 every SuSE Enterprise Edition user should be asked at install time if 
 they are going to use fsync a lot on each partition, and to use a 
 different fstab setting if yes?

Very little, I might tell them to buy the suse email server instead,
since that would have the settings done right.  data=journal is just a
small part of mail server tuning.

 
 I know that you are an experienced sysadmin who was good at it.  Your 
 intuition tells you that most sysadmins are like the ones you were 
 willing to hire into your group at the university.  They aren't.
 
 Linux needs to be like a telephone.  You plug it in, push buttons, and 
 talk.  It works well, but most folks don't know why.
 

Exactly.  I think there are 3 classes of users at play here.

1) Those who don't understand and don't have enough load to notice.
2) Those who don't understand and do have enough load to notice.
3) Those who do understand and do have enough load to notice.

#2 will buy support from someone, and they should be able to configure
the thing right.

#3 will find the docs and do it right themselves.

 A moderate number of programs are small fsync bound for the simple 
 reason that it is simpler to write them that way.We need to cover 
 over their simplistic designs.
 
 So, you have my sympathies Chris, because I believe you that it makes 
 the code uglier and it won't be a joy to code and test.  I hope you also 
 see that it should be done.

Mostly, I feel this kind of tuning is a mistake right now.  The patch is
young and there are so many places left to tweak...I'm still at the
stage where much larger improvements are possible, and a better use of
coding time.  Plus, it's monday and it's always more fun to debate than
give in on mondays.

-chris





Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Manuel Krause

On 05/07/2002 12:57 AM, Chris Mason wrote:

 On Mon, 2002-05-06 at 17:21, Hans Reiser wrote:
 
I'd rather not put it back in because it adds yet another corner case to
maintain for all time.  Most of the fsync/O_SYNC bound applications are
just given their own partition anyway, so most users that need data
logging need it for every write.


Does mozilla's mail user agent use fsync?  Should I give it its own 
partition?  I bet it is fsync bound;-)

 
 [ I took Wayne off the cc list, he's probably not horribly interested ]
 
 Perhaps, but I'll also bet the fsync performance hit doesn't affect the
 performance of the system as a whole.  Remember that data=journal
 doesn't make the fsyncs fast, it just makes them faster.
 
 
Most persons using small fsyncs are using it because the person who 
wrote their application wrote it wrong.  What's more, many of the 
persons who wrote those applications cannot understand that they did it 
wrong even if you tell them (e.g. qmail author reportedly cannot 
understand, sendmail guys now understand but had Kirk McKusick on their 
staff and attending the meeting when I explained it to them so they are 
not very typical).  

In other words, handling stupidity is an important life skill, and we 
all need to excell at it.;-)

 
 A real strength to linux is the application designers can talk directly
 to their own personal bottlenecks.  Hopefully we reward those that hunt
 us down and spend the time convincing us their applications are worth
 tuning for.  They then proceed to beat the pants off their competition.
 
 
Tell me what your thoughts are on the following:

If you ask randomly selected ReiserFS users (not the reiserfs-list, but 
the ones who would never send you an email)  the following 
questions, what percentage will answer which choice?

The filesystem you are using is named:

a) the Performance Optimized SuSE FS

b) NTFS

c) FAT

d) ext2

e) ReiserFS

 
 I believe the ones that know what a filesystem is will answer ReiserFS,
 You might get a lot of ext2 answers, just because that's what a lot of
 people think the linux filesystem is.
 
 
If you want to change reiserfs to use data journaling you must do which:

a) reinstall the reiserfs package using rpm

b) modify /etc/fs.conf

c) reinstall the operating system from scratch, and select different 
options during the install this time

d) reformat your reiserfs partition using mkreiserfs

e) none of the above

f) all of the above except e)

 
 These people won't be admins of systems big enough for the difference to
 matter.  data journaling is targeted at people with so much load they
 would have to buy more hardware to make up for it.  The new option
 lowers the price to performance ratio, which is exactly what we want to
 do for sendmails, egeneras, lycos, etc.  If it takes my laptop 20ms to
 deliver a mail message, cutting the time down to 10ms just won't matter.
 
 

What do you think the chances are that you can convince Hubert that 
every SuSE Enterprise Edition user should be asked at install time if 
they are going to use fsync a lot on each partition, and to use a 
different fstab setting if yes?

 
 Very little, I might tell them to buy the suse email server instead,
 since that would have the settings done right.  data=journal is just a
 small part of mail server tuning.
 
 
I know that you are an experienced sysadmin who was good at it.  Your 
intuition tells you that most sysadmins are like the ones you were 
willing to hire into your group at the university.  They aren't.

Linux needs to be like a telephone.  You plug it in, push buttons, and 
talk.  It works well, but most folks don't know why.


 
 Exactly.  I think there are 3 classes of users at play here.
 
 1) Those who don't understand and don't have enough load to notice.
 2) Those who don't understand and do have enough load to notice.
 3) Those who do understand and do have enough load to notice.
 
 #2 will buy support from someone, and they should be able to configure
 the thing right.
 
 #3 will find the docs and do it right themselves.
 
 
A moderate number of programs are small fsync bound for the simple 
reason that it is simpler to write them that way.We need to cover 
over their simplistic designs.

So, you have my sympathies Chris, because I believe you that it makes 
the code uglier and it won't be a joy to code and test.  I hope you also 
see that it should be done.

 
 Mostly, I feel this kind of tuning is a mistake right now.  The patch is
 young and there are so many places left to tweak...I'm still at the
 stage where much larger improvements are possible, and a better use of
 coding time.  Plus, it's monday and it's always more fun to debate than
 give in on mondays.
 
 -chris
 


Hi, Chris  Hans!

Don't think this somekind of destructive discussion would lead to 
anything useful for now, can you post a diff for 
2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?

I'll try it and report if that leads 

Re: [reiserfs-list] fsync() Performance Issue

2002-05-06 Thread Chris Mason

On Mon, 2002-05-06 at 21:17, Manuel Krause wrote:
 On 05/07/2002 12:57 AM, Chris Mason wrote:

 
 Hi, Chris  Hans!
 
 Don't think this somekind of destructive discussion would lead to 
 anything useful for now, can you post a diff for 
 2.4.19-pre7+latest-related-pending +compound-patch-from-ftp?
 
 I'll try it and report if that leads to more security and/or less 
 performance on my every day use with NS6 and so on if there is any.

The current data logging patches are at:

ftp.suse.com/pub/people/mason/patches/data-logging

They are against 2.4.19-pre7, and contain versions of the major (stable)
speedups.  The patch is pretty big, so I'm not likely to merge with the
namesys pending directories.  The namesys guys add things frequently,
and I think it would get confusing for people trying to figure out which
patches to apply.

The data logging stuff is beta code, if you have a good test bed where
it's ok if things go wrong I can make you a special patch with the
pending stuff merged.

-chris







Re: [reiserfs-list] fsync() Performance Issue

2002-05-02 Thread Oleg Drokin

Hello!

On Thu, May 02, 2002 at 07:07:18AM +0200, Christian Stuke wrote:
 Could we have this for 2.4.18+ pending also please?

This patch would apply to 2.4.18 + pending patches, I believe.
As for including these patchs into pending queue for 2.4.18, this is impossible
now, it is too big of a change, unfortunatelly. We hope to get something
like this into 2.4.19-pre1+

Bye,
Oleg



Re: [reiserfs-list] fsync() Performance Issue

2002-05-01 Thread Christian Stuke

Could we have this for 2.4.18+ pending also please?

Chris
- Original Message -
From: Oleg Drokin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Tuesday, April 30, 2002 4:20 PM
Subject: Re: [reiserfs-list] fsync() Performance Issue


 Hello!

 On Fri, Apr 26, 2002 at 04:28:26PM -0400, [EMAIL PROTECTED] wrote:
  I'm wondering if anyone out there may have some suggestions on how
  to improve the performance of a system employing fsync(). I have to be
able
  to guaranty that every write to my fileserver is on disk when the client
has
  passed it to the server. Therefore, I have disabled write cache on the
disk
  and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs 3.6.25,
  without additional patches. I have seen some discussions out here about
  various other speed-up patches and am wondering if I need to add these
to
  2.4.19-pre7? And what they are and where can I obtain said patches?
Also,
  I'm wondering if there is another solution to syncing the data that is
  faster than fsync(). Testing, thusfar, has shown a large disparity
between
  running with and without sync.Another idea is to explore another
filesystem,
  but I'm not exactly excited by the other journaling filesystems out
there at
  this time. All ideas will be greatly appreciated.

 Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
 operations a little. (From Chris Mason).
 Filesystem cannot do very much at this point unfortunatelly, it is ending
up
 waiting for disk to finish write operations.

 Also we are working on other speedup patches that would cover different
areas
 of write perfomance itself.

 Bye,
 Oleg





Re: [reiserfs-list] fsync() Performance Issue

2002-04-30 Thread Oleg Drokin

Hello!

On Fri, Apr 26, 2002 at 04:28:26PM -0400, [EMAIL PROTECTED] wrote:
   I'm wondering if anyone out there may have some suggestions on how
 to improve the performance of a system employing fsync(). I have to be able
 to guaranty that every write to my fileserver is on disk when the client has
 passed it to the server. Therefore, I have disabled write cache on the disk
 and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs 3.6.25,
 without additional patches. I have seen some discussions out here about
 various other speed-up patches and am wondering if I need to add these to
 2.4.19-pre7? And what they are and where can I obtain said patches? Also,
 I'm wondering if there is another solution to syncing the data that is
 faster than fsync(). Testing, thusfar, has shown a large disparity between
 running with and without sync.Another idea is to explore another filesystem,
 but I'm not exactly excited by the other journaling filesystems out there at
 this time. All ideas will be greatly appreciated.

Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
operations a little. (From Chris Mason).
Filesystem cannot do very much at this point unfortunatelly, it is ending up
waiting for disk to finish write operations.

Also we are working on other speedup patches that would cover different areas
of write perfomance itself.

Bye,
Oleg


diff -uNr linux-2.4.19-pre6.o/fs/buffer.c linux-2.4.19-pre6.speedup/fs/buffer.c
--- linux-2.4.19-pre6.o/fs/buffer.c Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/buffer.c   Wed Apr 10 10:43:46 2002
@@ -325,6 +325,8 @@
lock_super(sb);
if (sb-s_dirt  sb-s_op  sb-s_op-write_super)
sb-s_op-write_super(sb);
+   if (sb-s_op  sb-s_op-commit_super)
+   sb-s_op-commit_super(sb);
unlock_super(sb);
unlock_kernel();
 
@@ -344,7 +346,7 @@
lock_kernel();
sync_inodes(dev);
DQUOT_SYNC(dev);
-   sync_supers(dev);
+   commit_supers(dev);
unlock_kernel();
 
return sync_buffers(dev, 1);
Binary files linux-2.4.19-pre6.o/fs/reiserfs/.journal.c.rej.swp and 
linux-2.4.19-pre6.speedup/fs/reiserfs/.journal.c.rej.swp differ
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/bitmap.c 
linux-2.4.19-pre6.speedup/fs/reiserfs/bitmap.c
--- linux-2.4.19-pre6.o/fs/reiserfs/bitmap.cMon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/bitmap.c  Wed Apr 10 10:43:46 2002
@@ -122,7 +122,6 @@
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s-s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
@@ -433,7 +432,6 @@
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s-s_dirt = 1;
 
   return CARRY_ON;
 }
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/ibalance.c 
linux-2.4.19-pre6.speedup/fs/reiserfs/ibalance.c
--- linux-2.4.19-pre6.o/fs/reiserfs/ibalance.c  Sat Nov 10 01:18:25 2001
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/ibalance.cWed Apr 10 10:43:46 2002
@@ -632,7 +632,6 @@
/* use check_internal if new root is an internal node */
check_internal (new_root);
/**/
-   tb-tb_sb-s_dirt = 1;
 
/* do what is needed for buffer thrown from tree */
reiserfs_invalidate_buffer(tb, tbSh);
@@ -950,7 +949,6 @@
 PUT_SB_ROOT_BLOCK( tb-tb_sb, tbSh-b_blocknr );
 PUT_SB_TREE_HEIGHT( tb-tb_sb, SB_TREE_HEIGHT(tb-tb_sb) + 1 );
do_balance_mark_sb_dirty (tb, tb-tb_sb-u.reiserfs_sb.s_sbh, 1);
-   tb-tb_sb-s_dirt = 1;
 }

 if ( tb-blknum[h] == 2 ) {
diff -uNr linux-2.4.19-pre6.o/fs/reiserfs/journal.c 
linux-2.4.19-pre6.speedup/fs/reiserfs/journal.c
--- linux-2.4.19-pre6.o/fs/reiserfs/journal.c   Mon Apr  8 14:53:24 2002
+++ linux-2.4.19-pre6.speedup/fs/reiserfs/journal.c Wed Apr 10 10:44:32 2002
@@ -64,12 +64,15 @@
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers = LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run */
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and commit
 structs at 4k */
@@ -576,17 +579,12 @@
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read((SB_JOURNAL(p_s_sb)-j_wlock))  0) {
-PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-

Re: [reiserfs-list] fsync() Performance Issue

2002-04-30 Thread Chris Mason

On Tue, 2002-04-30 at 10:20, Oleg Drokin wrote:

 Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
 operations a little. (From Chris Mason).
 Filesystem cannot do very much at this point unfortunatelly, it is ending up
 waiting for disk to finish write operations.
 
 Also we are working on other speedup patches that would cover different areas
 of write perfomance itself.

A newer one (against 2.4.19-pre7) is below.  It has not been through as
much testing on the namesys side, which is why Oleg sent the older one.

Wayne and I have been talking in private mail, he's getting a bunch of
beta patches later today (this speedup, data logging, updated barrier
code).  Along with instructions for testing.

-chris

# Veritas (Hugh Dickins supplied the patch) sent the bits in
# fs/super.c that allow the FS to leave super-s_dirt set after a
# write_super call.
#
diff -urN --exclude *.orig parent/fs/buffer.c comp/fs/buffer.c
--- parent/fs/buffer.c  Mon Apr 29 10:20:24 2002
+++ comp/fs/buffer.cMon Apr 29 10:20:22 2002
 -325,6 +325,8 
lock_super(sb);
if (sb-s_dirt  sb-s_op  sb-s_op-write_super)
sb-s_op-write_super(sb);
+   if (sb-s_op  sb-s_op-commit_super)
+   sb-s_op-commit_super(sb);
unlock_super(sb);
unlock_kernel();
 
 -344,7 +346,7 
lock_kernel();
sync_inodes(dev);
DQUOT_SYNC(dev);
-   sync_supers(dev);
+   commit_supers(dev);
unlock_kernel();
 
return sync_buffers(dev, 1);
diff -urN --exclude *.orig parent/fs/reiserfs/bitmap.c comp/fs/reiserfs/bitmap.c
--- parent/fs/reiserfs/bitmap.c Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/bitmap.c   Mon Apr 29 10:20:19 2002
 -122,7 +122,6 
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s-s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
 -433,7 +432,6 
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s-s_dirt = 1;
 
   return CARRY_ON;
 }
diff -urN --exclude *.orig parent/fs/reiserfs/ibalance.c comp/fs/reiserfs/ibalance.c
--- parent/fs/reiserfs/ibalance.c   Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/ibalance.c Mon Apr 29 10:20:19 2002
 -632,7 +632,6 
/* use check_internal if new root is an internal node */
check_internal (new_root);
/**/
-   tb-tb_sb-s_dirt = 1;
 
/* do what is needed for buffer thrown from tree */
reiserfs_invalidate_buffer(tb, tbSh);
 -950,7 +949,6 
 PUT_SB_ROOT_BLOCK( tb-tb_sb, tbSh-b_blocknr );
 PUT_SB_TREE_HEIGHT( tb-tb_sb, SB_TREE_HEIGHT(tb-tb_sb) + 1 );
do_balance_mark_sb_dirty (tb, tb-tb_sb-u.reiserfs_sb.s_sbh, 1);
-   tb-tb_sb-s_dirt = 1;
 }

 if ( tb-blknum[h] == 2 ) {
diff -urN --exclude *.orig parent/fs/reiserfs/journal.c comp/fs/reiserfs/journal.c
--- parent/fs/reiserfs/journal.cMon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/journal.c  Mon Apr 29 10:20:21 2002
 -64,12 +64,15 
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers = LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run */
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and commit
 structs at 4k */
 -576,17 +579,12 
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read((SB_JOURNAL(p_s_sb)-j_wlock))  0) {
-PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-sleep_on((SB_JOURNAL(p_s_sb)-j_wait)) ;
-  }
-  atomic_set((SB_JOURNAL(p_s_sb)-j_wlock), 1) ;
+  down(SB_JOURNAL(p_s_sb)-j_lock);
 }
 
 /* unlock the current transaction */
 inline static void unlock_journal(struct super_block *p_s_sb) {
-  atomic_dec((SB_JOURNAL(p_s_sb)-j_wlock)) ;
-  wake_up((SB_JOURNAL(p_s_sb)-j_wait)) ;
+  up(SB_JOURNAL(p_s_sb)-j_lock);
 }
 
 /*
 -756,7 +754,6 
   atomic_set((jl-j_commit_flushing), 0) ;
   wake_up((jl-j_commit_wait)) ;
 
-  s-s_dirt = 1 ;
   return 0 ;
 }
 
 -1220,7 +1217,6 
 if (run++ == 0) {
 goto loop_start ;
 }
-
 atomic_set((jl-j_flushing), 0) ;
 wake_up((jl-j_flush_wait)) ;
 return ret ;
 -1250,7 +1246,7 
 while(i != start) {
 jl = SB_JOURNAL_LIST(s) + i  ;
 age = CURRENT_TIME - jl-j_timestamp ;
-if (jl-j_len  0  // age = (JOURNAL_MAX_COMMIT_AGE * 2)  
+if (jl-j_len  0  age = JOURNAL_MAX_COMMIT_AGE 

RE: [reiserfs-list] fsync() Performance Issue

2002-04-30 Thread berthiaume_wayne

Thanks. I'll start putting this one into test.
Wayne.

-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, April 30, 2002 10:28 AM
To: Oleg Drokin
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [reiserfs-list] fsync() Performance Issue


On Tue, 2002-04-30 at 10:20, Oleg Drokin wrote:

 Attached is a speedup patch for 2.4.19-pre7 that should help your fsync
 operations a little. (From Chris Mason).
 Filesystem cannot do very much at this point unfortunatelly, it is ending
up
 waiting for disk to finish write operations.
 
 Also we are working on other speedup patches that would cover different
areas
 of write perfomance itself.

A newer one (against 2.4.19-pre7) is below.  It has not been through as
much testing on the namesys side, which is why Oleg sent the older one.

Wayne and I have been talking in private mail, he's getting a bunch of
beta patches later today (this speedup, data logging, updated barrier
code).  Along with instructions for testing.

-chris

# Veritas (Hugh Dickins supplied the patch) sent the bits in
# fs/super.c that allow the FS to leave super-s_dirt set after a
# write_super call.
#
diff -urN --exclude *.orig parent/fs/buffer.c comp/fs/buffer.c
--- parent/fs/buffer.c  Mon Apr 29 10:20:24 2002
+++ comp/fs/buffer.cMon Apr 29 10:20:22 2002
@@ -325,6 +325,8 @@
lock_super(sb);
if (sb-s_dirt  sb-s_op  sb-s_op-write_super)
sb-s_op-write_super(sb);
+   if (sb-s_op  sb-s_op-commit_super)
+   sb-s_op-commit_super(sb);
unlock_super(sb);
unlock_kernel();
 
@@ -344,7 +346,7 @@
lock_kernel();
sync_inodes(dev);
DQUOT_SYNC(dev);
-   sync_supers(dev);
+   commit_supers(dev);
unlock_kernel();
 
return sync_buffers(dev, 1);
diff -urN --exclude *.orig parent/fs/reiserfs/bitmap.c
comp/fs/reiserfs/bitmap.c
--- parent/fs/reiserfs/bitmap.c Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/bitmap.c   Mon Apr 29 10:20:19 2002
@@ -122,7 +122,6 @@
   set_sb_free_blocks( rs, sb_free_blocks(rs) + 1 );
 
   journal_mark_dirty (th, s, sbh);
-  s-s_dirt = 1;
 }
 
 void reiserfs_free_block (struct reiserfs_transaction_handle *th, 
@@ -433,7 +432,6 @@
   /* update free block count in super block */
   PUT_SB_FREE_BLOCKS( s, SB_FREE_BLOCKS(s) - init_amount_needed );
   journal_mark_dirty (th, s, SB_BUFFER_WITH_SB (s));
-  s-s_dirt = 1;
 
   return CARRY_ON;
 }
diff -urN --exclude *.orig parent/fs/reiserfs/ibalance.c
comp/fs/reiserfs/ibalance.c
--- parent/fs/reiserfs/ibalance.c   Mon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/ibalance.c Mon Apr 29 10:20:19 2002
@@ -632,7 +632,6 @@
/* use check_internal if new root is an internal node */
check_internal (new_root);
/**/
-   tb-tb_sb-s_dirt = 1;
 
/* do what is needed for buffer thrown from tree */
reiserfs_invalidate_buffer(tb, tbSh);
@@ -950,7 +949,6 @@
 PUT_SB_ROOT_BLOCK( tb-tb_sb, tbSh-b_blocknr );
 PUT_SB_TREE_HEIGHT( tb-tb_sb, SB_TREE_HEIGHT(tb-tb_sb) + 1 );
do_balance_mark_sb_dirty (tb, tb-tb_sb-u.reiserfs_sb.s_sbh, 1);
-   tb-tb_sb-s_dirt = 1;
 }

 if ( tb-blknum[h] == 2 ) {
diff -urN --exclude *.orig parent/fs/reiserfs/journal.c
comp/fs/reiserfs/journal.c
--- parent/fs/reiserfs/journal.cMon Apr 29 10:20:24 2002
+++ comp/fs/reiserfs/journal.c  Mon Apr 29 10:20:21 2002
@@ -64,12 +64,15 @@
 */
 static int reiserfs_mounted_fs_count = 0 ;
 
+static struct list_head kreiserfsd_supers =
LIST_HEAD_INIT(kreiserfsd_supers);
+
 /* wake this up when you add something to the commit thread task queue */
 DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_wait) ;
 
 /* wait on this if you need to be sure you task queue entries have been run
*/
 static DECLARE_WAIT_QUEUE_HEAD(reiserfs_commit_thread_done) ;
 DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
+DECLARE_MUTEX(kreiserfsd_sem) ;
 
 #define JOURNAL_TRANS_HALF 1018   /* must be correct to keep the desc and
commit
 structs at 4k */
@@ -576,17 +579,12 @@
 /* lock the current transaction */
 inline static void lock_journal(struct super_block *p_s_sb) {
   PROC_INFO_INC( p_s_sb, journal.lock_journal );
-  while(atomic_read((SB_JOURNAL(p_s_sb)-j_wlock))  0) {
-PROC_INFO_INC( p_s_sb, journal.lock_journal_wait );
-sleep_on((SB_JOURNAL(p_s_sb)-j_wait)) ;
-  }
-  atomic_set((SB_JOURNAL(p_s_sb)-j_wlock), 1) ;
+  down(SB_JOURNAL(p_s_sb)-j_lock);
 }
 
 /* unlock the current transaction */
 inline static void unlock_journal(struct super_block *p_s_sb) {
-  atomic_dec((SB_JOURNAL(p_s_sb)-j_wlock)) ;
-  wake_up((SB_JOURNAL(p_s_sb)-j_wait)) ;
+  up(SB_JOURNAL(p_s_sb)-j_lock);
 }
 
 /*
@@ -756,7 +754,6 @@
   atomic_set((jl-j_commit_flushing), 0) ;
   wake_up((jl-j_commit_wait)) ;
 
-  s-s_dirt = 1 ;
   return 0 ;
 }
 
@@ -1220,7 +1217,6 @@
 if (run++ == 0) {
 goto loop_start

Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Russell Coker

On Fri, 26 Apr 2002 22:28, [EMAIL PROTECTED] wrote:

It's interesting to note your email address and what it implies...

   I'm wondering if anyone out there may have some suggestions on how
 to improve the performance of a system employing fsync(). I have to be able
 to guaranty that every write to my fileserver is on disk when the client
 has passed it to the server. Therefore, I have disabled write cache on the
 disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
 3.6.25, without additional patches. I have seen some discussions out here
 about various other speed-up patches and am wondering if I need to add
 these to 2.4.19-pre7? And what they are and where can I obtain said
 patches? Also, I'm wondering if there is another solution to syncing the
 data that is faster than fsync(). Testing, thusfar, has shown a large
 disparity between running with and without sync.Another idea is to explore
 another filesystem, but I'm not exactly excited by the other journaling
 filesystems out there at this time. All ideas will be greatly appreciated.

These issues have been discussed a few times, but not with any results as 
exciting as you might hope for.  One which was mentioned was using 
fdatasync() instead of fsync().

One thing that has occurred to me (which has not been previously discussed as 
far as I recall) is the possibility for using sync() instead of fsync() if 
you can accumulate a number of files (and therefore replace many fsync()'s 
with one sync() ).

-- 
If you send email to me or to a mailing list that I use which has 4 lines
of legalistic junk at the end then you are specifically authorizing me to do
whatever I wish with the message and all other messages from your domain, by
posting the message you agree that your long legalistic sig is void.



Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Toby Dickenson

On Mon, 29 Apr 2002 18:20:18 +0200, Russell Coker
[EMAIL PROTECTED] wrote:

On Fri, 26 Apr 2002 22:28, [EMAIL PROTECTED] wrote:

It's interesting to note your email address and what it implies...

  I'm wondering if anyone out there may have some suggestions on how
 to improve the performance of a system employing fsync(). I have to be able
 to guaranty that every write to my fileserver is on disk when the client
 has passed it to the server. Therefore, I have disabled write cache on the
 disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
 3.6.25, without additional patches. I have seen some discussions out here
 about various other speed-up patches and am wondering if I need to add
 these to 2.4.19-pre7? And what they are and where can I obtain said
 patches? Also, I'm wondering if there is another solution to syncing the
 data that is faster than fsync(). Testing, thusfar, has shown a large
 disparity between running with and without sync.Another idea is to explore
 another filesystem, but I'm not exactly excited by the other journaling
 filesystems out there at this time. All ideas will be greatly appreciated.

These issues have been discussed a few times, but not with any results as 
exciting as you might hope for.  One which was mentioned was using 
fdatasync() instead of fsync().

One thing that has occurred to me (which has not been previously discussed as 
far as I recall) is the possibility for using sync() instead of fsync() if 
you can accumulate a number of files (and therefore replace many fsync()'s 
with one sync() ).

I can see

write to file A
write to file B
write to file C
sync

might be faster than

write to file A
fsync A
write to file B
fsync B
write to file C
fsync C

but is it possible for it to be faster than

write to file A
write to file B
write to file C
fsync A
fsync B
fsync C

?



Toby Dickenson
[EMAIL PROTECTED]



Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Chris Mason

On Mon, 2002-04-29 at 12:20, Russell Coker wrote:
 On Fri, 26 Apr 2002 22:28, [EMAIL PROTECTED] wrote:
 
 It's interesting to note your email address and what it implies...
 
  I'm wondering if anyone out there may have some suggestions on how
  to improve the performance of a system employing fsync(). I have to be able
  to guaranty that every write to my fileserver is on disk when the client
  has passed it to the server. Therefore, I have disabled write cache on the
  disk and issue an fsync() per file. I'm running 2.4.19-pre7, reiserfs
  3.6.25, without additional patches. I have seen some discussions out here
  about various other speed-up patches and am wondering if I need to add
  these to 2.4.19-pre7? And what they are and where can I obtain said
  patches? Also, I'm wondering if there is another solution to syncing the
  data that is faster than fsync(). Testing, thusfar, has shown a large
  disparity between running with and without sync.Another idea is to explore
  another filesystem, but I'm not exactly excited by the other journaling
  filesystems out there at this time. All ideas will be greatly appreciated.
 
 These issues have been discussed a few times, but not with any results as 
 exciting as you might hope for.  One which was mentioned was using 
 fdatasync() instead of fsync().

The speedup patches should help fsync some, since they make it much more
likely a commit will be done without the journal lock held.

If all the writes on the FS end up being done through fsync, the data
logging patches might help a lot.  These should be ready for broader
testing this week.

If you are using IDE drives, the write barrier patches are almost enough
to allow you to turn on write caching safely.  They make sure metadata
triggers proper drive cache flushes, I can try to rig up something that
will also trigger a cache flush on data syncs.

-chris





Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Chris Mason

On Mon, 2002-04-29 at 12:32, Toby Dickenson wrote:

 One thing that has occurred to me (which has not been previously discussed as 
 far as I recall) is the possibility for using sync() instead of fsync() if 
 you can accumulate a number of files (and therefore replace many fsync()'s 
 with one sync() ).
 
 I can see
 
 write to file A
 write to file B
 write to file C
 sync
 
 might be faster than
 
 write to file A
 fsync A
 write to file B
 fsync B
 write to file C
 fsync C

Correct.

 
 but is it possible for it to be faster than
 
 write to file A
 write to file B
 write to file C
 fsync A
 fsync B
 fsync C

It depends on the rest of the system.  sync() goes through the big lru
list for the whole box, and fsync() goes through the private list for
just that inode.  If you've got other devices or files with dirty data,
case C that you presented will always be the fastest.  For general use,
I like this one the best, it is what the journal code is optimized for.

If files A, B, and C are the only dirty things on the whole box, a
single sync() will be slightly better, mostly due to reduced cpu time.

-chris





Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Matthias Andree

Toby Dickenson [EMAIL PROTECTED] writes:

 write to file A
 write to file B
 write to file C
 sync

Be careful with this approach. Apart from syncing other processes' dirty
data, sync() does not make the same guarantees as fsync() does.

Barring write cache effects, fsync() only returns after all blocks are
on disk. While I'm not sure if and if yes, which, Linux file systems are
affected, but for portable applications, be aware that sync() may return
prematurely (and is allowed to!).

-- 
Matthias Andree



RE: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread berthiaume_wayne

Agreed, it would be better to sync to disk after multiple files
rather than serially; however, in the interest of not being concerned of a
power outage during the process, one of the reason the disk cache is
disabled, the choice was to fsync() each write.  

-Original Message-
From: Chris Mason [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 12:46 PM
To: [EMAIL PROTECTED]
Cc: Russell Coker; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: [reiserfs-list] fsync() Performance Issue


On Mon, 2002-04-29 at 12:32, Toby Dickenson wrote:

 One thing that has occurred to me (which has not been previously
discussed as 
 far as I recall) is the possibility for using sync() instead of fsync()
if 
 you can accumulate a number of files (and therefore replace many
fsync()'s 
 with one sync() ).
 
 I can see
 
 write to file A
 write to file B
 write to file C
 sync
 
 might be faster than
 
 write to file A
 fsync A
 write to file B
 fsync B
 write to file C
 fsync C

Correct.

 
 but is it possible for it to be faster than
 
 write to file A
 write to file B
 write to file C
 fsync A
 fsync B
 fsync C

It depends on the rest of the system.  sync() goes through the big lru
list for the whole box, and fsync() goes through the private list for
just that inode.  If you've got other devices or files with dirty data,
case C that you presented will always be the fastest.  For general use,
I like this one the best, it is what the journal code is optimized for.

If files A, B, and C are the only dirty things on the whole box, a
single sync() will be slightly better, mostly due to reduced cpu time.

-chris




Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Valdis . Kletnieks

On Mon, 29 Apr 2002 19:56:59 +0200, Matthias Andree [EMAIL PROTECTED]  
said:

 Barring write cache effects, fsync() only returns after all blocks are
 on disk. While I'm not sure if and if yes, which, Linux file systems are
 affected, but for portable applications, be aware that sync() may return
 prematurely (and is allowed to!).

And in fact is the reason for the old recipe:
  # sync
  # sync
  # sync
  # reboot

On the older Vax 750-class machines, sync could return LONG before the blocks
were all flushed - the second 2 sync's were so you were busy typing for
several seconds while the disks whirred.  Failure to understand the typing
speed issue has lead at least one otherwise-clued author to recommend:
  # sync;sync;sync
  # reboot

(the distinction being obvious if you think about when the shell reads the
commands, and when it does the fork/exec for each case)

-- 
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech





msg05263/pgp0.pgp
Description: PGP signature


Re: [reiserfs-list] fsync() Performance Issue

2002-04-29 Thread Hans Reiser

[EMAIL PROTECTED] wrote:

On Mon, 29 Apr 2002 19:56:59 +0200, Matthias Andree [EMAIL PROTECTED] 
 said:

  

Barring write cache effects, fsync() only returns after all blocks are
on disk. While I'm not sure if and if yes, which, Linux file systems are
affected, but for portable applications, be aware that sync() may return
prematurely (and is allowed to!).



And in fact is the reason for the old recipe:
  # sync
  # sync
  # sync
  # reboot

On the older Vax 750-class machines, sync could return LONG before the blocks
were all flushed - the second 2 sync's were so you were busy typing for
several seconds while the disks whirred.  Failure to understand the typing
speed issue has lead at least one otherwise-clued author to recommend:
  # sync;sync;sync
  # reboot

(the distinction being obvious if you think about when the shell reads the
commands, and when it does the fork/exec for each case)

  

Finally I understand this.  Doing more than one sync always seemed 
mysterious to me.;-)

Thanks Matthias.

Hans