Re: batching support for transactions

2007-10-03 Thread Ric Wheeler

Andreas Dilger wrote:

On Oct 02, 2007  08:57 -0400, Ric Wheeler wrote:
One thing that jumps out is that the way we currently batch synchronous 
work loads into transactions does really horrible things to performance 
for storage devices which have really low latency.


For example, one a mid-range clariion box, we can use a single thread to 
write around 750 (10240 byte) files/sec to a single directory in ext3. 
That gives us an average time around 1.3ms per file.


With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.


Is this with HZ=250?


Yes - I assume that with HZ=1000 the batching would start to work again 
since the penalty for batching would only be 1ms which would add a 0.3ms 
overhead while waiting for some other thread to join.





The culprit seems to be the assumptions in journal_stop() which throw in 
a call to schedule_timeout_uninterruptible(1):


pid = current-pid;
if (handle-h_sync  journal-j_last_sync_writer != pid) {
journal-j_last_sync_writer = pid;
do {
old_handle_count = transaction-t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction-t_handle_count);
}


It would seem one of the problems is that we shouldn't really be
scheduling for a fixed 1 jiffie timeout, but rather only until the
other threads have a chance to run and join the existing transaction.


This is really very similar to the domain of the IO schedulers - when do 
you hold off an IO and/or try to combine it.


It is hard to predict the future need of threads that will be wanting to 
do IO, but you can dynamically measure the average time it takes a 
transaction to commit.


Would it work to keep this average commit time is less than say 80% of 
the timeout?  Using the 1000HZ example, 1ms wait for the average commit 
time of 1.2 or 1.3 ms?




What seems to be needed here is either a static per file system/storage 
device tunable to allow us to change this timeout (maybe with 0 
defaulting back to the old reiserfs trick of simply doing a yield()?)


Tunables are to be avoided if possible, since they will usually not be
set except by the .1% of people who actually understand them.  Using
yield() seems like the right thing, but Andrew Morton added this code and
my guess would be that yield() doesn't block the first thread long enough
for the second one to get into the transaction (e.g. on an 2-CPU system
with 2 threads, yield() will likely do nothing).


I agree that tunables are a bad thing.  It might be nice to dream about 
having mkfs do some test timings (issues and time the average 
synchronous IOs/sec) and setting this in the superblock.


Andy tried playing with yield() and it did not do well. Note this this 
server is a dual CPU box, so your intuition is most likely correct.


The balance is that the batching does work well for normal slow disks, 
especially when using the write barriers (giving us an average commit 
time closer to 20ms).


or a more dynamic, per device way to keep track of the average time it 
takes to commit a transaction to disk. Based on that rate, we could 
dynamically adjust our logic to account for lower latency devices.


It makes sense to track not only the time to commit a single synchronous
transaction, but also the time between sync transactions to decide if
the initial transaction should be held to allow later ones.


Yes, that is what I was trying to suggest with the rate. Even if we are 
relatively slow, if the IO's are being synched at a low rate, we are 
effectively adding a potentially nasty latency for each IO.


That would give us two measurements to track per IO device - average 
commit time and this average IO's/sec rate. That seems very doable.



Alternately, it might be possible to check if a new thread is trying to
start a sync handle when the previous one was also synchronous and had
only a single handle in it, then automatically enable the delay in that case.


I am not sure that this avoids the problem with the current defaults at 
250HZ where each wait is sufficient to do 3 fully independent 
transactions ;-)


ric

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: batching support for transactions

2007-10-03 Thread Andreas Dilger
On Oct 03, 2007  06:42 -0400, Ric Wheeler wrote:
 With 2 threads writing to the same directory, we instantly drop down to 
 234 files/sec.
 
 Is this with HZ=250?
 
 Yes - I assume that with HZ=1000 the batching would start to work again 
 since the penalty for batching would only be 1ms which would add a 0.3ms 
 overhead while waiting for some other thread to join.

This is probably the easiest solution, but at the same time using HZ=1000
adds overhead to the server because of extra interrupts, etc.

 It would seem one of the problems is that we shouldn't really be
 scheduling for a fixed 1 jiffie timeout, but rather only until the
 other threads have a chance to run and join the existing transaction.
 
 This is really very similar to the domain of the IO schedulers - when do 
 you hold off an IO and/or try to combine it.

I was thinking the same.

 my guess would be that yield() doesn't block the first thread long enough
 for the second one to get into the transaction (e.g. on an 2-CPU system
 with 2 threads, yield() will likely do nothing).
 
 Andy tried playing with yield() and it did not do well. Note this this 
 server is a dual CPU box, so your intuition is most likely correct.

How many threads did you try?

 It makes sense to track not only the time to commit a single synchronous
 transaction, but also the time between sync transactions to decide if
 the initial transaction should be held to allow later ones.
 
 Yes, that is what I was trying to suggest with the rate. Even if we are 
 relatively slow, if the IO's are being synched at a low rate, we are 
 effectively adding a potentially nasty latency for each IO.
 
 That would give us two measurements to track per IO device - average 
 commit time and this average IO's/sec rate. That seems very doable.

Agreed.

 Alternately, it might be possible to check if a new thread is trying to
 start a sync handle when the previous one was also synchronous and had
 only a single handle in it, then automatically enable the delay in that 
 case.
 
 I am not sure that this avoids the problem with the current defaults at 
 250HZ where each wait is sufficient to do 3 fully independent 
 transactions ;-)

I was trying to think if there was some way to non-busy-wait that is
less than 1 jiffie.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: batching support for transactions

2007-10-03 Thread Ric Wheeler

Andreas Dilger wrote:

On Oct 03, 2007  06:42 -0400, Ric Wheeler wrote:
With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.

Is this with HZ=250?
Yes - I assume that with HZ=1000 the batching would start to work again 
since the penalty for batching would only be 1ms which would add a 0.3ms 
overhead while waiting for some other thread to join.


This is probably the easiest solution, but at the same time using HZ=1000
adds overhead to the server because of extra interrupts, etc.


We will do some testing with this in the next day or so.


It would seem one of the problems is that we shouldn't really be
scheduling for a fixed 1 jiffie timeout, but rather only until the
other threads have a chance to run and join the existing transaction.
This is really very similar to the domain of the IO schedulers - when do 
you hold off an IO and/or try to combine it.


I was thinking the same.

my guess would be that yield() doesn't block the first thread long enough
for the second one to get into the transaction (e.g. on an 2-CPU system
with 2 threads, yield() will likely do nothing).
Andy tried playing with yield() and it did not do well. Note this this 
server is a dual CPU box, so your intuition is most likely correct.


How many threads did you try?


Andy's tested 1, 2, 4, 8, 20 and 40 threads.  Once we review the test
and his patch, we can post the summary data.


It makes sense to track not only the time to commit a single synchronous
transaction, but also the time between sync transactions to decide if
the initial transaction should be held to allow later ones.
Yes, that is what I was trying to suggest with the rate. Even if we are 
relatively slow, if the IO's are being synched at a low rate, we are 
effectively adding a potentially nasty latency for each IO.


That would give us two measurements to track per IO device - average 
commit time and this average IO's/sec rate. That seems very doable.


Agreed.


This would also seem to be code that would be good to share between all
of the file systems for their transaction bundling.


Alternately, it might be possible to check if a new thread is trying to
start a sync handle when the previous one was also synchronous and had
only a single handle in it, then automatically enable the delay in that 
case.
I am not sure that this avoids the problem with the current defaults at 
250HZ where each wait is sufficient to do 3 fully independent 
transactions ;-)


I was trying to think if there was some way to non-busy-wait that is
less than 1 jiffie.


One other technique would be to use async IO, which could push the 
batching of the fsync's up to application space.  For example, send down 
a sequence of async fsync requests for a series of files and then poll 
for completion once you have launched them.


ric


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


batching support for transactions

2007-10-02 Thread Ric Wheeler


After several years of helping tune file systems for normal (ATA/S-ATA) 
drives, we have been doing some performance work on ext3  reiserfs on 
disk arrays.


One thing that jumps out is that the way we currently batch synchronous 
work loads into transactions does really horrible things to performance 
for storage devices which have really low latency.


For example, one a mid-range clariion box, we can use a single thread to 
write around 750 (10240 byte) files/sec to a single directory in ext3. 
That gives us an average time around 1.3ms per file.


With 2 threads writing to the same directory, we instantly drop down to 
234 files/sec.


The culprit seems to be the assumptions in journal_stop() which throw in 
a call to schedule_timeout_uninterruptible(1):


/*
 * Implement synchronous transaction batching.  If the handle
 * was synchronous, don't force a commit immediately.  Let's
 * yield and let another thread piggyback onto this transaction.
 * Keep doing that while new threads continue to arrive.
 * It doesn't cost much - we're about to run a commit and sleep
 * on IO anyway.  Speeds up many-threaded, many-dir operations
 * by 30x or more...
 *
 * But don't do this if this process was the most recent one to
 * perform a synchronous write.  We do this to detect the case 
where a
 * single process is doing a stream of sync writes.  No point 
in waiting

 * for joiners in that case.
 */
pid = current-pid;
if (handle-h_sync  journal-j_last_sync_writer != pid) {
journal-j_last_sync_writer = pid;
do {
old_handle_count = transaction-t_handle_count;
schedule_timeout_uninterruptible(1);
} while (old_handle_count != transaction-t_handle_count);
}


reiserfs and ext4 have similar if not exactly the same logic.

What seems to be needed here is either a static per file system/storage 
device tunable to allow us to change this timeout (maybe with 0 
defaulting back to the old reiserfs trick of simply doing a yield()?) or 
a more dynamic, per device way to keep track of the average time it 
takes to commit a transaction to disk. Based on that rate, we could 
dynamically adjust our logic to account for lower latency devices.


A couple of last thoughts. One, if for some reason you don't have a low 
latency storage array handy and want to test this for yourselves, you 
can test the worst case by using a ram disk.


The test we used was fs_mark with 10240 bytes files, writing to one 
shared directory with varying the numbers of threads from 1 up to 40. In 
the ext3 case, it takes 8 concurrent threads to catch up to the single 
thread writing case.


We are continuing to play with the code and try out some ideas, but I 
wanted to bounce this off the broader list to see if this makes sense...


ric


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html