The starvation occurs when some process sends large requests to the same scsi
controller as our journal replay which sends one block requests, and the one
block requests starve. Raid-resync is one known instance where this happens.
Edward's patch cures that instance.
Hans
Edward Shushkin wrote:
Philippe Gramoulle wrote:
Hi,
We've setup a test system : Linux box with a PERC3/QC AMI RAID card
(MegaRAID driver) with 3 diskshelves of 12x36Go. ( RAID5 , 2 spare
disks on each shelf ,1 Terabyte total)
First of all, there is some odd message at boot :
megaraid: v1.15d (Release Date: Wed May 30 17:30:41 EDT 2001)
megaraid: found 0x101e:0x1960:idx 0:bus 2:slot 0:func 0
scsi2 : Found a MegaRAID controller at 0xf8902000, IRQ: 20
megaraid: [1.57:3.13] detected 2 logical drives
scsi2 : AMI MegaRAID 1.57 254 commands 16 targs 4 chans 40 luns
Attached scsi disk sdb at scsi2, channel 4, id 0, lun 0
Attached scsi disk sdc at scsi2, channel 4, id 0, lun 1
SCSI device sdb: 318468096 512-byte hdwr sectors (163056 MB)
sdb: sdb1 ^^^
SCSI device sdc: 2059595776 512-byte hdwr sectors (-44998 MB)
^^^
Why does sdc is reporting -44998 MB ??
Nevertheless, fdisk'ing /dev/sdc runs fine.
mkreiserfs runs fine as well.
Mounting the partition for the first time took 32 minutes.
umounting and remounting the partition for the second time took 1 minute
32s.
Re-unmounting and re-remounting the partition took 32 minutes again .
There were absolutely no operations done in between.
What do you think takes so much time for the mount ?
It looks like you got the worst case when the system tries to find
valid transaction
(when your fs is just created or you have fs that was non-cleanly
unmounted)
and reads all journal blocks during raid5-resync process that causes a
large number
of IO requests. If so, there can not be more then one journal request in
the queue
due to wait_on_buffer and this request can not be merged with the other
journal requsts.
Probably you want the attached patch against 2.4.7 that uses read ahead
of 32 journal blocks
instead bread(). We have tested it a bit - time of mount seems to be
reduced..
Please report about your results.
Thanks,
Edward.
Aren't we hitting a 32 bits issue here ? Replacing 36Go disks with 73Go
disks
would give me a : unable to open /dev/sdc when trying to do the fdisk.
Has someone already created volumes above 1 terabytes ?
We're currently trying the same tests with ext2 but mke2fs takes a
*long* time compared to mkreiserfs :o). I'll give you the results soon.
Thanks,
Philippe.
--- linux-2.4.7/fs/reiserfs/journal.c Mon Aug 6 15:29:31 2001
+++ linux-2.4.7-new/fs/reiserfs/journal.c Tue Aug 14 22:57:06 2001
@@ -81,6 +81,7 @@
DECLARE_TASK_QUEUE(reiserfs_commit_thread_tq) ;
#define JOURNAL_TRANS_HALF 1018 /* must be correct to keep the desc and commit
structs at 4k */
+#define NBUF 32 /* read ahead */
/* cnode stat bits. Move these into reiserfs_fs.h */
@@ -1597,7 +1598,9 @@
int replay_count = 0 ;
int continue_replay = 1 ;
int ret ;
-
+ int need_read_ahead = 1;
+ int first_read_ahead = 0;
+ struct buffer_head * log_blocks[NBUF];
cur_dblock = reiserfs_get_journal_block(p_s_sb) ;
printk(reiserfs: checking transaction log (device %s) ...\n,
kdevname(p_s_sb-s_dev)) ;
@@ -1653,7 +1656,29 @@
** all the valid transactions, and pick out the oldest.
*/
while(continue_replay cur_dblock (reiserfs_get_journal_block(p_s_sb) +
JOURNAL_BLOCK_COUNT)) {
-d_bh = bread(p_s_sb-s_dev, cur_dblock, p_s_sb-s_blocksize) ;
+if (need_read_ahead) {
+ /* read ahead NBUF buffers */
+ int i;
+ first_read_ahead = cur_dblock;
+ for (i = 0; i NBUF; i++) {
+ log_blocks [i] = getblk (p_s_sb-s_dev, first_read_ahead + i,
+p_s_sb-s_blocksize);
+ if (!log_blocks [i]) {
+ brelse_array (log_blocks, i);
+ return -1;
+ }
+ }
+ ll_rw_block (READ, NBUF, log_blocks);
+ for (i = 0; i NBUF; i++) {
+ wait_on_buffer (log_blocks [i]);
+ if (!buffer_uptodate (log_blocks [i])) {
+ brelse_array (log_blocks, NBUF);
+ return -1;
+ }
+ }
+ need_read_ahead = 0;
+}
+d_bh = log_blocks[cur_dblock - first_read_ahead];
ret = journal_transaction_is_valid(p_s_sb, d_bh, oldest_invalid_trans_id,
newest_mount_id) ;
if (ret == 1) {
desc = (struct reiserfs_journal_desc *)d_bh-b_data ;
@@ -1680,12 +1705,17 @@
newest_mount_id to %d\n, le32_to_cpu(desc-j_mount_id));
}
cur_dblock += le32_to_cpu(desc-j_len) + 2 ;
-}
-else {
+} else
+