Re: mdadm 2.5.2 - Static built , Interesting warnings when
On Tuesday June 27, [EMAIL PROTECTED] wrote: > Hello All , What change in Glibc mekes this necessary ? Is there a > method available to include the getpwnam & getgrnam structures so that > full static build will work . Tia , JimL > > gcc -Wall -Werror -Wstrict-prototypes -ggdb -DSendmail=\""/usr/sbin/sendmail > -t"\" -DCONFFILE=\"/etc/mdadm.conf\" -DCONFFILE2=\"/etc/mdadm/mdadm.conf\" > -DHAVE_STDINT_H -o sha1.o -c sha1.c > gcc -static -o mdadm mdadm.o config.o mdstat.o ReadMe.o util.o Manage.o > Assemble.o Build.o Create.o Detail.o Examine.o Grow.o Monitor.o dlink.o > Kill.o Query.o mdopen.o super0.o super1.o bitmap.o restripe.o sysfs.o sha1.o > config.o(.text+0x8c4): In function `createline': > /home/archive/mdadm-2.5.2/config.c:341: warning: Using 'getgrnam' in > statically linked applications requires at runtime the shared libraries from > the glibc version used for linking > config.o(.text+0x80b):/home/archive/mdadm-2.5.2/config.c:326: warning: Using > 'getpwnam' in statically linked applications requires at runtime the shared > libraries from the glibc version used for linking > nroff -man mdadm.8 > mdadm.man Are you running make LDFLAGS=-static mdadm or something like that? No, that won't work any more. Use make mdadm.static that will get you a good static binary by included 'pwgr.o'. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
On Wednesday June 28, [EMAIL PROTECTED] wrote: > Hello, > > I'm facing this problem: > > when my Linux box detects a POWER FAIL event from the UPS, it > starts a normal shutdown. Just before the normal kernel poweroff, > it sends to the UPS a signal on the serial line which says > "cut-off the power to the server and switch-off the UPS". > > This is required to reboot the server as soon as the power is > restored. > > The problem is that the root partition is on top of a RAID-1 > filesystem which is still mounted when the program that kills the > power is run, so the system goes down with a non clean RAID > volume. > > What can be the proper action to do before killing the power to > ensure that RAID will remain clean? It seems that remounting > the partition read-only is not sufficient. Are you running a 2.4 kernel or a 2.6 kernel? With 2.4, you cannot do what you want to do. With 2.6, killall -9 md0_raid1 should do the trick (assuming root is on /dev/md0. If it is elsewhere, choose a different process name). After you "kill -9" the raid thread, the array will be marked clean immediately all writes complete, and marked dirty again before allowing another write. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Drive issues in RAID vs. not-RAID ..
On Wednesday June 28, [EMAIL PROTECTED] wrote: > > I've seen a few comments to the effect that some disks have problems when > used in a RAID setup and I'm a bit preplexed as to why this might be.. > > What's the difference between a drive in a RAID set (either s/w or h/w) > and a drive on it's own, assuming the load, etc. is roughly the same in > each setup? > > Is it just "bad feeling" or is there any scientific reasons for it? I don't think that 'disks' have problems being in a raid, but I believe some controllers do (though I don't know whether it is the controller or the driver that is at fault). RAID make concurrent requests much more likely and so is likely to push hard at any locking issues. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cutting power without breaking RAID
Your UPS won't accept a timer value to wait before actually cutting power? That would probably be ideal, issue the power off command with something like a 30 second timeout, which would give the system time to power off cleanly first. -Tim Niccolo Rigacci wrote: > Hello, > > I'm facing this problem: > > when my Linux box detects a POWER FAIL event from the UPS, it > starts a normal shutdown. Just before the normal kernel poweroff, > it sends to the UPS a signal on the serial line which says > "cut-off the power to the server and switch-off the UPS". > > This is required to reboot the server as soon as the power is > restored. > > The problem is that the root partition is on top of a RAID-1 > filesystem which is still mounted when the program that kills the > power is run, so the system goes down with a non clean RAID > volume. > > What can be the proper action to do before killing the power to > ensure that RAID will remain clean? It seems that remounting > the partition read-only is not sufficient. > > - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Cutting power without breaking RAID
Hello, I'm facing this problem: when my Linux box detects a POWER FAIL event from the UPS, it starts a normal shutdown. Just before the normal kernel poweroff, it sends to the UPS a signal on the serial line which says "cut-off the power to the server and switch-off the UPS". This is required to reboot the server as soon as the power is restored. The problem is that the root partition is on top of a RAID-1 filesystem which is still mounted when the program that kills the power is run, so the system goes down with a non clean RAID volume. What can be the proper action to do before killing the power to ensure that RAID will remain clean? It seems that remounting the partition read-only is not sufficient. -- Niccolo Rigacci Firenze - Italy Iraq, missione di pace: 38475 morti - www.iraqbodycount.net - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
On Wed, 28 Jun 2006, Christian Pernegger wrote: My current 15 drive RAID-6 server is built around a KT600 board with an AMD Sempron processor and 4 SATA150TX4 cards. It does the job but it's not the fastest thing around (takes about 10 hours to do a check of the array or about 15 to do a rebuild). What kind of enclosure do you have this in? I also subscribe to the "almost commodity hardware" philosophy, however I've not been able to find a case that comfortably takes even 8 drives. (The Stacker is an absolute nightmare ...) Even most rackable cases stop at 6 3.5" drive bays -- either that or they are dedicated storage racks with integrated hw RAID and fiber SCSI interconnect --> definitely not commodity. Thanks, C. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html For the case, there a number of cases (Lian Li) that fit 20 drives with easy, check here: http://www.newegg.com/Product/Product.asp?item=N82E1682062 - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 001 of 006] raid5: Move write operations to a work queue
This patch moves write (reconstruct and read-modify) operations to a work queue. Note the next patch in this series fixes some incorrect assumptions around having multiple operations in flight (i.e. ignore this version of 'queue_raid_work'). Signed-off-by: Dan Williams <[EMAIL PROTECTED]> drivers/md/raid5.c | 314 + include/linux/raid/raid5.h | 67 + 2 files changed, 357 insertions(+), 24 deletions(-) === Index: linux-2.6-raid/drivers/md/raid5.c === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-28 08:44:11.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-28 09:52:07.0 -0700 @@ -305,6 +305,7 @@ memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev)); sh->raid_conf = conf; spin_lock_init(&sh->lock); + INIT_WORK(&sh->ops.work, conf->do_block_ops, sh); if (grow_buffers(sh, conf->raid_disks)) { shrink_buffers(sh, conf->raid_disks); @@ -1224,6 +1225,80 @@ } } +static int handle_write_operations5(struct stripe_head *sh, int rcw, int locked) +{ + int i, pd_idx = sh->pd_idx, disks = sh->disks; + int complete=0; + + if (test_bit(STRIPE_OP_RCW, &sh->state) && + test_bit(STRIPE_OP_RCW_Done, &sh->ops.state)) { + clear_bit(STRIPE_OP_RCW, &sh->state); + clear_bit(STRIPE_OP_RCW_Done, &sh->ops.state); + complete++; + } + + if (test_bit(STRIPE_OP_RMW, &sh->state) && + test_bit(STRIPE_OP_RMW_Done, &sh->ops.state)) { + clear_bit(STRIPE_OP_RMW, &sh->state); + clear_bit(STRIPE_OP_RMW_Done, &sh->ops.state); + BUG_ON(++complete == 2); + } + + + /* If no operation is currently in process then use the rcw flag to +* select an operation +*/ + if (locked == 0) { + if (rcw == 0) { + /* enter stage 1 of reconstruct write operation */ + set_bit(STRIPE_OP_RCW, &sh->state); + set_bit(STRIPE_OP_RCW_Drain, &sh->ops.state); + for (i=disks ; i-- ;) { + struct r5dev *dev = &sh->dev[i]; + + if (i!=pd_idx && dev->towrite) { + set_bit(R5_LOCKED, &dev->flags); + locked++; + } + } + } else { + /* enter stage 1 of read modify write operation */ + BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags)); + set_bit(STRIPE_OP_RMW, &sh->state); + set_bit(STRIPE_OP_RMW_ParityPre, &sh->ops.state); + for (i=disks ; i-- ;) { + struct r5dev *dev = &sh->dev[i]; + if (i==pd_idx) + continue; + + if (dev->towrite && + test_bit(R5_UPTODATE, &dev->flags)) { + set_bit(R5_LOCKED, &dev->flags); + locked++; + } + } + } + } else if (locked && complete == 0) /* the queue has an operation in flight */ + locked = -EBUSY; + else if (complete) + locked = 0; + + /* keep the parity disk locked while asynchronous operations +* are in flight +*/ + if (locked > 0) { + set_bit(R5_LOCKED, &sh->dev[pd_idx].flags); + clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); + sh->ops.queue_count++; + } else if (locked == 0) + set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); + + PRINTK("%s: stripe %llu locked: %d complete: %d op_state: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + locked, complete, sh->ops.state); + + return locked; +} /* @@ -1320,6 +1395,174 @@ return pd_idx; } +static inline void drain_bio(struct bio *wbi, sector_t sector, struct page *page) +{ + while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) { + copy_data(1, wbi, page, sector); + wbi = r5_next_bio(wbi, sector); + } +} + +/* must be called under the stripe lock */ +static void queue_raid_work(struct stripe_head *sh) +{ + if (--sh->ops.queue_count == 0) { + atomic_inc(&sh->count); + queue_work(sh->raid_conf->block_ops_queue, &sh->ops.work); + } else if (
[PATCH 000 of 006] raid5: Offload RAID operations to a workqueue
This patch set is a step towards enabling hardware offload in the md-raid5 driver. These patches are considered experimental and are not yet suitable for production environments. As mentioned, this patch set is the first step in that it moves work from handle_stripe5 to a work queue. The next step is to enable the work queue to offload the operations to hardware copy/xor engines using the dmaengine API (include/linux/dmaengine.h). Initial testing shows that about 60% of the array maintenance work previously performed by raid5d has moved to the work queue. These patches apply to the version of md as of commit 266bee88699ddbde42ab303bbc426a105cc49809 in Linus' tree. Regards, Dan Williams [PATCH 001 of 006] raid5: Move write operations to a work queue [PATCH 002 of 006] raid5: Move check parity operations to a work queue [PATCH 003 of 006] raid5: Move compute block operations to a work queue [PATCH 004 of 006] raid5: Move read completion copies to a work queue [PATCH 005 of 006] raid5: Move expansion operations to a work queue [PATCH 006 of 006] raid5: Remove compute_block and compute_parity - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 004 of 006] raid5: Move read completion copies to a work queue
This patch moves the data copying portion of satisfying read requests into the work queue. It adds a 'read' (past tense) pointer to the r5dev structure to to track reads that have been offloaded to the work queue. When the copy operation is complete the 'read' pointer is reused as the return_bi for the bi_end_io() call. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> drivers/md/raid5.c | 94 - include/linux/raid/raid5.h |6 +- 2 files changed, 71 insertions(+), 29 deletions(-) === Index: linux-2.6-raid/drivers/md/raid5.c === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-28 10:35:31.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-28 10:35:40.0 -0700 @@ -213,11 +213,11 @@ for (i = sh->disks; i--; ) { struct r5dev *dev = &sh->dev[i]; - if (dev->toread || dev->towrite || dev->written || + if (dev->toread || dev->read || dev->towrite || dev->written || test_bit(R5_LOCKED, &dev->flags)) { - printk("sector=%llx i=%d %p %p %p %d\n", + printk("sector=%llx i=%d %p %p %p %p %d\n", (unsigned long long)sh->sector, i, dev->toread, - dev->towrite, dev->written, + dev->read, dev->towrite, dev->written, test_bit(R5_LOCKED, &dev->flags)); BUG(); } @@ -1490,6 +1490,35 @@ ops_state_orig = ops_state = sh->ops.state; spin_unlock(&sh->lock); + if (test_bit(STRIPE_OP_BIOFILL, &state)) { + raid5_conf_t *conf = sh->raid_conf; + struct bio *return_bi=NULL; + PRINTK("%s: stripe %llu STRIPE_OP_BIOFILL op_state: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + ops_state); + + for (i=disks ; i-- ;) { + struct r5dev *dev = &sh->dev[i]; + struct bio *rbi, *rbi2; + rbi = dev->read; + while (rbi && rbi->bi_sector < dev->sector + STRIPE_SECTORS) { + copy_data(0, rbi, dev->page, dev->sector); + rbi2 = r5_next_bio(rbi, dev->sector); + spin_lock_irq(&conf->device_lock); + if (--rbi->bi_phys_segments == 0) { + rbi->bi_next = return_bi; + return_bi = rbi; + } + spin_unlock_irq(&conf->device_lock); + rbi = rbi2; + dev->read = return_bi; + } + } + + work++; + set_bit(STRIPE_OP_BIOFILL_Done, &ops_state); + } + if (test_bit(STRIPE_OP_COMPUTE, &state)) { for (i=disks ; i-- ;) { struct r5dev *dev = &sh->dev[i]; @@ -1725,6 +1754,7 @@ int i; int syncing, expanding, expanded; int locked=0, uptodate=0, to_read=0, to_write=0, failed=0, written=0; + int fill_complete=0, to_fill=0; int non_overwrite = 0; int failed_num=0; struct r5dev *dev; @@ -1740,45 +1770,49 @@ syncing = test_bit(STRIPE_SYNCING, &sh->state); expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state); expanded = test_bit(STRIPE_EXPAND_READY, &sh->state); - /* Now to look around and see what can be done */ + if (test_bit(STRIPE_OP_BIOFILL, &sh->state) && + test_bit(STRIPE_OP_BIOFILL_Done, &sh->ops.state)) { + clear_bit(STRIPE_OP_BIOFILL, &sh->state); + clear_bit(STRIPE_OP_BIOFILL_Done, &sh->ops.state); + fill_complete++; + } + /* Now to look around and see what can be done */ rcu_read_lock(); for (i=disks; i--; ) { mdk_rdev_t *rdev; dev = &sh->dev[i]; clear_bit(R5_Insync, &dev->flags); - PRINTK("check %d: state 0x%lx read %p write %p written %p\n", - i, dev->flags, dev->toread, dev->towrite, dev->written); + PRINTK("check %d: state 0x%lx toread %p read %p write %p written %p\n", + i, dev->flags, dev->toread, dev->read, dev->towrite, dev->written); /* maybe we can reply to a read */ - if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread) { - struct bio *rbi, *rbi2; - PRINTK("Return read for disc %d\n", i); - spin_lock_irq(&conf->device_lock); -
[PATCH 003 of 006] raid5: Move compute block operations to a work queue
This patch adds 'compute block' capabilities to the work queue. Here are a few notes about the new flags R5_ComputeReq and STRIPE_OP_COMPUTE_Recover: Previously, when handle_stripe5 found a block that needed to be computed it updated it in the same step. Now that these operations are separated (across multiple calls to handle_stripe5), a R5_ComputeReq flag is needed to tell other parts of handle_stripe5 to treat the block under computation as if it were up to date. The order of events in the work queue ensures that the block is indeed up to date before performing further operations. STRIPE_OP_COMPUTE_Recover was added to track when the parity block is being computed due to a failed parity check. This allows the code in handle_stripe5 that produces requests for check_parity and compute_block operations to be separate from the code that consumes the result. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> drivers/md/raid5.c | 147 + include/linux/raid/raid5.h |7 +- 2 files changed, 129 insertions(+), 25 deletions(-) === Index: linux-2.6-raid/drivers/md/raid5.c === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-28 10:47:43.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-28 11:06:06.0 -0700 @@ -1263,7 +1263,9 @@ } } else { /* enter stage 1 of read modify write operation */ - BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags)); + BUG_ON(!(test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags) || + test_bit(R5_ComputeReq, &sh->dev[pd_idx].flags))); + set_bit(STRIPE_OP_RMW, &sh->state); set_bit(STRIPE_OP_RMW_ParityPre, &sh->ops.state); for (i=disks ; i-- ;) { @@ -1272,7 +1274,8 @@ continue; if (dev->towrite && - test_bit(R5_UPTODATE, &dev->flags)) { + (test_bit(R5_UPTODATE, &dev->flags) || + test_bit(R5_ComputeReq, &dev->flags))) { set_bit(R5_LOCKED, &dev->flags); locked++; } @@ -1331,6 +1334,30 @@ return work_queued; } +static int handle_compute_operations5(struct stripe_head *sh, int dd_idx) +{ + int work_queued = -EBUSY; + + if (test_bit(STRIPE_OP_COMPUTE, &sh->state) && + test_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state)) { + clear_bit(STRIPE_OP_COMPUTE, &sh->state); + clear_bit(STRIPE_OP_COMPUTE_Done, &sh->ops.state); + clear_bit(R5_ComputeReq, &sh->dev[dd_idx].flags); + work_queued = 0; + } else if (!test_bit(STRIPE_OP_COMPUTE, &sh->state)) { + set_bit(STRIPE_OP_COMPUTE, &sh->state); + set_bit(STRIPE_OP_COMPUTE_Prep, &sh->ops.state); + set_bit(R5_ComputeReq, &sh->dev[dd_idx].flags); + work_queued = 1; + sh->ops.pending++; + } + + PRINTK("%s: stripe %llu work_queued: %d op_state: %lx dev[%d].flags: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + work_queued, sh->ops.state, dd_idx, sh->dev[dd_idx].flags); + + return work_queued; +} /* * Each stripe/dev can have one or more bion attached. @@ -1454,7 +1481,7 @@ int i, pd_idx = sh->pd_idx, disks = sh->disks, count = 1; void *ptr[MAX_XOR_BLOCKS]; struct bio *chosen; - int overlap=0, work=0, written=0; + int overlap=0, work=0, written=0, compute=0, dd_idx=0; unsigned long state, ops_state, ops_state_orig; /* take a snapshot of what needs to be done at this point in time */ @@ -1463,6 +1490,51 @@ ops_state_orig = ops_state = sh->ops.state; spin_unlock(&sh->lock); + if (test_bit(STRIPE_OP_COMPUTE, &state)) { + for (i=disks ; i-- ;) { + struct r5dev *dev = &sh->dev[i]; + if (test_bit(R5_ComputeReq, &dev->flags)) { + dd_idx = i; + i = -1; + break; + } + } + BUG_ON(i >= 0); + PRINTK("%s: stripe %llu STRIPE_OP_COMPUTE op_state: %lx block: %d\n", + __FUNCTION__, (unsigned long long)sh->sector, + ops_state, dd_idx); + ptr[0] = page_address(sh->dev[dd_idx].page); + + if (test_and_clear_bit(STRIPE_OP_COMPUTE_Prep, &ops_state)) { + memset(ptr[0], 0, STRIPE_SIZE);
[PATCH 006 of 006] raid5: Remove compute_block and compute_parity
compute_block and compute_parity5 are replaced by the work queue and the handle_*_operations5 routines. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> raid5.c | 123 1 files changed, 123 deletions(-) === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-27 16:16:31.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-27 16:19:13.0 -0700 @@ -918,129 +918,6 @@ } while(0) -static void compute_block(struct stripe_head *sh, int dd_idx) -{ - int i, count, disks = sh->disks; - void *ptr[MAX_XOR_BLOCKS], *p; - - PRINTK("compute_block, stripe %llu, idx %d\n", - (unsigned long long)sh->sector, dd_idx); - - ptr[0] = page_address(sh->dev[dd_idx].page); - memset(ptr[0], 0, STRIPE_SIZE); - count = 1; - for (i = disks ; i--; ) { - if (i == dd_idx) - continue; - p = page_address(sh->dev[i].page); - if (test_bit(R5_UPTODATE, &sh->dev[i].flags)) - ptr[count++] = p; - else - printk(KERN_ERR "compute_block() %d, stripe %llu, %d" - " not present\n", dd_idx, - (unsigned long long)sh->sector, i); - - check_xor(); - } - if (count != 1) - xor_block(count, STRIPE_SIZE, ptr); - set_bit(R5_UPTODATE, &sh->dev[dd_idx].flags); -} - -static void compute_parity5(struct stripe_head *sh, int method) -{ - raid5_conf_t *conf = sh->raid_conf; - int i, pd_idx = sh->pd_idx, disks = sh->disks, count; - void *ptr[MAX_XOR_BLOCKS]; - struct bio *chosen; - - PRINTK("compute_parity5, stripe %llu, method %d\n", - (unsigned long long)sh->sector, method); - - count = 1; - ptr[0] = page_address(sh->dev[pd_idx].page); - switch(method) { - case READ_MODIFY_WRITE: - BUG_ON(!test_bit(R5_UPTODATE, &sh->dev[pd_idx].flags)); - for (i=disks ; i-- ;) { - if (i==pd_idx) - continue; - if (sh->dev[i].towrite && - test_bit(R5_UPTODATE, &sh->dev[i].flags)) { - ptr[count++] = page_address(sh->dev[i].page); - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) - wake_up(&conf->wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - check_xor(); - } - } - break; - case RECONSTRUCT_WRITE: - memset(ptr[0], 0, STRIPE_SIZE); - for (i= disks; i-- ;) - if (i!=pd_idx && sh->dev[i].towrite) { - chosen = sh->dev[i].towrite; - sh->dev[i].towrite = NULL; - - if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags)) - wake_up(&conf->wait_for_overlap); - - BUG_ON(sh->dev[i].written); - sh->dev[i].written = chosen; - } - break; - case CHECK_PARITY: - break; - } - if (count>1) { - xor_block(count, STRIPE_SIZE, ptr); - count = 1; - } - - for (i = disks; i--;) - if (sh->dev[i].written) { - sector_t sector = sh->dev[i].sector; - struct bio *wbi = sh->dev[i].written; - while (wbi && wbi->bi_sector < sector + STRIPE_SECTORS) { - copy_data(1, wbi, sh->dev[i].page, sector); - wbi = r5_next_bio(wbi, sector); - } - - set_bit(R5_LOCKED, &sh->dev[i].flags); - set_bit(R5_UPTODATE, &sh->dev[i].flags); - } - - switch(method) { - case RECONSTRUCT_WRITE: - case CHECK_PARITY: - for (i=disks; i--;) - if (i != pd_idx) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); - } - break; - case READ_MODIFY_WRITE: - for (i = disks; i--;) - if (sh->dev[i].written) { - ptr[count++] = page_address(sh->dev[i].page); - check_xor(); -
[PATCH 002 of 006] raid5: Move check parity operations to a work queue
This patch adds 'check parity' capabilities to the work queue and fixes 'queue_raid_work'. Also, raid5_do_soft_block_ops now accesses the stripe state under the lock to ensure that it is never out of sync with handle_stripe5. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> drivers/md/raid5.c | 123 ++--- include/linux/raid/raid5.h | 25 ++--- 2 files changed, 113 insertions(+), 35 deletions(-) === Index: linux-2.6-raid/drivers/md/raid5.c === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-28 09:52:07.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-28 10:35:23.0 -0700 @@ -1289,7 +1289,7 @@ if (locked > 0) { set_bit(R5_LOCKED, &sh->dev[pd_idx].flags); clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); - sh->ops.queue_count++; + sh->ops.pending++; } else if (locked == 0) set_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); @@ -1300,6 +1300,37 @@ return locked; } +static int handle_check_operations5(struct stripe_head *sh, int start_n) +{ + int complete=0, work_queued = -EBUSY; + + if (test_bit(STRIPE_OP_CHECK, &sh->state) && + test_bit(STRIPE_OP_CHECK_Done, &sh->ops.state)) { + clear_bit(STRIPE_OP_CHECK, &sh->state); + clear_bit(STRIPE_OP_CHECK_Done, &sh->ops.state); + complete = 1; + } + + if (start_n == 0) { + /* enter stage 1 of parity check operation */ + set_bit(STRIPE_OP_CHECK, &sh->state); + set_bit(STRIPE_OP_CHECK_Gen, &sh->ops.state); + work_queued = 1; + } else if (complete) + work_queued = 0; + + if (work_queued > 0) { + clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags); + sh->ops.pending++; + } + + PRINTK("%s: stripe %llu start: %d complete: %d op_state: %lx\n", + __FUNCTION__, (unsigned long long)sh->sector, + start_n == 0, complete, sh->ops.state); + + return work_queued; +} + /* * Each stripe/dev can have one or more bion attached. @@ -1406,11 +1437,11 @@ /* must be called under the stripe lock */ static void queue_raid_work(struct stripe_head *sh) { - if (--sh->ops.queue_count == 0) { + if (!test_bit(STRIPE_OP_QUEUED, &sh->state) && sh->ops.pending) { + set_bit(STRIPE_OP_QUEUED, &sh->state); atomic_inc(&sh->count); queue_work(sh->raid_conf->block_ops_queue, &sh->ops.work); - } else if (sh->ops.queue_count < 0) - sh->ops.queue_count = 0; + } } /* @@ -1423,16 +1454,17 @@ int i, pd_idx = sh->pd_idx, disks = sh->disks, count = 1; void *ptr[MAX_XOR_BLOCKS]; struct bio *chosen; - int overlap=0, new_work=0, written=0; - unsigned long state, ops_state; + int overlap=0, work=0, written=0; + unsigned long state, ops_state, ops_state_orig; /* take a snapshot of what needs to be done at this point in time */ spin_lock(&sh->lock); state = sh->state; - ops_state = sh->ops.state; + ops_state_orig = ops_state = sh->ops.state; spin_unlock(&sh->lock); if (test_bit(STRIPE_OP_RMW, &state)) { + BUG_ON(test_bit(STRIPE_OP_RCW, &state)); PRINTK("%s: stripe %llu STRIPE_OP_RMW op_state: %lx\n", __FUNCTION__, (unsigned long long)sh->sector, ops_state); @@ -1483,14 +1515,14 @@ if (count != 1) xor_block(count, STRIPE_SIZE, ptr); - /* signal completion and acknowledge the last state seen -* by sh->ops.state -*/ + work++; set_bit(STRIPE_OP_RMW_Done, &ops_state); - set_bit(STRIPE_OP_RMW_ParityPre, &ops_state); } - } else if (test_bit(STRIPE_OP_RCW, &state)) { + } + + if (test_bit(STRIPE_OP_RCW, &state)) { + BUG_ON(test_bit(STRIPE_OP_RMW, &state)); PRINTK("%s: stripe %llu STRIPE_OP_RCW op_state: %lx\n", __FUNCTION__, (unsigned long long)sh->sector, ops_state); @@ -1527,20 +1559,47 @@ if (count != 1) xor_block(count, STRIPE_SIZE, ptr); - /* signal completion and acknowledge the last state seen -* by sh->ops.state -*/ + work++; set_bit(STRIPE_OP_RCW_Done, &ops_state); -
[PATCH 005 of 006] raid5: Move expansion operations to a work queue
This patch modifies handle_write_operations5() to handle the parity calculation request made by the reshape code. However this patch does not move the copy operation associated with an expand to the work queue. First, it was difficult to find a clean way to pass the parameters of this operation to the queue. Second, this section of code is a good candidate for performing the copies with inline calls to the dma routines. This patch also cleans up the *_End flags which as of this version of the patch set are not needed. Signed-off-by: Dan Williams <[EMAIL PROTECTED]> drivers/md/raid5.c | 51 - include/linux/raid/raid5.h | 36 +++ 2 files changed, 54 insertions(+), 33 deletions(-) === Index: linux-2.6-raid/drivers/md/raid5.c === --- linux-2.6-raid.orig/drivers/md/raid5.c 2006-06-28 10:35:40.0 -0700 +++ linux-2.6-raid/drivers/md/raid5.c 2006-06-28 10:35:50.0 -0700 @@ -1250,16 +1250,25 @@ */ if (locked == 0) { if (rcw == 0) { - /* enter stage 1 of reconstruct write operation */ - set_bit(STRIPE_OP_RCW, &sh->state); - set_bit(STRIPE_OP_RCW_Drain, &sh->ops.state); - for (i=disks ; i-- ;) { - struct r5dev *dev = &sh->dev[i]; - - if (i!=pd_idx && dev->towrite) { - set_bit(R5_LOCKED, &dev->flags); + /* skip the drain operation on an expand */ + if (test_bit(STRIPE_OP_RCW_Expand, &sh->ops.state)) { + set_bit(STRIPE_OP_RCW, &sh->state); + set_bit(STRIPE_OP_RCW_Parity, &sh->ops.state); + for (i=disks ; i-- ;) { + set_bit(R5_LOCKED, &sh->dev[i].flags); locked++; } + } else { /* enter stage 1 of reconstruct write operation */ + set_bit(STRIPE_OP_RCW, &sh->state); + set_bit(STRIPE_OP_RCW_Drain, &sh->ops.state); + for (i=disks ; i-- ;) { + struct r5dev *dev = &sh->dev[i]; + + if (i!=pd_idx && dev->towrite) { + set_bit(R5_LOCKED, &dev->flags); + locked++; + } + } } } else { /* enter stage 1 of read modify write operation */ @@ -2213,16 +,24 @@ } if (expanded && test_bit(STRIPE_EXPANDING, &sh->state)) { + int work_queued, start_n=1; /* Need to write out all blocks after computing parity */ sh->disks = conf->raid_disks; sh->pd_idx = stripe_to_pdidx(sh->sector, conf, conf->raid_disks); - compute_parity5(sh, RECONSTRUCT_WRITE); - for (i= conf->raid_disks; i--;) { - set_bit(R5_LOCKED, &sh->dev[i].flags); - locked++; - set_bit(R5_Wantwrite, &sh->dev[i].flags); + if (!(test_bit(STRIPE_OP_RCW, &sh->state) || + test_bit(STRIPE_OP_RCW_Expand, &sh->ops.state))) { + start_n = 0; + set_bit(STRIPE_OP_RCW_Expand, &sh->ops.state); + } + work_queued = handle_write_operations5(sh, 0, start_n); + if (work_queued == 0) { + for (i= conf->raid_disks; i--;) + set_bit(R5_Wantwrite, &sh->dev[i].flags); + clear_bit(STRIPE_EXPANDING, &sh->state); + clear_bit(STRIPE_OP_RCW_Expand, &sh->ops.state); + } else if (work_queued > 0) { + locked += work_queued; } - clear_bit(STRIPE_EXPANDING, &sh->state); } else if (expanded) { clear_bit(STRIPE_EXPAND_READY, &sh->state); atomic_dec(&conf->reshape_stripes); @@ -2257,9 +2274,15 @@ release_stripe(sh2); continue; } + /* to do: perform these operations with a dma engine +* inline (rather than pushing to the workqueue) +*/ + /*#ifdef CONFIG_DMA_ENGINE*/ + /*#else
Re: Ok to go ahead with this setup?
On Wed, 28 Jun 2006, Justin Piszcz wrote: How do you have the 15 drives attached? Did you buy a SATA raid card? Do you have multiple (cheap JBOD SATA cards)? If so, which did you use? I cannot seem to find any PCI-e cards with >= 4-8 slots that support JBOD under $700-$900. We're using a 3ware 9550SX-16ML, which is a 133mhz pci-x card. They also have a 9590SE that does the same with PCI-E, though I don't know if the stock 2.6.x kernel supports these yet. Mike - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
Brad Campbell wrote: > I'd love to do something similar with PCI-E or PCI-X and make it go > faster (the PCI bus bandwidth is the killer), however I've not seen > many affordable PCI-E multi-port cards that are supported yet and > PCI-X seems to mean moving to "server" class mainboards and the other > expenses that come along with that. Recently I was looking for a budget solution to exactly this problem, and the best I found was to use 2-port SiI 3132 based PCI-E 1x card combined with 1:5 SATA Splitter based on SiI 3726 (e.g. http://fwdepot.com/thestore/product_info.php/products_id/1245). Unfortunately I didn't find anyone selling the splitter here in Czechia, so I went with 4-port SiI PCI card, which is performing well and stable, but of course quite slow. Some test I googled up at that time suggested that this combo can get about 220MB/s bandwidth through in real life (test was on Win32 though), so at today's drive speeds you can connect ~4-5 drives to one PCI-E without bus bandwidth becoming the limiting factor. Anyway, for really budget machines I can recommend the PCI SiI 3124 based cards, the driver in kernel is working rock-stable for me. Only grudge is that driver doesn't sense if you disconnect a drive from SATA connector, i.e. when you do that, computer will freeze trying to write to disconnected drive. After ~3 minutes it times out and md kicks the drive out of the array, though. If someone has any experience to share about SiI 3132+3726 under linux, I'll be happy to hear about it. According to http://linux-ata.org/software-status.html#pmp it should work, question is how stable it is, since it is recent development. Petr - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
On Wed, 28 Jun 2006, Christian Pernegger wrote: > I also subscribe to the "almost commodity hardware" philosophy, > however I've not been able to find a case that comfortably takes even > 8 drives. (The Stacker is an absolute nightmare ...) Even most > rackable cases stop at 6 3.5" drive bays -- either that or they are > dedicated storage racks with integrated hw RAID and fiber SCSI > interconnect --> definitely not commodity. I've used these: http://www.acme-technology.co.uk/acm338.htm (8 drives in a 3U case), and their variants eg: http://www.acme-technology.co.uk/acm312.htm (12 disks in a 3U case) for several years with good results. Not the cheapest on the block though, but never had any real issues with them. Gordon - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Here's a tentative setup: Intel SE7230NH1-E mainboard Pentium D 930 2x1GB Crucial 533 DDR2 ECC Intel SC5295-E enclosure The above components have finally arrived ... and I was shocked to see that the case's drive bays do not have their own fan, nor can I think of anywhere to put one. Apparently Intel thinks the case airflow is enough to cool 6 rather densly packed drives. I could get one of their hotswap backplanes (which include a fan) but I don't need hotswap, really. Does anyone have enough experience with this or similar Intel cases to comment? Sorry for being OT, C. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
My current 15 drive RAID-6 server is built around a KT600 board with an AMD Sempron processor and 4 SATA150TX4 cards. It does the job but it's not the fastest thing around (takes about 10 hours to do a check of the array or about 15 to do a rebuild). What kind of enclosure do you have this in? I also subscribe to the "almost commodity hardware" philosophy, however I've not been able to find a case that comfortably takes even 8 drives. (The Stacker is an absolute nightmare ...) Even most rackable cases stop at 6 3.5" drive bays -- either that or they are dedicated storage racks with integrated hw RAID and fiber SCSI interconnect --> definitely not commodity. Thanks, C. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I need a PCI V2.1 4 port SATA card
On Wed, 28 Jun 2006, Brad Campbell wrote: Guy wrote: Hello group, I am upgrading my disks from old 18 Gig SCSI disks to 300 Gig SATA disks. I need a good SATA controller. My system is old and has PCI V 2.1. I need a 4 port card, or 2 2 port cards. My system has multi PCI buses, so 2 cards may give me better performance, but I don't need it. I will be using software RAID. Can anyone recommend a card that is supported by the current kernel? I'm using Promise SATA150TX4 cards here in old PCI based systems. They work great and have been rock solid for well in excess of a year 24/7 hard use. I have 3 in one box and 4 in another. I'm actually looking at building another 15 disk server now and was hoping to move to something quicker using _almost_ commodity hardware. My current 15 drive RAID-6 server is built around a KT600 board with an AMD Sempron processor and 4 SATA150TX4 cards. It does the job but it's not the fastest thing around (takes about 10 hours to do a check of the array or about 15 to do a rebuild). I'd love to do something similar with PCI-E or PCI-X and make it go faster (the PCI bus bandwidth is the killer), however I've not seen many affordable PCI-E multi-port cards that are supported yet and PCI-X seems to mean moving to "server" class mainboards and the other expenses that come along with that. Brad -- "Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so." -- Douglas Adams - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html That is the problem, the only 4 port cards are PCI and not PCI-e and thus limit your speed and bw, the only alternative I see is an Areca card if you want speed.. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
On Wed, 28 Jun 2006, [EMAIL PROTECTED] wrote: Mike Dresser wrote: On Fri, 23 Jun 2006, Molle Bestefich wrote: Christian Pernegger wrote: Anything specific wrong with the Maxtors? I'd watch out regarding the Western Digital disks, apparently they have a bad habit of turning themselves off when used in RAID mode, for some reason: http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/1980/ The MaxLine III's (7V300F0) with VA111630/670 firmware currently timeout on a weekly or less basis.. I'm still testing VA111680 on a 15x300 gig array We also see similar problem on Maxtor 6V250F0 drives: they 'crash' randomly at a weeks timescale. Only way to get them back is by power cycling. Tried both SuperMicro SATA card (Marvell chip) and Promise Fastrak, firmware updates from Maxtor did not fix it yet. We were already forced to exchange all drives at a customer because he does not want to use Maxtor's anymore. Neither do we :( Bart - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html The MaxLine III's (7V300F0) with VA111630/670 firmware currently timeout on a weekly or less basis.. I'm still testing VA111680 on a 15x300 gig array How do you have the 15 drives attached? Did you buy a SATA raid card? Do you have multiple (cheap JBOD SATA cards)? If so, which did you use? I cannot seem to find any PCI-e cards with >= 4-8 slots that support JBOD under $700-$900. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Drive issues in RAID vs. not-RAID ..
I've seen a few comments to the effect that some disks have problems when used in a RAID setup and I'm a bit preplexed as to why this might be.. What's the difference between a drive in a RAID set (either s/w or h/w) and a drive on it's own, assuming the load, etc. is roughly the same in each setup? Is it just "bad feeling" or is there any scientific reasons for it? Cheers, Gordon - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
Christian Pernegger wrote: > > > The MaxLine III's (7V300F0) with VA111630/670 firmware currently timeout > > on a weekly or less basis.. > > I have just one 7V300F0, so no idea how it behaves is a RAID. It's > been fine apart from the fact that my VIA southbridge SATA controllers > doesn't even detect it ... :( > You'll need a drive firmware update for this.. > (Anyone else notice compatibility problems are through the roof lately?) > Yes, the manufacturers are busy with SATAII while SATAI still not being stable > The 8x 6B300R0 (PATA) have been excellent on my PATA 3ware. > Untypically for Maxtor, not one has died yet (over a year) :) > > The 8x 6Y120L0 (PATA) died at a rate of about two a year. > > I mainly use Maxtor due to the fact that the RMA process is automated, > hassle-free and fast. They will exchange a drive when the first bad > sector errors start to show up and not insist on a low level format to > "fix" the problem. > They did not offer to exchage our +30 drives that are having timeouts - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ok to go ahead with this setup?
The MaxLine III's (7V300F0) with VA111630/670 firmware currently timeout on a weekly or less basis.. I have just one 7V300F0, so no idea how it behaves is a RAID. It's been fine apart from the fact that my VIA southbridge SATA controllers doesn't even detect it ... :( (Anyone else notice compatibility problems are through the roof lately?) The 8x 6B300R0 (PATA) have been excellent on my PATA 3ware. Untypically for Maxtor, not one has died yet (over a year) :) The 8x 6Y120L0 (PATA) died at a rate of about two a year. I mainly use Maxtor due to the fact that the RMA process is automated, hassle-free and fast. They will exchange a drive when the first bad sector errors start to show up and not insist on a low level format to "fix" the problem. The Atlas SCSI line they inherited from Quantum isn't half-bad either. What happens now that Seagate has bought them nobody knows. C. - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html