RE: How many drives are bad?
The box presents 48 drives, split across 6 SATA controllers. So disks sda-sdh are on one controller, etc. In our configuration, I run a RAID5 MD array for each controller, then run LVM on top of these to form one large VolGroup. I might be missing something here, and I realise you'd lose 8 drives to redundancy rather than 6, but wouldn't it have been better to have 8 arrays of 6 drives, each array using a single drive from each controller? That way a single controller failure (assuming no other HD failures) wouldn't actually take any array down? I do realise that 2 controller failures at the same time would lose everything. Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.20.7/1286 - Release Date: 18/02/2008 18:49 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
HDD errors in dmesg, but don't know why...
Hi All, I've got a degraded RAID5 which I'm trying to add in the replacement disk. Trouble is, every time the recovery starts, it flies along at 70MB/s or so. Then after doing about 1%, it starts dropping rapidly, until eventually a device is marked failed. When I look in dmesg, I get the following... SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data 131072 in res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x7ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:18:3f:02:f9/01:00:00:00:00/40 tag 3 cdb 0x0 data 131072 in res 41/40:00:c3:02:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back ata5.00: exception Emask 0x0 SAct 0x3ff SErr 0x0 action 0x0 ata5.00: (irq_stat 0x00060002, device error via SDB FIS) ata5.00: cmd 60/00:10:3f:0e:f9/01:00:00:00:00/40 tag 2 cdb 0x0 data 131072 in res 41/40:00:50:0e:f9/9c:00:00:00:00/40 Emask 0x9 (media error) ata5.00: configured for UDMA/100 ata5: EH complete SCSI device sdd: 976773168 512-byte hdwr sectors (500108 MB) sdd: Write Protect is off sdd: Mode Sense: 00 3a 00 00 SCSI device sdd: drive cache: write back I've no idea what to make of these errors. As far as I can work out, the HD's themselves are fine They are all less than 2 months old. The box is CentOS 5.1. Linux space.homenet.com 2.6.18-53.1.13.el5 #1 SMP Tue Feb 12 13:02:30 EST 2008 x86_64 x86_64 x86_64 GNU/Linux Any suggestions on what I can do to stop this issue? Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.20.7/1284 - Release Date: 17/02/2008 14:39 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Any inexpensive hardware recommendations for PCI interface cards?
Hi All, I currently have a couple of IT8212 PCI ATA RAID (1, 0 ot 1+0) cards which Linux RAID doesn't seem to like too well. Initially I tried creating an array out of 4 disks on the 4 primaries over the 2 cards. Although this seemed to work, the access performance was impossibly low and I never actually got to the point of leaving it for a week to build the array. Since then, I have striped 2 of the drives (just primaries again and known to be good drives) on a single card using the cards own striping capability. This striped drive, I've that added to my RAID 5 array as a 500GB single disk. The grow into this disk worked fine as did the resize2fs. As soon as I tried to copy data onto the array, the system marked the disk as faulty. Now currently I'm running a degraded array anyway as I'm waiting for the replacement of the drive that failed with bad sectors when I initially started the grow. So I had to use assemble --force to get the md device back online. I've not mounted it since. (I know it's wise not to use the drive while it's degraded, but if I can get the data off of a 320GB HD, then I can stripe that with another disk on the other ITE card, and add in a spare to the array). Can anyone see any issues with what I'm trying to do? Are there any known issues with IT8212 cards (They worked as straight disks on linux fine)? Is anyone using an array with disks on PCI interface cards? Is there an issue with mixing motherboard interfaces and PCI card based ones? Does anyone recommend any inexpensive (probably SATA-II) PCI interface cards? The motherboard has run out of sensible interfaces (I'm not using both primary and secondary in an array on IDE), but I'd still like the capacity to grow my array further. Thanks again for the help. Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.21/1265 - Release Date: 07/02/2008 11:17 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Disk failure during grow, what is the current state.
Hi All, I was wondering if someone might be willing to confirm what the current state of my RAID array is, given the following sequence of events (sorry it's pretty long) I had a clean, running /dev/md0 using 5 disks in RAID 5 (sda1, sdb1, sdc1, sdd1, hdd1). It had been clean like that for a while. So last night I decided it was safe to grow the array into a sixth disk [EMAIL PROTECTED] ~]# mdadm /dev/md0 --add /dev/hdi1 mdadm: added /dev/hdi1 [EMAIL PROTECTED] ~]# mdadm -D /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 5 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Feb 5 23:55:59 2008 State : clean Active Devices : 5 Working Devices : 6 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Events : 0.429616 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 332 active sync /dev/sdc1 3 22 653 active sync /dev/hdd1 4 8 494 active sync /dev/sdd1 5 561- spare /dev/hdi1 [EMAIL PROTECTED] ~]# mdadm --grow /dev/md0 --raid-devices=6 mdadm: Need to backup 1280K of critical section.. mdadm: ... critical section passed. [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdi1[5] sdd1[4] sdc1[2] sdb1[1] sda1[0] hdd1[3] 1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/6] [UU] [] reshape = 0.0% (29184/488383936) finish=2787.4min speed=2918K/sec unused devices: none [EMAIL PROTECTED] ~]# OK, so that would take nearly 2 days to complete, so I went to bed happy about 10 hours ago. I come to the machine this morning, and I have the following [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md0 : active raid5 hdi1[5] sdd1[6](F) sdc1[2] sdb1[1] sda1[0] hdd1[3] 1953535744 blocks super 0.91 level 5, 64k chunk, algorithm 2 [6/5] [_U] unused devices: none You have new mail in /var/spool/mail/root [EMAIL PROTECTED] ~]# mdadm -D /dev/md0 /dev/md0: Version : 00.91.03 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 6 Total Devices : 6 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Feb 6 05:28:09 2008 State : clean, degraded Active Devices : 5 Working Devices : 5 Failed Devices : 1 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Delta Devices : 1, (5-6) UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Events : 0.470964 Number Major Minor RaidDevice State 0 810 active sync /dev/sda1 1 8 171 active sync /dev/sdb1 2 8 332 active sync /dev/sdc1 3 22 653 active sync /dev/hdd1 4 004 removed 5 5615 active sync /dev/hdi1 6 8 49- faulty spare [EMAIL PROTECTED] ~]# df -k Filesystem 1K-blocks Used Available Use% Mounted on /dev/mapper/VolGroup00-LogVol00 56086828 11219432 41972344 22% / /dev/hda1 101086 18281 77586 20% /boot /dev/md0 1922882096 1775670344 69070324 97% /Downloads tmpfs 513556 0513556 0% /dev/shm [EMAIL PROTECTED] ~]# mdadm /dev/md0 --remove /dev/sdd1 mdadm: cannot find /dev/sdd1: No such file or directory [EMAIL PROTECTED] ~]# As you can see, one of the original 5 devices has failed (sdd1) and automatically removed. The reshape has stopped, but the new disk seems to be in and clean which is the bit I don't understand. The new disk hasn't been added to the size, so it would seem that md has switched it to being used as a spare instead (possibly as the grow hadn't completed?). How come it seems to have recovered so nicely? Is there something I can do to check it's integrity? Was it just so much quicker than 2 days because it switched to only having to sort out the 1 disk? Would it be safe to run an fsck to check the integrity of the fs? I don't want to inadvertently blat the raid array by 'using' it when it's in a dodgy state. I have unmounted the drive for the time being, so that it doesn't get any writes until I know what state
FW: Disk failure during grow, what is the current state.
I'm having a nightmare with emails today. I can't get a single one right first time. Apologies to Alex for sending it directly to him and not to the list on first attempt. Steve -Original Message- From: Steve Fairbairn [mailto:[EMAIL PROTECTED] Sent: 06 February 2008 15:02 To: 'Nagilum' Subject: RE: Disk failure during grow, what is the current state. -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nagilum Sent: 06 February 2008 14:34 To: Steve Fairbairn Cc: linux-raid@vger.kernel.org Subject: Re: Disk failure during grow, what is the current state. If a drive failes during reshape the reshape will just continue. The blocks which were on the failed drive are calculated from the the other disks and writes to the failed disk are simply omitted. The result is a raid5 with a failed drive. You should get a new drive asap to restore the redundancy. Also it's kinda important that you don't run 2.6.23 because it has a nasty bug which would be triggered in this scenario. The reshape probably increased in speed after the system was no longer actively used and io bandwidth freed up. Kind regards, Alex. Thanks for the response Alex, but Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Surely the added disk should now been added to the Array Size? 5 * 500GB is 2500GB, not 2000GB. This is why I don't think the reshape has continued. As for speeding up because of no IO badwidth, this also doesn't actually hold very true, because the system was at a point of not being used anyway before I added the disk, and I didn't unmount the drive until this morning after it claimed it had finished doing anything. It's because the size doesn't match up to all 5 disks being used that I still wonder at the state of the array. Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Disk failure during grow, what is the current state.
-Original Message- From: Steve Fairbairn [mailto:[EMAIL PROTECTED] Sent: 06 February 2008 15:02 To: 'Nagilum' Subject: RE: Disk failure during grow, what is the current state. Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Surely the added disk should now been added to the Array Size? 5 * 500GB is 2500GB, not 2000GB. This is why I don't think the reshape has continued. As for speeding up because of no IO badwidth, this also doesn't actually hold very true, because the system was at a point of not being used anyway before I added the disk, and I didn't unmount the drive until this morning after it claimed it had finished doing anything. Thanks again to Alex for his comments. I've just rebooted the box, and the reshape has continued on the degraded array and an RMA has been raised for the faulty disk. Thanks, Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.20/1261 - Release Date: 05/02/2008 20:57 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: mdadm error when trying to replace a failed drive in RAID5 array
Thanks for the response Bill. Neil has responded to me a few times, but I'm more than happy to try and keep it on this list instead as it feels like I'm badgering Neil which really isn't fair... Since my initial email, I got to the point of believing it was down to the superblock, and that --zero-superblock wasn't working, so a good few hours and a dd if=/dev/zero of=/dev/hdc later, I tried adding it again to the same result. As it happens, I did the --zero-superblock, then tried to insert it again and then examined (mdadm -E) again and the block was 'still there' - What really happened was that the act of trying to add it writes in the superblock. So --zero-superblock is working fine for me, but it's still refusing to add the device. The only other thing I've tried is moving the replacement drive to /dev/hdd instead (secondary slave) with an small old HD I had lying around as hdc. [EMAIL PROTECTED] ~]# mdadm -E /dev/hdd1 mdadm: No md superblock detected on /dev/hdd1. [EMAIL PROTECTED] ~]# mdadm /dev/md0 --add /dev/hdd1 mdadm: add new device failed for /dev/hdd1 as 5: Invalid argument [EMAIL PROTECTED] ~]# dmesg | tail ... md: hdd1 has invalid sb, not importing! md: md_import_device returned -22 [EMAIL PROTECTED] ~]# mdadm -E /dev/hdd1 /dev/hdd1: Magic : a92b4efc Version : 00.90.00 UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Raid Devices : 5 Total Devices : 4 Preferred Minor : 0 Update Time : Sun Jan 20 13:02:00 2008 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 1 Spare Devices : 0 Checksum : 198f8fb4 - correct Events : 0.348270 Layout : left-symmetric Chunk Size : 64K Number Major Minor RaidDevice State this 5 22 65 -1 spare /dev/hdd1 0 0 8 1 0 active sync /dev/sda1 1 1 8 17 1 active sync /dev/sdb1 2 2 8 33 2 active sync /dev/sdc1 3 3 0 0 3 faulty removed 4 4 8 49 4 active sync /dev/sdd1 I have mentioned it to Neil, but didn't mention it here before. I am a C developer by trade, so can easily devle into the mdadm source for extra debug if anyone thinks it could help. I could also delve into md in the kernel if really wanted, but my knowledge of building kernels on linux is some 4 years+ out of date and forgotten, so if that's a yes, then some pointers on how to get the centos kernel config and a choice of kernel from www.kernel.org, or from the centos distro would be invaluable. I'm away for a few days from tomorrow and probably wont be able to do much if anything until I'm back on Thursday, so please be patient if I don't respond before then. Many Thanks, Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.7/1233 - Release Date: 19/01/2008 18:37 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: mdadm error when trying to replace a failed drive in RAID5 array
-Original Message- From: Neil Brown [mailto:[EMAIL PROTECTED] Sent: 20 January 2008 20:37 md: hdd1 has invalid sb, not importing! md: md_import_device returned -22 In 2.6.18, the only thing that can return this message without other more explanatory messages are: 2/ If the device appears to be too small. Maybe it is the later, though that seems unlikely. [EMAIL PROTECTED] ~]# mdadm /dev/md0 --verbose --add /dev/hdd1 mdadm: added /dev/hdd1 HUGE thanks to Neil, and one white gold plated donkey award to me. OK. When I created /dev/md1 after creating /dev/md0, I was using a mishmash of disks I had lying around. As this selection of disks used differing block sizes, I chose to create the raid partitions from the first block, to a set size (+250G). When I reinstalled the disk for going into /dev/md0, I partitioned the disk the same way (+500G), which it turns out isn't how I created the partitions when I created that array. So the device I was trying to add was about 22 blocks too small. Taking Neils suggestion and looking at /proc/partitions showed this up incredibly quickly. My sincere apologies for wasting all your time on a stupid error, and again many many thanks for the solution... md0 : active raid5 hdd1[5] sdd1[4] sdc1[2] sdb1[1] sda1[0] 1953535744 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUU_U] [] recovery = 0.9% (4430220/488383936) finish=1110.8min speed=7259K/sec Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.7/1233 - Release Date: 19/01/2008 18:37 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
mdadm error when trying to replace a failed drive in RAID5 array
Hi All, Firstly, I must express my thanks to Neil Brown for being willing to respond to the direct email I sent him as I couldn't for the life of me find any forums on mdadm or this list... I have a Software RAID 5 device configured, but one of the drives failed. I removed the drive with the following command... mdadm /dev/md0 --remove /dev/hdc1 [EMAIL PROTECTED] ~]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md1 : active raid5 hdk1[5] hdi1[3] hdh1[2] hdg1[1] hde1[0] 976590848 blocks level 5, 64k chunk, algorithm 2 [5/4] [_] [] recovery = 22.1% (54175872/244147712) finish=3615.3min speed=872K/sec md0 : active raid5 sdd1[4] sdc1[2] sdb1[1] sda1[0] 1953535744 blocks level 5, 64k chunk, algorithm 2 [5/4] [UUU_U] unused devices: none Please ignore /dev/md1 for now at least. Now my array (/dev/md0) shows the following... [EMAIL PROTECTED] ~]# mdadm -QD /dev/md0 /dev/md0: Version : 00.90.03 Creation Time : Wed Jan 9 18:57:53 2008 Raid Level : raid5 Array Size : 1953535744 (1863.04 GiB 2000.42 GB) Used Dev Size : 488383936 (465.76 GiB 500.11 GB) Raid Devices : 5 Total Devices : 4 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Tue Jan 4 04:28:03 2005 State : clean, degraded Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K UUID : 382c157a:405e0640:c30f9e9e:888a5e63 Events : 0.337650 Number Major Minor RaidDevice State 0 8 1 0 active sync /dev/sda1 1 8 17 1 active sync /dev/sdb1 2 8 33 2 active sync /dev/sdc1 3 0 0 3 removed 4 8 49 4 active sync /dev/sdd1 Now, when I try to insert the replacement drive back in, I get the following... [EMAIL PROTECTED] ~]# mdadm /dev/md0 --add /dev/hdc1 mdadm: add new device failed for /dev/hdc1 as 5: Invalid argument It seems to be that mdadm is trying to add the device as number 5 instead of replacing number 3, but I have no idea why, or how to make it replace number 3. --- Neil has explained to me already that the drive should be added as 5, and then switched to 3 after a a rebuild is complete. Neil aslo asked me if dmesg showed up anything when I tried adding the drive [EMAIL PROTECTED] mdadm-2.6.4]# dmesg | tail ... md: hdc1 has invalid sb, not importing! md: md_import_device returned -22 md: hdc1 has invalid sb, not importing! md: md_import_device returned -22 I have updated mdadm to the latest version I can find... [EMAIL PROTECTED] ~]# mdadm --version mdadm - v2.6.4 - 19th October 2007 Still get the same error. I'm hoping someone will have some suggestion as to how to sort this out. Backing up nearly 2TB of data isn't really a viable option for me, so I'm quite desperate to get the redundancy back. My linux distribution is a relatively new installation from CentOS 5.1 ISOs. The Kernel version is [EMAIL PROTECTED] ~]# uname -a Linux space.homenet.com 2.6.18-53.1.4.el5 #1 SMP Fri Nov 30 00:45:55 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Many Thanks, Steve. No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.516 / Virus Database: 269.19.7/1232 - Release Date: 18/01/2008 19:32 - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html