RAID5 rebuild with a bad source drive fails
Hello linux-raid, I have a home fileserver which used a 6-disk RAID5 array with old disks and cheap IDE controllers (all disks are IDE masters). As was expected, sooner or later the old hardware (and/or cabling) began failing. The array falls apart, in particular currently it has 5 working disks and one marked as a spare (which was working before). The rebuild does not complete, because half-way through one of the working disks has a set of bad blocks (about 30 of them). When the rebuild process (or the mount process) hits these blocks, I get a non-running array with 4 working drives, one failed and one spare. While I can force-assemble the failing drive back into the array, it's not useful - rebuild fails again and again. Question 1: is there a superblock-edit function, or maybe an equivalent manual procedure, which would let me mark the spare drive as a working part of the array? It [mostly] has all the data in correct stripes; at least the event counters are all the same, and it may be a better working drive than the one with bad blocks. Even if I succeeded in editing all the superblocks to believe that the spare disk is okay now, would it help in my data recovery? :) Question 2: the disk's hardware apparetly fails to relocate the bad blocks. Is it possible for the metadevice layer to do the same - remap and/or ignore the bad blocks? In particular, is it possible for linux md to consider a block of data as a failed quantum, not the whole partition or disk, and try to use all 6 drives I have to deliver the usable data (at least in some sort of recovery mode)? -- Best regards, Jim Klimov mailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re[4]: RAID1 submirror failure causes reboot?
] hdc: drive not ready for command [115518.991827] hdc: lost interrupt [115518.995201] hdc: task_out_intr: status=0x58 { DriveReady SeekComplete DataRequest } [115519.003493] ide: failed opcode was: unknown [115524.009004] hdc: status timeout: status=0xd0 { Busy } [115524.014441] ide: failed opcode was: unknown [115524.018988] hdc: drive not ready for command [115524.114548] ide1: reset: success [115524.198619] raid1: Disk failure on hdc6, disabling device. [115524.198624] Operation continuing on 2 devices [115524.209260] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete } [115544.209530] hdc: dma_timer_expiry: dma status == 0x20 [115544.215027] hdc: DMA timeout retry [115544.218669] hdc: timeout waiting for DMA [115544.461490] hdc: status error: status=0x58 { DriveReady SeekComplete DataRequest } [115544.469875] ide: failed opcode was: unknown [115544.474388] hdc: drive not ready for command [115544.503556] RAID1 conf printout: [115544.507091] --- wd:1 rd:2 [115544.510120] disk 0, wo:0, o:1, dev:hdd1 [115544.712573] [ cut here ] [115544.717459] kernel BUG at mm/filemap.c:541! [115544.721902] invalid opcode: [#1] [115544.725806] SMP [115544.728015] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt [115544.735981] CPU:0 [115544.735983] EIP:0060:[c013c436]Not tainted VLI [115544.735986] EFLAGS: 00010046 (2.6.18.2server #4) [115544.749248] EIP is at unlock_page+0xf/0x28 [115544.753589] eax: ebx: c1f5d8c0 ecx: eab25680 edx: c1f5d8c0 [115544.760627] esi: f6a78f00 edi: 0001 ebp: esp: c0481e7c [115544.767703] ds: 007b es: 007b ss: 0068 [115544.772033] Process swapper (pid: 0, ti=c048 task=c040e460 task.ti=c048) [115544.779512] Stack: eab25674 c017bed6 f6a78f00 c017be7c c016069e 0003 c2e2aa20 [115544.788824]c0115cf1 0020 f6a78f00 3400 f6a78f00 c024f8ac f7f37a00 [115544.798171]f7f37a00 0001 c04fcd94 0001 2c00 [115544.807683] Call Trace: [115544.810797] [c017bed6] mpage_end_io_read+0x5a/0x6e [115544.816407] [c017be7c] mpage_end_io_read+0x0/0x6e [115544.821865] [c016069e] bio_endio+0x5f/0x84 [115544.826749] [c0115cf1] find_busiest_group+0x153/0x4df [115544.832582] [c024f8ac] __end_that_request_first+0x1ec/0x31c [115544.838889] [c02b508a] __ide_end_request+0x56/0xe4 [115544.844462] [c02b5159] ide_end_request+0x41/0x65 [115544.849671] [c02bad69] task_end_request+0x37/0x7e [115544.854950] [c02baed9] task_out_intr+0x84/0xb9 [115544.860128] [c01243bd] del_timer+0x56/0x58 [115544.864839] [c02bae55] task_out_intr+0x0/0xb9 [115544.869760] [c02b6a4b] ide_intr+0xb7/0x132 [115544.874467] [c013a2f4] handle_IRQ_event+0x26/0x59 [115544.879740] [c013a3ae] __do_IRQ+0x87/0xed [115544.884368] [c0105199] do_IRQ+0x31/0x69 [115544.43] [c0103746] common_interrupt+0x1a/0x20 [115544.894124] [c0100c4f] default_idle+0x35/0x5b [115544.899233] [c0100d03] cpu_idle+0x7a/0x83 [115544.903844] [c0486810] start_kernel+0x16a/0x194 [115544.908950] [c0486286] unknown_bootoption+0x0/0x1b6 [115544.914384] Code: e8 5e ff ff ff b9 __wake_up+0x38/0x4e [115545.083930] [c01243bd] -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RAID1 submirror failure causes reboot?
: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87387.701287] ide: failed opcode was: unknown [87392.564004] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87392.572790] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87392.582454] ide: failed opcode was: unknown [87392.635961] ide1: reset: success [87397.528687] hdc: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } [87397.537607] hdc: task_in_intr: error=0x01 { AddrMarkNotFound }, LBAsect=176315718, sector=176315718 [87397.547335] ide: failed opcode was: unknown [87397.551897] end_request: I/O error, dev hdc, sector 176315718 [87398.520820] raid1: Disk failure on hdc11, disabling device. [87398.520826] Operation continuing on 1 devices [87398.531579] blk: request botched [87398.535098] hdc: task_out_intr: status=0x50 { DriveReady SeekComplete } [87398.542129] ide: failed opcode was: unknown [87403.582775] [ cut here ] [87403.587748] kernel BUG at mm/filemap.c:541! [87403.592082] invalid opcode: [#1] [87403.596063] SMP [87403.598217] Modules linked in: w83781d hwmon_vid i2c_isa i2c_core w83627hf_wdt [87403.606114] CPU:0 [87403.606117] EIP:0060:[c01406a7]Not tainted VLI [87403.606120] EFLAGS: 00010046 (2.6.18.2debug #1) [87403.619728] EIP is at unlock_page+0x12/0x2d [87403.624170] eax: ebx: c2d5caa8 ecx: e8148680 edx: c2d5caa8 [87403.631543] esi: da71c600 edi: 0001 ebp: c04cfe28 esp: c04cfe24 [87403.638924] ds: 007b es: 007b ss: 0068 [87403.643419] Process swapper (pid: 0, ti=c04ce000 task=c041e500 task.ti=c04ce000) [87403.650774] Stack: e81487e8 c04cfe3c c0180e0a da71c600 c0180dac c04cfe64 c0164af9 [87403.659985]f7d49000 c04cfe84 f2dea5a0 f2dea5a0 da71c600 da71c600 [87403.669288]c04cfea8 c0256778 c041e500 c04cbd90 0046 [87403.678603] Call Trace: [87403.681462] [c0103bba] show_stack_log_lvl+0x8d/0xaa [87403.686911] [c0103ddc] show_registers+0x1b0/0x221 [87403.692306] [c0103ffc] die+0x124/0x1ee [87403.696558] [c0104165] do_trap+0x9f/0xa1 [87403.700988] [c0104427] do_invalid_op+0xa7/0xb1 [87403.706012] [c0103871] error_code+0x39/0x40 [87403.710794] [c0180e0a] mpage_end_io_read+0x5e/0x72 [87403.716154] [c0164af9] bio_endio+0x56/0x7b [87403.720798] [c0256778] __end_that_request_first+0x1e0/0x301 [87403.726985] [c02568a4] end_that_request_first+0xb/0xd [87403.732699] [c02bd73c] __ide_end_request+0x54/0xe1 [87403.738214] [c02bd807] ide_end_request+0x3e/0x5c [87403.743382] [c02c35df] task_error+0x5b/0x97 [87403.748113] [c02c36fa] task_in_intr+0x6e/0xa2 [87403.753120] [c02bf19e] ide_intr+0xaf/0x12c [87403.757815] [c013e5a7] handle_IRQ_event+0x23/0x57 [87403.763135] [c013e66f] __do_IRQ+0x94/0xfd [87403.767802] [c0105192] do_IRQ+0x32/0x68 [87403.772278] [c010372e] common_interrupt+0x1a/0x20 [87403.777586] [c0100cfe] cpu_idle+0x7d/0x86 [87403.782184] [c01002b7] rest_init+0x23/0x25 [87403.786869] [c04d4889] start_kernel+0x175/0x19d [87403.791963] [] 0x0 [87403.795270] Code: ff ff ff b9 0b 00 14 c0 8d 55 dc c7 04 24 02 00 00 00 e8 21 26 25 00 eb dc 55 89 e5 53 89 c3 31 c0 f0 0f b3 03 19 c0 85 c0 75 08 0f 0b 1d 02 6c bf 3b c0 89 d8 e8 34 ff ff ff 89 da 31 c9 e8 24 [87403.819040] EIP: [c01406a7] unlock_page+0x12/0x2d SS:ESP 0068:c04cfe24 [87403.826101] 0Kernel panic - not syncing: Fatal exception in interrupt -- Best regards, Jim Klimov mailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re[2]: RAID1 submirror failure causes reboot?
Hello Neil, [87398.531579] blk: request botched NB NB That looks bad. Possible some bug in the IDE controller or elsewhere NB in the block layer. Jens: What might cause that? NB --snip-- NB That doesn't look like raid was involved. If it was you would expect NB to see raid1_end_write_request or raid1_end_read_request in that NB trace. So that might be the hard or soft part of IDE layer failing the system, or a PCI problem for example? Just in case, motherboard is Supermicro X5DPE-G2: e7501 chipset, Dual Xeon-533. Rather old now, way cool back when it was bought ;) But, as anything new, it could be manufactured bad. Maybe now there are some well-known problems with the tech? NB Do you have any other partitions of hdc in use but not on raid? NB Which partition is sector 176315718 in ?? All partitions are mirrored, and this sector is in hdc11: # fdisk /dev/hdc ... Command (m for help): u Changing display/entry units to sectors Command (m for help): p Disk /dev/hdc: 102.9 GB, 102935347200 bytes 16 heads, 63 sectors/track, 199450 cylinders, total 201045600 sectors Units = sectors of 1 * 512 = 512 bytes Device Boot Start End Blocks Id System /dev/hdc1 * 63 1000943 500440+ fd Linux raid autodetect /dev/hdc2 1000944 5001695 2000376 fd Linux raid autodetect /dev/hdc3 500169616721711 5860008 83 Linux /dev/hdc416721712 201045599921619445 Extended /dev/hdc51672177524534719 3906472+ 83 Linux /dev/hdc62453478328441727 1953472+ 83 Linux /dev/hdc72844179132348735 1953472+ 83 Linux /dev/hdc83234879944068751 5859976+ 83 Linux /dev/hdc94406881551881759 3906472+ 83 Linux /dev/hdc10 5188182359694767 3906472+ 83 Linux /dev/hdc11 59694831 19055534365430256+ 83 Linux /dev/hdc12 190555407 201045599 5245096+ 83 Linux -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with 3xRAID1 to RAID 0
Hello Vladimir, Tuesday, July 11, 2006, 11:41:31 AM, you wrote: VS Hi, VS I created to 3 x /dev/md1 to /dev/md3 which consist of six identical VS 200GB hdd VS my mdadm --detail --scan looks like VS Proteus:/home/vladoportos# mdadm --detail --scan VS ARRAY /dev/md1 level=raid1 num-devices=2 VS UUID=d1fadb29:cc004047:aabf2f31:3f044905 VSdevices=/dev/sdb,/dev/sda VS ARRAY /dev/md2 level=raid1 num-devices=2 VS UUID=38babb4d:92129d4a:94d659f1:3b238c53 VSdevices=/dev/sdc,/dev/sdd VS ARRAY /dev/md3 level=raid1 num-devices=2 VS UUID=a0406e29:c1f586be:6b3381cf:086be0c2 VSdevices=/dev/sde,/dev/sdf VS ARRAY /dev/md0 level=raid1 num-devices=2 VS UUID=c04441d4:e15d900e:57903584:9eb5fea6 VSdevices=/dev/hdc1,/dev/hdd1 VS and mdadm.conf VS DEVICE partitions VS ARRAY /dev/md4 level=raid0 num-devices=3 VS UUID=1c8291ba:2d83cf54:2698ce30:e49b1e6c VSdevices=/dev/md1,/dev/md2,/dev/md3 VS ARRAY /dev/md3 level=raid1 num-devices=2 VS UUID=a0406e29:c1f586be:6b3381cf:086be0c2 VSdevices=/dev/sde,/dev/sdf VS ARRAY /dev/md2 level=raid1 num-devices=2 VS UUID=38babb4d:92129d4a:94d659f1:3b238c53 VSdevices=/dev/sdc,/dev/sdd VS ARRAY /dev/md1 level=raid1 num-devices=2 VS UUID=d1fadb29:cc004047:aabf2f31:3f044905 VSdevices=/dev/sda,/dev/sdb VS ARRAY /dev/md0 level=raid1 num-devices=2 VS UUID=c04441d4:e15d900e:57903584:9eb5fea6 VSdevices=/dev/hdc1,/dev/hdd1 VS as you can see i created than from md1-3 RAID0 - md4 its works fine... VS but i cant get it again after reboot i need to create it again... VS I dont get it why it wont creat at boot... any body had similar problem ? I haven't had a problem like this, but taking a wild guess - did you try putting the definitions in mdadm.conf in a different order? In particular, you define md4 before the system knows anything about the devices md[1-3]... You can speed up the checks (I think) by using something like this instead of rebooting full-scale, except for the last check to see if it all actually works :) mdadm --stop /dev/md4 mdadm --stop /dev/md3 mdadm --stop /dev/md2 mdadm --stop /dev/md1 mdadm -As or mdadm -Asc /etc/mdadm.conf.test Also you seem to make the md[1-3] devices from whole disks. Had you made them from partitions you could 1) Set a partition type to 0xfd so that a proper kernel could make your raid1 sets at boot-time and then make md4 correctly even with the current config file 2) Move the submirrors to another disk (i.e. a new larger one) if you needed to rebuild, upgrade, recover, etc. by just making a new partition of the same size. Also keep in mind that 200Gb (and any other) disks of different models and makers can vary in size by several tens of megabytes... Bit me once with certain 36Gb SCSI disks which were somewhat larger than any competition, so we had to hunt for the same model to rebuild our array. A question to the general public: am I wrong? :) Are there any actual bonuses to making RAIDs on whole raw disks? -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Real Time Mirroring of a NAS
Hello andy, al Can I export NAS B as a SAN or ISCSI target, connect the two machines Am I right in the assumption that your NASes are Linux boxes? :) Did you take a look at Linux Network block devices (nbd/enbd)? They might be what you need: you'd get a raw device on one of the servers to use in a mirror along with a local device. The NBD page mentioned some setups for high-availability services where an active server clones itself to a backup server and vice-versa, whichever was active most recently. I'm not sure about performance though... al with, say, mryinet cards or 10 GbE TOE cards, mount the NAS B volume on al NAS A, and create a RAID-1 mirror of the two volumes? Is this kind of al thing done? Are you sure you need 10GbE? My experience with a 10-drive 3Ware 8506 array in RAID5 shows that reads from it usually fit in 500-700Mbit/s. And it's a very busy popular fileserver, so I guess it's close to the hardware limits of our array. -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re[2]: Recommendations for supported 4-port SATA PCI card ?
Hello Joshua, JBL That's exactly why I recommended them. 3w- has been in the kernel a JBL *long* time and is extremely stable. Sure, it's expensive for just a JBL SATA controller, but not for a solid one that doesn't fall over at random JBL times. Just for my 2 cents: we have a number of different generations of 3Ware cards (IDE and SATA) in our campus network, mostly with Western Digital drives. It seems quite beneficial to use their Raid Edition series of drives instead of desktop ones. Beside the longer warranty and perhaps better manufacturing and better resistance to vibration, according to the WD site, these drives also report errors (if any) quicker to a host adapter. This results in errors being handled by RAID hardware gracefully, instead of the whole drive being considered as timed out. In practice, we had some fun months with a RAID5 set made of desktop Caviars rebuilding once in a while for no apparent reason, with each disk working well for a long time. And we had no such problem with RE disks for over a year now. Hope this helps some list readers make their choice, and please don't consider this an advertisement of certain brands ;) If any other manufaturer offers capabilities similar to WD RE (especially timeouts), please take a better look if you consider a hardware RAID controller. -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re[2]: ANNOUNCE: mdadm 2.4 - A tool for managing Soft RAID under Linux
Hello Farkas, FL that's realy a long avaiting feature. but at the same time wouldn't it FL be finally possible to convert a non raid partition to an raid1? it's FL avery common thing and they used to said it's even working on windows:-( That would be cooler than making a metadevice and copying tons of files :) However, AFAIK this would require some support on FS side as well? FS-addressable space in a RAID metadevice (i.e. submirror) is aligned till the pre-last 64K block. If the partition was used up completely, the last [64..128]Kb can be used by its data and need to be remapped to a free location. And this is quite FS-dependant (like grow/shrink) and should be addressed by those FS toolkits. tune2fs and e2fsck do some similar remapping job when we enable/disable spare superblocks on a used filesystem... -- Best regards, Jim Klimovmailto:[EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html