RE: Fixing a RAID
Ryan Coleman wrote: Jun 4 23:02:28 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID RAID5 (stripe 64 KB) status: READY Jun 4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave Jun 4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at ata8-master Jun 4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave Jun 4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5 ... My guess is that the rebuild failure is due to unreadable sectors on one (or more) of the original three drives. I recently had this happen to me on an 8 x 1 TB RAID-5 array on a Highpoint RocketRAID 2340 controller. For some unknown reason two drives developed unreadable sectors within hours of each other. To make a long story short, the way I fixed this was to: 1. Used a tool I got from Highpoint tech-support to re-init the array information (so the array was no longer marked as broken). 2. Unplugged both drives and hooked them up to another computer using a regular SATA controller. 3. One of the drives was put through a complete recondition cycle(a). 4. The other drive was put through a partial recondition cycle(b). 5. I hooked up both drives to the 2340 controller again. The BIOS immediately marked the array as degraded (because it didn't recognize the wiped drive as part of the array), and I could re-add the wiped drive so a rebuild of the array could start. 6. I finally ran a zpool scrub on the tank, and restored the few files that had checksum errors. (a) I tried to run a SMART long selftest, but it failed. I then completely wiped the drive by writing zeroes to the entire surface, allowing the firmware to remap the bad sectors. After this procedure the long selftest succeeded. I finally used a diagnostic program from the drive vendor (Western Digital) to again verify that the drive was working properly. (b) The SMART long selftest failed the first time, but after running a surface scan using the diagnostic program from Western Digital the selftest passed. I'm pretty sure the diagnostic program remapped the bad sector, replacing it with a blank one. At least the program warned me to back up all data before starting the surface scan. Alternatively I could have used dd (with offset) to write to just the failed sector (available in the SMART selftest log). If I were you I would run all three drives through a SMART long selftest. I'm sure you'll find that at least one of them will fail the selftest. Use something like SpinRite 6 to recover the drive, or use dd / dd_rescue to copy the data to a fresh drive. Once all three of the original drives pass a long selftest the array should be able to finish a rebuild using a fourth (blank) drive. By the way, don't try to use SpinRite 6 on 1 TB drives, it will fail halfway through with a division-by-zero error. I haven't tried it on any 500 GB drives yet. /Daniel Eriksson ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
I recently had this happen to me on an 8 x 1 TB RAID-5 array on a Highpoint RocketRAID 2340 controller. For some unknown reason two drives developed unreadable sectors within hours of each other. To make a long story short, the way I fixed this was to: Not FreeBSD related, so you can delete now if not interested... We had a 1.5TB NetApp filer at my previous place. It was originally backed up by another 1.5TB filer taking snapshots every few hours. After a few years, the customer decided it was too safe so they used the 2nd filer for something else. A month later, we had a double disk failure in the same volume. The NetApp freaked out and rebooted, but when it did it marked one disk dead, and the other as fine. Since there was a hot spare, it started to attempt a rebuild. It took 9 hours for a 72G disk, and the 1/2 failed drive sounded like it was putting the head through the media with lead shot in it. The filer performed at about 1/2 speed during that time. The SECOND that it finished, and the software claimed that the array was in optimal mode, we immediately pulled the bad disk out and replaced it with a fresh disk. That rebuild went fine. Pulled the failed disk, and put another disk in for hot spare. Not sure if its a testimony to NetApp, or our and the customers luck. They had specifically not wanted backups, and rebuilding the data would have taken months, many man hours, and loss of revenue to the site. Ever since then, I try to get disks made at different times and different batches. You figure that if they were MADE around the same time, they will most likely DIE around the same time. :) Tuc ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Is there a way to figure out what order drives were supposed to go in for a RAID 5? Using a hex tool? Do you mean that you physically unplugged them, and they were not labeled? What kind of disk controller is it? Technically, AFAIK, the order should not matter. The stripe on the disk should know what is where and simply run with it. In practice however... I have time to figure all this out. What happens when you try it? Is FreeBSD in use in any form or fashion at all on these drives, or is this a generalized hardware question? Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Is there a way to figure out what order drives were supposed to go in for a RAID 5? Using a hex tool? Do you mean that you physically unplugged them, and they were not labeled? What kind of disk controller is it? Technically, AFAIK, the order should not matter. The stripe on the disk should know what is where and simply run with it. In practice however... I have time to figure all this out. What happens when you try it? Is FreeBSD in use in any form or fashion at all on these drives, or is this a generalized hardware question? Steve It's a HighPoint pATA controller, one drive went kaput so I replaced it with another 250G drive and went to rebuild and it wouldn't go. The drive itself wasn't actually dead, I did some running tests on it and it spun up OK in an enclosure and then in another machine. So I tried to put the drive back on the array and it doesn't believe in having data anymore. This is a 4x250G R5 (so ~750G logical) that does have data on it that I would very much like to recover somehow. I know this is very likely a fruitless endeavor, I just need to try. OnTrack and other recovery places are just too expensive for this. I can dig up the old logs (I think) from when she was firing errors two weeks ago. The drive was formatted UFS2 as one large logical drive in sysinstall. Hope that's helpful. Thanks for the reponse. -- Ryan ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Is there a way to figure out what order drives were supposed to go in for a RAID 5? Using a hex tool? Do you mean that you physically unplugged them, and they were not labeled? What kind of disk controller is it? It's a HighPoint pATA controller, one drive went kaput so I replaced it with another 250G drive and went to rebuild and it wouldn't go. The drive itself wasn't actually dead, I did some running tests on it and it spun up OK in an enclosure and then in another machine. So I tried to put the drive back on the array and it doesn't believe in having data anymore. Ok. The errors you were witnessing after attempting to re-insert it into the controller, were they generated at BIOS level within the controller bootup, or in FreeBSD. I'm completely assuming that your running OS was ON these disks, so the former is true. This is a 4x250G R5 (so ~750G logical) that does have data on it that I would very much like to recover somehow. I know this is very likely a fruitless endeavor, ah, ah ah, never say never, ever. I just need to try. OnTrack and other recovery places are just too expensive for this. Recover from backup ;) I'm kidding. It's too late for that, isn't it. read on... I can dig up the old logs (I think) from when she was firing errors two weeks ago. Yes. Post the logs. If they are extensive, perhaps you could email them off-list, with a notice to the list that you have them in the event others would like to review them as well. The drive was formatted UFS2 as one large logical drive in sysinstall. ..so if I understand correctly, you had a RAID-5 with three operational physical disks, and one hot spare? Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? After all, that is the purpose of RAID-5. stripe, with parity. One fails, the other two (or N) keep right on going... Or, is it a RAID-5 card that you put into operation as a RAID-0 span? If the latter is the case, good luck ;) Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Is there a way to figure out what order drives were supposed to go in for a RAID 5? Using a hex tool? Do you mean that you physically unplugged them, and they were not labeled? What kind of disk controller is it? It's a HighPoint pATA controller, one drive went kaput so I replaced it with another 250G drive and went to rebuild and it wouldn't go. The drive itself wasn't actually dead, I did some running tests on it and it spun up OK in an enclosure and then in another machine. So I tried to put the drive back on the array and it doesn't believe in having data anymore. Ok. The errors you were witnessing after attempting to re-insert it into the controller, were they generated at BIOS level within the controller bootup, or in FreeBSD. I'm completely assuming that your running OS was ON these disks, so the former is true. This is a 4x250G R5 (so ~750G logical) that does have data on it that I would very much like to recover somehow. I know this is very likely a fruitless endeavor, ah, ah ah, never say never, ever. I just need to try. OnTrack and other recovery places are just too expensive for this. Recover from backup ;) I'm kidding. It's too late for that, isn't it. read on... I can dig up the old logs (I think) from when she was firing errors two weeks ago. Yes. Post the logs. If they are extensive, perhaps you could email them off-list, with a notice to the list that you have them in the event others would like to review them as well. The drive was formatted UFS2 as one large logical drive in sysinstall. ..so if I understand correctly, you had a RAID-5 with three operational physical disks, and one hot spare? Steve Actually, this is the data storage temporary before I got my massive 7TB RAID purchased and built. But it crashed out 2 days before it arrived. You'll see below the errors. I couldn't even run a find(1) on it. It was 4 disks that made a 714G functional drive, no hotspare, I didn't have the disks for it at the time -- but I do now. The g_vfs_done() errors threw me a bad thought and my tech said that's a bad sign, you're toast and left me hanging. I know more than enough about BSD to get around and tech, but RAIDs are not something I have a lot of experience in. [EMAIL PROTECTED] /var/log]# more messages.0 | grep 'ar0' May 31 17:25:18 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID RAID5 (stripe 64 KB) status: READY May 31 17:25:18 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave May 31 17:25:18 testserver kernel: ar0: disk1 READY using ad16 at ata8-master May 31 17:25:18 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave May 31 17:25:18 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave Jun 4 22:35:45 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID RAID5 (stripe 64 KB) status: READY Jun 4 22:35:45 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave Jun 4 22:35:45 testserver kernel: ar0: disk1 READY using ad16 at ata8-master Jun 4 22:35:45 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave Jun 4 22:35:45 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave Jun 4 22:58:09 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID RAID5 (stripe 64 KB) status: READY Jun 4 22:58:09 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave Jun 4 22:58:09 testserver kernel: ar0: disk1 READY using ad16 at ata8-master Jun 4 22:58:09 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave Jun 4 22:58:09 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave Jun 4 23:02:28 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID RAID5 (stripe 64 KB) status: READY Jun 4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave Jun 4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at ata8-master Jun 4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave Jun 4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5 Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=397138788352, length=16384)]error = 5 Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=585206398976, length=16384)]error = 5 Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=360527265792, length=16384)]error = 5 Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=279018455040, length=16384)]error = 5 Jun 4 23:05:35 testserver kernel: g_vfs_done():ar0s1c[READ(offset=674808283136, length=16384)]error = 5 Jun 4 23:10:06 testserver kernel: g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5 Jun 4 23:10:06 testserver kernel: g_vfs_done():ar0s1c[READ(offset=397138788352, length=16384)]error = 5 Jun 4 23:10:06 testserver kernel: g_vfs_done():ar0s1c[READ(offset=585206398976, length=16384)]error = 5 Jun 4 23:10:06
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? After all, that is the purpose of RAID-5. stripe, with parity. One fails, the other two (or N) keep right on going... Or, is it a RAID-5 card that you put into operation as a RAID-0 span? If the latter is the case, good luck ;) No, I'm not that stupid. :) My old job, we had the big LaCie drives and one of the 4 250Gs in it would fail and they were f*ed. I went to replace the drive right away so I wouldn't be in that situation. When I went to rebuild in the BIOS it failed at 2%, no matter what 250G drive I put in to fill the spot. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: and my tech said that's a bad sign, you're toast and left me hanging. Knowing you spanned the drives without parity or backup, there is no need for me to review the errors. I agree with your tech. Unless there is a miracle (or you outsource the entire array to a recovery location), good luck. Sorry I couldn't be more help. FYI...when you span drives, your single point of failure is an exponential factor of how many drives you are spanning. I have done low level disk data recovery before, but describing it is beyond what I can do via email. Even still, said disk recovery still relied on the ability for the heads to read off the platter. If I were you, I'd consider your backup strategy now for that 7TB array you are building. Thats a lot of data. You need to be able to go back more than one day. If nobody else has a suggestion to retrieve the info, you will send it away. Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? After all, that is the purpose of RAID-5. stripe, with parity. One fails, the other two (or N) keep right on going... Or, is it a RAID-5 card that you put into operation as a RAID-0 span? If the latter is the case, good luck ;) No, I'm not that stupid. :) My old job, we had the big LaCie drives and one of the 4 250Gs in it would fail and they were f*ed. I went to replace the drive right away so I wouldn't be in that situation. When I went to rebuild in the BIOS it failed at 2%, no matter what 250G drive I put in to fill the spot. Hrm... I didn't implicitly attempt to call you stupid. I was asking a question, and laying out info for others that may not know as they follow the thread... Besides...if you are seriously considering a 7TB storage facility, then you already know that building a proper RAID solution should include controllers that are hot-swappable, and will rebuild the array either as soon as you pop a new drive in, or with a hot-spare, without having to reboot and waste three hours rebuilding via a BIOS software. Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? After all, that is the purpose of RAID-5. stripe, with parity. One fails, the other two (or N) keep right on going... Or, is it a RAID-5 card that you put into operation as a RAID-0 span? If the latter is the case, good luck ;) No, I'm not that stupid. :) My old job, we had the big LaCie drives and one of the 4 250Gs in it would fail and they were f*ed. I went to replace the drive right away so I wouldn't be in that situation. When I went to rebuild in the BIOS it failed at 2%, no matter what 250G drive I put in to fill the spot. Hrm... I didn't implicitly attempt to call you stupid. I was asking a question, and laying out info for others that may not know as they follow the thread... Besides...if you are seriously considering a 7TB storage facility, then you already know that building a proper RAID solution should include controllers that are hot-swappable, and will rebuild the array either as soon as you pop a new drive in, or with a hot-spare, without having to reboot and waste three hours rebuilding via a BIOS software. I didn't mean to make it seem like you did, I just wanted to say I'm no fool :) I can rebuild with HighPoint's web interface when necc. and I hope to be able to upgrade the controller from the 8-port I have to a 12-port in the next year this putting a spare in the case. I have two extra drives still in their bags in case something does happen, I don't have to wait days to get a replacement drive in. I'm sorry I implied that you called me stupid. Just been a struggle with this one machine for the last few weeks. ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [freebsd-questions] Re: Fixing a RAID
Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? After all, that is the purpose of RAID-5. stripe, with parity. One fails, the other two (or N) keep right on going... Or, is it a RAID-5 card that you put into operation as a RAID-0 span? If the latter is the case, good luck ;) No, I'm not that stupid. :) My old job, we had the big LaCie drives and one of the 4 250Gs in it would fail and they were f*ed. I went to replace the drive right away so I wouldn't be in that situation. When I went to rebuild in the BIOS it failed at 2%, no matter what 250G drive I put in to fill the spot. I had that happen on a 4 disk (36G each) raid-5 (I forget the controller). No matter what disk I put in to replace a failed one, it wouldn't take. 3 drives, exact model, different production dates... None took. I futzed and futzed and finally decided to declare the cage bad and think of backout procedures. About 2 hours after I had set another machine up to take its place, it started giving spurious errors and fell over. I pulled the machine out of the datacenter, cleared out the raid config, and went to rebuild with just the 3 drives. Wouldn't build a fresh raid-5 from just the 3 disks. After the Which one of these things is not like the other, I found that apparently one of the disks still was working, but causing heck if I put another disk in the slot next to it. A year later, and I finally decided to buy a few more disks off ebay to see if my final theory is right. I win (hopefully) the auction in 5 days... If the cage really is bad, I previously sourced a new case/cage, and decided even though its a 4G Dual Xenon system I probably could get a new system cheaper thats faster. Tuc ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [freebsd-questions] Re: Fixing a RAID
Tuc at T-B-O-H.NET wrote: Ryan Coleman wrote: Ryan Coleman wrote: Oh, I completely forgot to ask... Does the RAID still operate even though one disk is bad? A year later, and I finally decided to buy a few more disks off ebay to see if my final theory is right. I win (hopefully) the auction in 5 days... If the cage really is bad, I previously sourced a new case/cage, and decided even though its a 4G Dual Xenon system I probably could get a new system cheaper thats faster. I would be extremely interested to know if your diligence in testing your theory pays off in this case. Please post your results ;) Steve ___ freebsd-questions@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-questions To unsubscribe, send any mail to [EMAIL PROTECTED]