RE: Fixing a RAID

2008-06-19 Thread Daniel Eriksson
Ryan Coleman wrote:

 Jun  4 23:02:28 testserver kernel: ar0: 715425MB HighPoint v3
RocketRAID RAID5 (stripe 64 KB) status: READY
 Jun  4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at
ata6-slave
 Jun  4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at
ata8-master
 Jun  4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at
ata7-slave
 Jun  4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at
ata8-slave
 Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5
 ...

My guess is that the rebuild failure is due to unreadable sectors on one
(or more) of the original three drives.

I recently had this happen to me on an 8 x 1 TB RAID-5 array on a
Highpoint RocketRAID 2340 controller. For some unknown reason two drives
developed unreadable sectors within hours of each other. To make a long
story short, the way I fixed this was to:

1. Used a tool I got from Highpoint tech-support to re-init the array
information (so the array was no longer marked as broken).
2. Unplugged both drives and hooked them up to another computer using a
regular SATA controller.
3. One of the drives was put through a complete recondition cycle(a).
4. The other drive was put through a partial recondition cycle(b).
5. I hooked up both drives to the 2340 controller again. The BIOS
immediately marked the array as degraded (because it didn't recognize
the wiped drive as part of the array), and I could re-add the wiped
drive so a rebuild of the array could start.
6. I finally ran a zpool scrub on the tank, and restored the few files
that had checksum errors.

(a) I tried to run a SMART long selftest, but it failed. I then
completely wiped the drive by writing zeroes to the entire surface,
allowing the firmware to remap the bad sectors. After this procedure the
long selftest succeeded. I finally used a diagnostic program from the
drive vendor (Western Digital) to again verify that the drive was
working properly.

(b) The SMART long selftest failed the first time, but after running a
surface scan using the diagnostic program from Western Digital the
selftest passed. I'm pretty sure the diagnostic program remapped the bad
sector, replacing it with a blank one. At least the program warned me to
back up all data before starting the surface scan. Alternatively I could
have used dd (with offset) to write to just the failed sector (available
in the SMART selftest log).


If I were you I would run all three drives through a SMART long
selftest. I'm sure you'll find that at least one of them will fail the
selftest. Use something like SpinRite 6 to recover the drive, or use dd
/ dd_rescue to copy the data to a fresh drive. Once all three of the
original drives pass a long selftest the array should be able to finish
a rebuild using a fourth (blank) drive.

By the way, don't try to use SpinRite 6 on 1 TB drives, it will fail
halfway through with a division-by-zero error. I haven't tried it on any
500 GB drives yet.

/Daniel Eriksson
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-19 Thread Tuc at T-B-O-H
 
 I recently had this happen to me on an 8 x 1 TB RAID-5 array on a
 Highpoint RocketRAID 2340 controller. For some unknown reason two drives
 developed unreadable sectors within hours of each other. To make a long
 story short, the way I fixed this was to:
 
Not FreeBSD related, so you can delete now if not interested...

We had a 1.5TB NetApp filer at my previous place. It was originally
backed up by another 1.5TB filer taking snapshots every few hours. After
a few years, the customer decided it was too safe so they used the 2nd
filer for something else. A month later, we had a double disk failure in the
same volume.

The NetApp freaked out and rebooted, but when it did it marked one
disk dead, and the other as fine. Since there was a hot spare, it started
to attempt a rebuild. It took 9 hours for a 72G disk, and the 1/2 failed
drive sounded like it was putting the head through the media with lead shot
in it. The filer performed at about 1/2 speed during that time. The SECOND
that it finished, and the software claimed that the array was in optimal 
mode, we immediately pulled the bad disk out and replaced it with a fresh
disk. That rebuild went fine. Pulled the failed disk, and put another disk
in for hot spare.

Not sure if its a testimony to NetApp, or our and the customers
luck. They had specifically not wanted backups, and rebuilding the data
would have taken months, many man hours, and loss of revenue to the site. 

Ever since then, I try to get disks made at different times and
different batches. You figure that if they were MADE around the same time,
they will most likely DIE around the same time. :)

Tuc
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Ryan Coleman wrote:

Is there a way to figure out what order drives were supposed to go in for
a RAID 5? Using a hex tool?


Do you mean that you physically unplugged them, and they were not labeled?

What kind of disk controller is it?

Technically, AFAIK, the order should not matter. The stripe on the disk 
should know what is where and simply run with it. In practice however...



I have time to figure all this out.


What happens when you try it?

Is FreeBSD in use in any form or fashion at all on these drives, or is 
this a generalized hardware question?


Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Ryan Coleman

 Ryan Coleman wrote:
 Is there a way to figure out what order drives were supposed to go in
 for
 a RAID 5? Using a hex tool?

 Do you mean that you physically unplugged them, and they were not labeled?

 What kind of disk controller is it?

 Technically, AFAIK, the order should not matter. The stripe on the disk
 should know what is where and simply run with it. In practice however...

 I have time to figure all this out.

 What happens when you try it?

 Is FreeBSD in use in any form or fashion at all on these drives, or is
 this a generalized hardware question?

 Steve


It's a HighPoint pATA controller, one drive went kaput so I replaced it
with another 250G drive and went to rebuild and it wouldn't go. The drive
itself wasn't actually dead, I did some running tests on it and it spun up
OK in an enclosure and then in another machine. So I tried to put the
drive back on the array and it doesn't believe in having data anymore.

This is a 4x250G R5 (so ~750G logical) that does have data on it that I
would very much like to recover somehow. I know this is very likely a
fruitless endeavor, I just need to try. OnTrack and other recovery places
are just too expensive for this. I can dig up the old logs (I think) from
when she was firing errors two weeks ago. The drive was formatted UFS2 as
one large logical drive in sysinstall.

Hope that's helpful.
Thanks for the reponse.
--
Ryan
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Ryan Coleman wrote:

Ryan Coleman wrote:

Is there a way to figure out what order drives were supposed to go in
for
a RAID 5? Using a hex tool?

Do you mean that you physically unplugged them, and they were not labeled?

What kind of disk controller is it?



It's a HighPoint pATA controller, one drive went kaput so I replaced it
with another 250G drive and went to rebuild and it wouldn't go. The drive
itself wasn't actually dead, I did some running tests on it and it spun up
OK in an enclosure and then in another machine. So I tried to put the
drive back on the array and it doesn't believe in having data anymore.


Ok. The errors you were witnessing after attempting to re-insert it into 
the controller, were they generated at BIOS level within the controller 
bootup, or in FreeBSD. I'm completely assuming that your running OS was 
ON these disks, so the former is true.



This is a 4x250G R5 (so ~750G logical) that does have data on it that I
would very much like to recover somehow. I know this is very likely a
fruitless endeavor,


ah, ah ah, never say never, ever.


I just need to try. OnTrack and other recovery places
are just too expensive for this. 


Recover from backup ;)

I'm kidding. It's too late for that, isn't it. read on...


I can dig up the old logs (I think) from
when she was firing errors two weeks ago. 


Yes. Post the logs. If they are extensive, perhaps you could email them 
off-list, with a notice to the list that you have them in the event 
others would like to review them as well.



The drive was formatted UFS2 as
one large logical drive in sysinstall.


..so if I understand correctly, you had a RAID-5 with three operational 
physical disks, and one hot spare?


Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Ryan Coleman wrote:

Ryan Coleman wrote:


Oh, I completely forgot to ask...

Does the RAID still operate even though one disk is bad?

After all, that is the purpose of RAID-5. stripe, with parity. One 
fails, the other two (or N) keep right on going...


Or, is it a RAID-5 card that you put into operation as a RAID-0 span?

If the latter is the case, good luck ;)

Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Ryan Coleman

 Ryan Coleman wrote:
 Ryan Coleman wrote:
 Is there a way to figure out what order drives were supposed to go in
 for
 a RAID 5? Using a hex tool?
 Do you mean that you physically unplugged them, and they were not
 labeled?

 What kind of disk controller is it?

 It's a HighPoint pATA controller, one drive went kaput so I replaced it
 with another 250G drive and went to rebuild and it wouldn't go. The
 drive
 itself wasn't actually dead, I did some running tests on it and it spun
 up
 OK in an enclosure and then in another machine. So I tried to put the
 drive back on the array and it doesn't believe in having data anymore.

 Ok. The errors you were witnessing after attempting to re-insert it into
 the controller, were they generated at BIOS level within the controller
 bootup, or in FreeBSD. I'm completely assuming that your running OS was
 ON these disks, so the former is true.

 This is a 4x250G R5 (so ~750G logical) that does have data on it that I
 would very much like to recover somehow. I know this is very likely a
 fruitless endeavor,

 ah, ah ah, never say never, ever.

 I just need to try. OnTrack and other recovery places
 are just too expensive for this.

 Recover from backup ;)

 I'm kidding. It's too late for that, isn't it. read on...

 I can dig up the old logs (I think) from
 when she was firing errors two weeks ago.

 Yes. Post the logs. If they are extensive, perhaps you could email them
 off-list, with a notice to the list that you have them in the event
 others would like to review them as well.

 The drive was formatted UFS2 as
 one large logical drive in sysinstall.

 ..so if I understand correctly, you had a RAID-5 with three operational
 physical disks, and one hot spare?

 Steve


Actually, this is the data storage temporary before I got my massive 7TB
RAID purchased and built. But it crashed out 2 days before it arrived.
You'll see below the errors. I couldn't even run a find(1) on it.

It was 4 disks that made a 714G functional drive, no hotspare, I didn't
have the disks for it at the time -- but I do now. The g_vfs_done() errors
threw me a bad thought and my tech said that's a bad sign, you're toast
and left me hanging. I know more than enough about BSD to get around and
tech, but RAIDs are not something I have a lot of experience in.


[EMAIL PROTECTED] /var/log]# more messages.0 | grep 'ar0'
May 31 17:25:18 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID
RAID5 (stripe 64 KB) status: READY
May 31 17:25:18 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave
May 31 17:25:18 testserver kernel: ar0: disk1 READY using ad16 at ata8-master
May 31 17:25:18 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave
May 31 17:25:18 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave
Jun  4 22:35:45 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID
RAID5 (stripe 64 KB) status: READY
Jun  4 22:35:45 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave
Jun  4 22:35:45 testserver kernel: ar0: disk1 READY using ad16 at ata8-master
Jun  4 22:35:45 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave
Jun  4 22:35:45 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave
Jun  4 22:58:09 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID
RAID5 (stripe 64 KB) status: READY
Jun  4 22:58:09 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave
Jun  4 22:58:09 testserver kernel: ar0: disk1 READY using ad16 at ata8-master
Jun  4 22:58:09 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave
Jun  4 22:58:09 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave
Jun  4 23:02:28 testserver kernel: ar0: 715425MB HighPoint v3 RocketRAID
RAID5 (stripe 64 KB) status: READY
Jun  4 23:02:28 testserver kernel: ar0: disk0 READY using ad13 at ata6-slave
Jun  4 23:02:28 testserver kernel: ar0: disk1 READY using ad16 at ata8-master
Jun  4 23:02:28 testserver kernel: ar0: disk2 READY using ad15 at ata7-slave
Jun  4 23:02:28 testserver kernel: ar0: disk3 READY using ad17 at ata8-slave
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=397138788352, length=16384)]error = 5
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=585206398976, length=16384)]error = 5
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=360527265792, length=16384)]error = 5
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=279018455040, length=16384)]error = 5
Jun  4 23:05:35 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=674808283136, length=16384)]error = 5
Jun  4 23:10:06 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=501963358208, length=16384)]error = 5
Jun  4 23:10:06 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=397138788352, length=16384)]error = 5
Jun  4 23:10:06 testserver kernel:
g_vfs_done():ar0s1c[READ(offset=585206398976, length=16384)]error = 5
Jun  4 23:10:06 

Re: Fixing a RAID

2008-06-18 Thread Ryan Coleman

 Ryan Coleman wrote:
 Ryan Coleman wrote:

 Oh, I completely forgot to ask...

 Does the RAID still operate even though one disk is bad?

 After all, that is the purpose of RAID-5. stripe, with parity. One
 fails, the other two (or N) keep right on going...

 Or, is it a RAID-5 card that you put into operation as a RAID-0 span?

 If the latter is the case, good luck ;)

No, I'm not that stupid. :) My old job, we had the big LaCie drives and
one of the 4 250Gs in it would fail and they were f*ed. I went to replace
the drive right away so I wouldn't be in that situation.

When I went to rebuild in the BIOS it failed at 2%, no matter what 250G
drive I put in to fill the spot.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Ryan Coleman wrote:

Ryan Coleman wrote:



and my tech said that's a bad sign, you're toast
and left me hanging. 


Knowing you spanned the drives without parity or backup, there is no 
need for me to review the errors.


I agree with your tech. Unless there is a miracle (or you outsource the 
entire array to a recovery location), good luck.


Sorry I couldn't be more help.

FYI...when you span drives, your single point of failure is an 
exponential factor of how many drives you are spanning.


I have done low level disk data recovery before, but describing it is 
beyond what I can do via email. Even still, said disk recovery still 
relied on the ability for the heads to read off the platter.


If I were you, I'd consider your backup strategy now for that 7TB array 
you are building.


Thats a lot of data. You need to be able to go back more than one day.

If nobody else has a suggestion to retrieve the info, you will send it 
away.


Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Ryan Coleman wrote:

Ryan Coleman wrote:

Ryan Coleman wrote:

Oh, I completely forgot to ask...

Does the RAID still operate even though one disk is bad?

After all, that is the purpose of RAID-5. stripe, with parity. One
fails, the other two (or N) keep right on going...

Or, is it a RAID-5 card that you put into operation as a RAID-0 span?

If the latter is the case, good luck ;)


No, I'm not that stupid. :) My old job, we had the big LaCie drives and
one of the 4 250Gs in it would fail and they were f*ed. I went to replace
the drive right away so I wouldn't be in that situation.

When I went to rebuild in the BIOS it failed at 2%, no matter what 250G
drive I put in to fill the spot.


Hrm... I didn't implicitly attempt to call you stupid. I was asking a 
question, and laying out info for others that may not know as they 
follow the thread...


Besides...if you are seriously considering a 7TB storage facility, then 
you already know that building a proper RAID solution should include 
controllers that are hot-swappable, and will rebuild the array either as 
soon as you pop a new drive in, or with a hot-spare, without having to 
reboot and waste three hours rebuilding via a BIOS software.


Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Fixing a RAID

2008-06-18 Thread Ryan Coleman

 Ryan Coleman wrote:
 Ryan Coleman wrote:
 Ryan Coleman wrote:
 Oh, I completely forgot to ask...

 Does the RAID still operate even though one disk is bad?

 After all, that is the purpose of RAID-5. stripe, with parity. One
 fails, the other two (or N) keep right on going...

 Or, is it a RAID-5 card that you put into operation as a RAID-0 span?

 If the latter is the case, good luck ;)

 No, I'm not that stupid. :) My old job, we had the big LaCie drives and
 one of the 4 250Gs in it would fail and they were f*ed. I went to
 replace
 the drive right away so I wouldn't be in that situation.

 When I went to rebuild in the BIOS it failed at 2%, no matter what 250G
 drive I put in to fill the spot.

 Hrm... I didn't implicitly attempt to call you stupid. I was asking a
 question, and laying out info for others that may not know as they
 follow the thread...

 Besides...if you are seriously considering a 7TB storage facility, then
 you already know that building a proper RAID solution should include
 controllers that are hot-swappable, and will rebuild the array either as
 soon as you pop a new drive in, or with a hot-spare, without having to
 reboot and waste three hours rebuilding via a BIOS software.



I didn't mean to make it seem like you did, I just wanted to say I'm no
fool :)

I can rebuild with HighPoint's web interface when necc. and I hope to be
able to upgrade the controller from the 8-port I have to a 12-port in the
next year this putting a spare in the case. I have two extra drives still
in their bags in case something does happen, I don't have to wait days to
get a replacement drive in.

I'm sorry I implied that you called me stupid. Just been a struggle with
this one machine for the last few weeks.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [freebsd-questions] Re: Fixing a RAID

2008-06-18 Thread Tuc at T-B-O-H.NET
 
 
  Ryan Coleman wrote:
  Ryan Coleman wrote:
 
  Oh, I completely forgot to ask...
 
  Does the RAID still operate even though one disk is bad?
 
  After all, that is the purpose of RAID-5. stripe, with parity. One
  fails, the other two (or N) keep right on going...
 
  Or, is it a RAID-5 card that you put into operation as a RAID-0 span?
 
  If the latter is the case, good luck ;)
 
 No, I'm not that stupid. :) My old job, we had the big LaCie drives and
 one of the 4 250Gs in it would fail and they were f*ed. I went to replace
 the drive right away so I wouldn't be in that situation.
 
 When I went to rebuild in the BIOS it failed at 2%, no matter what 250G
 drive I put in to fill the spot.

I had that happen on a 4 disk (36G each) raid-5 (I forget the
controller). No matter what disk I put in to replace a failed one, it
wouldn't take. 3 drives, exact model, different production dates...
None took. 

I futzed and futzed and finally decided to declare the cage
bad and think of backout procedures. About 2 hours after I had set
another machine up to take its place, it started giving spurious
errors and fell over.

I pulled the machine out of the datacenter, cleared out
the raid config, and went to rebuild with just the 3 drives. Wouldn't
build a fresh raid-5 from just the 3 disks. After the Which one of
these things is not like the other, I found that apparently one of
the disks still was working, but causing heck if I put another disk
in the slot next to it. 

A year later, and I finally decided to buy a few more disks
off ebay to see if my final theory is right. I win (hopefully) the
auction in 5 days... If the cage really is bad, I previously sourced
a new case/cage, and decided even though its a 4G Dual Xenon system
I probably could get a new system cheaper thats faster.

Tuc
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [freebsd-questions] Re: Fixing a RAID

2008-06-18 Thread Steve Bertrand

Tuc at T-B-O-H.NET wrote:



Ryan Coleman wrote:

Ryan Coleman wrote:

Oh, I completely forgot to ask...

Does the RAID still operate even though one disk is bad?




A year later, and I finally decided to buy a few more disks
off ebay to see if my final theory is right. I win (hopefully) the
auction in 5 days... If the cage really is bad, I previously sourced
a new case/cage, and decided even though its a 4G Dual Xenon system
I probably could get a new system cheaper thats faster.


I would be extremely interested to know if your diligence in testing 
your theory pays off in this case.


Please post your results ;)

Steve
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]