RE: 3Ware Escalade Issues

2006-02-23 Thread Ted Mittelstaedt


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Nathan Vidican
Sent: Wednesday, February 22, 2006 10:24 AM
To: Charles Swiger
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: 3Ware Escalade Issues


Charles Swiger wrote:
 On Feb 22, 2006, at 12:31 PM, Don O'Neil wrote:

 3) Is there some way I can do a faster FSCK, or perhaps
'fool'  the
 system
 into thinking the file system is clean?


 If you update to 5.x or later, you can use background FSCK
rather  than

 having to

 wait for the FSCK to complete the way it does under 4.x.


 I wasn't aware 5.x could do this. My next question is how are my
 existing
 apps going to be affected by upgrading to 5.x?


 If you install the 4.x compatibility libraries, your old 4.x
binaries
 should continue to work just fine.  However, you will want to
rebuild
 as much of your existing software under 5.x as possible.

 Also, if you update to 5.x, you can run the smartmon tools, which
 will let

 you

 do a drive self-test using SMART, this will give much better
 information

 about

 what is going on with the drive, and also give an estimate of its
 remaining
 lifespan.


 Yes, this would help a lot!!!


 Well, once you're running 5.x, install smartmon and run:
smartctl -t
 long /dev/ad0, or whatever the right device is.

 How old are the drives, if you know?


 They're less than 2 years old, and still under warranty.
This is  the
 second
 drive to fail and it's driving me nuts.

 They're Maxtor DiamondMax Plus 9 6Y250P0 250 GB PATA
drives...  Never
 had a
 problem with that particular drive until this batch.

 Can anyone suggest some good 250GB PATA drives for me to
use? I  might as
 well swap them all out since I'm starting over. The 6000
series  Escalade
 card I'm using doesn't support anything more than 250 GB.


 I've had somewhat better luck with the so-called special edition
 variants of the drives, such as the WD1200JB, which have more
cache  RAM
 and a longer warranty period than the generic versions


According to Western Digital, ONLY their 'SD' or (RAID-Edition)
drives should be
attempted in an array; WDC utilizes proprietary error
correction mechanisms
which mangle the error-handling done by an array controller. In
short, while the
drive is doing it's internal error-correction, the raid
controller sees it as a
drive failure and a whole new mess develops.


Whoah, there chicken little!

The article is in Western Digital's knowledgebase, article# 1397

Here are the relevant bits:

...If you install and use a desktop edition hard drive connected to a
RAID controller, the drive may not work correctly. This is caused by the
normal error recovery procedure that a desktop edition hard drive
uses

...When an error is found on a desktop edition hard drive, the drive
will enter into a deep recovery cycle to attempt to repair the error,
recover the data from the problematic area, and then reallocate a
dedicated area to replace the problematic area. This process can take up
to 2 minutes depending on the severity of the issue...

... Most RAID controllers allow a very short amount of time for a hard
drive to recover from an error. If a hard drive takes too long to
complete this process, the drive will be dropped from the RAID array...

So let me explain.  If you have a WD ide disk that is NOT in an array,
has a major error, and goes
away and hides for TWO MINUTES this is supposed to BE OK in a desktop?

How many users do you know are going to sit twiddling their thumbs
waiting for 2 minutes for
their computer to unfreeze?  I thought so.

You must have an extremely elastic idea of what an acceptable error
handling is on an IDE drive.
Yes, IDE, you know, Intelligent Drive Electronics?!?  As in, intelligent
enough to know that
if the problem is so severe it's going to take 2 minutes of scrubbing to
fix, that it's a
sign of imminent disk failure and the disk ought to be thrown out anyway?

I think what we have here is a bit of creative justification by WD for
why you should pay more money
for their RAID edition drives.

I'll tell you what.  I will keep an eye on my ATA raid setups that use WD
drives in them.  If one
disk dies for 2 minutes and the array dumps it, I'll RMA the drive back
to WD for a new one.
You by contrast, can keep your failing drives in your array until they
croak permanently.

Ted

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


RE: 3Ware Escalade Issues

2006-02-22 Thread Ted Mittelstaedt


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Chuck Swiger
Sent: Sunday, February 19, 2006 6:47 AM
To: Don O'Neil
Cc: freebsd-questions@freebsd.org
Subject: Re: 3Ware Escalade Issues


Don O'Neil wrote:
 There appears to be a bad sector on one of the drives
according to smartctl,
 but nothing serious.

What that may mean is that there have been many bad sectors,
which have been
corrected using the spares, until no more spare sectors are
left for replacements.

That drive may well fail catastrophically, soon.


_Will_ fail soon!

 However, every time the system tried to write to that sector
in the array,
 the system would freeze, and then reboot, and of course it
would say the
 file system isn't clean, etc...

 Since the file system is 1 TB in size, it would take 8+ hours
to FSCK it.
 The array is only striped, and not mirrored or built with
redunancy. I'm
 basically using the card/driver to make one large volume for
a web server.

OK.  Well, if this data is important to you, you should give
consideration to
using a RAID-1, RAID-10, or RAID-5 configuration to gain redundancy.


RAID0-1 is the way to go - disks are cheap now.  Fry's was selling 300GB
UDMA Seagates yesterday for $69.00 with rebates.  You can find Promise
and Highpoint UDMA RAID 100 cards on Ebay for $15 or so.

 I have a few questions:

 1) Is this a known bug? I'm running FreeBSD 4.11 (for
software compatibility

No it is not a bug.

 issues at the moment, I will upgrade at some point in the future)

Normally, the OS will only kill the affected processes using
that sector,

No, Chuck, the OS has no knowledge of bad sectors on the disk.

All UNIXes out there assume perfect storage media, and perfect RAM.
It is the hardware's job to handle error correction or containment.
All the OS knows in a disk error is that it is pulling data off
the disk and doing something with it.  If the data that's pulled off
is corrupted and happens to be a device driver or some such, or
other part of the kernel (perhaps it was swapped out) then the system
will crash.  Otherwise the system won't know the difference if the
area is user data.  If it's a program then the results will be
the same as if the program had a bug in it, it will unexpectedly
terminate.

People have lost databases due to corruption by not knowing about
disk failures like this.


 but
without knowing where it is, perhaps it's affecting some
important file like the
kernel itself, /bin/sh...?

 2) How can I trap the errors and eliminate the re-boot issue?

Shut down the system.  Replace the failing hard drive.  Use dd
to make an exact
copy onto the new drive on some other system. and put the new
drive back into
the array.  Note that the replacement drive must be an exact
match for this to
work, otherwise you will have to backup your data and rebuild the array.

Speaking of which, do you have known-good backups available?

 3) Is there some way I can do a faster FSCK, or perhaps
'fool' the system
 into thinking the file system is clean?

If you update to 5.x or later, you can use background FSCK
rather than having to
wait for the FSCK to complete the way it does under 4.x.

 4) Any suggestions on how to fix this?

Also, if you update to 5.x, you can run the smartmon tools,
which will let you
do a drive self-test using SMART, this will give much better
information about
what is going on with the drive, and also give an estimate of
its remaining
lifespan.

How old are the drives, if you know?


A lot of the drive manufacturers these days are offering plenty
generous warranties, it is likely his disk is still under warranty.

Ted

--
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to
[EMAIL PROTECTED]

--
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.375 / Virus Database: 267.15.12/265 - Release
Date: 2/20/2006


___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 3Ware Escalade Issues

2006-02-22 Thread Don O'Neil
Chuck,
  Thanks for the response, this helps me a lot... My answers are inline:

Don O'Neil wrote:
 There appears to be a bad sector on one of the drives according to
smartctl,
 but nothing serious. 

What that may mean is that there have been many bad sectors, which have
been
corrected using the spares, until no more spare sectors are left for
replacements.

That drive may well fail catastrophically, soon.

I figured as much, which is why I'm going to re-build the whole array with a
new drive, etc... Fortunatly I got all my data off ok without any issues.

 However, every time the system tried to write to that sector in the
array,
 the system would freeze, and then reboot, and of course it would say the
 file system isn't clean, etc...
 
 Since the file system is 1 TB in size, it would take 8+ hours to FSCK it.
 The array is only striped, and not mirrored or built with redunancy. I'm
 basically using the card/driver to make one large volume for a web
server.

OK.  Well, if this data is important to you, you should give consideration
to
using a RAID-1, RAID-10, or RAID-5 configuration to gain redundancy.

Yes, and when I re-build it with will be RAID-5 rather than just RAID-0

 I have a few questions:
 
 1) Is this a known bug? I'm running FreeBSD 4.11 (for software
compatibility
. issues at the moment, I will upgrade at some point in the future)

Normally, the OS will only kill the affected processes using that sector,
but
without knowing where it is, perhaps it's affecting some important file
like the
kernel itself, /bin/sh...?

Actually the only thing that was on the array was a DB, so I think the
failure may have been causing MySQL to go nuts, and cascading up. 

 2) How can I trap the errors and eliminate the re-boot issue?

Shut down the system.  Replace the failing hard drive.  Use dd to make an
exact
copy onto the new drive on some other system. and put the new drive back
into
the array.  Note that the replacement drive must be an exact match for this
to
work, otherwise you will have to backup your data and rebuild the array.

Speaking of which, do you have known-good backups available?

Of course I have backups!! Never work without them. I'm going to re-build
with RAID-5 this time.

 3) Is there some way I can do a faster FSCK, or perhaps 'fool' the system
 into thinking the file system is clean?

If you update to 5.x or later, you can use background FSCK rather than
having to
wait for the FSCK to complete the way it does under 4.x.

I wasn't aware 5.x could do this. My next question is how are my existing
apps going to be affected by upgrading to 5.x? I have some builds of
packages that were done by a company that is no longer in operation. I
haven't fully figured out how they built the software yet so I can't
re-build under 5.X yet. If I try to put the elf binaries and the other
builds from 4.X on 5.X are they going to run ok or do I just need to give it
a try? Would you suggest going all the way to 6.x or sticking with the 5.x
chain?

 4) Any suggestions on how to fix this?

Also, if you update to 5.x, you can run the smartmon tools, which will let
you
do a drive self-test using SMART, this will give much better information
about
what is going on with the drive, and also give an estimate of its remaining
lifespan.

Yes, this would help a lot!!!

How old are the drives, if you know?

They're less than 2 years old, and still under warranty. This is the second
drive to fail and it's driving me nuts.

They're Maxtor DiamondMax Plus 9 6Y250P0 250 GB PATA drives... Never had a
problem with that particular drive until this batch. 

Can anyone suggest some good 250GB PATA drives for me to use? I might as
well swap them all out since I'm starting over. The 6000 series Escalade
card I'm using doesn't support anything more than 250 GB.

Thanks all again!!!
Don

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 3Ware Escalade Issues

2006-02-22 Thread Nathan Vidican

Charles Swiger wrote:

On Feb 22, 2006, at 12:31 PM, Don O'Neil wrote:

3) Is there some way I can do a faster FSCK, or perhaps 'fool'  the 
system

into thinking the file system is clean?




If you update to 5.x or later, you can use background FSCK rather  than


having to


wait for the FSCK to complete the way it does under 4.x.



I wasn't aware 5.x could do this. My next question is how are my  
existing

apps going to be affected by upgrading to 5.x?



If you install the 4.x compatibility libraries, your old 4.x binaries  
should continue to work just fine.  However, you will want to rebuild  
as much of your existing software under 5.x as possible.


Also, if you update to 5.x, you can run the smartmon tools, which  
will let


you

do a drive self-test using SMART, this will give much better  
information


about

what is going on with the drive, and also give an estimate of its  
remaining

lifespan.



Yes, this would help a lot!!!



Well, once you're running 5.x, install smartmon and run: smartctl -t  
long /dev/ad0, or whatever the right device is.



How old are the drives, if you know?



They're less than 2 years old, and still under warranty. This is  the 
second

drive to fail and it's driving me nuts.

They're Maxtor DiamondMax Plus 9 6Y250P0 250 GB PATA drives...  Never 
had a

problem with that particular drive until this batch.

Can anyone suggest some good 250GB PATA drives for me to use? I  might as
well swap them all out since I'm starting over. The 6000 series  Escalade
card I'm using doesn't support anything more than 250 GB.



I've had somewhat better luck with the so-called special edition  
variants of the drives, such as the WD1200JB, which have more cache  RAM 
and a longer warranty period than the generic versions




According to Western Digital, ONLY their 'SD' or (RAID-Edition) drives should be 
attempted in an array; WDC utilizes proprietary error correction mechanisms 
which mangle the error-handling done by an array controller. In short, while the 
drive is doing it's internal error-correction, the raid controller sees it as a 
drive failure and a whole new mess develops.


We've run into this several times now, both with our own in-house systems, and 
with those we've procured for others... trust me on this, if going with Western 
Digital drives... DO NOT use anything other than their 'SD' or RAID-Edition 
drives. Maxtor drives have no such issue AFAIK, nor Seagate... but only speaking 
from experience here not factual data. - WDC has a good explanation int he 
knowledge base on their website/support section.


--
Nathan Vidican
[EMAIL PROTECTED]
Windsor Match Plate  Tool Ltd.
http://www.wmptl.com/
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 3Ware Escalade Issues

2006-02-22 Thread Robin Vley

Don O'Neil wrote:

Don,


They're Maxtor DiamondMax Plus 9 6Y250P0 250 GB PATA drives... Never had a
problem with that particular drive until this batch. 


I used the 80, 120 and 160GB version of that series in some of my 
servers built 2 years ago. Out of the 18 disks originally put in, I have 
4 broken ones on my desk now. I'm not very happy with the overal 
performance when it comes to reliability of that series Maxtor.



Can anyone suggest some good 250GB PATA drives for me to use? I might as
well swap them all out since I'm starting over. The 6000 series Escalade
card I'm using doesn't support anything more than 250 GB.


I swapped the broken Maxtors with Seagate disks. I'm using 3ware 7500-4 
for the PATA. Forgot what the SATA 3ware controller was, but so far no 
problem with that one (also on Maxtor disks).


/Robin
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 3Ware Escalade Issues

2006-02-22 Thread Peter Giessel
On Wednesday, February 22, 2006, at 10:04AM, Robin Vley [EMAIL PROTECTED] 
wrote:

I swapped the broken Maxtors with Seagate disks.

I too am a big fan of Seagate disks.  So it Seagate it seems.

Maxtor and Western Digital give 1 year on the low end and 3 year
warranty on their Special Edition drives, whereas Seagate does
a 5 year warranty on all their drives.  This says something to me.
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


3Ware Escalade Issues

2006-02-19 Thread Don O'Neil
Hi all,
  I've been experiencing some problems with my 3ware Escalade 6000 array
lately that has been causing spontaneous reboots of the system.

There appears to be a bad sector on one of the drives according to smartctl,
but nothing serious. 

However, every time the system tried to write to that sector in the array,
the system would freeze, and then reboot, and of course it would say the
file system isn't clean, etc...

Since the file system is 1 TB in size, it would take 8+ hours to FSCK it.
The array is only striped, and not mirrored or built with redunancy. I'm
basically using the card/driver to make one large volume for a web server.

I have a few questions:

1) Is this a known bug? I'm running FreeBSD 4.11 (for software compatibility
issues at the moment, I will upgrade at some point in the future)

2) How can I trap the errors and eliminate the re-boot issue?

3) Is there some way I can do a faster FSCK, or perhaps 'fool' the system
into thinking the file system is clean?

4) Any suggestions on how to fix this?

Thanks all!!!

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: 3Ware Escalade Issues

2006-02-19 Thread Chuck Swiger
Don O'Neil wrote:
 There appears to be a bad sector on one of the drives according to smartctl,
 but nothing serious. 

What that may mean is that there have been many bad sectors, which have been
corrected using the spares, until no more spare sectors are left for 
replacements.

That drive may well fail catastrophically, soon.

 However, every time the system tried to write to that sector in the array,
 the system would freeze, and then reboot, and of course it would say the
 file system isn't clean, etc...
 
 Since the file system is 1 TB in size, it would take 8+ hours to FSCK it.
 The array is only striped, and not mirrored or built with redunancy. I'm
 basically using the card/driver to make one large volume for a web server.

OK.  Well, if this data is important to you, you should give consideration to
using a RAID-1, RAID-10, or RAID-5 configuration to gain redundancy.

 I have a few questions:
 
 1) Is this a known bug? I'm running FreeBSD 4.11 (for software compatibility
 issues at the moment, I will upgrade at some point in the future)

Normally, the OS will only kill the affected processes using that sector, but
without knowing where it is, perhaps it's affecting some important file like the
kernel itself, /bin/sh...?

 2) How can I trap the errors and eliminate the re-boot issue?

Shut down the system.  Replace the failing hard drive.  Use dd to make an exact
copy onto the new drive on some other system. and put the new drive back into
the array.  Note that the replacement drive must be an exact match for this to
work, otherwise you will have to backup your data and rebuild the array.

Speaking of which, do you have known-good backups available?

 3) Is there some way I can do a faster FSCK, or perhaps 'fool' the system
 into thinking the file system is clean?

If you update to 5.x or later, you can use background FSCK rather than having to
wait for the FSCK to complete the way it does under 4.x.

 4) Any suggestions on how to fix this?

Also, if you update to 5.x, you can run the smartmon tools, which will let you
do a drive self-test using SMART, this will give much better information about
what is going on with the drive, and also give an estimate of its remaining
lifespan.

How old are the drives, if you know?

-- 
-Chuck

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]