Re: question about raidframe getting stuck

2008-08-13 Thread Marcus Andree
snip

 Almost every RAID system out there handles the sudden removal
 of a disk from the system pretty well.  Why?  Because it's EASY
 to create that failure mode.  Problem is, in 25 years in this
 business, I don't recall having seen a hard disk fall out of a
 computer as a mode of actual failure (I did see a SCSI HBA fall
 out of a machine once, but that's a different story).

snip

I had seen that disk-suddenly-out-of-computer failure once. Coincidently
enough, it was an OpenBSD system configured only for NAT, about 6 years ago.

The IDE hard disk failed sometime at night. When we arrived on the
next day at office. Everything was working flawlessly until someone
ssh'ed to that machine. My guess is something has gone awry when
the syslog went to write that new connection and suddenly the OS
discovered that was no HD present.

Surprisingly enough, the onboard IDE controller survived, but after installing
the new disk, we found the parallel IDE cable faulty and it had to be replaced
also.

It was not a RAID system though...

snip



Re: question about raidframe getting stuck

2008-08-12 Thread Harald Dunkel

Hi Nick,

I highly appreciate your detailed report about your experiences
with RAID systems. That was cool. Surely I don't expect any
miracles from RAID anymore.

The current plan is to move to a ramdisk based system to get rid
of disk access afap, and to use carp to setup a fallback host.
Logging is done (non-blocking, hopefully) via network.


Many thanx

Harri



Re: question about raidframe getting stuck

2008-08-11 Thread Harald Dunkel

Stuart Henderson wrote:


With IDE (Integrated Drive Electronics), the controller is *on the
drive*.  A failing drive/controller can do all sorts of nasty things
to the host system.



So you mean I should not use IDE disks (PATA or SATA), because
Raidframe cannot support a failsafe operation with these disks?


Regards

Harri



Re: question about raidframe getting stuck

2008-08-11 Thread Nick Holland
Harald Dunkel wrote:
 Stuart Henderson wrote:
 
 With IDE (Integrated Drive Electronics), the controller is *on the
 drive*.  A failing drive/controller can do all sorts of nasty things
 to the host system.
 
 
 So you mean I should not use IDE disks (PATA or SATA), because
 Raidframe cannot support a failsafe operation with these disks?

rant
Your basic assumption that RAID=minimal down time is flawed.

The most horrible down times and events you ever see will
often involve RAID.  That goes for hardware RAID, software
RAID, whatever.

Almost every RAID system out there handles the sudden removal
of a disk from the system pretty well.  Why?  Because it's EASY
to create that failure mode.  Problem is, in 25 years in this
business, I don't recall having seen a hard disk fall out of a
computer as a mode of actual failure (I did see a SCSI HBA fall
out of a machine once, but that's a different story).

My preferred way to test RAID systems involves a powder-actuated
nail gun driving a nail through the platter.  Not overly
realistic either, but arguably more so than having the drive
suddenly being pulled out.  The disks get expensive, though.

Back to your situation...
The drive reports a failure, but not one so horrible that the
OS doesn't attempt a retry.  So, at what point does the OS just
shut down the drive and say, not worth the trouble?  If you
are running a single drive, you generally want to keep trying
as long as there is the slightest hope (another digression: back
in the MSDOS v2 days, I had a machine blow a disk such that if I
kept hitting Retry enough times, each sector would ultimately
be read successfully.  Wedged a pen in between the 'R' key and
the monitor, went to dinner, and when I came back, I had all my
data successfully copied to another drive).  In your case,
however, you have a drive saying, I'm getting better when you
are saying, It'll be stone dead in a moment.  You want the
OS to whack the drive and toss it on the cart..er..remove it
from the RAID set at the first sign of trouble, but that's not
a universal answer.

Curiously, I've had servers that caused problems BOTH Ways. One
kept a drive on-line even though it was having serious problems
and should have been declared dead.  In a several other cases,
the drives reported minor errors and were popped off-line and
out of the array when there was really nothing significant wrong
with the disks, but the local staff didn't recognize that...and
if the right two popped off-line, down went the array.

oh, btw: those were both HW RAID.  You can run into these
problems no matter what you are using.  The try too long ones
were SATA, the give up too early ones were SCSI.  We had 20
servers with the SATA HW mirroring, not a single one lost data,
though one got Really Slow until we figured out it was a drive
problem.  We had 15 SCSI systems which cost about four times as
much as the SATA systems..three of those lost data.  Complexity
kills.

Proper operation can be neat.  Failures rarely rarely are.
There are usually more ways a system can fail than there are
ways it can work.  It is also really hard to have the drives
fail in realistic ways when the designers are watching, and
it is really hard to fail something the same exact way again
to work out every bug.

In your case, you have firewalls, which can be made completely
redundant, rather than just making the disk system more complex.
Why run RAID in the firewalls, when you can just run CARP and
have much more redundancy?

Of course, you can have similar problems with CARP, too..we
managed to install a proxy package in a CARP'd pair of FWs and
didn't notice how fast it filled the disk.  One box quit passing
packets when the proxy couldn't log any longer, but CARP didn't
see the box or interfaces or links as actually failing, so it
didn't switch over to the standby system.  Happened when both
our administrators were out that morning (of course), so when
they called me at home, I asked a few questions, and had them
hit the power switch on the primary firewall, which instantly
got data moving again through the secondary.

Consider RAID to be a rapid repair tool, don't expect it
to never let you go down (and that's assuming you know how to
recover when it actually does fail...and most people I've seen
just assume magic happens or they hope they got a job elsewhere
by that point).  And don't expect to get less down time out of
a very complex system compared to a simple system.

In particular, when an IDE disk fails, it often does seem to
the computer than an entire controller fell out of the system,
so don't expect an IDE system to stay up after a drive failure.
On the other hand, if you haven't seen a SCSI disk take out an
entire SCSI bus, just wait, do enough, you will.  Don't expect
them to stay up, either.  SATA?  Ask me again in about ten
ten years, but so far, I've seen a SATA drive toss a dead
short across the power supply, killing the RAID box, the PS in
the computer and the PS on a second 

Re: question about raidframe getting stuck

2008-08-08 Thread Olivier Cherrier
On Thu, Aug 07, 2008 at 12:33:40PM +0200, [EMAIL PROTECTED] wrote:
 Seems we have some misunderstanding here. I am talking about
 future events. Of course I don't know in advance which disk
 fails when. If a disk dies, then its the job of raidframe to
 detect this event, to mark the disk as bad, and to provide the
 basic service with the remaining disks, as far as possible.
 And yet the machine became unresponsive for 30 minutes.
 This took much too long.

Couldn't it be related to the IDE bus? What for noise can a deffective
disk on an IDE controller generate when it is failling.
I would suggest to you to use SCSI controllers and disks with
hot-swappable functionnalities.

-- 
Olivier Cherrier - Symacx.com
mailto:[EMAIL PROTECTED]



Re: question about raidframe getting stuck

2008-08-08 Thread Stuart Henderson
On 2008-08-08, Olivier Cherrier [EMAIL PROTECTED] wrote:
 Couldn't it be related to the IDE bus? What for noise can a deffective
 disk on an IDE controller generate when it is failling.

With IDE (Integrated Drive Electronics), the controller is *on the
drive*.  A failing drive/controller can do all sorts of nasty things
to the host system.



Re: question about raidframe getting stuck

2008-08-07 Thread Ariane van der Steldt
On Thu, Aug 07, 2008 at 09:27:24AM +0200, Harald Dunkel wrote:
 I've got a configuration issue with Raidframe: Our
 gateway/firewall runs a raid1 for the system disk.
 No swap partition.
 
 Recently one of the raid disks (wd0) showed some
 problem:
 
 Aug  2 17:22:35 fw01 /bsd: wd0(pciide0:0:0): timeout
 Aug  2 17:53:52 fw01 /bsd:  type: ata
 Aug  2 17:53:52 fw01 /bsd:  c_bcount: 16384
 Aug  2 17:53:52 fw01 /bsd:  c_skip: 0
 Aug  2 17:53:52 fw01 /bsd: pciide0:0:0: bus-master DMA error: missing 
 interrupt, status=0x21
 Aug  2 17:53:52 fw01 /bsd: pciide0 channel 0: reset failed for drive 0
 Aug  2 17:53:52 fw01 /bsd: wd0d: device timeout writing fsbn 46172704 of 
 46172704-46172735 (wd0 bn 50368000; cn 49968 tn 4 sn 4), retrying
 :
 :
 Aug  2 17:53:52 fw01 /bsd: wd0d: device timeout writing fsbn 46172704 of 
 46172704-46172735 (wd0 bn 50368000; cn 49968 tn 4 sn 4)
 Aug  2 17:53:52 fw01 /bsd: raid0: IO Error.  Marking /dev/wd0d as failed.
 Aug  2 17:53:52 fw01 /bsd: raid0: node (Wpd) returned fail, rolling forward
 Aug  2 17:53:52 fw01 /bsd: pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, 
 err=0x00
 Aug  2 17:53:52 fw01 /bsd: pciide0 channel 0: reset failed for drive 0
 Aug  2 17:53:52 fw01 /bsd: wd0d: device timeout writing fsbn 46137472 of 
 46137472-46137503 (wd0 bn 50332768; cn 49933 tn 4 sn 52), retrying
 :
 :
 Aug  2 17:53:53 fw01 /bsd: pciide0:0:0: not ready, st=0xd0BSY,DRDY,DSC, 
 err=0x00
 Aug  2 17:53:53 fw01 /bsd: pciide0 channel 0: reset failed for drive 0
 Aug  2 17:53:53 fw01 /bsd: wd0d: device timeout writing fsbn 46152320 of 
 46152320-46152343 (wd0 bn 50347616; cn 49948 tn 0 sn 32)
 Aug  2 17:53:53 fw01 /bsd: raid0: node (Wpd) returned fail, rolling forward
 
 
 Surely wd0 is defect. Can happen. But my problem is that the
 machine became unresponsive for 30 minutes. Even a ping did
 not work. This is not what I would expect from a raid system.
 
 What would you suggest to reduce the waiting time? 2 minutes
 would be OK, but 30 minutes downtime are a _huge_ problem.
 
 Do I have to expect the same for a raid5 built from 9 disks, but
 with a higher probability, because there are more disks in the
 loop?

Your best bet is to replace the disk. 30 minutes wait time seems a bit
odd though. I have a similar situation where one disk is having
problems, requiring the disk to restart, but that only takes approx. a
minute. You can mark the disk as bad and replace it before the other
disk fails I guess (after all, there's not much point in relying on a
faulty disk).

Ciao,
Ariane



Re: question about raidframe getting stuck

2008-08-07 Thread Harald Dunkel

Ariane van der Steldt wrote:


Your best bet is to replace the disk. 30 minutes wait time seems a bit
odd though. I have a similar situation where one disk is having
problems, requiring the disk to restart, but that only takes approx. a
minute. You can mark the disk as bad and replace it before the other
disk fails I guess (after all, there's not much point in relying on a
faulty disk).



The problem is not replacing the disk, but how to avoid
30 minutes downtime due to some low level kernel routine
getting stuck.


Regards

Harri



Re: question about raidframe getting stuck

2008-08-07 Thread Ariane van der Steldt
On Thu, Aug 07, 2008 at 11:41:59AM +0200, Harald Dunkel wrote:
 Ariane van der Steldt wrote:
  
  Your best bet is to replace the disk. 30 minutes wait time seems a bit
  odd though. I have a similar situation where one disk is having
  problems, requiring the disk to restart, but that only takes approx. a
  minute. You can mark the disk as bad and replace it before the other
  disk fails I guess (after all, there's not much point in relying on a
  faulty disk).
  
 
 The problem is not replacing the disk, but how to avoid
 30 minutes downtime due to some low level kernel routine
 getting stuck.

Mark it as a bad disk? If you do that, the raid code should do no more
requests to the disk.

Ariane



Re: question about raidframe getting stuck

2008-08-07 Thread nothingness

Harald Dunkel wrote:

Ariane van der Steldt wrote:


Your best bet is to replace the disk. 30 minutes wait time seems a bit
odd though. I have a similar situation where one disk is having
problems, requiring the disk to restart, but that only takes approx. a
minute. You can mark the disk as bad and replace it before the other
disk fails I guess (after all, there's not much point in relying on a
faulty disk).



The problem is not replacing the disk, but how to avoid
30 minutes downtime due to some low level kernel routine
getting stuck.


Regards

Harri

Presumably this was after a reboot? If so, the trick is to move the 
'raidctl -P all' line from /etc/rc to /etc/rc.local and add a '' so it 
runs as a background process.


Regards

Noth



Re: question about raidframe getting stuck

2008-08-07 Thread Harald Dunkel

nothingness wrote:
Presumably this was after a reboot? If so, the trick is to move the 
'raidctl -P all' line from /etc/rc to /etc/rc.local and add a '' so it 
runs as a background process.




There was no reboot involved. Before this event the machine was
running for weeks, and it is still running.


Regards

Harri



Re: question about raidframe getting stuck

2008-08-07 Thread Harald Dunkel

Ariane van der Steldt wrote:

On Thu, Aug 07, 2008 at 11:41:59AM +0200, Harald Dunkel wrote:

Ariane van der Steldt wrote:

Your best bet is to replace the disk. 30 minutes wait time seems a bit
odd though. I have a similar situation where one disk is having
problems, requiring the disk to restart, but that only takes approx. a
minute. You can mark the disk as bad and replace it before the other
disk fails I guess (after all, there's not much point in relying on a
faulty disk).


The problem is not replacing the disk, but how to avoid
30 minutes downtime due to some low level kernel routine
getting stuck.


Mark it as a bad disk? If you do that, the raid code should do no more
requests to the disk.



Seems we have some misunderstanding here. I am talking about
future events. Of course I don't know in advance which disk
fails when. If a disk dies, then its the job of raidframe to
detect this event, to mark the disk as bad, and to provide the
basic service with the remaining disks, as far as possible.

Looking at the log file it seems that raidframe _did_ mark
the disk as bad:

:
Aug  2 17:53:52 fw01 /bsd: raid0: IO Error.  Marking /dev/wd0d as failed.
Aug  2 17:53:52 fw01 /bsd: raid0: node (Wpd) returned fail, rolling forward
:

And yet the machine became unresponsive for 30 minutes.
This took much too long.


Regards

Harri