Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-28 Thread Peter Pregler

Hi,

I had (or have) similar hangs (all frozen, no syslog-entry, kernel still
running since ping works, but all user-level stuff hangs). The hardware in
my case is a onboard AIC-7890, an additional AHA-394X, a 3com 3c905B. All
on a Dual-PIII-450 on a Asus motherboard running raidutils 0.9 on a
SMP-2.2.14. I am not completely sure yet if I was able to solve the
problem since it hang at uncertain intervalls. But what I did is make sure
that no interrupt is shared by swapping cards in PCI-slots. The silly bios
allocated the same interrupt to the ethernet card and to the onboard
scsi-controller. Since I swapped the cards around to get a better
interrupt assignment I have no freezes anymore. But since this is only 10
days now I am not yet convinced that it was the actual cause.

-Peter

On 27-Mar-2000 David Cooley wrote:
 Is the PC overclocked in any way?
 I had troubles with my 2940U2W in both Windows and Linux when I
 overclocked 
 the Front Side Bus from 100MHz to 103MHz.
 Seems the Adaptec cards can't handle *ANYTHING* over 33.3 MHz on the PCI
 bus.
 
 
 
 At 02:53 PM 3/27/2000, Jeff Hill wrote:
Thanks to everyone for the assistance.

I did recompile the kernel with Translucent disabled (I don't know why
it is enabled by default?). Unfortunately, this has not affected the
problem.

As for the Adaptec, I had checked on a hardware discussion list and
understood that, while some Adaptec's were problematic, the unit I
purchased (the 2940U2W with matching factory cables) was working well
for several Linux users.

However, it seems to me from my limited experience that the Adaptec may
be the problem as it would fit the type of hanging that seems to occur
(no error messages, everything just freezes -- possibly waiting for the
Adaptec to send the data through).

I am still unable to find any log or anyway of tracing the system hangs.
I may try debugging on the SCSI (haven't a clue how) for a few days
before trying to turn off RAID. Before buying another card (I have no
others), I'll hope some reconfiguration of the Adaptec will do the
trick. I hate to dump all that money down the drain.

Thanks again to everyone for the assistance.

Jeff Hill

"m. allan noah" wrote:
 
  jeff- i am using 2.2.14 with mingo patch, and it is great. i have a 
 dozen or
  so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are
  fine.
 
  it would be interesting to see if raid is the issue, or your adaptec
  (i am
  inclined to think the latter).
 --snip--
 
 ===
 David Cooley N5XMT Internet: [EMAIL PROTECTED]
 Packet: N5XMT@KQ4LO.#INT.NC.USA.NA T.A.P.R. Member #7068
 We are Borg... Prepare to be assimilated!
 ===
 

---
Email: [EMAIL PROTECTED]
WWW:   http://www.risc.uni-linz.ac.at/people/ppregler



Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-28 Thread Mike Bilow

The aic7xxx driver is definitely unstable in my experience on SMP when
shared interrupts are used.  It will usually hang on boot in this case. 

This is an especially annoying problem because many SMP motherboards will
insist on assigning interrupts automatically; Intel and Supermicro are
particularly notorious for this.  However, if you have IO-APIC support
enabled in the CMOS setup and in the Linux kernel, there should be no need
for assignment of shared interrupts.  If you get this right, you should
see something like this:

colossus:~$ cat /proc/interrupts
  CPU0CPU1
  0:  65033341 0  XT-PIC timer
  1:  8   0   IO-APIC-edge keyboard
  2:  0   0   XT-PIC cascade
  8:  1   1   IO-APIC-edge rtc
 13:  1   0   XT-PIC fpu
 14:  4   3   IO-APIC-edge ide0
 17:  1027421 1027589 IO-APIC-level aic7xxx
 18:  1307813 1308648 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet
NMI:  0
ERR:  0

On this particular system, an Intel PR440FX "Providence" motherboard, both
SCSI and Ethernet are built-in.

It is also very strongly advised to have MTRR support enabled in the
kernel if you are playing around with systems of this caliber.

--Mike


On Tue, 28 Mar 2000, Peter Pregler wrote: 

 I had (or have) similar hangs (all frozen, no syslog-entry, kernel still
 running since ping works, but all user-level stuff hangs). The hardware
 in my case is a onboard AIC-7890, an additional AHA-394X, a 3com 3c905B.
 All on a Dual-PIII-450 on a Asus motherboard running raidutils 0.9 on a
 SMP-2.2.14. I am not completely sure yet if I was able to solve the
 problem since it hang at uncertain intervalls. But what I did is make
 sure that no interrupt is shared by swapping cards in PCI-slots. The
 silly bios allocated the same interrupt to the ethernet card and to the
 onboard scsi-controller. Since I swapped the cards around to get a
 better interrupt assignment I have no freezesanymore. But since this is
 only 10 days now I am not yet convinced thatit was the actual cause. 





Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-27 Thread Jeff Hill

Thanks to everyone for the assistance.

I did recompile the kernel with Translucent disabled (I don't know why
it is enabled by default?). Unfortunately, this has not affected the
problem.

As for the Adaptec, I had checked on a hardware discussion list and
understood that, while some Adaptec's were problematic, the unit I
purchased (the 2940U2W with matching factory cables) was working well
for several Linux users. 

However, it seems to me from my limited experience that the Adaptec may
be the problem as it would fit the type of hanging that seems to occur
(no error messages, everything just freezes -- possibly waiting for the
Adaptec to send the data through).

I am still unable to find any log or anyway of tracing the system hangs.
I may try debugging on the SCSI (haven't a clue how) for a few days
before trying to turn off RAID. Before buying another card (I have no
others), I'll hope some reconfiguration of the Adaptec will do the
trick. I hate to dump all that money down the drain.

Thanks again to everyone for the assistance.

Jeff Hill

"m. allan noah" wrote:
 
 jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or
 so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine.
 
 it would be interesting to see if raid is the issue, or your adaptec (i am
 inclined to think the latter).
--snip--



Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-27 Thread David Cooley

Is the PC overclocked in any way?
I had troubles with my 2940U2W in both Windows and Linux when I overclocked 
the Front Side Bus from 100MHz to 103MHz.
Seems the Adaptec cards can't handle *ANYTHING* over 33.3 MHz on the PCI bus.



At 02:53 PM 3/27/2000, Jeff Hill wrote:
Thanks to everyone for the assistance.

I did recompile the kernel with Translucent disabled (I don't know why
it is enabled by default?). Unfortunately, this has not affected the
problem.

As for the Adaptec, I had checked on a hardware discussion list and
understood that, while some Adaptec's were problematic, the unit I
purchased (the 2940U2W with matching factory cables) was working well
for several Linux users.

However, it seems to me from my limited experience that the Adaptec may
be the problem as it would fit the type of hanging that seems to occur
(no error messages, everything just freezes -- possibly waiting for the
Adaptec to send the data through).

I am still unable to find any log or anyway of tracing the system hangs.
I may try debugging on the SCSI (haven't a clue how) for a few days
before trying to turn off RAID. Before buying another card (I have no
others), I'll hope some reconfiguration of the Adaptec will do the
trick. I hate to dump all that money down the drain.

Thanks again to everyone for the assistance.

Jeff Hill

"m. allan noah" wrote:
 
  jeff- i am using 2.2.14 with mingo patch, and it is great. i have a 
 dozen or
  so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine.
 
  it would be interesting to see if raid is the issue, or your adaptec (i am
  inclined to think the latter).
 --snip--

===
David Cooley N5XMT Internet: [EMAIL PROTECTED]
Packet: N5XMT@KQ4LO.#INT.NC.USA.NA T.A.P.R. Member #7068
We are Borg... Prepare to be assimilated!
===




Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-27 Thread Stephen Waters

have you tried the folks on the [EMAIL PROTECTED] list? this is the list recommended 
in
LINUX/drivers/scsi/README.aic7xxx ...
-s

Jeff Hill wrote:
 
 Thanks to everyone for the assistance.
 
 I did recompile the kernel with Translucent disabled (I don't know why
 it is enabled by default?). Unfortunately, this has not affected the
 problem.
 
 As for the Adaptec, I had checked on a hardware discussion list and
 understood that, while some Adaptec's were problematic, the unit I
 purchased (the 2940U2W with matching factory cables) was working well
 for several Linux users.
 
 However, it seems to me from my limited experience that the Adaptec may
 be the problem as it would fit the type of hanging that seems to occur
 (no error messages, everything just freezes -- possibly waiting for the
 Adaptec to send the data through).
 
 I am still unable to find any log or anyway of tracing the system hangs.
 I may try debugging on the SCSI (haven't a clue how) for a few days
 before trying to turn off RAID. Before buying another card (I have no
 others), I'll hope some reconfiguration of the Adaptec will do the
 trick. I hate to dump all that money down the drain.
 
 Thanks again to everyone for the assistance.
 
 Jeff Hill
 
 "m. allan noah" wrote:
 
  jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or
  so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine.
 
  it would be interesting to see if raid is the issue, or your adaptec (i am
  inclined to think the latter).
 --snip--



Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-26 Thread m . allan noah

jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or
so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine.

it would be interesting to see if raid is the issue, or your adaptec (i am
inclined to think the latter).

1. swap scsi cards. i like symbios/ncr 53c875 and 876 controllers. they are
cheaper than adaptecs, and are much better with borderline cabling, and
recover from disk hangs. i have found the adaptec does not.

2. turn off raid temporarily. to do this, try something like this- but BE
CAREFUL, i have not tried this with the hacked raid1 lilo...

a. use the raidhotremove command to get the sdbX partitions out of their
respective raids. this will ensure that when you return the system to its raid
state later, that /dev/sda will be considered the most recent copy of your
data.
b. mark the partition types of ALL the raid constituents to 83 instead of fd.
c. change your /etc/fstab to mount /dev/sdaX instead of /dev/md0 (where X is
the coorespondent partition to the one in your raid set, check your raidtab
for that info)
d. change your lilo config to use /dev/sdaX as root instead of /dev/md0
(again- where X is the partition on sda that is part of your root raid device)
e. run lilo.

now, when you reboot, make sure you choose the kernel that you changed the
root= for, and you will be running a simple, non-raid system.

see if you can get the lockups.

reverse the procedure to get the system back into raid:
a. change lilo.conf, setting root=/dev/md0
b. run lilo
c. edit fstab, moving back to md's
d. change partition types to fd instead of 83
e. reboot. system will bitch about /dev/sdbX being old, and all the arrays wil
be degraded, but syncing.
f. monitor /proc/mdstat to make sure the raid is reconstructing. if not, you
may need some help from the raidhotadd command.

allan


Jeff Hill [EMAIL PROTECTED] said:

 Jakob Østergaard wrote:
  
  On Sat, 25 Mar 2000, Jeff Hill wrote:
 --snip--
   My system hangs for 30 seconds to 5 minutes several times a day using a
   vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from
   Redhat on my Debian (Potato version) server. When the system hangs, it
 --snip--
  Is it a SMP system ?
 
   Nope. ASUS P3B-F motherboard w/Intel 440BX AGPset, 512MB PC100SDRAM,
 Pentium III 450Mhz
   I removed SMP support when I compiled the kernel.
  
   My mdstat reads:
 Personalities : [linear] [raid0] [raid1] [translucent]
 read_ahead 1024 sectors
 md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU]
 unused devices: none
  Disable translucent mode !   It's not intended to be used yet.
 
 I assume the only way to disable it is to recompile the kernel?
 
 --snip--
  It sounds pretty strange what you're seeing.  It would be very interesting
  to see if you could reproduce your problems without RAID.  You're running
  RAID-1, so you should be able to just don't start the RAID devices, and
  then mount one of the mirrors disregarding the /dev/mdX devices.
 
 Sorry for my ignorance of RAID, but I'm not certain I'm following. How
 do you not start RAID when it is compiled into the kernel to automount
 /dev/md0 at root (I use the RedHat lilo version that allows this). Doing
 a "raidstop /dev/md0"?  Or reboot to a 2.2.14 kernel without RAID
 compiled in? 
 
 I would just shoot in the dark at this, but I'm a little paranoid as it
 is my main webserver (yes, I should have done more testing before making
 it a production machine).
 
 Thanks for the assistance.
 
 Jeff Hill
 
 -- 
 
 --  HR On-Line:  The Network for Workplace Issues --
 http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708
 
 



-- 






System Hangs -- Which Is Most Stable Kernel?

2000-03-25 Thread Jeff Hill

Which is currently the most stable kernel that supports new-style RAID? 

My system hangs for 30 seconds to 5 minutes several times a day using a
vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from
Redhat on my Debian (Potato version) server. When the system hangs, it
doesn't crash and it responds to a few basic requests ('ls' for example)
but nothing else (not even 'ls -l'). Then, it proceeds to fill the
request. I've looked through the dmesg, system and
kernel logs and have found nothing out of the ordinary. I've tried
kernel 2.2.10 with similar results.

My mdstat reads:

Personalities : [linear] [raid0] [raid1] [translucent] 
read_ahead 1024 sectors
md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU]
unused devices: none


At the same time I updated the system to Debian Potato, the new kernel
and RAID-1, I also added an Adaptec 2940U2W SCSI controller and a
matching pair of Cheetah drives, but nothing in the new hardware seems
problematic. I've got the Redhat lilo installed so that I boot the RAID
from root, and /boot directories on both disks are on separate, non-raid
partitions.

I've been using Linux for several years, but I'm unable to determine
where the problem is -- I just suspect the kernel. Any suggestions on
the most stable kernel or on how to troubleshoot this appreciated.

Also, is there any searchable archive of the linux-raid list?  Wish I
would have found this list _before_ I built my raid.

Thanks,

Jeff Hill


P.S.: Just in case it helps:

raidtab:
raiddev /dev/md0
raid-level  1
nr-raid-disks   2
nr-spare-disks  0
chunk-size  4
persistent-superblock 1
device  /dev/sda2
raid-disk   0
device  /dev/sdb2
raid-disk   1


fdisk -l /dev/sda:
/dev/sda1   * 1 2 16064+  83  Linux
/dev/sda2 3  1090   8739360   fd  Linux raid autodetect
/dev/sda3  1091  1106128520   82  Linux swap

fdisk -l /dev/sdb:
/dev/sdb1   * 1 2 16064+  83  Linux
/dev/sdb2 3  1090   8739360   fd  Linux raid autodetect
/dev/sdb3  1091  1106128520   82  Linux swap



Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-25 Thread Jakob Østergaard

On Sat, 25 Mar 2000, Jeff Hill wrote:

 Which is currently the most stable kernel that supports new-style RAID? 
 
 My system hangs for 30 seconds to 5 minutes several times a day using a
 vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from
 Redhat on my Debian (Potato version) server. When the system hangs, it
 doesn't crash and it responds to a few basic requests ('ls' for example)
 but nothing else (not even 'ls -l'). Then, it proceeds to fill the
 request. I've looked through the dmesg, system and
 kernel logs and have found nothing out of the ordinary. I've tried
 kernel 2.2.10 with similar results.

Is it a SMP system ?

 My mdstat reads:
 
   Personalities : [linear] [raid0] [raid1] [translucent] 
   read_ahead 1024 sectors
   md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU]
   unused devices: none

Disable translucent mode !   It's not intended to be used yet.

I wouldn't know if this can cause you any trouble, but translucent
mode is definitely not something you want to have turned on.

 At the same time I updated the system to Debian Potato, the new kernel
 and RAID-1, I also added an Adaptec 2940U2W SCSI controller and a
 matching pair of Cheetah drives, but nothing in the new hardware seems
 problematic. I've got the Redhat lilo installed so that I boot the RAID
 from root, and /boot directories on both disks are on separate, non-raid
 partitions.
 
 I've been using Linux for several years, but I'm unable to determine
 where the problem is -- I just suspect the kernel. Any suggestions on
 the most stable kernel or on how to troubleshoot this appreciated.
 
 Also, is there any searchable archive of the linux-raid list?  Wish I
 would have found this list _before_ I built my raid.

It sounds pretty strange what you're seeing.  It would be very interesting
to see if you could reproduce your problems without RAID.  You're running
RAID-1, so you should be able to just don't start the RAID devices, and
then mount one of the mirrors disregarding the /dev/mdX devices.

-- 

: [EMAIL PROTECTED]  : And I see the elder races, :
:.: putrid forms of man:
:   Jakob Østergaard  : See him rise and claim the earth,  :
:OZ9ABN   : his downfall is at hand.   :
:.:{Konkhra}...:



Re: System Hangs -- Which Is Most Stable Kernel?

2000-03-25 Thread Jeff Hill

Jakob Østergaard wrote:
 
 On Sat, 25 Mar 2000, Jeff Hill wrote:
--snip--
  My system hangs for 30 seconds to 5 minutes several times a day using a
  vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from
  Redhat on my Debian (Potato version) server. When the system hangs, it
--snip--
 Is it a SMP system ?

Nope. ASUS P3B-F motherboard w/Intel 440BX AGPset, 512MB PC100SDRAM,
Pentium III 450Mhz
I removed SMP support when I compiled the kernel.
 
  My mdstat reads:
Personalities : [linear] [raid0] [raid1] [translucent]
read_ahead 1024 sectors
md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU]
unused devices: none
 Disable translucent mode !   It's not intended to be used yet.

I assume the only way to disable it is to recompile the kernel?

--snip--
 It sounds pretty strange what you're seeing.  It would be very interesting
 to see if you could reproduce your problems without RAID.  You're running
 RAID-1, so you should be able to just don't start the RAID devices, and
 then mount one of the mirrors disregarding the /dev/mdX devices.

Sorry for my ignorance of RAID, but I'm not certain I'm following. How
do you not start RAID when it is compiled into the kernel to automount
/dev/md0 at root (I use the RedHat lilo version that allows this). Doing
a "raidstop /dev/md0"?  Or reboot to a 2.2.14 kernel without RAID
compiled in? 

I would just shoot in the dark at this, but I'm a little paranoid as it
is my main webserver (yes, I should have done more testing before making
it a production machine).

Thanks for the assistance.

Jeff Hill

-- 

--  HR On-Line:  The Network for Workplace Issues --
http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708




RE: System Hangs -- Which Is Most Stable Kernel?

2000-03-25 Thread Gregory Leblanc

 -Original Message-
 From: Jeff Hill [mailto:[EMAIL PROTECTED]]
 Sent: Saturday, March 25, 2000 9:18 AM
 To: Jakob Østergaard
 Cc: [EMAIL PROTECTED]
 Subject: Re: System Hangs -- Which Is Most Stable Kernel?
 
[snip]
   My mdstat reads:
 Personalities : [linear] [raid0] [raid1] [translucent]
 read_ahead 1024 sectors
 md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU]
 unused devices: none
  Disable translucent mode !   It's not intended to be used yet.
 
 I assume the only way to disable it is to recompile the kernel?

Yep, unless you compiled it as a module, then you could probably just remove
the module without too many adverse affects.

 
 --snip--
  It sounds pretty strange what you're seeing.  It would be 
 very interesting
  to see if you could reproduce your problems without RAID.  
 You're running
  RAID-1, so you should be able to just don't start the RAID 
 devices, and
  then mount one of the mirrors disregarding the /dev/mdX devices.
 
 Sorry for my ignorance of RAID, but I'm not certain I'm following. How
 do you not start RAID when it is compiled into the kernel to automount
 /dev/md0 at root (I use the RedHat lilo version that allows 
 this). Doing
 a "raidstop /dev/md0"?  Or reboot to a 2.2.14 kernel without RAID
 compiled in? 

I think you could just change the type of those partitions to 0x83 (linux
native), and RAID won't autostart them.  The just point lilo to one disk,
and run like that for a while.  Not sure how you would go back to RAID after
that

 
 I would just shoot in the dark at this, but I'm a little 
 paranoid as it
 is my main webserver (yes, I should have done more testing 
 before making
 it a production machine).

Oops.  :-)  I'd grab another machine, and start breaking things on that.
You might try to reproduce your software setup on a second machine, and try
anything that you're going to try on the production machine there first, so
that when you hose something, you've done it on a test machine first.
Greg