Re: System Hangs -- Which Is Most Stable Kernel?
Hi, I had (or have) similar hangs (all frozen, no syslog-entry, kernel still running since ping works, but all user-level stuff hangs). The hardware in my case is a onboard AIC-7890, an additional AHA-394X, a 3com 3c905B. All on a Dual-PIII-450 on a Asus motherboard running raidutils 0.9 on a SMP-2.2.14. I am not completely sure yet if I was able to solve the problem since it hang at uncertain intervalls. But what I did is make sure that no interrupt is shared by swapping cards in PCI-slots. The silly bios allocated the same interrupt to the ethernet card and to the onboard scsi-controller. Since I swapped the cards around to get a better interrupt assignment I have no freezes anymore. But since this is only 10 days now I am not yet convinced that it was the actual cause. -Peter On 27-Mar-2000 David Cooley wrote: Is the PC overclocked in any way? I had troubles with my 2940U2W in both Windows and Linux when I overclocked the Front Side Bus from 100MHz to 103MHz. Seems the Adaptec cards can't handle *ANYTHING* over 33.3 MHz on the PCI bus. At 02:53 PM 3/27/2000, Jeff Hill wrote: Thanks to everyone for the assistance. I did recompile the kernel with Translucent disabled (I don't know why it is enabled by default?). Unfortunately, this has not affected the problem. As for the Adaptec, I had checked on a hardware discussion list and understood that, while some Adaptec's were problematic, the unit I purchased (the 2940U2W with matching factory cables) was working well for several Linux users. However, it seems to me from my limited experience that the Adaptec may be the problem as it would fit the type of hanging that seems to occur (no error messages, everything just freezes -- possibly waiting for the Adaptec to send the data through). I am still unable to find any log or anyway of tracing the system hangs. I may try debugging on the SCSI (haven't a clue how) for a few days before trying to turn off RAID. Before buying another card (I have no others), I'll hope some reconfiguration of the Adaptec will do the trick. I hate to dump all that money down the drain. Thanks again to everyone for the assistance. Jeff Hill "m. allan noah" wrote: jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine. it would be interesting to see if raid is the issue, or your adaptec (i am inclined to think the latter). --snip-- === David Cooley N5XMT Internet: [EMAIL PROTECTED] Packet: N5XMT@KQ4LO.#INT.NC.USA.NA T.A.P.R. Member #7068 We are Borg... Prepare to be assimilated! === --- Email: [EMAIL PROTECTED] WWW: http://www.risc.uni-linz.ac.at/people/ppregler
Re: System Hangs -- Which Is Most Stable Kernel?
The aic7xxx driver is definitely unstable in my experience on SMP when shared interrupts are used. It will usually hang on boot in this case. This is an especially annoying problem because many SMP motherboards will insist on assigning interrupts automatically; Intel and Supermicro are particularly notorious for this. However, if you have IO-APIC support enabled in the CMOS setup and in the Linux kernel, there should be no need for assignment of shared interrupts. If you get this right, you should see something like this: colossus:~$ cat /proc/interrupts CPU0CPU1 0: 65033341 0 XT-PIC timer 1: 8 0 IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 8: 1 1 IO-APIC-edge rtc 13: 1 0 XT-PIC fpu 14: 4 3 IO-APIC-edge ide0 17: 1027421 1027589 IO-APIC-level aic7xxx 18: 1307813 1308648 IO-APIC-level Intel EtherExpress Pro 10/100 Ethernet NMI: 0 ERR: 0 On this particular system, an Intel PR440FX "Providence" motherboard, both SCSI and Ethernet are built-in. It is also very strongly advised to have MTRR support enabled in the kernel if you are playing around with systems of this caliber. --Mike On Tue, 28 Mar 2000, Peter Pregler wrote: I had (or have) similar hangs (all frozen, no syslog-entry, kernel still running since ping works, but all user-level stuff hangs). The hardware in my case is a onboard AIC-7890, an additional AHA-394X, a 3com 3c905B. All on a Dual-PIII-450 on a Asus motherboard running raidutils 0.9 on a SMP-2.2.14. I am not completely sure yet if I was able to solve the problem since it hang at uncertain intervalls. But what I did is make sure that no interrupt is shared by swapping cards in PCI-slots. The silly bios allocated the same interrupt to the ethernet card and to the onboard scsi-controller. Since I swapped the cards around to get a better interrupt assignment I have no freezesanymore. But since this is only 10 days now I am not yet convinced thatit was the actual cause.
Re: System Hangs -- Which Is Most Stable Kernel?
Thanks to everyone for the assistance. I did recompile the kernel with Translucent disabled (I don't know why it is enabled by default?). Unfortunately, this has not affected the problem. As for the Adaptec, I had checked on a hardware discussion list and understood that, while some Adaptec's were problematic, the unit I purchased (the 2940U2W with matching factory cables) was working well for several Linux users. However, it seems to me from my limited experience that the Adaptec may be the problem as it would fit the type of hanging that seems to occur (no error messages, everything just freezes -- possibly waiting for the Adaptec to send the data through). I am still unable to find any log or anyway of tracing the system hangs. I may try debugging on the SCSI (haven't a clue how) for a few days before trying to turn off RAID. Before buying another card (I have no others), I'll hope some reconfiguration of the Adaptec will do the trick. I hate to dump all that money down the drain. Thanks again to everyone for the assistance. Jeff Hill "m. allan noah" wrote: jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine. it would be interesting to see if raid is the issue, or your adaptec (i am inclined to think the latter). --snip--
Re: System Hangs -- Which Is Most Stable Kernel?
Is the PC overclocked in any way? I had troubles with my 2940U2W in both Windows and Linux when I overclocked the Front Side Bus from 100MHz to 103MHz. Seems the Adaptec cards can't handle *ANYTHING* over 33.3 MHz on the PCI bus. At 02:53 PM 3/27/2000, Jeff Hill wrote: Thanks to everyone for the assistance. I did recompile the kernel with Translucent disabled (I don't know why it is enabled by default?). Unfortunately, this has not affected the problem. As for the Adaptec, I had checked on a hardware discussion list and understood that, while some Adaptec's were problematic, the unit I purchased (the 2940U2W with matching factory cables) was working well for several Linux users. However, it seems to me from my limited experience that the Adaptec may be the problem as it would fit the type of hanging that seems to occur (no error messages, everything just freezes -- possibly waiting for the Adaptec to send the data through). I am still unable to find any log or anyway of tracing the system hangs. I may try debugging on the SCSI (haven't a clue how) for a few days before trying to turn off RAID. Before buying another card (I have no others), I'll hope some reconfiguration of the Adaptec will do the trick. I hate to dump all that money down the drain. Thanks again to everyone for the assistance. Jeff Hill "m. allan noah" wrote: jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine. it would be interesting to see if raid is the issue, or your adaptec (i am inclined to think the latter). --snip-- === David Cooley N5XMT Internet: [EMAIL PROTECTED] Packet: N5XMT@KQ4LO.#INT.NC.USA.NA T.A.P.R. Member #7068 We are Borg... Prepare to be assimilated! ===
Re: System Hangs -- Which Is Most Stable Kernel?
have you tried the folks on the [EMAIL PROTECTED] list? this is the list recommended in LINUX/drivers/scsi/README.aic7xxx ... -s Jeff Hill wrote: Thanks to everyone for the assistance. I did recompile the kernel with Translucent disabled (I don't know why it is enabled by default?). Unfortunately, this has not affected the problem. As for the Adaptec, I had checked on a hardware discussion list and understood that, while some Adaptec's were problematic, the unit I purchased (the 2940U2W with matching factory cables) was working well for several Linux users. However, it seems to me from my limited experience that the Adaptec may be the problem as it would fit the type of hanging that seems to occur (no error messages, everything just freezes -- possibly waiting for the Adaptec to send the data through). I am still unable to find any log or anyway of tracing the system hangs. I may try debugging on the SCSI (haven't a clue how) for a few days before trying to turn off RAID. Before buying another card (I have no others), I'll hope some reconfiguration of the Adaptec will do the trick. I hate to dump all that money down the drain. Thanks again to everyone for the assistance. Jeff Hill "m. allan noah" wrote: jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine. it would be interesting to see if raid is the issue, or your adaptec (i am inclined to think the latter). --snip--
Re: System Hangs -- Which Is Most Stable Kernel?
jeff- i am using 2.2.14 with mingo patch, and it is great. i have a dozen or so boxes, 512meg, SMP pIII 450, ncr scsi, etc in this config. all are fine. it would be interesting to see if raid is the issue, or your adaptec (i am inclined to think the latter). 1. swap scsi cards. i like symbios/ncr 53c875 and 876 controllers. they are cheaper than adaptecs, and are much better with borderline cabling, and recover from disk hangs. i have found the adaptec does not. 2. turn off raid temporarily. to do this, try something like this- but BE CAREFUL, i have not tried this with the hacked raid1 lilo... a. use the raidhotremove command to get the sdbX partitions out of their respective raids. this will ensure that when you return the system to its raid state later, that /dev/sda will be considered the most recent copy of your data. b. mark the partition types of ALL the raid constituents to 83 instead of fd. c. change your /etc/fstab to mount /dev/sdaX instead of /dev/md0 (where X is the coorespondent partition to the one in your raid set, check your raidtab for that info) d. change your lilo config to use /dev/sdaX as root instead of /dev/md0 (again- where X is the partition on sda that is part of your root raid device) e. run lilo. now, when you reboot, make sure you choose the kernel that you changed the root= for, and you will be running a simple, non-raid system. see if you can get the lockups. reverse the procedure to get the system back into raid: a. change lilo.conf, setting root=/dev/md0 b. run lilo c. edit fstab, moving back to md's d. change partition types to fd instead of 83 e. reboot. system will bitch about /dev/sdbX being old, and all the arrays wil be degraded, but syncing. f. monitor /proc/mdstat to make sure the raid is reconstructing. if not, you may need some help from the raidhotadd command. allan Jeff Hill [EMAIL PROTECTED] said: Jakob Østergaard wrote: On Sat, 25 Mar 2000, Jeff Hill wrote: --snip-- My system hangs for 30 seconds to 5 minutes several times a day using a vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from Redhat on my Debian (Potato version) server. When the system hangs, it --snip-- Is it a SMP system ? Nope. ASUS P3B-F motherboard w/Intel 440BX AGPset, 512MB PC100SDRAM, Pentium III 450Mhz I removed SMP support when I compiled the kernel. My mdstat reads: Personalities : [linear] [raid0] [raid1] [translucent] read_ahead 1024 sectors md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU] unused devices: none Disable translucent mode ! It's not intended to be used yet. I assume the only way to disable it is to recompile the kernel? --snip-- It sounds pretty strange what you're seeing. It would be very interesting to see if you could reproduce your problems without RAID. You're running RAID-1, so you should be able to just don't start the RAID devices, and then mount one of the mirrors disregarding the /dev/mdX devices. Sorry for my ignorance of RAID, but I'm not certain I'm following. How do you not start RAID when it is compiled into the kernel to automount /dev/md0 at root (I use the RedHat lilo version that allows this). Doing a "raidstop /dev/md0"? Or reboot to a 2.2.14 kernel without RAID compiled in? I would just shoot in the dark at this, but I'm a little paranoid as it is my main webserver (yes, I should have done more testing before making it a production machine). Thanks for the assistance. Jeff Hill -- -- HR On-Line: The Network for Workplace Issues -- http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708 --
System Hangs -- Which Is Most Stable Kernel?
Which is currently the most stable kernel that supports new-style RAID? My system hangs for 30 seconds to 5 minutes several times a day using a vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from Redhat on my Debian (Potato version) server. When the system hangs, it doesn't crash and it responds to a few basic requests ('ls' for example) but nothing else (not even 'ls -l'). Then, it proceeds to fill the request. I've looked through the dmesg, system and kernel logs and have found nothing out of the ordinary. I've tried kernel 2.2.10 with similar results. My mdstat reads: Personalities : [linear] [raid0] [raid1] [translucent] read_ahead 1024 sectors md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU] unused devices: none At the same time I updated the system to Debian Potato, the new kernel and RAID-1, I also added an Adaptec 2940U2W SCSI controller and a matching pair of Cheetah drives, but nothing in the new hardware seems problematic. I've got the Redhat lilo installed so that I boot the RAID from root, and /boot directories on both disks are on separate, non-raid partitions. I've been using Linux for several years, but I'm unable to determine where the problem is -- I just suspect the kernel. Any suggestions on the most stable kernel or on how to troubleshoot this appreciated. Also, is there any searchable archive of the linux-raid list? Wish I would have found this list _before_ I built my raid. Thanks, Jeff Hill P.S.: Just in case it helps: raidtab: raiddev /dev/md0 raid-level 1 nr-raid-disks 2 nr-spare-disks 0 chunk-size 4 persistent-superblock 1 device /dev/sda2 raid-disk 0 device /dev/sdb2 raid-disk 1 fdisk -l /dev/sda: /dev/sda1 * 1 2 16064+ 83 Linux /dev/sda2 3 1090 8739360 fd Linux raid autodetect /dev/sda3 1091 1106128520 82 Linux swap fdisk -l /dev/sdb: /dev/sdb1 * 1 2 16064+ 83 Linux /dev/sdb2 3 1090 8739360 fd Linux raid autodetect /dev/sdb3 1091 1106128520 82 Linux swap
Re: System Hangs -- Which Is Most Stable Kernel?
On Sat, 25 Mar 2000, Jeff Hill wrote: Which is currently the most stable kernel that supports new-style RAID? My system hangs for 30 seconds to 5 minutes several times a day using a vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from Redhat on my Debian (Potato version) server. When the system hangs, it doesn't crash and it responds to a few basic requests ('ls' for example) but nothing else (not even 'ls -l'). Then, it proceeds to fill the request. I've looked through the dmesg, system and kernel logs and have found nothing out of the ordinary. I've tried kernel 2.2.10 with similar results. Is it a SMP system ? My mdstat reads: Personalities : [linear] [raid0] [raid1] [translucent] read_ahead 1024 sectors md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU] unused devices: none Disable translucent mode ! It's not intended to be used yet. I wouldn't know if this can cause you any trouble, but translucent mode is definitely not something you want to have turned on. At the same time I updated the system to Debian Potato, the new kernel and RAID-1, I also added an Adaptec 2940U2W SCSI controller and a matching pair of Cheetah drives, but nothing in the new hardware seems problematic. I've got the Redhat lilo installed so that I boot the RAID from root, and /boot directories on both disks are on separate, non-raid partitions. I've been using Linux for several years, but I'm unable to determine where the problem is -- I just suspect the kernel. Any suggestions on the most stable kernel or on how to troubleshoot this appreciated. Also, is there any searchable archive of the linux-raid list? Wish I would have found this list _before_ I built my raid. It sounds pretty strange what you're seeing. It would be very interesting to see if you could reproduce your problems without RAID. You're running RAID-1, so you should be able to just don't start the RAID devices, and then mount one of the mirrors disregarding the /dev/mdX devices. -- : [EMAIL PROTECTED] : And I see the elder races, : :.: putrid forms of man: : Jakob Østergaard : See him rise and claim the earth, : :OZ9ABN : his downfall is at hand. : :.:{Konkhra}...:
Re: System Hangs -- Which Is Most Stable Kernel?
Jakob Østergaard wrote: On Sat, 25 Mar 2000, Jeff Hill wrote: --snip-- My system hangs for 30 seconds to 5 minutes several times a day using a vanilla kernel 2.2.14 from ftp.kernel.org with a 2.2.14 RAID patch from Redhat on my Debian (Potato version) server. When the system hangs, it --snip-- Is it a SMP system ? Nope. ASUS P3B-F motherboard w/Intel 440BX AGPset, 512MB PC100SDRAM, Pentium III 450Mhz I removed SMP support when I compiled the kernel. My mdstat reads: Personalities : [linear] [raid0] [raid1] [translucent] read_ahead 1024 sectors md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU] unused devices: none Disable translucent mode ! It's not intended to be used yet. I assume the only way to disable it is to recompile the kernel? --snip-- It sounds pretty strange what you're seeing. It would be very interesting to see if you could reproduce your problems without RAID. You're running RAID-1, so you should be able to just don't start the RAID devices, and then mount one of the mirrors disregarding the /dev/mdX devices. Sorry for my ignorance of RAID, but I'm not certain I'm following. How do you not start RAID when it is compiled into the kernel to automount /dev/md0 at root (I use the RedHat lilo version that allows this). Doing a "raidstop /dev/md0"? Or reboot to a 2.2.14 kernel without RAID compiled in? I would just shoot in the dark at this, but I'm a little paranoid as it is my main webserver (yes, I should have done more testing before making it a production machine). Thanks for the assistance. Jeff Hill -- -- HR On-Line: The Network for Workplace Issues -- http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708
RE: System Hangs -- Which Is Most Stable Kernel?
-Original Message- From: Jeff Hill [mailto:[EMAIL PROTECTED]] Sent: Saturday, March 25, 2000 9:18 AM To: Jakob Østergaard Cc: [EMAIL PROTECTED] Subject: Re: System Hangs -- Which Is Most Stable Kernel? [snip] My mdstat reads: Personalities : [linear] [raid0] [raid1] [translucent] read_ahead 1024 sectors md0 : active raid1 sdb2[1] sda2[0] 8739264 blocks [2/2] [UU] unused devices: none Disable translucent mode ! It's not intended to be used yet. I assume the only way to disable it is to recompile the kernel? Yep, unless you compiled it as a module, then you could probably just remove the module without too many adverse affects. --snip-- It sounds pretty strange what you're seeing. It would be very interesting to see if you could reproduce your problems without RAID. You're running RAID-1, so you should be able to just don't start the RAID devices, and then mount one of the mirrors disregarding the /dev/mdX devices. Sorry for my ignorance of RAID, but I'm not certain I'm following. How do you not start RAID when it is compiled into the kernel to automount /dev/md0 at root (I use the RedHat lilo version that allows this). Doing a "raidstop /dev/md0"? Or reboot to a 2.2.14 kernel without RAID compiled in? I think you could just change the type of those partitions to 0x83 (linux native), and RAID won't autostart them. The just point lilo to one disk, and run like that for a while. Not sure how you would go back to RAID after that I would just shoot in the dark at this, but I'm a little paranoid as it is my main webserver (yes, I should have done more testing before making it a production machine). Oops. :-) I'd grab another machine, and start breaking things on that. You might try to reproduce your software setup on a second machine, and try anything that you're going to try on the production machine there first, so that when you hose something, you've done it on a test machine first. Greg