Re: 2.6.20: reproducible hard lockup with RAID-5 resync
Neil Brown wrote: Ok, so the difference is CONFIG_SYSFS_DEPRECATED. If that is not defined, the kernel locks up. There's not a lot of code under #ifdef/#ifndef CONFIG_SYSFS_DEPRECATED, but since I'm not familiar with any of it I don't expect trying to locate the bug on my own would be very productive. Neil, do you have CONFIG_SYSFS_DEPRECATED enabled? If so, does disabling it reproduce my problem? If you can't reproduce it, should it take the problem over to linux-kernel? # CONFIG_SYSFS_DEPRECATED is not set No, it is not set, and yet it all still works for me. Dang, again. :) It is very hard to see how this CONFIG option can make a difference. Have you double checked that setting it removed the problem and clearing it causes the problem? Yes, it seems odd to me too, but I have double-checked. If I build a kernel with CONFIG_SYSFS_DEPRECATED enabled, it works; if I disable that option and rebuild the kernel, it locks up. I just tried running 'make defconfig' and then enabling only RAID, RAID-0, RAID-1, and RAID-4/5/6. If I then disable CONFIG_SYSFS_DEPRECATED, there aren't any problems. ...so, I'll try to isolate the problem some more later. Thanks, Corey - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20: reproducible hard lockup with RAID-5 resync
Corey Hickey wrote: When I get home (late) tonight I'll try running dd and badblocks on the corresponding drives and partitions. Well, I haven't been able to reproduce the problem that way. I tried the following: $ dd id=/dev/hda of=/dev/null $ badblocks /dev/hda $ badblocks -n /dev/hda ...and the same for sda6, sdb, sdc6, sdd, and md2. In each case I killed the test after several seconds, on the assumption that if the problem was reproducible within less than a second by triggering a resync, it wouldn't take long any other way. If anyone has any suggestions for further tests I can do, I'll be happy to try them out. Thanks, Corey - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20: reproducible hard lockup with RAID-5 resync
On Thursday February 15, [EMAIL PROTECTED] wrote: I think I have found an easily-reproducible bug in Linux 2.6.20. I have already applied the Fix various bugs with aligned reads in RAID5 patch, and that had no effect. It appears to be related to the resync process, and makes the system lock up, hard. I'm guessing that the problem is at a lower level than raid. What IDE/SATA controllers do you have? Google to see if anyone else has had problems with them in 2.6.20. During the lock up, nothing is printed to the console, and the magic SysRQ key has no effect; I have to poke the reset button. Sound's like interrupts are disabled, but x86_64 always enables the NMI watchdog which should trigger if interrupts are off for too long. Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the kernel debugging options menu I think). If not, setting that would be worth a try. A raid5 resync across 5 sata drives on a couple of different silicon-image controllers doesn't lock up for me. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20: reproducible hard lockup with RAID-5 resync
Neil Brown wrote: On Thursday February 15, [EMAIL PROTECTED] wrote: I think I have found an easily-reproducible bug in Linux 2.6.20. I have already applied the Fix various bugs with aligned reads in RAID5 patch, and that had no effect. It appears to be related to the resync process, and makes the system lock up, hard. I'm guessing that the problem is at a lower level than raid. What IDE/SATA controllers do you have? Google to see if anyone else has had problems with them in 2.6.20. During the lock up, nothing is printed to the console, and the magic SysRQ key has no effect; I have to poke the reset button. Sound's like interrupts are disabled, but x86_64 always enables the NMI watchdog which should trigger if interrupts are off for too long. Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the kernel debugging options menu I think). If not, setting that would be worth a try. A raid5 resync across 5 sata drives on a couple of different silicon-image controllers doesn't lock up for me. Wow, thanks for the quick response. I have to go to bed now, but I'll try to get you that information tomorrow. Thanks, Corey - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 2.6.20: reproducible hard lockup with RAID-5 resync
Neil Brown wrote: On Thursday February 15, [EMAIL PROTECTED] wrote: I think I have found an easily-reproducible bug in Linux 2.6.20. I have already applied the Fix various bugs with aligned reads in RAID5 patch, and that had no effect. It appears to be related to the resync process, and makes the system lock up, hard. I'm guessing that the problem is at a lower level than raid. What IDE/SATA controllers do you have? Google to see if anyone else has had problems with them in 2.6.20. I have an nForce3 motherboard. lspci calls my IDE: nVidia Corporation CK8S Parallel ATA Controller (v2.5) (rev a2) ...and my SATA: nVidia Corporation CK8S Serial ATA Controller (v2.5) (rev a2) I'm using libata for my SATA drives and the old IDE driver for my IDE drive. For reference, I have uploaded my kernel configuration and the output of lspci: http://fatooh.org/files/tmp/config-2.6.20 http://fatooh.org/files/tmp/lspci-v Anyway, I googled a bit, and I also looked through the recent threads in the linux-kernel archives, but I haven't found anything. I don't follow kernel development closely, though, so it's quite possible I missed something. When I get home (late) tonight I'll try running dd and badblocks on the corresponding drives and partitions. During the lock up, nothing is printed to the console, and the magic SysRQ key has no effect; I have to poke the reset button. Sound's like interrupts are disabled, but x86_64 always enables the NMI watchdog which should trigger if interrupts are off for too long. How long is too long? I waited a few minutes, at least, on the first few tries. Do you have CONFIG_DETECT_SOFTLOCKUP=y in your .config (it is in the kernel debugging options menu I think). If not, setting that would be worth a try. I do indeed have CONFIG_DETECT_SOFTLOCKUP enabled. The Kconfig description says it should detect lockups 10 seconds, I've waited longer than that many times. A raid5 resync across 5 sata drives on a couple of different silicon-image controllers doesn't lock up for me. Heck. ;) Would it by any chance make a difference that I'm running RAID-5 across a mixture of drives and partitions? Thanks again, Corey - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
2.6.20: reproducible hard lockup with RAID-5 resync
I think I have found an easily-reproducible bug in Linux 2.6.20. I have already applied the Fix various bugs with aligned reads in RAID5 patch, and that had no effect. It appears to be related to the resync process, and makes the system lock up, hard. The steps to reproduce are: 1. Be running Linux 2.6.20 and do whatever is necessary to prepare for a crash (close open files, sync, unmount filesystems, or whatever). Alternatively, just boot with 'init=/bin/bash'. 2. Run 'mdadm -S /dev/md2', where /dev/md2 is a RAID-5. 3. Run 'mdadm -A /dev/md2 -U resync'. 4. Wait about 1 second. The system will lock up. During the lock up, nothing is printed to the console, and the magic SysRQ key has no effect; I have to poke the reset button. Normally, I wouldn't rule out a hardware problem, but I have reasonable faith in my computer. Neither memtest86+ nor cpuburn nor normal operation have flushed out any instability. Upon reboot, 2.6.20 will lock up almost immediately when it tries to resync the array. This appears to occur regardless of whether the resync is just starting; if I run 2.6.19 for a while until the resync is, say, 50% done and then reboot to 2.6.20, the lockup still happens. I have provided what I hope is enough information below. System information: Athlon64 3400+ 64-bit Linux 2.6.20 compiled with GCC 4.1.2 64-bit Debian Sid RAID-5 of 5 devices: /dev/hda (IDE hard drive) /dev/sda6 (partition on SATA hard drive) /dev/sdb (SATA hard drive) /dev/sdc6 (partition on SATA hard drive) /dev/sdd (SATA hard drive) bugfood:~# mdadm -D /dev/md2 /dev/md2: Version : 00.90.03 Creation Time : Mon May 29 22:13:47 2006 Raid Level : raid5 Array Size : 781433344 (745.23 GiB 800.19 GB) Device Size : 195358336 (186.31 GiB 200.05 GB) Raid Devices : 5 Total Devices : 5 Preferred Minor : 2 Persistence : Superblock is persistent Update Time : Thu Feb 15 22:07:26 2007 State : active, resyncing Active Devices : 5 Working Devices : 5 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 26% complete UUID : d016a205:bd3106ef:b19cb15b:b6d70494 Events : 0.3971003 Number Major Minor RaidDevice State 0 860 active sync /dev/sda6 1 8 381 active sync /dev/sdc6 2 302 active sync /dev/hda 3 8 163 active sync /dev/sdb 4 8 484 active sync /dev/sdd Thank you, Corey - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html