[OOPS] pegasus + MediaGX: Oops in khubd, the continuing story?
Well, I got fed up with all those Oops'es, so I started scribbling one on a piece of paper. This is what ksymoops makes of it: ksymoops 2.4.1 on i586 2.4.4. Options used -V (default) -k /var/log/ksymoops/20010504223943.ksyms (specified) -l /var/log/ksymoops/20010504223943.modules (specified) -o /lib/modules/2.4.3 (specified) -m /boot/System.map-2.4.3 (specified) Warning (compare_maps): snd symbol pm_register not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry Warning (compare_maps): snd symbol pm_send not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry Warning (compare_maps): snd symbol pm_unregister not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry eip: c010f6f3 Oops: CPU: 0 EIP: 0010:[] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010007 eax: c2667000 ebx: ecx: c2686000 edx: esi: 0046 edi: fff8 ebp: c26c7ce8 esp: c26c7ccc ds: 0018 es: 0018 ss: 0018 Process khubd (pid: 428, stackpage=c26c7000) Stack: c2686000 c2686074 c283ee40 c26861d0 0001 0286 0001 c283ee40 c4c840e5 c2686074 c4c7d222 c2686074 2f10 c2686074 0002 c4c7eccd c2686074 c4c88010 c4c88010 c2a6c000 0006 c2666000 Call Trace: c4c840e5 c4c7d222 c4c7cccd c4c88010 c4c88010 c4c7fe9b c4c88014 c4c80857 c4c8000c c4c88000 c01077df c010813e c0106e60 c0115054 c0108171 c0106c60 c011196c c4c84213 c4c859c0 c4c84601 0006 c4c851a2 5f5f c4c85564 c4c86334 c4c8639c c4c86380 c4c8639c c4c70ad2 c4c86334 c4c7b2e0 c4c70d5b c4c72988 c4c73dba c4c7b334 c4c73fa2 c4c7b36c c4c7b36c c4c74135 c010542c Code: 8b 4f 04 8b 1b 8b 01 85 45 fc 74 51 31 c0 9c 5e fa c7 01 00 >>EIP; c010f6f3 <__wake_up+33/a8> <= Trace; c4c840e5 <[pegasus]__module_parm_desc_loopback+25/28> Trace; c4c7d222 <[usb-ohci]sohci_return_urb+10e/118> Trace; c4c7cccd <[usbcore]__kstrtab_usb_devfs_handle+1291/15c4> Trace; c4c88010 <.data.end+1c51/> Trace; c4c88010 <.data.end+1c51/> Trace; c4c7fe9b <[usb-ohci]hc_release_ohci+4b/b0> Trace; c4c88014 <.data.end+1c55/> Code; c010f6f3 <__wake_up+33/a8> <_EIP>: Code; c010f6f3 <__wake_up+33/a8> <= 0: 8b 4f 04 mov0x4(%edi),%ecx <= Code; c010f6f6 <__wake_up+36/a8> 3: 8b 1b mov(%ebx),%ebx Code; c010f6f8 <__wake_up+38/a8> 5: 8b 01 mov(%ecx),%eax Code; c010f6fa <__wake_up+3a/a8> 7: 85 45 fc test %eax,0xfffc(%ebp) Code; c010f6fd <__wake_up+3d/a8> a: 74 51 je 5d <_EIP+0x5d> c010f750 <__wake_up+90/a8> Code; c010f6ff <__wake_up+3f/a8> c: 31 c0 xor%eax,%eax Code; c010f701 <__wake_up+41/a8> e: 9cpushf Code; c010f702 <__wake_up+42/a8> f: 5epop%esi Code; c010f703 <__wake_up+43/a8> 10: facli Code; c010f704 <__wake_up+44/a8> 11: c7 01 00 00 00 00 movl $0x0,(%ecx) 3 warnings issued. Results may not be reliable. I may have made some transcription errors, but the main stuff is there. This Oops (and others just like it) appear when the pegasus module is reloaded into the system. Some info on the system and the circumstances: MediaGXLV (200 MHz) + 5530 'kahlua' companion chip (so this is ohci usb) 60 MB RAM (+4MB for video) SMC 2202 (pegasus chip) 10/100tx USB NIC on a 10baseT LAN Oops also appears on 2.4.4 Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pegasus + MediaGX: Oops in khubd, the continuing story?
Hi'all, I'm experiencing loads of intermittent Oops'es when loading the pegasus driver (for an SMC 2202) on my MediaGX-equipped (Webplayer) systems. A scan of the lists turned up more problems with the MediaGX (which contains an OHCI implementation in the 5530 companion chip) in combination with the pegasus driver, so I'm not the only one it seems... The Oops'es are mostly in the khubd process, but they sometimes appear in other programs (insmod, ifconfig). They always lead to an immedate panic, and nothing is ever written to any log. When I tried to copy the Oops by hand on a notebook, the harddisk in that thing chose that specific moment to drop dead (I was nearly finished typing in the last call trace address...). And there was no rejoicing, and no call trace... Sorry... Is this a known problem (MediaGX + pegasus == intermittent Oops on load/reload), or am I telling something new? If I am, I'll create that call trace and run it through ksymoops, if it is known I'd rather spare myself the chore of typing in loads and loads of hex code. I've done enough of that in my Commodore-64 days... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
pegasus + MediaGX: Oops in khubd, the continuing story?
Hi'all, I'm experiencing loads of intermittent Oops'es when loading the pegasus driver (for an SMC 2202) on my MediaGX-equipped (Webplayer) systems. A scan of the lists turned up more problems with the MediaGX (which contains an OHCI implementation in the 5530 companion chip) in combination with the pegasus driver, so I'm not the only one it seems... The Oops'es are mostly in the khubd process, but they sometimes appear in other programs (insmod, ifconfig). They always lead to an immedate panic, and nothing is ever written to any log. When I tried to copy the Oops by hand on a notebook, the harddisk in that thing chose that specific moment to drop dead (I was nearly finished typing in the last call trace address...). And there was no rejoicing, and no call trace... Sorry... Is this a known problem (MediaGX + pegasus == intermittent Oops on load/reload), or am I telling something new? If I am, I'll create that call trace and run it through ksymoops, if it is known I'd rather spare myself the chore of typing in loads and loads of hex code. I've done enough of that in my Commodore-64 days... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[OOPS] pegasus + MediaGX: Oops in khubd, the continuing story?
Well, I got fed up with all those Oops'es, so I started scribbling one on a piece of paper. This is what ksymoops makes of it: ksymoops 2.4.1 on i586 2.4.4. Options used -V (default) -k /var/log/ksymoops/20010504223943.ksyms (specified) -l /var/log/ksymoops/20010504223943.modules (specified) -o /lib/modules/2.4.3 (specified) -m /boot/System.map-2.4.3 (specified) Warning (compare_maps): snd symbol pm_register not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry Warning (compare_maps): snd symbol pm_send not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry Warning (compare_maps): snd symbol pm_unregister not found in /usr/lib/alsa-modules/2.4.3/0.5/snd.o. Ignoring /usr/lib/alsa-modules/2.4.3/0.5/snd.o entry eip: c010f6f3 Oops: CPU: 0 EIP: 0010:[c010f6f3] Using defaults from ksymoops -t elf32-i386 -a i386 EFLAGS: 00010007 eax: c2667000 ebx: ecx: c2686000 edx: esi: 0046 edi: fff8 ebp: c26c7ce8 esp: c26c7ccc ds: 0018 es: 0018 ss: 0018 Process khubd (pid: 428, stackpage=c26c7000) Stack: c2686000 c2686074 c283ee40 c26861d0 0001 0286 0001 c283ee40 c4c840e5 c2686074 c4c7d222 c2686074 2f10 c2686074 0002 c4c7eccd c2686074 c4c88010 c4c88010 c2a6c000 0006 c2666000 Call Trace: c4c840e5 c4c7d222 c4c7cccd c4c88010 c4c88010 c4c7fe9b c4c88014 c4c80857 c4c8000c c4c88000 c01077df c010813e c0106e60 c0115054 c0108171 c0106c60 c011196c c4c84213 c4c859c0 c4c84601 0006 c4c851a2 5f5f c4c85564 c4c86334 c4c8639c c4c86380 c4c8639c c4c70ad2 c4c86334 c4c7b2e0 c4c70d5b c4c72988 c4c73dba c4c7b334 c4c73fa2 c4c7b36c c4c7b36c c4c74135 c010542c Code: 8b 4f 04 8b 1b 8b 01 85 45 fc 74 51 31 c0 9c 5e fa c7 01 00 EIP; c010f6f3 __wake_up+33/a8 = Trace; c4c840e5 [pegasus]__module_parm_desc_loopback+25/28 Trace; c4c7d222 [usb-ohci]sohci_return_urb+10e/118 Trace; c4c7cccd [usbcore]__kstrtab_usb_devfs_handle+1291/15c4 Trace; c4c88010 .data.end+1c51/ Trace; c4c88010 .data.end+1c51/ Trace; c4c7fe9b [usb-ohci]hc_release_ohci+4b/b0 Trace; c4c88014 .data.end+1c55/ Code; c010f6f3 __wake_up+33/a8 _EIP: Code; c010f6f3 __wake_up+33/a8 = 0: 8b 4f 04 mov0x4(%edi),%ecx = Code; c010f6f6 __wake_up+36/a8 3: 8b 1b mov(%ebx),%ebx Code; c010f6f8 __wake_up+38/a8 5: 8b 01 mov(%ecx),%eax Code; c010f6fa __wake_up+3a/a8 7: 85 45 fc test %eax,0xfffc(%ebp) Code; c010f6fd __wake_up+3d/a8 a: 74 51 je 5d _EIP+0x5d c010f750 __wake_up+90/a8 Code; c010f6ff __wake_up+3f/a8 c: 31 c0 xor%eax,%eax Code; c010f701 __wake_up+41/a8 e: 9cpushf Code; c010f702 __wake_up+42/a8 f: 5epop%esi Code; c010f703 __wake_up+43/a8 10: facli Code; c010f704 __wake_up+44/a8 11: c7 01 00 00 00 00 movl $0x0,(%ecx) 3 warnings issued. Results may not be reliable. I may have made some transcription errors, but the main stuff is there. This Oops (and others just like it) appear when the pegasus module is reloaded into the system. Some info on the system and the circumstances: MediaGXLV (200 MHz) + 5530 'kahlua' companion chip (so this is ohci usb) 60 MB RAM (+4MB for video) SMC 2202 (pegasus chip) 10/100tx USB NIC on a 10baseT LAN Oops also appears on 2.4.4 Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: * Re: Severe trashing in 2.4.4
On Tue, May 01, 2001 at 04:00:53PM -0700, David S. Miller wrote: > > Frank, thanks for doing all the legwork to resolve the networking > side of this problem. No problem... I just diff'd the 'old' and 'new' kernel trees. The one which produced the ravenous skb_hungry kernels was for all intents and purposed identical to the one which produced the (working, bug_free(tm)) kernel I'm currently running... Must be the weather... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
* Re: Severe trashing in 2.4.4
Well, When a puzzled Alexey wondered whether the problems I was seeing with 2.4.4 might be related to a failure to execute 'make clean' before compiling the kernel, I replied in the negative as I *always* clean up before compiling anything. Yet, for the sake of science and such I moved the kernel tree and started from scratch. The problems I was seeing are no more, 2.4.4 behaves like a good kernel should. Was it me? Was it reiserfs? Was is divine intervention? I will probably never find out, but for now this thread, and the accompanying scare, can Resquiam In Paces. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
* Re: Severe trashing in 2.4.4
Well, When a puzzled Alexey wondered whether the problems I was seeing with 2.4.4 might be related to a failure to execute 'make clean' before compiling the kernel, I replied in the negative as I *always* clean up before compiling anything. Yet, for the sake of science and such I moved the kernel tree and started from scratch. The problems I was seeing are no more, 2.4.4 behaves like a good kernel should. Was it me? Was it reiserfs? Was is divine intervention? I will probably never find out, but for now this thread, and the accompanying scare, can Resquiam In Paces. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: * Re: Severe trashing in 2.4.4
On Tue, May 01, 2001 at 04:00:53PM -0700, David S. Miller wrote: Frank, thanks for doing all the legwork to resolve the networking side of this problem. No problem... I just diff'd the 'old' and 'new' kernel trees. The one which produced the ravenous skb_hungry kernels was for all intents and purposed identical to the one which produced the (working, bug_free(tm)) kernel I'm currently running... Must be the weather... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 04:45:00PM -0700, David S. Miller wrote: > > Frank de Lange writes: > > What do you want me to check for? /proc/net/netstat is a rather busy place... > > Just show us the contents after you reproduce the problem. > We just want to see if a certain event if being triggered. Hm, 'twould be nice to know WHAT to look for (if only for educational purposes), but ok: http://www.unternet.org/~frank/projects/linux2404/2404-meminfo/ it contains an extra set of files, named p_n_netstat.*. Same as before, the .diff contains one-second interval diffs. Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Mon, Apr 30, 2001 at 12:06:52AM +0200, Manfred Spraul wrote: > You could enable STATS in mm/slab.c, then the number of alloc and free > calls would be printed in /proc/slabinfo. > > > Yeah, those as well. I kinda guessed they were related... > > Could you check /proc/sys/net/core/hot_list_length and skb_head_pool > (not available in /proc, use gdb --core /proc/kcore)? I doubt that this > causes your problems, but the skb_head code uses a special per-cpu > linked list for even faster allocations. > > Which network card do you use? Perhaps a bug in the zero-copy code of > the driver? I'll give it a go once I reboot into 2.4.4 again (now in 2.4.3 to get some 'work' done). Using the dreaded ne2k cards (two of them), which have caused me more than one headache already... I'll have a look at the driver for these cards. Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 01:58:52PM -0400, Alexander Viro wrote: > Hmm... I'd say that you also have a leak in kmalloc()'ed stuff - something > in 1K--2K range. From your logs it looks like the thing never shrinks and > grows prettu fast... Same goes for buffer_head: buffer_head44236 48520 96 1188 12131 : 252 126 quite high I think. 2.4.3 shows this, after about the same time and activity: buffer_head 891 2880 96 72 721 : 252 126 Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 01:58:52PM -0400, Alexander Viro wrote: > Hmm... I'd say that you also have a leak in kmalloc()'ed stuff - something > in 1K--2K range. From your logs it looks like the thing never shrinks and > grows prettu fast... Yeah, those as well. I kinda guessed they were related... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
499928 kB SwapFree: 461132 kB And to top-10 memury hogs: 892 54696 2279 /usr/bin/X11/XFree86 -depth 16 -gamma 1.6 -auth /var/lib/gdm/:0 632 2932 11363 ps -ax -o rss,vsz,pid,command 600 8988 2785 gnome-terminal -t [EMAIL PROTECTED] 368 7660 2685 multiload_applet --activate-goad-server multiload_applet --goad 312 2100 4731 top 308 7528 2675 gnomexmms --activate-goad-server gnomexmms --goad-fd 10 244 7660 2701 multiload_applet --activate-goad-server multiload_applet --goad 240 7436 2682 asclock_applet --activate-goad-server asclock_applet --goad-fd 4 11740 1110 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=my 4 11740 1109 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=my I've got a ton of logging from /proc/slabinfo, one entry a second. If someone wants to peruse it, you can find it here: http://www.unternet.org/~frank/projects/linux2404/2404-meminfo/ The .diff files are diffs between 'current' and 'previous' (one second interval) snapshots. slabinfo and meminfo are self-explanatory I guess. The 'memhogs' entry is the top-10 memory users list for each second of logging. Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
kB SwapFree: 464420 kB [frank@behemoth mp3]$ ps -xao rss,vsz,pid,command|sort -rn|head 2244 55304 1310 /usr/bin/X11/XFree86 -depth 16 -gamma 1.6 -auth /var/lib/gdm/:0 1644 5484 1401 sawfish --sm-client-id 11c0a801059849521860010240115 -- 1252 9008 1438 gnome-terminal -t [EMAIL PROTECTED] 1172 2924 1796 ps -xao rss,vsz,pid,command 956 7656 1413 tasklist_applet --activate-goad-server tasklist_applet --goad-f 944 8388 1696 gnome-terminal --tclass=Remote -x ssh -v ostrogoth.localnet 776 7588 1411 deskguide_applet --activate-goad-server deskguide_applet --goad 556 3012 1797 sort -rn 504 7436 1419 asclock_applet --activate-goad-server asclock_applet --goad-fd 464 8356 1405 panel --sm-config-prefix /panel.d/default-ZTNCVS/ --sm-client-i [ system just started thrashing again, had to sysrq-reboot ] So, there's something wrong here... Wish I knew what... 2.4.3 runs fine on the same box with the same apps. Any clues? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
SwapFree: 464420 kB [frank@behemoth mp3]$ ps -xao rss,vsz,pid,command|sort -rn|head 2244 55304 1310 /usr/bin/X11/XFree86 -depth 16 -gamma 1.6 -auth /var/lib/gdm/:0 1644 5484 1401 sawfish --sm-client-id 11c0a801059849521860010240115 -- 1252 9008 1438 gnome-terminal -t [EMAIL PROTECTED] 1172 2924 1796 ps -xao rss,vsz,pid,command 956 7656 1413 tasklist_applet --activate-goad-server tasklist_applet --goad-f 944 8388 1696 gnome-terminal --tclass=Remote -x ssh -v ostrogoth.localnet 776 7588 1411 deskguide_applet --activate-goad-server deskguide_applet --goad 556 3012 1797 sort -rn 504 7436 1419 asclock_applet --activate-goad-server asclock_applet --goad-fd 464 8356 1405 panel --sm-config-prefix /panel.d/default-ZTNCVS/ --sm-client-i [ system just started thrashing again, had to sysrq-reboot ] So, there's something wrong here... Wish I knew what... 2.4.3 runs fine on the same box with the same apps. Any clues? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
SwapFree: 461132 kB And to top-10 memury hogs: 892 54696 2279 /usr/bin/X11/XFree86 -depth 16 -gamma 1.6 -auth /var/lib/gdm/:0 632 2932 11363 ps -ax -o rss,vsz,pid,command 600 8988 2785 gnome-terminal -t [EMAIL PROTECTED] 368 7660 2685 multiload_applet --activate-goad-server multiload_applet --goad 312 2100 4731 top 308 7528 2675 gnomexmms --activate-goad-server gnomexmms --goad-fd 10 244 7660 2701 multiload_applet --activate-goad-server multiload_applet --goad 240 7436 2682 asclock_applet --activate-goad-server asclock_applet --goad-fd 4 11740 1110 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=my 4 11740 1109 /usr/sbin/mysqld --basedir=/ --datadir=/var/lib/mysql --user=my I've got a ton of logging from /proc/slabinfo, one entry a second. If someone wants to peruse it, you can find it here: http://www.unternet.org/~frank/projects/linux2404/2404-meminfo/ The .diff files are diffs between 'current' and 'previous' (one second interval) snapshots. slabinfo and meminfo are self-explanatory I guess. The 'memhogs' entry is the top-10 memory users list for each second of logging. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 01:58:52PM -0400, Alexander Viro wrote: Hmm... I'd say that you also have a leak in kmalloc()'ed stuff - something in 1K--2K range. From your logs it looks like the thing never shrinks and grows prettu fast... Yeah, those as well. I kinda guessed they were related... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 01:58:52PM -0400, Alexander Viro wrote: Hmm... I'd say that you also have a leak in kmalloc()'ed stuff - something in 1K--2K range. From your logs it looks like the thing never shrinks and grows prettu fast... Same goes for buffer_head: buffer_head44236 48520 96 1188 12131 : 252 126 quite high I think. 2.4.3 shows this, after about the same time and activity: buffer_head 891 2880 96 72 721 : 252 126 Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Mon, Apr 30, 2001 at 12:06:52AM +0200, Manfred Spraul wrote: You could enable STATS in mm/slab.c, then the number of alloc and free calls would be printed in /proc/slabinfo. Yeah, those as well. I kinda guessed they were related... Could you check /proc/sys/net/core/hot_list_length and skb_head_pool (not available in /proc, use gdb --core /proc/kcore)? I doubt that this causes your problems, but the skb_head code uses a special per-cpu linked list for even faster allocations. Which network card do you use? Perhaps a bug in the zero-copy code of the driver? I'll give it a go once I reboot into 2.4.4 again (now in 2.4.3 to get some 'work' done). Using the dreaded ne2k cards (two of them), which have caused me more than one headache already... I'll have a look at the driver for these cards. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Severe trashing in 2.4.4
On Sun, Apr 29, 2001 at 04:45:00PM -0700, David S. Miller wrote: Frank de Lange writes: What do you want me to check for? /proc/net/netstat is a rather busy place... Just show us the contents after you reproduce the problem. We just want to see if a certain event if being triggered. Hm, 'twould be nice to know WHAT to look for (if only for educational purposes), but ok: http://www.unternet.org/~frank/projects/linux2404/2404-meminfo/ it contains an extra set of files, named p_n_netstat.*. Same as before, the .diff contains one-second interval diffs. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Network error persists in 2.4.4
> (on problems with ne2k-pci on SMP-systems) Seems you're experiencing the effects of the infamous IO-APIC problem ('erratum' in Intel-lingo). There's a patch for these problems by Maciej W. Rozycki, which should (IMnsHO) really be accepted into the main kernel tree since many people are experiencing these problems and the patch fixes them quite well. The patch has been submitted to the list several times now, but I'll do it again. (attached to this message...) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/apic.c linux-2.4.1/arch/i386/kernel/apic.c --- linux-2.4.1.macro/arch/i386/kernel/apic.c Wed Dec 13 23:54:27 2000 +++ linux-2.4.1/arch/i386/kernel/apic.c Mon Feb 12 16:11:15 2001 @@ -23,6 +23,7 @@ #include #include +#include #include #include #include @@ -270,7 +271,13 @@ void __init setup_local_APIC (void) * PCI Ne2000 networking cards and PII/PIII processors, dual * BX chipset. ] */ -#if 0 + /* +* Actually disabling the focus CPU check just makes the hang less +* frequent as it makes the interrupt distributon model be more +* like LRU than MRU (the short-term load is more even across CPUs). +* See also the comment in end_level_ioapic_irq(). --macro +*/ +#if 1 /* Enable focus processor (bit==0) */ value &= ~(1<<9); #else @@ -764,7 +771,7 @@ asmlinkage void smp_error_interrupt(void apic_write(APIC_ESR, 0); v1 = apic_read(APIC_ESR); ack_APIC_irq(); - irq_err_count++; + atomic_inc(_err_count); /* Here is what the APIC error bits mean: 0: Send CS error diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/i8259.c linux-2.4.1/arch/i386/kernel/i8259.c --- linux-2.4.1.macro/arch/i386/kernel/i8259.c Mon Nov 20 18:01:58 2000 +++ linux-2.4.1/arch/i386/kernel/i8259.cSun Feb 11 19:54:33 2001 @@ -12,6 +12,7 @@ #include #include +#include #include #include #include @@ -321,7 +322,7 @@ spurious_8259A_irq: printk("spurious 8259A interrupt: IRQ%d.\n", irq); spurious_irq_mask |= irqmask; } - irq_err_count++; + atomic_inc(_err_count); /* * Theoretically we do not have to handle this IRQ, * but in Linux this does not cause problems and is diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/io_apic.c linux-2.4.1/arch/i386/kernel/io_apic.c --- linux-2.4.1.macro/arch/i386/kernel/io_apic.cSat Feb 3 12:05:49 2001 +++ linux-2.4.1/arch/i386/kernel/io_apic.c Tue Feb 13 19:59:55 2001 @@ -33,6 +33,8 @@ #include #include +#define APIC_LOCKUP_DEBUG + static spinlock_t ioapic_lock = SPIN_LOCK_UNLOCKED; /* @@ -122,8 +124,14 @@ static void add_pin_to_irq(unsigned int static void name##_IO_APIC_irq (unsigned int irq) \ __DO_ACTION(R, ACTION, FINAL) -DO_ACTION( __mask,0, |= 0x0001, io_apic_sync(entry->apic))/* mask = 1 */ -DO_ACTION( __unmask, 0, &= 0xfffe, ) /* mask = 0 */ +DO_ACTION( __mask, 0, |= 0x0001, io_apic_sync(entry->apic) ) + /* mask = 1 */ +DO_ACTION( __unmask, 0, &= 0xfffe, ) + /* mask = 0 */ +DO_ACTION( __mask_and_edge,0, = (reg & 0x7fff) | 0x0001, ) + /* mask = 1, trigger = 0 */ +DO_ACTION( __unmask_and_level, 0, = (reg & 0xfffe) | 0x8000, ) + /* mask = 0, trigger = 1 */ static void mask_IO_APIC_irq (unsigned int irq) { @@ -847,6 +855,8 @@ void /*__init*/ print_local_APIC(void * v = apic_read(APIC_EOI); printk(KERN_DEBUG "... APIC EOI: %08x\n", v); + v = apic_read(APIC_RRR); + printk(KERN_DEBUG "... APIC RRR: %08x\n", v); v = apic_read(APIC_LDR); printk(KERN_DEBUG "... APIC LDR: %08x\n", v); v = apic_read(APIC_DFR); @@ -1191,12 +1201,61 @@ static unsigned int startup_level_ioapic #define enable_level_ioapic_irqunmask_IO_APIC_irq #define disable_level_ioapic_irq mask_IO_APIC_irq -static void end_level_ioapic_irq (unsigned int i) +static void end_level_ioapic_irq (unsigned int irq) { + unsigned long v; + +/* + * It appears there is an erratum which a
Re: Network error persists in 2.4.4
(on problems with ne2k-pci on SMP-systems) Seems you're experiencing the effects of the infamous IO-APIC problem ('erratum' in Intel-lingo). There's a patch for these problems by Maciej W. Rozycki, which should (IMnsHO) really be accepted into the main kernel tree since many people are experiencing these problems and the patch fixes them quite well. The patch has been submitted to the list several times now, but I'll do it again. (attached to this message...) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est. ] diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/apic.c linux-2.4.1/arch/i386/kernel/apic.c --- linux-2.4.1.macro/arch/i386/kernel/apic.c Wed Dec 13 23:54:27 2000 +++ linux-2.4.1/arch/i386/kernel/apic.c Mon Feb 12 16:11:15 2001 @@ -23,6 +23,7 @@ #include linux/mc146818rtc.h #include linux/kernel_stat.h +#include asm/atomic.h #include asm/smp.h #include asm/mtrr.h #include asm/mpspec.h @@ -270,7 +271,13 @@ void __init setup_local_APIC (void) * PCI Ne2000 networking cards and PII/PIII processors, dual * BX chipset. ] */ -#if 0 + /* +* Actually disabling the focus CPU check just makes the hang less +* frequent as it makes the interrupt distributon model be more +* like LRU than MRU (the short-term load is more even across CPUs). +* See also the comment in end_level_ioapic_irq(). --macro +*/ +#if 1 /* Enable focus processor (bit==0) */ value = ~(19); #else @@ -764,7 +771,7 @@ asmlinkage void smp_error_interrupt(void apic_write(APIC_ESR, 0); v1 = apic_read(APIC_ESR); ack_APIC_irq(); - irq_err_count++; + atomic_inc(irq_err_count); /* Here is what the APIC error bits mean: 0: Send CS error diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/i8259.c linux-2.4.1/arch/i386/kernel/i8259.c --- linux-2.4.1.macro/arch/i386/kernel/i8259.c Mon Nov 20 18:01:58 2000 +++ linux-2.4.1/arch/i386/kernel/i8259.cSun Feb 11 19:54:33 2001 @@ -12,6 +12,7 @@ #include linux/init.h #include linux/kernel_stat.h +#include asm/atomic.h #include asm/system.h #include asm/io.h #include asm/irq.h @@ -321,7 +322,7 @@ spurious_8259A_irq: printk(spurious 8259A interrupt: IRQ%d.\n, irq); spurious_irq_mask |= irqmask; } - irq_err_count++; + atomic_inc(irq_err_count); /* * Theoretically we do not have to handle this IRQ, * but in Linux this does not cause problems and is diff -up --recursive --new-file linux-2.4.1.macro/arch/i386/kernel/io_apic.c linux-2.4.1/arch/i386/kernel/io_apic.c --- linux-2.4.1.macro/arch/i386/kernel/io_apic.cSat Feb 3 12:05:49 2001 +++ linux-2.4.1/arch/i386/kernel/io_apic.c Tue Feb 13 19:59:55 2001 @@ -33,6 +33,8 @@ #include asm/smp.h #include asm/desc.h +#define APIC_LOCKUP_DEBUG + static spinlock_t ioapic_lock = SPIN_LOCK_UNLOCKED; /* @@ -122,8 +124,14 @@ static void add_pin_to_irq(unsigned int static void name##_IO_APIC_irq (unsigned int irq) \ __DO_ACTION(R, ACTION, FINAL) -DO_ACTION( __mask,0, |= 0x0001, io_apic_sync(entry-apic))/* mask = 1 */ -DO_ACTION( __unmask, 0, = 0xfffe, ) /* mask = 0 */ +DO_ACTION( __mask, 0, |= 0x0001, io_apic_sync(entry-apic) ) + /* mask = 1 */ +DO_ACTION( __unmask, 0, = 0xfffe, ) + /* mask = 0 */ +DO_ACTION( __mask_and_edge,0, = (reg 0x7fff) | 0x0001, ) + /* mask = 1, trigger = 0 */ +DO_ACTION( __unmask_and_level, 0, = (reg 0xfffe) | 0x8000, ) + /* mask = 0, trigger = 1 */ static void mask_IO_APIC_irq (unsigned int irq) { @@ -847,6 +855,8 @@ void /*__init*/ print_local_APIC(void * v = apic_read(APIC_EOI); printk(KERN_DEBUG ... APIC EOI: %08x\n, v); + v = apic_read(APIC_RRR); + printk(KERN_DEBUG ... APIC RRR: %08x\n, v); v = apic_read(APIC_LDR); printk(KERN_DEBUG ... APIC LDR: %08x\n, v); v = apic_read(APIC_DFR); @@ -1191,12 +1201,61 @@ static unsigned int startup_level_ioapic #define enable_level_ioapic_irqunmask_IO_APIC_irq #define disable_level_ioapic_irq mask_IO_APIC_irq -static void end_level_ioapic_irq (unsigned int i) +static void end_level_ioapic_irq (unsigned int irq
Re: 2.4.3: still experiencing APIC-related hangs
On Fri, Mar 30, 2001 at 08:32:39AM -0800, [EMAIL PROTECTED] wrote: > On Fri, Mar 30, 2001 at 02:32:24PM +0200, Frank de Lange wrote: > > > > Maciej, did you submit the patch to Linus? It really seems to solve the > > (occurence of the) problems with these boards... > > Where is this patch found? I am not seeing it so far on kernel.org. It is allmost ancient history, from days long gone when men were men, women were women and Linux had only reached 2.4.1... I can send you a copy, if you need it... Cheers//Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.3: still experiencing APIC-related hangs
Hi'all, Subject says it all: 2.4.3 (unpatchaed) is still causing the dreaded APIC-related hangs on SMP BX systems (Abit BP-6, maybe Gigabyte). I still need to apply one of Maciej's patches to get rid of these hangs. The source comments in arc/i386/kernel/apic.c ("If focus CPU is disabled then the hang goes away") are incorrect, as the hang does not go away by simply disabling focus CPU. The only way for me to get rid of the hangs is to apply patch-2.4.1-io_apic-46 (which does the LEVEL->EDGE->LEVEL triggered trick to 'free' the IO_APIC). I've been running with this patch for quite some time now, and have not experienced any problems with it. Maybe it it time to include it in the main kernel, perhaps as a configurable option ("BROKEN_IO_APIC")? Maciej, did you submit the patch to Linus? It really seems to solve the (occurence of the) problems with these boards... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.4.3: still experiencing APIC-related hangs
Hi'all, Subject says it all: 2.4.3 (unpatchaed) is still causing the dreaded APIC-related hangs on SMP BX systems (Abit BP-6, maybe Gigabyte). I still need to apply one of Maciej's patches to get rid of these hangs. The source comments in arc/i386/kernel/apic.c ("If focus CPU is disabled then the hang goes away") are incorrect, as the hang does not go away by simply disabling focus CPU. The only way for me to get rid of the hangs is to apply patch-2.4.1-io_apic-46 (which does the LEVEL-EDGE-LEVEL triggered trick to 'free' the IO_APIC). I've been running with this patch for quite some time now, and have not experienced any problems with it. Maybe it it time to include it in the main kernel, perhaps as a configurable option ("BROKEN_IO_APIC")? Maciej, did you submit the patch to Linus? It really seems to solve the (occurence of the) problems with these boards... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.4.3: still experiencing APIC-related hangs
On Fri, Mar 30, 2001 at 08:32:39AM -0800, [EMAIL PROTECTED] wrote: On Fri, Mar 30, 2001 at 02:32:24PM +0200, Frank de Lange wrote: Maciej, did you submit the patch to Linus? It really seems to solve the (occurence of the) problems with these boards... Where is this patch found? I am not seeing it so far on kernel.org. It is allmost ancient history, from days long gone when men were men, women were women and Linux had only reached 2.4.1... I can send you a copy, if you need it... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.2-ac21
Oops... Linux 2.4.2-ac21 does not like my box, or the other way around: loading the agpgart module (MGA G400 AGP) -> system hangs loading the SCSI module (53c875) -> system hangs In both cases, the magic SysRq sequence does not work, but it is still possible to ping the box from the outside. Connecting to it (ssh) does not work, however. I backed out both the SCSI driver patches as well as the agpgart patches, but this did not fix the symptoms. Looks more like a module-loading related issue, but I have not found it yet. All this on an SMP (Abit BP6) box by the way... The changes which introduced these symptoms have occured somewhere between -ac7 and -ac21, since -ac7 DID run on the same hardware. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux 2.4.2-ac21
Oops... Linux 2.4.2-ac21 does not like my box, or the other way around: loading the agpgart module (MGA G400 AGP) - system hangs loading the SCSI module (53c875) - system hangs In both cases, the magic SysRq sequence does not work, but it is still possible to ping the box from the outside. Connecting to it (ssh) does not work, however. I backed out both the SCSI driver patches as well as the agpgart patches, but this did not fix the symptoms. Looks more like a module-loading related issue, but I have not found it yet. All this on an SMP (Abit BP6) box by the way... The changes which introduced these symptoms have occured somewhere between -ac7 and -ac21, since -ac7 DID run on the same hardware. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sat, Feb 17, 2001 at 06:18:46PM -0800, David wrote: > > Well, I run glibc-2.2.1 as well, so that might be one of the factors > > contributing to this. Then again, glibc-2.2.1 with ext2 does not cause any > > problems whatsoever with mozilla. So it could be that reiserfs + glibc-2.2.1 is > > a bad combination, question remains which of these two is the culprit (if not > > both). Since glibc-2.2.2 is out, I will give that a try as well. Not tonight > > though... FYI I'm running glibc-2.2.2 now, and alas... Mozilla still refuses to be compiled, no change... Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sat, Feb 17, 2001 at 06:18:46PM -0800, David wrote: Well, I run glibc-2.2.1 as well, so that might be one of the factors contributing to this. Then again, glibc-2.2.1 with ext2 does not cause any problems whatsoever with mozilla. So it could be that reiserfs + glibc-2.2.1 is a bad combination, question remains which of these two is the culprit (if not both). Since glibc-2.2.2 is out, I will give that a try as well. Not tonight though... FYI I'm running glibc-2.2.2 now, and alas... Mozilla still refuses to be compiled, no change... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
> Minor nit, but I'd rather clear it up now. Which distribution you run > doesn't matter for debugging. What does matter is that we've got known > problems with a given compiler, and that compiler goes by a few different > flavors with the same version number. Since there are known problems, if > you don't provide the compiler version, I'll ask. If your bug is *really* > odd, I might ask a few different ways, just to make sure you give the same > answer every time ;-) Well, a nit to a nit... In my experience it surely matters which distribution somebody runs, since that tells a lot about the basic system (libc, probable compiler, binutils, etc). RH7 is broken in many respects. Since it uses glibc-2.2 as well, I usually add the notice that I do NOT run RH7 to messages like these where I mention I use glibc-2.2.x, if only to ward off the usual 'are you running RH7 if yes please upgrade so and so' cycle. Bits and electrons are much to precious to waste on useless banter like that... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
Minor nit, but I'd rather clear it up now. Which distribution you run doesn't matter for debugging. What does matter is that we've got known problems with a given compiler, and that compiler goes by a few different flavors with the same version number. Since there are known problems, if you don't provide the compiler version, I'll ask. If your bug is *really* odd, I might ask a few different ways, just to make sure you give the same answer every time ;-) Well, a nit to a nit... In my experience it surely matters which distribution somebody runs, since that tells a lot about the basic system (libc, probable compiler, binutils, etc). RH7 is broken in many respects. Since it uses glibc-2.2 as well, I usually add the notice that I do NOT run RH7 to messages like these where I mention I use glibc-2.2.x, if only to ward off the usual 'are you running RH7 if yes please upgrade so and so' cycle. Bits and electrons are much to precious to waste on useless banter like that... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sat, Feb 17, 2001 at 05:47:49PM -0800, David wrote: > I can say "me too" for this. I thought it was perhaps glibc or binutils > tho. I only have reiserfs systems now so I don't have a basis for > comparison. > > However I -can- say that I didn't experience this until I put glibc > 2.2.1 on my systems. I do use an "approved" gcc, stock 2.95.2. > > I wouldn't be so quick to pin it on reiserfs. Well, I run glibc-2.2.1 as well, so that might be one of the factors contributing to this. Then again, glibc-2.2.1 with ext2 does not cause any problems whatsoever with mozilla. So it could be that reiserfs + glibc-2.2.1 is a bad combination, question remains which of these two is the culprit (if not both). Since glibc-2.2.2 is out, I will give that a try as well. Not tonight though... And no, I'm not running RedHat 7.x for those who might think so (and automatically blame everything on it). When did you switch to glibc-2.2.1? Were you running reiserfs before that? Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sun, Feb 18, 2001 at 01:57:15AM +0100, Frank de Lange wrote: > I will retry this with 'all warnings and bells and whistles' turned on in > reiserfs (on 2.4.1-ac18), and see if anything out of the ordinary is logged. I > somehow doubt it, since repeated forced reiserfsck's have turned up nothing at > all... I just ran the compile again on the described build, same results, no warnings of any kind, nothing in the debug log facility, nothing on the console... Reiserfs seems to believe it did the right thing. I'm here to tell you that it didn't... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
> At least the patch didn't make it worse. Would anyone care to comment on > how the elf-dynstr-gc option changes the file access patterns for the > compile? It does not change the file access patterns, it adds an extra step. A separate binary (dist/bin/elf-dynstr-gc, a convoluted version of strip) is run over the final (linked) library/executable to remove some symbol info. The elf-dynstr-gc program is compiled as part of the mozilla build. There's nothing wrong with elf-dynstr-gc on the reiserfs filesystem, it is identical to the one on the ext2 partition. Running the 'reiserfs' version on the ext2 tree works as it should, running the ext2 version on the reiserfs tree crashes (seems the program is not very robust, as it does not detect garbled input files). As said, running objdump on the corrupted (reiserfs compiled) library also produces errors. Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
> That's not good. Which compiler did you use to compile the kernel? This > sounds lame, but reiserfs exercises the cpu/mem more than ext2, so we hit > bad ram more often. If we run out of other things to try, please run a > memory tester. I use 'good old' gcc 2.95.2: gcc -v: gcc version 2.95.2 19991024 (release) I just tried 2.4.1-ac18, which also gave me the same segfault. When I compare the corrupted binary (the one compile on reiserfs) to the working one (compiled on ext2), I notice that at position 0x1000 in the file, a block of data from position 0x0e60 is duplicated. It seems to be inserted into the data stream, as it is followed by data which (in the working version of libsample.so) starts at 0x1000: (bsdiff (binary sdiff) between both files) (actually the differences between both files start much earlier, but that seems to be just all kinds of changed relocation information as a result of the error) (hope my careful ASCII-formatting makes it through the list and the archives) THE BAD THE GOOD e60 c4 20 83 c4 f4 8b 06 e60 c4 20 83 c4 f4 8b 06 e68 8b 40 10 ff d0 eb 06 e68 8b 40 10 ff d0 eb 06 e70 bf 0e 00 07 80 89 f8 e70 bf 0e 00 07 80 89 f8 e78 65 e8 5b 5e 5f 89 ec e78 65 e8 5b 5e 5f 89 ec e80 c3 8d 76 00 55 89 e5 e80 c3 8d 76 00 55 89 e5 e88 c0 89 ec 5d c3 8d 76 e88 c0 89 ec 5d c3 8d 76 e90 55 89 e5 31 c0 89 ec e90 55 89 e5 31 c0 89 ec fd8 00 00 00 00 c0 00 00 fd8 00 00 00 00 c0 00 00 fe0 00 00 00 46 80 a0 c0 fe0 00 00 00 46 80 a0 c0 fe8 68 08 d3 11 91 5f d9 fe8 68 08 d3 11 91 5f d9 ff0 89 d4 8e 3c 40 92 89 ff0 89 d4 8e 3c 40 92 89 ff8 d2 f9 d2 11 bd d6 00 ff8 d2 f9 d2 11 bd d6 00 LOOK HERE: IDENTICAL TO THE AND THIS IS WHAT IT SHOULD DATA AT e60 LOOK LIKE... 0001000 c4 20 83 c4 f4 8b 06 | 0001000 64 65 73 74 86 52 38 0001008 8b 40 10 ff d0 eb 06 | 0001008 c4 cb d2 11 8c ca 00 0001010 bf 0e 00 07 80 89 f8 | 0001010 b0 fc 14 a3 a0 58 f1 0001018 65 e8 5b 5e 5f 89 ec | 0001018 dd ca d2 11 8c ca 00 0001190 89 d4 8e 3c 40 92 89 < 0001198 d2 f9 d2 11 bd d6 00 < AND HERE THE 'GOOD' DATA STARTS AGAIN, THIS BLOCK IS IDENTICAL TO THE ONE AT 0x1000 IN THE 'GOOD' FILE 00011a0 64 65 73 74 86 52 38 < 00011a8 c4 cb d2 11 8c ca 00 < 00011b0 b0 fc 14 a3 a0 58 f1 < 00011b8 dd ca d2 11 8c ca 00 < 00011c0 b0 fc 14 a3 40 a7 58 < 00011c8 dc d5 d2 11 92 fb 00 < So, it seems a wrong block of data was inserted into the stream at position 0x1000, wreaking havoc on the file structure. Now 0x1000 is kind of a magic number, isn't it? Alsmost to good to be true... I will retry this with 'all warnings and bells and whistles' turned on in reiserfs (on 2.4.1-ac18), and see if anything out of the ordinary is logged. I somehow doubt it, since repeated forced reiserfsck's have turned up nothing at all... Oh, and both my own and my computer's memory is OK, so this is not a hardware fault... :-) By the way, /tmp (where most action is taking place when compiling) is hosted on a good ext2 filesystem. Just in case you wondered... And, also of interest, I'm using an SMP box (BP6, 2 non overclocked Celeron 466s) Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
Hi'all, Well, subject says it all... When I try to compile mozilla (CVS version) with the '--enable-elf-dynstr-gc' option, the compile fails with a segfault: ../../dist/bin/elf-dynstr-gc ../../dist/lib/components/libsample.so make[2]: *** [install] Segmentation fault (core dumped) compiling the same codebase on an ext2 filesystem does not produce this segfault. When I compare the produced library (libsample.so), there is a consistent difference between the one compile on the reiserfs and the ext2 filesystem. Running objdump on the reiserfs-compiled library also produces errors (some assertion failures, a lot of 'invalid string offset' errors, and finally a 'Memory exhausted' error), while objdump happily disassebles the ext-produced binary. These problems occur on: 2.4.1 2.4.2-pre4 2.4.2-pre4 with Chris Mason's 'reiserfs fix for null bytes in small files' So, there's something quite wrong here. If anyone wants me to try something, do tell... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
Hi'all, Well, subject says it all... When I try to compile mozilla (CVS version) with the '--enable-elf-dynstr-gc' option, the compile fails with a segfault: ../../dist/bin/elf-dynstr-gc ../../dist/lib/components/libsample.so make[2]: *** [install] Segmentation fault (core dumped) compiling the same codebase on an ext2 filesystem does not produce this segfault. When I compare the produced library (libsample.so), there is a consistent difference between the one compile on the reiserfs and the ext2 filesystem. Running objdump on the reiserfs-compiled library also produces errors (some assertion failures, a lot of 'invalid string offset' errors, and finally a 'Memory exhausted' error), while objdump happily disassebles the ext-produced binary. These problems occur on: 2.4.1 2.4.2-pre4 2.4.2-pre4 with Chris Mason's 'reiserfs fix for null bytes in small files' So, there's something quite wrong here. If anyone wants me to try something, do tell... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
That's not good. Which compiler did you use to compile the kernel? This sounds lame, but reiserfs exercises the cpu/mem more than ext2, so we hit bad ram more often. If we run out of other things to try, please run a memory tester. I use 'good old' gcc 2.95.2: gcc -v: gcc version 2.95.2 19991024 (release) I just tried 2.4.1-ac18, which also gave me the same segfault. When I compare the corrupted binary (the one compile on reiserfs) to the working one (compiled on ext2), I notice that at position 0x1000 in the file, a block of data from position 0x0e60 is duplicated. It seems to be inserted into the data stream, as it is followed by data which (in the working version of libsample.so) starts at 0x1000: (bsdiff (binary sdiff) between both files) (actually the differences between both files start much earlier, but that seems to be just all kinds of changed relocation information as a result of the error) (hope my careful ASCII-formatting makes it through the list and the archives) THE BAD THE GOOD deletia, a lot of uninteresting data... e60 c4 20 83 c4 f4 8b 06 e60 c4 20 83 c4 f4 8b 06 e68 8b 40 10 ff d0 eb 06 e68 8b 40 10 ff d0 eb 06 e70 bf 0e 00 07 80 89 f8 e70 bf 0e 00 07 80 89 f8 e78 65 e8 5b 5e 5f 89 ec e78 65 e8 5b 5e 5f 89 ec e80 c3 8d 76 00 55 89 e5 e80 c3 8d 76 00 55 89 e5 e88 c0 89 ec 5d c3 8d 76 e88 c0 89 ec 5d c3 8d 76 e90 55 89 e5 31 c0 89 ec e90 55 89 e5 31 c0 89 ec deletia, a lot of uninteresting data... fd8 00 00 00 00 c0 00 00 fd8 00 00 00 00 c0 00 00 fe0 00 00 00 46 80 a0 c0 fe0 00 00 00 46 80 a0 c0 fe8 68 08 d3 11 91 5f d9 fe8 68 08 d3 11 91 5f d9 ff0 89 d4 8e 3c 40 92 89 ff0 89 d4 8e 3c 40 92 89 ff8 d2 f9 d2 11 bd d6 00 ff8 d2 f9 d2 11 bd d6 00 LOOK HERE: IDENTICAL TO THE AND THIS IS WHAT IT SHOULD DATA AT e60 LOOK LIKE... 0001000 c4 20 83 c4 f4 8b 06 | 0001000 64 65 73 74 86 52 38 0001008 8b 40 10 ff d0 eb 06 | 0001008 c4 cb d2 11 8c ca 00 0001010 bf 0e 00 07 80 89 f8 | 0001010 b0 fc 14 a3 a0 58 f1 0001018 65 e8 5b 5e 5f 89 ec | 0001018 dd ca d2 11 8c ca 00 deletia, a lot of uninteresting data... 0001190 89 d4 8e 3c 40 92 89 0001198 d2 f9 d2 11 bd d6 00 AND HERE THE 'GOOD' DATA STARTS AGAIN, THIS BLOCK IS IDENTICAL TO THE ONE AT 0x1000 IN THE 'GOOD' FILE 00011a0 64 65 73 74 86 52 38 00011a8 c4 cb d2 11 8c ca 00 00011b0 b0 fc 14 a3 a0 58 f1 00011b8 dd ca d2 11 8c ca 00 00011c0 b0 fc 14 a3 40 a7 58 00011c8 dc d5 d2 11 92 fb 00 deletia, a lot of uninteresting data... So, it seems a wrong block of data was inserted into the stream at position 0x1000, wreaking havoc on the file structure. Now 0x1000 is kind of a magic number, isn't it? Alsmost to good to be true... I will retry this with 'all warnings and bells and whistles' turned on in reiserfs (on 2.4.1-ac18), and see if anything out of the ordinary is logged. I somehow doubt it, since repeated forced reiserfsck's have turned up nothing at all... Oh, and both my own and my computer's memory is OK, so this is not a hardware fault... :-) By the way, /tmp (where most action is taking place when compiling) is hosted on a good ext2 filesystem. Just in case you wondered... And, also of interest, I'm using an SMP box (BP6, 2 non overclocked Celeron 466s) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
At least the patch didn't make it worse. Would anyone care to comment on how the elf-dynstr-gc option changes the file access patterns for the compile? It does not change the file access patterns, it adds an extra step. A separate binary (dist/bin/elf-dynstr-gc, a convoluted version of strip) is run over the final (linked) library/executable to remove some symbol info. The elf-dynstr-gc program is compiled as part of the mozilla build. There's nothing wrong with elf-dynstr-gc on the reiserfs filesystem, it is identical to the one on the ext2 partition. Running the 'reiserfs' version on the ext2 tree works as it should, running the ext2 version on the reiserfs tree crashes (seems the program is not very robust, as it does not detect garbled input files). As said, running objdump on the corrupted (reiserfs compiled) library also produces errors. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sun, Feb 18, 2001 at 01:57:15AM +0100, Frank de Lange wrote: I will retry this with 'all warnings and bells and whistles' turned on in reiserfs (on 2.4.1-ac18), and see if anything out of the ordinary is logged. I somehow doubt it, since repeated forced reiserfsck's have turned up nothing at all... I just ran the compile again on the described build, same results, no warnings of any kind, nothing in the debug log facility, nothing on the console... Reiserfs seems to believe it did the right thing. I'm here to tell you that it didn't... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: reiserfs on 2.4.1,2.4.2-pre (with null bytes patch) breaks mozilla compile
On Sat, Feb 17, 2001 at 05:47:49PM -0800, David wrote: I can say "me too" for this. I thought it was perhaps glibc or binutils tho. I only have reiserfs systems now so I don't have a basis for comparison. However I -can- say that I didn't experience this until I put glibc 2.2.1 on my systems. I do use an "approved" gcc, stock 2.95.2. I wouldn't be so quick to pin it on reiserfs. Well, I run glibc-2.2.1 as well, so that might be one of the factors contributing to this. Then again, glibc-2.2.1 with ext2 does not cause any problems whatsoever with mozilla. So it could be that reiserfs + glibc-2.2.1 is a bad combination, question remains which of these two is the culprit (if not both). Since glibc-2.2.2 is out, I will give that a try as well. Not tonight though... And no, I'm not running RedHat 7.x for those who might think so (and automatically blame everything on it). When did you switch to glibc-2.2.1? Were you running reiserfs before that? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups
On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote: > There is also an additional debugging/statistics counter provided in > /proc/cpuinfo that counts interrupts which got delivered with its trigger > mode mismatched. Check it out to find if you get any misdelivered > interrupts at all. I guess you mean the MIS: counter in /proc/interrupts? This is what it says on my box after running some 33 interrupts (at a rate of app. 900/second) through the network/usb IRQ: cat /proc/interrupts CPU0 CPU1 0: 31693 32749IO-APIC-edge timer 1: 1208 1174IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 3:113 26IO-APIC-edge serial 4: 4689 4567IO-APIC-edge serial 14: 4440 4545IO-APIC-edge ide0 15: 1911 2132IO-APIC-edge ide1 16: 85021 84227 IO-APIC-level es1371, mga@PCI:1:0:0 17: 26 26 IO-APIC-level sym53c8xx 18: 0 0 IO-APIC-level btaudio, bttv 19: 165467 166254 IO-APIC-level eth0, eth1, usb-uhci NMI: 64376 64376 LOC: 64364 64362 ERR: 0 MIS:647 So, that's about 650 misdelivered interrupts for 33 deliveries (the other interrupts never gave me any trouble, so I guess the misdelivered ones are all from IRQ 19), or about .2% When I load the network and stream some audio over it, the sound becomes a bit choppy. The MIS: counter only increases when the network (read: IRQ1() is loaded, a single audio stream (app. 220 int/sec) causes no MISses to occur. In general, I'd say the stability WITH the patch is good, and timeouts are withing tolerable levels. If I need something better, I'll probably get myself a better set of network cards... So, quick conclusion, this seems a reasonable fix... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] 2.4.1, 2.4.2-pre3: APIC lockups
On Tue, Feb 13, 2001 at 09:13:10PM +0100, Maciej W. Rozycki wrote: There is also an additional debugging/statistics counter provided in /proc/cpuinfo that counts interrupts which got delivered with its trigger mode mismatched. Check it out to find if you get any misdelivered interrupts at all. I guess you mean the MIS: counter in /proc/interrupts? This is what it says on my box after running some 33 interrupts (at a rate of app. 900/second) through the network/usb IRQ: cat /proc/interrupts CPU0 CPU1 0: 31693 32749IO-APIC-edge timer 1: 1208 1174IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 3:113 26IO-APIC-edge serial 4: 4689 4567IO-APIC-edge serial 14: 4440 4545IO-APIC-edge ide0 15: 1911 2132IO-APIC-edge ide1 16: 85021 84227 IO-APIC-level es1371, mga@PCI:1:0:0 17: 26 26 IO-APIC-level sym53c8xx 18: 0 0 IO-APIC-level btaudio, bttv 19: 165467 166254 IO-APIC-level eth0, eth1, usb-uhci NMI: 64376 64376 LOC: 64364 64362 ERR: 0 MIS:647 So, that's about 650 misdelivered interrupts for 33 deliveries (the other interrupts never gave me any trouble, so I guess the misdelivered ones are all from IRQ 19), or about .2% When I load the network and stream some audio over it, the sound becomes a bit choppy. The MIS: counter only increases when the network (read: IRQ1() is loaded, a single audio stream (app. 220 int/sec) causes no MISses to occur. In general, I'd say the stability WITH the patch is good, and timeouts are withing tolerable levels. If I need something better, I'll probably get myself a better set of network cards... So, quick conclusion, this seems a reasonable fix... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: hard crashes 2.4.0/1 with NE2K stuff
On Mon, Feb 05, 2001 at 07:41:11PM +, Roeland Th. Jansen wrote: > On Mon, Feb 05, 2001 at 06:26:52PM +, Roeland Th. Jansen wrote: > > > > I'll report further. an Maciej -- thanks for your work ! > > with the extra patch in arch/i386/kernel/apic.c: > > #else > /* Disable focus processor (bit==1) */ > value |= (1<<9); > #endif > > used, eth0 (ne2k) doesn't die anymore; no choppy sound either. we're > currently having over 2.100.000 interrupts without a problem. Same here (although I just changed #if 1 to #if 0 to disable focus processor support), the net stays up and the chops are gone. Cheers//Frank -- WWWWW ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hard crashes 2.4.0/1 with NE2K stuff
> 2.4.1. rebuilt here and with a floodping towards my machine causes a > hard crash where nothing works anymore. I'm currently running 2.4.1 with Maciej's patch-2.4.0-io_apic-4. Additionally, I disabled focus_processor in apic.c to get rid of some network delays. Flood pings both from and to this system do not cause any problems, other than making the streaming audio sound a bit choppy... Box is a dual-celeron (466, non-overclocked) BP-6 with two ne2k (Winbond W89C940 based) cards sharing an interrupt. Maybe that works for you as well? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: hard crashes 2.4.0/1 with NE2K stuff
2.4.1. rebuilt here and with a floodping towards my machine causes a hard crash where nothing works anymore. I'm currently running 2.4.1 with Maciej's patch-2.4.0-io_apic-4. Additionally, I disabled focus_processor in apic.c to get rid of some network delays. Flood pings both from and to this system do not cause any problems, other than making the streaming audio sound a bit choppy... Box is a dual-celeron (466, non-overclocked) BP-6 with two ne2k (Winbond W89C940 based) cards sharing an interrupt. Maybe that works for you as well? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Mailing List, Archive by Week: Gigabyte 6VXDC7: APGigabyte 6VXDC7: APIC error on CPU1: 08(08)
Heikki, Those are the same problems I had with my Abit BP-6 SMP-board. There are a couple of patched which seem to make the problem disappear. The jury is still not out on whether they really solve the problem or merely hide it, but I haven't had a crash ever since I patched my box. The most recent patch is the one from Maciej, you can find it on the list, or in the archives (like this one: http://boudicca.tux.org/hypermail/linux-kernel/this-week/0469.html - this link is only valid 'till sunday!) Unfortunately, the archives often mangle patches, so it is better to get them directly from the list (or mail Maciej for it...) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Sun, Jan 14, 2001 at 12:13:58AM +, Roeland Th. Jansen wrote: > On Fri, Jan 12, 2001 at 09:03:49PM +0100, Ingo Molnar wrote: > > well, some time ago i had an ne2k card in an SMP system as well, and found > > this very problem. Disabling/enabling focus-cpu appeared to make a > > difference, but later on i made experiments that show that in both cases > > the hang happens. I spent a good deal of time trying to fix this problem, > > but failed - so any fresh ideas are more than welcome. > > for the record. my BP6, non OC, apic smp system with ne2k fails within > 24 hours here too. if I can be of any help. (2.4.0. kernel. no > vmware or opensound) You can help yourself by applying Manfred's patch to 8390.c (in preference to my own patch to the same file). This will sove the hanging-network problem. If your entire box hangs, that's another story which will probably not be fixed by that patch. You can find the patch in Manfred's posting to the list from Fri Jan 12 2001 - 14:04:24 EST. I've been running a patched driver for more than a day now, under heavy network load, without problems. Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Sun, Jan 14, 2001 at 12:13:58AM +, Roeland Th. Jansen wrote: On Fri, Jan 12, 2001 at 09:03:49PM +0100, Ingo Molnar wrote: well, some time ago i had an ne2k card in an SMP system as well, and found this very problem. Disabling/enabling focus-cpu appeared to make a difference, but later on i made experiments that show that in both cases the hang happens. I spent a good deal of time trying to fix this problem, but failed - so any fresh ideas are more than welcome. for the record. my BP6, non OC, apic smp system with ne2k fails within 24 hours here too. if I can be of any help. (2.4.0. kernel. no vmware or opensound) You can help yourself by applying Manfred's patch to 8390.c (in preference to my own patch to the same file). This will sove the hanging-network problem. If your entire box hangs, that's another story which will probably not be fixed by that patch. You can find the patch in Manfred's posting to the list from Fri Jan 12 2001 - 14:04:24 EST. I've been running a patched driver for more than a day now, under heavy network load, without problems. Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Sat, Jan 13, 2001 at 02:51:54AM +0100, Manfred Spraul wrote: > Frank de Lange wrote: > > > > It could be that people using those cards are not the ones who tend > > to go for the (somewhat tricky) BP6 board... > > > > I doubt that it's BP6 specific: I have the problem with a Gigabyte BXD > board and I doubt that Ingo used an BP6. Perhaps 82093AA specific (the > IO APIC chip used for SMP 440BX board) It isn't. But I just meant to indicate that the mere fact that I could not find any problem-report for that combination does not indicate that there ARE no problems... > I can't find any spec updates for that chip: either it's the first > perfect chip Intel ever produced, or ... :-) Well, the BX chipset is one of their better attempts I think... Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, Jan 12, 2001 at 04:56:24PM -0800, Linus Torvalds wrote: > IDE is not my favourite example of a "known stable driver". Also, in many > cases IDE is for historical reasons connected to an EDGE io-apic pin (ie > it's still considered an ISA interrupt). Which probably wouldn't show this > problem anyway. They (ide interrupts) are indeed EDGE-triggered on my box. I have not enabled the HPT366 (ATA66) controller on this board, so I can not tell if that controller is EDGE-triggered as well. > Also, IDE doesn't generate all that many interrupts. You can make a > network driver do a _lot_ more interrupts than just about any disk driver > by simply sending/receiving a lot of packets. With disks it is very hard > to get the same kind of irq load - Linux will merge the requests and do at > least 1kB worth of transfer per interrupt etc. On a ne2k 100Mbps PCI card, > you can probably _easily_ generate a much higher stream of interrupts. There's sound... The msnd.c (Turtle Beach MultiSound) driver (and its derivatives, like msnd_pinnacle) uses disable_irq. Running esd (esound daemon), sound can easily generate > 1000 interrupts/second, since esd uses small dma transfers. This can be seen quite clearly from /proc/interrupts on my soundserver: CPU0 0: 276867328 XT-PIC timer 1: 2 XT-PIC keyboard 2: 0 XT-PIC cascade 3:7631519 XT-PIC eth1 4:2751419 XT-PIC serial 5: 1907346678 XT-PIC soundblaster 8: 1 XT-PIC rtc 9: 45022986 XT-PIC eth0 13: 1 XT-PIC fpu 14:4320643 XT-PIC ide0 15:4409193 XT-PIC ide1 NMI: 0 OK, this is an ageing P166, and it uses a different driver, etc. I have not found any problems with hanging sound drivers in Google query for 'linux msnd bp6' or 'linux multisound bp6'. Of course, this is no conclusive evidence, far from it... It could be that people using those cards are not the ones who tend to go for the (somewhat tricky) BP6 board... Cheers//Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, Jan 12, 2001 at 04:36:33PM -0800, Linus Torvalds wrote: > It may well not be disable_irq() that is buggy. In fact, there's good > reason to believe that it's a hardware problem. I am inclined to believe it IS a hardware problem... If disable_irq were buggy, wouldn't the problem occur more frequently in other irq-heavy areas? A quick count shows that disable_irq* is used in 84 sourcefiles in the driver/* directory. This includes drivers which generate many interrupts in a short timeframe (like ide). Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 04:15:37PM -0800, Linus Torvalds wrote: > On Fri, 12 Jan 2001, Frank de Lange wrote: > > > > Gentleman, this (the patch to 8390.c) seems to fix the problem. > > The problem with this patch is that anybody with a slow ISA ne2000 clone > will basically have absolutely _horrible_ interrupt latency because we > hold the irq lock over some quite expensive operations. > > The spin_lock_irqsave() is absolutely my preferred fix, and if I remember > correctly this is in fact how some early 2.1.x code fixed the ne2000 > driver when the original irq scalability stuff happened (for some time > during development we did not have a working "disable_irq()" AT ALL > because the irq-disabling counters etc logic hadn't been done). And that's the patch I meant... Manfred's spin_lock_irqsave/spin_unlock_irqrestore based one, not my (spin_lock_irq/spin_unlock_irq) based patch. That is also the one I'm running now. Frank -- WWWWW ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
> Remind me: what polarity are your io-apic irq's? Level, edge, sideways? > Anything else that might be relevant? Well, sideways ofcourse! :-) here's a cat /proc/interrupts from the (BP6) box: CPU0 CPU1 0: 104936 105433IO-APIC-edge timer 1: 4384IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 3: 79 59IO-APIC-edge serial 4: 12743 12850IO-APIC-edge serial 14: 7855 7885IO-APIC-edge ide0 15: 1990 1703IO-APIC-edge ide1 16: 0 0 IO-APIC-level es1371, mga@PCI:1:0:0 17: 24 28 IO-APIC-level sym53c8xx 18: 0 0 IO-APIC-level bttv 19: 460435 460402 IO-APIC-level eth0, eth1, usb-uhci NMI: 210303 210303 LOC: 210285 210284 ERR: 0 The interrupt which caused problems was 19 (with both network cards and USB on it). It shows a high number of interrupts because I've been load-testing the network. The mere fact that it shows this hig number of interrupts shows the fix works... As this is a BP6, I'm now supposed to go on about the dead chickens, dedicated air conditioners, nuclear powersupplies and other magic you're supposed to buy to get these boards running. Well, nothing of that sort, it is running on a simple (but high quality) 235W PSU with heatgreased coolers on the CPUs and the BX xhipset. Nothing is overclocked. CPU and chipset tmeperatures are 24.C and 32.C, respectively. In short, nothing remarkable. All PCI slots are used, as you can see from my first posting in this thread (which contains more info on the hardware). //Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:54:31PM +0100, Manfred Spraul wrote: > I have found one combination that doesn't hang with the unpatched > 8390.c, but network throughput is down to 1/2. I hope that's due to the > debugging changes. Hm, could it be that the fact that network throughput is halved causes the problem not to appear? Remember, it only appears under HEAVY network load. A single nfs cp -rd was not enough to hang my network, I needed to add at least another cp -rd or some streaming audio or something else... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:51:36PM +0100, Ingo Molnar wrote: > great. Back when i had the same problem, flood pinging another host (on > the local network) was the quickest way to reproduce the hang: > > ping -f -s 10 otherhost > > this produced an IOAPIC-hang within seconds. Apart from killing streaming audio and interactive network use, nothing hangs. As soon as the ping flood is stopped, audio streams on and ssh sessions are useable again. So, it seems to fix it... Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:37:24PM +0100, Ingo Molnar wrote: > okay - i just wanted to hear a definitive word from you that this fixes > your problem, because this is what we'll have to do as a final solution. > (barring any other solution.) Now running with this config: PATCHED 8390.c (using irq_safe spinlocks instead of disable_irq) PATCHED apic.c (focus cpu ENABLED) STOCK io_apic.c No problems under heavy network load. Gentleman, this (the patch to 8390.c) seems to fix the problem. Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:34:03PM +0100, Ingo Molnar wrote: > ? this is x86-only code. There is no hot-pluggable CPU support for Linux > AFAIK. (But in any case, the code is basically ready for hot-pluggable > CPUs, just take a few precautions and change cpu_online_mask and a couple > of other things.) OK, maybe the Sun example was not the best to give for this code... But if there are no hot-pluggable x86's around now (I think there are, but can not recollect who made 'm...) and nobody is complaining, then it is fine with me... I won't hot-unplug my BP6's CPU's anyway... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:31:15PM +0100, Ingo Molnar wrote: > > On Fri, 12 Jan 2001, Frank de Lange wrote: > > > WITH or WITHOUT the changed 8390 driver? I can already give you the > > results for running WITH the changed driver: it works. I have not yet > > tried it WITHOUT the changed 8390 driver (so that would be stock 8390, > > patched apic.c, stock io_apic.c). Please let me know which you want... > > WITH. patched 8390.c, patched apic.c, sock io_apic.c. My very strong > feeling is that this will be a stable combination, and that this is what > we want as a final solution. It is. As I already mentioned in other messages, I already tested with JUST the patched 8390.c driver, no other patches. It was stable. I then patched apic.c AND io_apic.c, which did not introduce new instabilities. Unless you think that reverting back to a stock io_apic.c would cause instabilities (which would be weird, since I had no instabilities running only a patched 8390.c), I think the patch to 8390.c DOES remove the symptoms all by itself. No other patches seem necessary to get a stable box. But I'll patch the mess again just fox kicks :-) Cheers//Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:19:53PM +0100, Ingo Molnar wrote: > > In addition, I patched apic.c (focus cpu enabled) > > In addition, I patched io_apic ((TARGET_CPUS 0xff) > > please try it with the focus CPU enabling change (we want to enable that > feature, i only disabled it due to the stuck-ne2k bug), but with > TARGET_CPUS set to cpu_online_mask. (this later is needed for certain > crappy BIOSes.) WITH or WITHOUT the changed 8390 driver? I can already give you the results for running WITH the changed driver: it works. I have not yet tried it WITHOUT the changed 8390 driver (so that would be stock 8390, patched apic.c, stock io_apic.c). Please let me know which you want... Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:11:29PM +0100, Manfred Spraul wrote: > Frank, please clarify: > you still run without disable_irq_nosync() in 8390.c? I am running with your patched version of 8390.c (so WITHOUT disable_irq_nosync()). In addition, I patched apic.c (focus cpu enabled) In addition, I patched io_apic ((TARGET_CPUS 0xff) > I have a first idea: we send an EOI to an interrupt that is masked on > the IO apic, perhaps that causes the problems. Sound plausible... > I'm right now typing a patch. I'll await yours instead of making my own patch this time... :-) Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 11:59:25AM -0800, Linus Torvalds wrote: > > Could this really be the solution? > > I'd like to know _which_ of the two makes a difference (or does it only > trigger with both of them enabled)? And even then I'm not sure that it is > "the" solution - both changes to io-apic handling had some reason for > them. Ingo, what was the focus-cpu thing? Well, with 'this' (in 'could THIS be') I really meant the move from disable_irq to the irq_safe spinlocks. I'm currently running with the patched 8390.c driver, patched io_apic (TARGET_CPUS 0xff) and patched apic.c (focus cpu enabled), and have had no problems yet... even though I'm running several simulatnsous nfs cp -rd , streaming network audio, scanning with an USB scanner, etc. So far, it seems that the patch to 8390.c removed the symptoms. The changes to apic.c and io_apic.c did not make the network hang come back. Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:33:15PM +0100, Manfred Spraul wrote: > Frank, the 2.4.0 contains 2 band aids that were added for ne2k smp: > > * From Ingo: focus cpu disabled, in arch/i386/kernel/apic.c > * From myself: TARGET_CPU = cpu_online_mask, was 0xFF. > > Could you disable both bandaids? I disabled them, no problems so far. I disabled both (I guess you meant the 'define TARGET_CPUS cpu_online' in io_apic.c?), and reverted my own patch, added your patch... Now running with the usual heavy network load, no problems so far... Also made USB produce interrupts (shares irq with network), no problems... Could this really be the solution? Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:04:24PM +0100, Manfred Spraul wrote: > I removed the disable_irq lines from 8390.c, and that fixed the problem: > no hang within 2 minutes - the test is still running. > > Frank, could you double check it? I'm currently running my own patched version, which uses spin_lock_irq/spin_unlock_irq instead of spin_lock_irqsave/spin_unlock_irqrestore like you patch uses. Looking at spinlock.h, spin_lock_irq does a local irq disable, which seems to be closer to the original intent (disable_irq) than spin_lock_irqsave. Anyone want to comment on this? Anyway, still running under load, also got USB (which uses the same irq) to produce some interrupts by scanning some stuff. No problems so far... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:04:24PM +0100, Manfred Spraul wrote: > Linus wrote: > > Does this seem to happen mainly with drivers that use "disable_irq()" > > and "enable_irq()"? I know the ne drivers do (through the 8390 module), > > and some others do too (3c59x). > > I removed the disable_irq lines from 8390.c, and that fixed the problem: > no hang within 2 minutes - the test is still running. > > Frank, could you double check it? Hm, I also sent in a (somewhat different) patch on my own... :-)] Anyway, still running under heavy load... Cheers//Frank -- WWWWW ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
As per Linus' suggestion, I removed the disable_irq/enable_irq statements from the 8390 core driver, and replace the spinlocks with irq-safe versions. This seems to solve the network hangs, as I am currently running a heavy network load (which would have killed a non-patched driver within seconds). Network latency seems a bit higher, and there are some hiccups in the streaming audio (part of the network load, easy indicator of performance...), but no hangs. Here's the patch: --- linux/drivers/net/8390.c.orgFri Jan 12 19:52:38 2001 +++ linux/drivers/net/8390.cFri Jan 12 19:54:50 2001 @@ -242,15 +242,15 @@ /* Ugly but a reset can be slow, yet must be protected */ - disable_irq_nosync(dev->irq); - spin_lock(_local->page_lock); + /* disable_irq_nosync(dev->irq); */ + spin_lock_irq(_local->page_lock); /* Try to restart the card. Perhaps the user has fixed something. */ ei_reset_8390(dev); NS8390_init(dev, 1); - spin_unlock(_local->page_lock); - enable_irq(dev->irq); + spin_unlock_irq(_local->page_lock); + /* enable_irq(dev->irq); */ netif_wake_queue(dev); } @@ -285,9 +285,9 @@ * Slow phase with lock held. */ - disable_irq_nosync(dev->irq); + /* disable_irq_nosync(dev->irq); */ - spin_lock(_local->page_lock); + spin_lock_irq(_local->page_lock); ei_local->irqlock = 1; @@ -383,8 +383,8 @@ ei_local->irqlock = 0; outb_p(ENISR_ALL, e8390_base + EN0_IMR); - spin_unlock(_local->page_lock); - enable_irq(dev->irq); + spin_unlock_irq(_local->page_lock); + /* enable_irq(dev->irq); */ dev_kfree_skb (skb); ei_local->stat.tx_bytes += send_length; -- WWWWW ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 06:51:36PM +0100, Manfred Spraul wrote: > Frank, I've attached a proposed kick_IOAPIC pin. Could you try it? > I'm rebooting with that patch right now. I added the patch, and tried it out. When the network hangs, I am able to revive it with ALT-SYSRQ-Q. The debug log shows these entries: Jan 12 19:22:57 behemoth kernel: SysRq: <0> NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: Jan 12 19:22:57 behemoth kernel: Before: Jan 12 19:22:57 behemoth kernel: 00 003 03 011 1 11199 Jan 12 19:22:57 behemoth kernel: After switching to edge: Jan 12 19:22:57 behemoth kernel: 00 003 03 001 1 11199 Jan 12 19:22:57 behemoth kernel: After switch back: Jan 12 19:22:57 behemoth kernel: 00 003 03 011 1 11199 -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
hemoth kernel: ... APIC ICR2: 0100 Jan 12 18:26:21 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 18:26:21 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC LVT0: 00010700 Jan 12 18:26:21 behemoth kernel: ... APIC LVT1: 00010400 Jan 12 18:26:21 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 18:26:21 behemoth kernel: ... APIC TMICT: a322 Jan 12 18:26:21 behemoth kernel: ... APIC TMCCT: 1803 Jan 12 18:26:21 behemoth kernel: ... APIC TDCR: 0003 Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth kernel: printing local APIC contents on CPU#0/0: Jan 12 18:26:21 behemoth kernel: ... APIC ID: (0) Jan 12 18:26:21 behemoth kernel: ... APIC VERSION: 00040011 Jan 12 18:26:21 behemoth kernel: ... APIC TASKPRI: (00) Jan 12 18:26:21 behemoth kernel: ... APIC ARBPRI: 00e0 (e0) Jan 12 18:26:21 behemoth kernel: ... APIC PROCPRI: Jan 12 18:26:21 behemoth kernel: ... APIC EOI: Jan 12 18:26:21 behemoth kernel: ... APIC LDR: 0100 Jan 12 18:26:21 behemoth kernel: ... APIC DFR: Jan 12 18:26:21 behemoth kernel: ... APIC SPIV: 03ff Jan 12 18:26:21 behemoth kernel: ... APIC ISR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 7 times Jan 12 18:26:21 behemoth kernel: ... APIC TMR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 3 times Jan 12 18:26:21 behemoth kernel: 01000100 Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 2 times Jan 12 18:26:21 behemoth kernel: ... APIC IRR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 6 times Jan 12 18:26:21 behemoth kernel: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC ESR: Jan 12 18:26:21 behemoth kernel: ... APIC ICR: 000c08fb Jan 12 18:26:21 behemoth kernel: ... APIC ICR2: 0200 Jan 12 18:26:21 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 18:26:21 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC LVT0: 00010700 Jan 12 18:26:21 behemoth kernel: ... APIC LVT1: 0400 Jan 12 18:26:21 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 18:26:21 behemoth kernel: ... APIC TMICT: a322 Jan 12 18:26:21 behemoth kernel: ... APIC TMCCT: 4e26 Jan 12 18:26:21 behemoth kernel: ... APIC TDCR: 0003 -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
el: ... APIC ICR: 08fc Jan 12 16:29:32 behemoth kernel: ... APIC ICR2: 0100 Jan 12 16:29:32 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 16:29:32 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 16:29:32 behemoth kernel: ... APIC LVT0: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVT1: 00010400 Jan 12 16:29:32 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 16:29:32 behemoth kernel: ... APIC TMICT: a322 Jan 12 16:29:32 behemoth kernel: ... APIC TMCCT: 1686 Jan 12 16:29:32 behemoth kernel: ... APIC TDCR: 0003 Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth kernel: printing local APIC contents on CPU#0/0: Jan 12 16:29:32 behemoth kernel: ... APIC ID: (0) Jan 12 16:29:32 behemoth kernel: ... APIC VERSION: 00040011 Jan 12 16:29:32 behemoth kernel: ... APIC TASKPRI: (00) Jan 12 16:29:32 behemoth kernel: ... APIC ARBPRI: 00f0 (f0) Jan 12 16:29:32 behemoth kernel: ... APIC PROCPRI: Jan 12 16:29:32 behemoth kernel: ... APIC EOI: Jan 12 16:29:32 behemoth kernel: ... APIC LDR: 0100 Jan 12 16:29:32 behemoth kernel: ... APIC DFR: Jan 12 16:29:32 behemoth kernel: ... APIC SPIV: 03ff Jan 12 16:29:32 behemoth kernel: ... APIC ISR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 7 times Jan 12 16:29:32 behemoth kernel: ... APIC TMR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 3 times Jan 12 16:29:32 behemoth kernel: 0100 Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 2 times Jan 12 16:29:32 behemoth kernel: ... APIC IRR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 6 times Jan 12 16:29:32 behemoth kernel: 00011000 Jan 12 16:29:32 behemoth kernel: ... APIC ESR: Jan 12 16:29:32 behemoth kernel: ... APIC ICR: 000c08fb Jan 12 16:29:32 behemoth kernel: ... APIC ICR2: 0200 Jan 12 16:29:32 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 16:29:32 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 16:29:32 behemoth kernel: ... APIC LVT0: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVT1: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 16:29:32 behemoth kernel: ... APIC TMICT: a322 Jan 12 16:29:32 behemoth kernel: ... APIC TMCCT: 47d7 Jan 12 16:29:32 behemoth kernel: ... APIC TDCR: 0003 -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 10:40:04PM +1100, Andrew Morton wrote: > Here is a debugging patch. Could you please apply this, > rebuild and: > > 1: Type ALT-SYSRQ-A when everything is good > 2: Type ALT-SYSRQ-A when everything is bad > 3: send the resulting logs. OK, here's the results I get... Before network hang === print_PIC() printing PIC contents print_IO_APIC() testing the IO APIC... done. print_all_local_APICs() ... APIC ID: 0100 (1) ... APIC VERSION: 00040011 0100 0001 ... APIC ID: (0) ... APIC VERSION: 00040011 01000100 1000 NOTICE: results differ every time I hit ALT-SYSRQ-A. The '1' bit at 'row 11, col. 26' stays '1' no matter how many times I use the magic keys. The other '1' bits jump around a bit, or disappear alltogether. Also, the sequence in which the APICs appear in the dump sometimes differs (this example shows 1 first, then 0, other times you'd see 0 first, then 1) After network hang == print_PIC() printing PIC contents print_IO_APIC() testing the IO APIC... done. print_all_local_APICs() ... APIC ID: (0) ... APIC VERSION: 00040011 0100 0001 ... APIC ID: 0100 (1) ... APIC VERSION: 00040011 0100 0001 NOTICE: hmmm... see, now that '1' bit at row 11, col. 26 for APIC 0 which was '1' before has turned to '0'. It will stay '0' no matter how many times I hit the magic keys... It seems to have been replaced by the '1' bit at row 11, col. 10, since that bit stays '1' no matter how many magic I throw at it... Hope this helps... If you need more, let me know... Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 10:40:04PM +1100, Andrew Morton wrote: > Frank de Lange wrote: > > > > Quick and dirty conclusion: as soon as the apic comes in to play, things get > > messy... > Here is a debugging patch. Could you please apply this, > rebuild and: > > 1: Type ALT-SYSRQ-A when everything is good > 2: Type ALT-SYSRQ-A when everything is bad > 3: send the resulting logs. WillCo... Now rebuilding... Cheers//Frank -- W ___________ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 10:40:04PM +1100, Andrew Morton wrote: Frank de Lange wrote: Quick and dirty conclusion: as soon as the apic comes in to play, things get messy... Here is a debugging patch. Could you please apply this, rebuild and: 1: Type ALT-SYSRQ-A when everything is good 2: Type ALT-SYSRQ-A when everything is bad 3: send the resulting logs. WillCo... Now rebuilding... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 10:40:04PM +1100, Andrew Morton wrote: Here is a debugging patch. Could you please apply this, rebuild and: 1: Type ALT-SYSRQ-A when everything is good 2: Type ALT-SYSRQ-A when everything is bad 3: send the resulting logs. OK, here's the results I get... Before network hang === print_PIC() printing PIC contents print_IO_APIC() testing the IO APIC... done. print_all_local_APICs() ... APIC ID: 0100 (1) ... APIC VERSION: 00040011 0100 0001 ... APIC ID: (0) ... APIC VERSION: 00040011 01000100 1000 NOTICE: results differ every time I hit ALT-SYSRQ-A. The '1' bit at 'row 11, col. 26' stays '1' no matter how many times I use the magic keys. The other '1' bits jump around a bit, or disappear alltogether. Also, the sequence in which the APICs appear in the dump sometimes differs (this example shows 1 first, then 0, other times you'd see 0 first, then 1) After network hang == print_PIC() printing PIC contents print_IO_APIC() testing the IO APIC... done. print_all_local_APICs() ... APIC ID: (0) ... APIC VERSION: 00040011 0100 0001 ... APIC ID: 0100 (1) ... APIC VERSION: 00040011 0100 0001 NOTICE: hmmm... see, now that '1' bit at row 11, col. 26 for APIC 0 which was '1' before has turned to '0'. It will stay '0' no matter how many times I hit the magic keys... It seems to have been replaced by the '1' bit at row 11, col. 10, since that bit stays '1' no matter how many magic I throw at it... Hope this helps... If you need more, let me know... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: sen
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
:29:32 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 16:29:32 behemoth kernel: ... APIC LVT0: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVT1: 00010400 Jan 12 16:29:32 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 16:29:32 behemoth kernel: ... APIC TMICT: a322 Jan 12 16:29:32 behemoth kernel: ... APIC TMCCT: 1686 Jan 12 16:29:32 behemoth kernel: ... APIC TDCR: 0003 Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth kernel: printing local APIC contents on CPU#0/0: Jan 12 16:29:32 behemoth kernel: ... APIC ID: (0) Jan 12 16:29:32 behemoth kernel: ... APIC VERSION: 00040011 Jan 12 16:29:32 behemoth kernel: ... APIC TASKPRI: (00) Jan 12 16:29:32 behemoth kernel: ... APIC ARBPRI: 00f0 (f0) Jan 12 16:29:32 behemoth kernel: ... APIC PROCPRI: Jan 12 16:29:32 behemoth kernel: ... APIC EOI: Jan 12 16:29:32 behemoth kernel: ... APIC LDR: 0100 Jan 12 16:29:32 behemoth kernel: ... APIC DFR: Jan 12 16:29:32 behemoth kernel: ... APIC SPIV: 03ff Jan 12 16:29:32 behemoth kernel: ... APIC ISR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 7 times Jan 12 16:29:32 behemoth kernel: ... APIC TMR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 3 times Jan 12 16:29:32 behemoth kernel: 0100 Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 2 times Jan 12 16:29:32 behemoth kernel: ... APIC IRR field: Jan 12 16:29:32 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 16:29:32 behemoth kernel: Jan 12 16:29:32 behemoth last message repeated 6 times Jan 12 16:29:32 behemoth kernel: 00011000 Jan 12 16:29:32 behemoth kernel: ... APIC ESR: Jan 12 16:29:32 behemoth kernel: ... APIC ICR: 000c08fb Jan 12 16:29:32 behemoth kernel: ... APIC ICR2: 0200 Jan 12 16:29:32 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 16:29:32 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 16:29:32 behemoth kernel: ... APIC LVT0: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVT1: 0400 Jan 12 16:29:32 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 16:29:32 behemoth kernel: ... APIC TMICT: a322 Jan 12 16:29:32 behemoth kernel: ... APIC TMCCT: 47d7 Jan 12 16:29:32 behemoth kernel: ... APIC TDCR: 0003 -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
: ... APIC LVTPC: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC LVT0: 00010700 Jan 12 18:26:21 behemoth kernel: ... APIC LVT1: 00010400 Jan 12 18:26:21 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 18:26:21 behemoth kernel: ... APIC TMICT: a322 Jan 12 18:26:21 behemoth kernel: ... APIC TMCCT: 1803 Jan 12 18:26:21 behemoth kernel: ... APIC TDCR: 0003 Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth kernel: printing local APIC contents on CPU#0/0: Jan 12 18:26:21 behemoth kernel: ... APIC ID: (0) Jan 12 18:26:21 behemoth kernel: ... APIC VERSION: 00040011 Jan 12 18:26:21 behemoth kernel: ... APIC TASKPRI: (00) Jan 12 18:26:21 behemoth kernel: ... APIC ARBPRI: 00e0 (e0) Jan 12 18:26:21 behemoth kernel: ... APIC PROCPRI: Jan 12 18:26:21 behemoth kernel: ... APIC EOI: Jan 12 18:26:21 behemoth kernel: ... APIC LDR: 0100 Jan 12 18:26:21 behemoth kernel: ... APIC DFR: Jan 12 18:26:21 behemoth kernel: ... APIC SPIV: 03ff Jan 12 18:26:21 behemoth kernel: ... APIC ISR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 7 times Jan 12 18:26:21 behemoth kernel: ... APIC TMR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 3 times Jan 12 18:26:21 behemoth kernel: 01000100 Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 2 times Jan 12 18:26:21 behemoth kernel: ... APIC IRR field: Jan 12 18:26:21 behemoth kernel: 0123456789abcdef0123456789abcdef Jan 12 18:26:21 behemoth kernel: Jan 12 18:26:21 behemoth last message repeated 6 times Jan 12 18:26:21 behemoth kernel: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC ESR: Jan 12 18:26:21 behemoth kernel: ... APIC ICR: 000c08fb Jan 12 18:26:21 behemoth kernel: ... APIC ICR2: 0200 Jan 12 18:26:21 behemoth kernel: ... APIC LVTT: 000200ef Jan 12 18:26:21 behemoth kernel: ... APIC LVTPC: 0001 Jan 12 18:26:21 behemoth kernel: ... APIC LVT0: 00010700 Jan 12 18:26:21 behemoth kernel: ... APIC LVT1: 0400 Jan 12 18:26:21 behemoth kernel: ... APIC LVTERR: 00fe Jan 12 18:26:21 behemoth kernel: ... APIC TMICT: a322 Jan 12 18:26:21 behemoth kernel: ... APIC TMCCT: 4e26 Jan 12 18:26:21 behemoth kernel: ... APIC TDCR: 0003 -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 06:51:36PM +0100, Manfred Spraul wrote: Frank, I've attached a proposed kick_IOAPIC pin. Could you try it? I'm rebooting with that patch right now. I added the patch, and tried it out. When the network hangs, I am able to revive it with ALT-SYSRQ-Q. The debug log shows these entries: Jan 12 19:22:57 behemoth kernel: SysRq: 0 NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect: Jan 12 19:22:57 behemoth kernel: Before: Jan 12 19:22:57 behemoth kernel: 00 003 03 011 1 11199 Jan 12 19:22:57 behemoth kernel: After switching to edge: Jan 12 19:22:57 behemoth kernel: 00 003 03 001 1 11199 Jan 12 19:22:57 behemoth kernel: After switch back: Jan 12 19:22:57 behemoth kernel: 00 003 03 011 1 11199 -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
As per Linus' suggestion, I removed the disable_irq/enable_irq statements from the 8390 core driver, and replace the spinlocks with irq-safe versions. This seems to solve the network hangs, as I am currently running a heavy network load (which would have killed a non-patched driver within seconds). Network latency seems a bit higher, and there are some hiccups in the streaming audio (part of the network load, easy indicator of performance...), but no hangs. Here's the patch: --- linux/drivers/net/8390.c.orgFri Jan 12 19:52:38 2001 +++ linux/drivers/net/8390.cFri Jan 12 19:54:50 2001 @@ -242,15 +242,15 @@ /* Ugly but a reset can be slow, yet must be protected */ - disable_irq_nosync(dev-irq); - spin_lock(ei_local-page_lock); + /* disable_irq_nosync(dev-irq); */ + spin_lock_irq(ei_local-page_lock); /* Try to restart the card. Perhaps the user has fixed something. */ ei_reset_8390(dev); NS8390_init(dev, 1); - spin_unlock(ei_local-page_lock); - enable_irq(dev-irq); + spin_unlock_irq(ei_local-page_lock); + /* enable_irq(dev-irq); */ netif_wake_queue(dev); } @@ -285,9 +285,9 @@ * Slow phase with lock held. */ - disable_irq_nosync(dev-irq); + /* disable_irq_nosync(dev-irq); */ - spin_lock(ei_local-page_lock); + spin_lock_irq(ei_local-page_lock); ei_local-irqlock = 1; @@ -383,8 +383,8 @@ ei_local-irqlock = 0; outb_p(ENISR_ALL, e8390_base + EN0_IMR); - spin_unlock(ei_local-page_lock); - enable_irq(dev-irq); + spin_unlock_irq(ei_local-page_lock); + /* enable_irq(dev-irq); */ dev_kfree_skb (skb); ei_local-stat.tx_bytes += send_length; -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:04:24PM +0100, Manfred Spraul wrote: Linus wrote: Does this seem to happen mainly with drivers that use "disable_irq()" and "enable_irq()"? I know the ne drivers do (through the 8390 module), and some others do too (3c59x). I removed the disable_irq lines from 8390.c, and that fixed the problem: no hang within 2 minutes - the test is still running. Frank, could you double check it? Hm, I also sent in a (somewhat different) patch on my own... :-)] Anyway, still running under heavy load... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:04:24PM +0100, Manfred Spraul wrote: I removed the disable_irq lines from 8390.c, and that fixed the problem: no hang within 2 minutes - the test is still running. Frank, could you double check it? I'm currently running my own patched version, which uses spin_lock_irq/spin_unlock_irq instead of spin_lock_irqsave/spin_unlock_irqrestore like you patch uses. Looking at spinlock.h, spin_lock_irq does a local irq disable, which seems to be closer to the original intent (disable_irq) than spin_lock_irqsave. Anyone want to comment on this? Anyway, still running under load, also got USB (which uses the same irq) to produce some interrupts by scanning some stuff. No problems so far... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 08:33:15PM +0100, Manfred Spraul wrote: Frank, the 2.4.0 contains 2 band aids that were added for ne2k smp: * From Ingo: focus cpu disabled, in arch/i386/kernel/apic.c * From myself: TARGET_CPU = cpu_online_mask, was 0xFF. Could you disable both bandaids? I disabled them, no problems so far. I disabled both (I guess you meant the 'define TARGET_CPUS cpu_online' in io_apic.c?), and reverted my own patch, added your patch... Now running with the usual heavy network load, no problems so far... Also made USB produce interrupts (shares irq with network), no problems... Could this really be the solution? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Fri, Jan 12, 2001 at 11:59:25AM -0800, Linus Torvalds wrote: Could this really be the solution? I'd like to know _which_ of the two makes a difference (or does it only trigger with both of them enabled)? And even then I'm not sure that it is "the" solution - both changes to io-apic handling had some reason for them. Ingo, what was the focus-cpu thing? Well, with 'this' (in 'could THIS be') I really meant the move from disable_irq to the irq_safe spinlocks. I'm currently running with the patched 8390.c driver, patched io_apic (TARGET_CPUS 0xff) and patched apic.c (focus cpu enabled), and have had no problems yet... even though I'm running several simulatnsous nfs cp -rd big_dir, streaming network audio, scanning with an USB scanner, etc. So far, it seems that the patch to 8390.c removed the symptoms. The changes to apic.c and io_apic.c did not make the network hang come back. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:11:29PM +0100, Manfred Spraul wrote: Frank, please clarify: you still run without disable_irq_nosync() in 8390.c? I am running with your patched version of 8390.c (so WITHOUT disable_irq_nosync()). In addition, I patched apic.c (focus cpu enabled) In addition, I patched io_apic ((TARGET_CPUS 0xff) I have a first idea: we send an EOI to an interrupt that is masked on the IO apic, perhaps that causes the problems. Sound plausible... I'm right now typing a patch. I'll await yours instead of making my own patch this time... :-) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:19:53PM +0100, Ingo Molnar wrote: In addition, I patched apic.c (focus cpu enabled) In addition, I patched io_apic ((TARGET_CPUS 0xff) please try it with the focus CPU enabling change (we want to enable that feature, i only disabled it due to the stuck-ne2k bug), but with TARGET_CPUS set to cpu_online_mask. (this later is needed for certain crappy BIOSes.) WITH or WITHOUT the changed 8390 driver? I can already give you the results for running WITH the changed driver: it works. I have not yet tried it WITHOUT the changed 8390 driver (so that would be stock 8390, patched apic.c, stock io_apic.c). Please let me know which you want... Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:31:15PM +0100, Ingo Molnar wrote: On Fri, 12 Jan 2001, Frank de Lange wrote: WITH or WITHOUT the changed 8390 driver? I can already give you the results for running WITH the changed driver: it works. I have not yet tried it WITHOUT the changed 8390 driver (so that would be stock 8390, patched apic.c, stock io_apic.c). Please let me know which you want... WITH. patched 8390.c, patched apic.c, sock io_apic.c. My very strong feeling is that this will be a stable combination, and that this is what we want as a final solution. It is. As I already mentioned in other messages, I already tested with JUST the patched 8390.c driver, no other patches. It was stable. I then patched apic.c AND io_apic.c, which did not introduce new instabilities. Unless you think that reverting back to a stock io_apic.c would cause instabilities (which would be weird, since I had no instabilities running only a patched 8390.c), I think the patch to 8390.c DOES remove the symptoms all by itself. No other patches seem necessary to get a stable box. But I'll patch the mess again just fox kicks :-) Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:34:03PM +0100, Ingo Molnar wrote: ? this is x86-only code. There is no hot-pluggable CPU support for Linux AFAIK. (But in any case, the code is basically ready for hot-pluggable CPUs, just take a few precautions and change cpu_online_mask and a couple of other things.) OK, maybe the Sun example was not the best to give for this code... But if there are no hot-pluggable x86's around now (I think there are, but can not recollect who made 'm...) and nobody is complaining, then it is fine with me... I won't hot-unplug my BP6's CPU's anyway... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:37:24PM +0100, Ingo Molnar wrote: okay - i just wanted to hear a definitive word from you that this fixes your problem, because this is what we'll have to do as a final solution. (barring any other solution.) Now running with this config: PATCHED 8390.c (using irq_safe spinlocks instead of disable_irq) PATCHED apic.c (focus cpu ENABLED) STOCK io_apic.c No problems under heavy network load. Gentleman, this (the patch to 8390.c) seems to fix the problem. Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:51:36PM +0100, Ingo Molnar wrote: great. Back when i had the same problem, flood pinging another host (on the local network) was the quickest way to reproduce the hang: ping -f -s 10 otherhost this produced an IOAPIC-hang within seconds. Apart from killing streaming audio and interactive network use, nothing hangs. As soon as the ping flood is stopped, audio streams on and ssh sessions are useable again. So, it seems to fix it... Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardwarerelated?
On Fri, Jan 12, 2001 at 09:54:31PM +0100, Manfred Spraul wrote: I have found one combination that doesn't hang with the unpatched 8390.c, but network throughput is down to 1/2. I hope that's due to the debugging changes. Hm, could it be that the fact that network throughput is halved causes the problem not to appear? Remember, it only appears under HEAVY network load. A single nfs cp -rd big_dir was not enough to hang my network, I needed to add at least another cp -rd or some streaming audio or something else... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
Remind me: what polarity are your io-apic irq's? Level, edge, sideways? Anything else that might be relevant? Well, sideways ofcourse! :-) here's a cat /proc/interrupts from the (BP6) box: CPU0 CPU1 0: 104936 105433IO-APIC-edge timer 1: 4384IO-APIC-edge keyboard 2: 0 0 XT-PIC cascade 3: 79 59IO-APIC-edge serial 4: 12743 12850IO-APIC-edge serial 14: 7855 7885IO-APIC-edge ide0 15: 1990 1703IO-APIC-edge ide1 16: 0 0 IO-APIC-level es1371, mga@PCI:1:0:0 17: 24 28 IO-APIC-level sym53c8xx 18: 0 0 IO-APIC-level bttv 19: 460435 460402 IO-APIC-level eth0, eth1, usb-uhci NMI: 210303 210303 LOC: 210285 210284 ERR: 0 The interrupt which caused problems was 19 (with both network cards and USB on it). It shows a high number of interrupts because I've been load-testing the network. The mere fact that it shows this hig number of interrupts shows the fix works... As this is a BP6, I'm now supposed to go on about the dead chickens, dedicated air conditioners, nuclear powersupplies and other magic you're supposed to buy to get these boards running. Well, nothing of that sort, it is running on a simple (but high quality) 235W PSU with heatgreased coolers on the CPUs and the BX xhipset. Nothing is overclocked. CPU and chipset tmeperatures are 24.C and 32.C, respectively. In short, nothing remarkable. All PCI slots are used, as you can see from my first posting in this thread (which contains more info on the hardware). //Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, Jan 12, 2001 at 04:36:33PM -0800, Linus Torvalds wrote: It may well not be disable_irq() that is buggy. In fact, there's good reason to believe that it's a hardware problem. I am inclined to believe it IS a hardware problem... If disable_irq were buggy, wouldn't the problem occur more frequently in other irq-heavy areas? A quick count shows that disable_irq* is used in 84 sourcefiles in the driver/* directory. This includes drivers which generate many interrupts in a short timeframe (like ide). Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Fri, Jan 12, 2001 at 04:56:24PM -0800, Linus Torvalds wrote: IDE is not my favourite example of a "known stable driver". Also, in many cases IDE is for historical reasons connected to an EDGE io-apic pin (ie it's still considered an ISA interrupt). Which probably wouldn't show this problem anyway. They (ide interrupts) are indeed EDGE-triggered on my box. I have not enabled the HPT366 (ATA66) controller on this board, so I can not tell if that controller is EDGE-triggered as well. Also, IDE doesn't generate all that many interrupts. You can make a network driver do a _lot_ more interrupts than just about any disk driver by simply sending/receiving a lot of packets. With disks it is very hard to get the same kind of irq load - Linux will merge the requests and do at least 1kB worth of transfer per interrupt etc. On a ne2k 100Mbps PCI card, you can probably _easily_ generate a much higher stream of interrupts. There's sound... The msnd.c (Turtle Beach MultiSound) driver (and its derivatives, like msnd_pinnacle) uses disable_irq. Running esd (esound daemon), sound can easily generate 1000 interrupts/second, since esd uses small dma transfers. This can be seen quite clearly from /proc/interrupts on my soundserver: CPU0 0: 276867328 XT-PIC timer 1: 2 XT-PIC keyboard 2: 0 XT-PIC cascade 3:7631519 XT-PIC eth1 4:2751419 XT-PIC serial 5: 1907346678 XT-PIC soundblaster 8: 1 XT-PIC rtc 9: 45022986 XT-PIC eth0 13: 1 XT-PIC fpu 14:4320643 XT-PIC ide0 15:4409193 XT-PIC ide1 NMI: 0 OK, this is an ageing P166, and it uses a different driver, etc. I have not found any problems with hanging sound drivers in Google query for 'linux msnd bp6' or 'linux multisound bp6'. Of course, this is no conclusive evidence, far from it... It could be that people using those cards are not the ones who tend to go for the (somewhat tricky) BP6 board... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware
On Sat, Jan 13, 2001 at 02:51:54AM +0100, Manfred Spraul wrote: Frank de Lange wrote: It could be that people using those cards are not the ones who tend to go for the (somewhat tricky) BP6 board... I doubt that it's BP6 specific: I have the problem with a Gigabyte BXD board and I doubt that Ingo used an BP6. Perhaps 82093AA specific (the IO APIC chip used for SMP 440BX board) It isn't. But I just meant to indicate that the mere fact that I could not find any problem-report for that combination does not indicate that there ARE no problems... I can't find any spec updates for that chip: either it's the first perfect chip Intel ever produced, or ... :-) Well, the BX chipset is one of their better attempts I think... Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ Hacker for Hire \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Thu, Jan 11, 2001 at 02:23:53PM -0500, Jeff Garzik wrote: > Just out of curiosity, if you boot a Linux 2.4.0 kernel with the > "noapic" command line option, does behavior improve? For the curious, here's a summary of some tests I did: apic, 2 cpu's, no smp affinity -> network hangs under load apic, maxcpus=1, no smp affinity -> network hangs under load apic, 2 cpu's, smp affinity for all irq's on CPU1 -> network hangs under load noapic, 2 cpu's, no smp affinity -> NO HANG, WORKSFORME Quick and dirty conclusion: as soon as the apic comes in to play, things get messy... ps. load == 2 simultaneous nfs cp -rd sessions and streaming esd audio over the network Cheers//Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
On Thu, Jan 11, 2001 at 04:47:00PM -0500, Jeff Garzik wrote: > Are you judging based on the error message? The 'netdev watchdog ...' > message is a generic error message that could have any number of > causes. It's just saying, well, what it says :) The kernel was unable > to transmit a packet in a certain amount of time. You might get these > messages if you unplug a cable suddenly, or if your hardware isn't > delivering interrupts, or many other things... No, I'm judging based on the fact that I found reports from people using NE2K-PCI with several cards as well as tulip-based cards (different driver) on abit BP6 as well as Gigabyte motherboards, mostly on 2.3.x/2.4.x kernels. I found some postings with these problems on 2.2.x kernels. Cheers//Frank -- W ___ ## o o\ / Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
OK, just one last addition to what has nearly become my own thread... I now am fairly certain that the problem (network stalls on multiprocessor systems) is not BP6 or NE2K-PCI specific. I found several postings which relate to similar problems on dissimilar hardware. Another interesting one is: Re: PROBLEM : Networking stops working with kernel 2.4.0-test11 (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg18722.html) "...I have an almos identical system as you, 2x200MMX motherboard (Gigabyte 586DX) also Voodoo3 (2000 pci) the same nic Realtek 8029AS, also a bt848 tv card, also SCSI (Aic-7880 onboard, but not used). I have reported it some time ago, and now all I get with 2.4.0-test11-pre4 and I think a additional patch is NETDEV WATCHDOG: eth0: transmit timed out, and something in the console about lost irq? I can't reproduce it with a uniprocesor kernel, and I have a 3c503 card wich uses the 8390 module, so I suppose that the problem it's not in the 8390, and it seems to be smp related" ne2k-pci freezes with APIC error on 2.4.0-testX SMP (http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg14468.html) "... When doing massive NFS transfers (2.4 machine as the client) on my SMP box (Abit BP6 2x celeronA 533mhz (non-overclocked) 64Mb ram, latest apt-get-ed debian woody) my ne2k-pci card (Realtek Semiconductor Co., Ltd. RTL-8029(AS) (rev 0)) suddenly stops working. test5 spits that in syslog:..." More to be found when searching the archives. This problem has been around for a long, long time (probably since the current level of apic-support was added, somewhere around 2.3.1x?). It has been reported by several people, several times. I feel like rigging every apic-related piece of code with a zillion bells and printk's but that would surely only create more mayhem as this whole thing seems to be timing-related... Anyone got any idea's on how to tackle this? Anyone who is 'intimate with' the apic-related code? It'll take me some time to dive into that part, so if there is anyone who already has taken the plunge, do tell... Cheers//Frank [ who is still running apic-less, without problems [ -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
Hm, the noapic option seems to help, as I'm currently beating the network to death but it won't die... As the problem is elusive, it is hard to tell, and it would not surprise me if the net dropped dead the moment this mail went through, but current indication is that noapic makes the sudden net-death disappear. So we're still left with the question 'is this hardware-related, or is it a software/configuration problem'? Other people seem to have similar problems with dissimilar hardware (tulip cards instead of Winbond, etc), on 2.2.x as well as 2.3/4.x. As I do not run Windows (NT or 2K), I can not tell if this problem also occurs there. And my FreeBSD-box is uniprocessor... So... has anyone seen anything like this on other 'true' (SMP) OS's? If so, that would indicate a hardware problem... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
Another observation wrt. behaviour with 'noapic'... When streaming time-critical data over the network (running esound to another server, etc), sometimes there are hiccups in the stream. These hiccups seem to be much less frequent, if at all present, when running with 'noapic'. I'm currently running sound over a heavily loaded ethernet, no hiccups at all... Weird, since the apic ought to spread the load of handling the interrupts over all available CPU's. Whatever is causing this, there seems to be something fishy in the way interrupts are handled when the apic(s) is/are enabled... Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
Here's another posting to the list which mentions problems with NE2K and BP6: http://web.gnu.walfield.org/mail-archive/linux-kernel/2000-August/0132.html "...In another machine, a dual celeron abit-bp6, recent 2.3.x kernels seem to dislike my realtek 8029 NIC. (I know, it's garbage plugged in to garbage...) The network card will die randomly, usually when I'm sending large amounts of data. When it dies, there are no kernel messages, and the interrupt count in /proc/interrupts for the card stop changing. Minor (painful) experimentation has shown that if the card is sharing the interrupt with anything else (say, ide2), it takes that with it. This only happens in "newer" kernels, it's fine in 2.2.16, and in some earlier 2.3.x kernels. It goes away if I boot with the noapic=1 kernel parameter, and seems to be replaced with harmless "spurious 8259A interrupt: IRQ7." messages. (I haven't configured any hardware at all to be on IRQ7 - though I'm lead to believe IRQ7 has some sort of special purpose) ..." So I'm not the only one... Cheers//Frank -- W _______ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: Network hangs with BP6 and 2.4.x kernels, hardware related?
> Do you get any transmit timeout messages in the logs? If > so, send them. In addition to my previous message, here's what I get from the debug log facility: Jan 10 22:56:51 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:51 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=33. Jan 10 22:56:52 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:52 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=26. Jan 10 22:56:53 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:53 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=30. Jan 10 22:56:56 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:56 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=78. Jan 10 22:56:56 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:56 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=32. Jan 10 22:56:58 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:56:58 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=89. Jan 10 22:57:00 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:57:00 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=77. Jan 10 22:57:03 behemoth kernel: NETDEV WATCHDOG: eth0: transmit timed out Jan 10 22:57:03 behemoth kernel: eth0: Tx timed out, lost interrupt? TSR=0x3, ISR=0x3, t=171. So yeah, I get timeouts allright... Currently running NOAPIC, pity to see CPU1 receiving no interrupts at all... In the same debug log I now just saw this: Jan 11 17:37:05 behemoth kernel: spurious 8259A interrupt: IRQ7 That's weird, since there's nothing there...: cat /proc/interrupts CPU0 CPU1 0: 232967 0 XT-PIC timer 1: 6424 0 XT-PIC keyboard 2: 0 0 XT-PIC cascade 3:138 0 XT-PIC serial 4: 46201 0 XT-PIC serial 9: 52 0 XT-PIC sym53c8xx 10: 744329 0 XT-PIC eth0, eth1, usb-uhci 11: 0 0 XT-PIC bttv 12: 0 0 XT-PIC es1371, mga@PCI:1:0:0 14: 19778 0 XT-PIC ide0 15: 4520 0 XT-PIC ide1 NMI: 0 0 LOC: 232916 232914 ERR: 1 See? Nothing on 7... This is with NOAPIC (as you can see from the XT-PIC's in the above dump). BP6 again? Cheers//Frank -- W ___ ## o o\/ Frank de Lange \ }# \| / \ ##---# _/ \ \ +31-320-252965/ \[EMAIL PROTECTED]/ - [ "Omnis enim res, quae dando non deficit, dum habetur et non datur, nondum habetur, quomodo habenda est." ] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/