Re: em driver testing
On Tue, November 7, 2006 6:18 pm, Scott Long wrote: It's just unclear to me how you're associating bce problems with checksum offloading and IP fragmentation to em problems with design issues in the watchdog code. You are correct, the bce watchdog timeouts seem to be related to hw checksums, apologies for the mistake. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On Monday, 6 November 2006 at 22:42:18 -0800, Jack Vogel wrote: On 11/6/06, Adrian Chadd [EMAIL PROTECTED] wrote: Just out of curiousity - why wasn't the offending MPSAFE related changes to em just reverted after discovering the em instability? The driver -was- stable a couple of months ago, no? Actually it was not. Some reports have cited problems back to 6.0 or before. Well i have 5.5 box with very similar symptomatic :) I do not see watchdog timeouts on it, but a lot of UP/DOWN events. The watchdog design was fundamentally flawed from an SMP point of view and needed to be changed. We also didnt want to go backwards if possible. My Intel driver had support for new hardware that was good to pick up. There's lots of new stuff coming too, so stay tuned :) After 48 hours of production running there is no watchdog timeouts on my 6.2 SMP server with your patch. Thanks for all who working on this. Jack ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] -- == - Best regards, Nikolay Pavlov. --- == ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote: Well i have 5.5 box with very similar symptomatic :) I do not see watchdog timeouts on it, but a lot of UP/DOWN events. Are you sure this is the same problem as what's being discussed here? If you revert to a previous kernel or em driver, does the problem (link up/down) go away? Are you sure you don't actually have a flaky cable or RJ45 connector? What does the switch your NIC is connected to say? (does it show link going up and down) I feel horrible for both Scott and Jack -- I think there's tons of people coming out of the woodwork with ME TOO comments who may in fact be suffering from other problems, and are looking for a scapegoat thread. -- | Jeremy Chadwick jdc at parodius.com | | Parodius Networkinghttp://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Jeremy Chadwick wrote: On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote: Well i have 5.5 box with very similar symptomatic :) I do not see watchdog timeouts on it, but a lot of UP/DOWN events. Are you sure this is the same problem as what's being discussed here? If you revert to a previous kernel or em driver, does the problem (link up/down) go away? Are you sure you don't actually have a flaky cable or RJ45 connector? What does the switch your NIC is connected to say? (does it show link going up and down) I feel horrible for both Scott and Jack -- I think there's tons of people coming out of the woodwork with ME TOO comments who may in fact be suffering from other problems, and are looking for a scapegoat thread. The timeout/watchdog mechanism in the interface layer has been a problem ever since the MPSAFE work was done on the network stack. It's prone to races, and as the OS has improved and gotten faster over the past 2 years, those races have gotten bigger. In a way, it's a actually a positive indication of progress and improvement =-) I don't doubt that there are users with other problems. We spent some time collecting as much user data as we could in order to find patterns and weed out the uncommon cases. But this timer/watchdog thing looks to be a strong candidate for being the root cause of many of the problems. We'll continue to investigate these problems and address other drivers. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On Wednesday, 8 November 2006 at 7:41:02 -0800, Jeremy Chadwick wrote: On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote: Well i have 5.5 box with very similar symptomatic :) I do not see watchdog timeouts on it, but a lot of UP/DOWN events. Are you sure this is the same problem as what's being discussed here? If you revert to a previous kernel or em driver, does the problem (link up/down) go away? Are you sure you don't actually have a flaky cable or RJ45 connector? What does the switch your NIC is connected to say? (does it show link going up and down) I am pretty sure. All my servers using the same em chip, on all my 6.1 boxes either UP or SMP i see watchdog timeout, average load of this adapters is 5000 - 6000 interrunpts per second. I have only one box with 5.5 (same task and same platform), but i am not claiming that this is exactly the watchdog problem, it's just very symptomatic in context of discussion. In any case new 6.2 em patch works for me, at least i do not see watchdog timeouts after 48 hours of uptime. By the way the box is connected to 2950 switch, i can't find any problems on cabling. Here is how it looks like on 5.5: Oct 18 05:38:45 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 05:38:50 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 05:39:21 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 05:39:32 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again Oct 18 05:52:22 ms6 kernel: em0: Link is Down Oct 18 05:55:13 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 05:55:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 05:55:44 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 05:55:46 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again Oct 18 06:01:52 ms6 kernel: em0: Link is Down Oct 18 06:03:54 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:03:54 ms6 kernel: em0: Link is Down Oct 18 06:04:01 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:16:07 ms6 kernel: em0: Link is Down Oct 18 06:18:16 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:21:55 ms6 kernel: em0: Link is Down Oct 18 06:25:12 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:25:25 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:25:27 ms6 kernel: em0: Link is Down Oct 18 06:25:33 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:25:43 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:26:10 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again Oct 18 06:43:12 ms6 kernel: em0: Link is Down Oct 18 06:45:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:45:44 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:46:15 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:46:27 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:46:28 ms6 kernel: em0: Link is Down Oct 18 06:46:34 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 06:46:46 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:47:17 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 06:47:26 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again Oct 18 07:02:51 ms6 kernel: em0: Link is Down Oct 18 07:04:42 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 07:04:44 ms6 kernel: em0: Link is Down Oct 18 07:04:50 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex Oct 18 07:05:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 18 07:05:25 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again Oct 18 16:40:05 ms6 kernel: receive error 60 from nfs server 206.53.x.x:/usr/home/shared Oct 19 03:55:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not responding Oct 19 03:55:15 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive again After that date it was rebooted at least three times and i don't see such symptoms any more. I feel horrible for both Scott and Jack -- I think there's tons of people coming out of the woodwork with ME TOO comments who may in fact be suffering from other problems, and are looking for a scapegoat thread. Just ignore me. Patch works for me and this is end. -- == - Best regards, Nikolay Pavlov. --- == ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On Wednesday, 8 November 2006 at 9:26:26 -0700, Scott Long wrote: Jeremy Chadwick wrote: On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote: Well i have 5.5 box with very similar symptomatic :) I do not see watchdog timeouts on it, but a lot of UP/DOWN events. Are you sure this is the same problem as what's being discussed here? If you revert to a previous kernel or em driver, does the problem (link up/down) go away? Are you sure you don't actually have a flaky cable or RJ45 connector? What does the switch your NIC is connected to say? (does it show link going up and down) I feel horrible for both Scott and Jack -- I think there's tons of people coming out of the woodwork with ME TOO comments who may in fact be suffering from other problems, and are looking for a scapegoat thread. The timeout/watchdog mechanism in the interface layer has been a problem ever since the MPSAFE work was done on the network stack. It's prone to races, and as the OS has improved and gotten faster over the past 2 years, those races have gotten bigger. In a way, it's a actually a positive indication of progress and improvement =-) I don't doubt that there are users with other problems. We spent some time collecting as much user data as we could in order to find patterns and weed out the uncommon cases. But this timer/watchdog thing looks to be a strong candidate for being the root cause of many of the problems. We'll continue to investigate these problems and address other drivers. Scott Thanks Scott. From my side, I'd like to say that I'm always ready to help you in testing. -- == - Best regards, Nikolay Pavlov. --- == ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Scott Long wrote: I don't doubt that there are users with other problems. We spent some time collecting as much user data as we could in order to find patterns and weed out the uncommon cases. But this timer/watchdog thing looks to be a strong candidate for being the root cause of many of the problems. We'll continue to investigate these problems and address other drivers. Out of interest are there any priorities for drivers that will get fixed before 6.2 release? Obviously em will but what others: bge, xl? Steve This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. In the event of misdirection, illegible or incomplete transmission please telephone +44 845 868 1337 or return the E.mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Steven Hartland wrote: Scott Long wrote: I don't doubt that there are users with other problems. We spent some time collecting as much user data as we could in order to find patterns and weed out the uncommon cases. But this timer/watchdog thing looks to be a strong candidate for being the root cause of many of the problems. We'll continue to investigate these problems and address other drivers. Out of interest are there any priorities for drivers that will get fixed before 6.2 release? Obviously em will but what others: bge, xl? Steve My personally opinion is that the changes needed are too risky to rush into 6.2 for all of the different drivers. For the vast majority of people, what is in 6.2 works quite well, and there is no need to introduce new bugs. We are pushing forward with if_em because the problems there are pretty widespread, and Intel is providing direct engineering and QA input. Thus, risk is reduced. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Scott Long wrote: My personally opinion is that the changes needed are too risky to rush into 6.2 for all of the different drivers. For the vast majority of people, what is in 6.2 works quite well, and there is no need to introduce new bugs. We are pushing forward with if_em because the problems there are pretty widespread, and Intel is providing direct engineering and QA input. Thus, risk is reduced. From a purely user's point of view, it would be nice if the problems were clearly mentioned in errata together with links to the newest patches or some other kind of pointer to the ongoing work so people know where to look if these bugs hit them. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On Mon, Nov 06, 2006 at 04:14:40PM -0800, Jack Vogel wrote: Well, so run 6.2 BETA3 plus the patch I posted as Patrick mentioned and then report on that. You've got a lot of potential problem areas here, I have no experience with samba on FreeBSD. And that motherboard only has PCI as I recall, yes? Still, it should get rid of the watchdogs unless you have real hardware issues. As a point of information, I don't think that samba specifically has anything to do with the problem. I am running samba on FreeBSD, and have two servers that are rather heavily used (one is the filestore for a CFD cluster, and the other for a Maya/Muster rendering cluster), each having two em interfaces and SMP -- and have not seen any watchdog issues (they are currently running FreeBSD 6.2-PRERELEASE as of Oct 7 -- but no problems with any earlier 6.1-STABLE versions either). -- greg byshenk - [EMAIL PROTECTED] - Leiden, NL ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. The motherboard has 2 onboard xl's and I am using the one for a live IP and the other one is doing nothing. It is a server motherboard with an AMD762 north bridge. It has 64bit pci 66MHz slots which I have the em card in. The em card is a 32bit pcs 33MHz card though. Here's what the xl card is with pciconf -lhv [EMAIL PROTECTED]:15:0: class=0x02 card=0x246210f1 chip=0x980010b7 rev=0x78 hdr=0x00 vendor = '3COM Corp, Networking Division' device = '3C980-TX Fast EtherLink XL Server Adapter2' The em card is such: [EMAIL PROTECTED]:9:0: class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = 'PRO/1000 GT' Any help would be greatly appreciated. Clay - Original Message - From: Jack Vogel [EMAIL PROTECTED] To: Clayton Milos [EMAIL PROTECTED] Cc: freebsd-stable@freebsd.org; ke han [EMAIL PROTECTED] Sent: Tuesday, November 07, 2006 2:14 AM Subject: Re: em driver testing Well, so run 6.2 BETA3 plus the patch I posted as Patrick mentioned and then report on that. You've got a lot of potential problem areas here, I have no experience with samba on FreeBSD. And that motherboard only has PCI as I recall, yes? Still, it should get rid of the watchdogs unless you have real hardware issues. Good luck, Jack On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote: Hi there I am having similar issues. Running 6.1-RELEASE. I'm using the box as a samba server with pure-ftpd on it too with 2.5T of raid storage in it. the box is running the generic MP kernel on a Tyan Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that passed all tests with memtest. When I pull a few concurrent files over samba or if i pull a big file (say 2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box with watchdog timeout on the em interface. Usually it pops up with timeouts on the xl interface at the same time and after a few seconds on the ahc (onboard adaptec scsi) interfce and I have to hard boot the box to get it back to life. I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the interface. In fact the interface will accept an IP address but times out pinging anything on the LAN. If a kernel developer would like access to the box to chek it out please mail me. Regards Clay Hello! On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote: I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. Put some CPU load on the machine, e.g. by running cd /usr/src sh while true do make -j4 buildworld done mk.log on one terminal and then transfer some data to the system, e.g. by fetch(1)ing via FTP from another box connected to the same LAN. On all systems I have, there is no need to saturate the Gbit-Link. 100 Mbit/s local connection will trigger the problem, too. If the problem exists on your system, you will see emN - watchdog timeout messages on the console and in /var/log/messages, followed by a reset of the interface and a short and recoverable, but complete, loss of connectivity. A couple of seconds, maybe. This is enough to frustrate people, who e.g. run large backup jobs over a single TCP connection that takes a couple of hours to complete - the interface reset aborts the backup :-/ I must say that it seems to me, these guys are putting a hell of a lot of effort into this problem and we are making progress. Things look quite good to me for 6.2-RELEASE. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED
Re: em driver testing
On 11/7/06, Clayton Milos [EMAIL PROTECTED] wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. I'm not sure what it is, but the fact that a variety of nic drivers are having this same problem indicates that something changed in the if_timer and its caller, someone knowing that subsystem would be better qualified to say what. The other drivers should do the same thing that em did, and stop using the net layer timer :) Jack ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
We've basically identified problems with the way that watchdogs are handled. It is very fragile and sensitive to timing, so it's not surprising that adjusting the the timing in one driver will affect another driver. The solution is to push the timieout/watchdog logic entirely into the NIC drivers, like we did for if_em. That will take some time, and I doubt that xl specifically will get fixed for 6.2. Scott Clayton Milos wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. The motherboard has 2 onboard xl's and I am using the one for a live IP and the other one is doing nothing. It is a server motherboard with an AMD762 north bridge. It has 64bit pci 66MHz slots which I have the em card in. The em card is a 32bit pcs 33MHz card though. Here's what the xl card is with pciconf -lhv [EMAIL PROTECTED]:15:0: class=0x02 card=0x246210f1 chip=0x980010b7 rev=0x78 hdr=0x00 vendor = '3COM Corp, Networking Division' device = '3C980-TX Fast EtherLink XL Server Adapter2' The em card is such: [EMAIL PROTECTED]:9:0: class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 hdr=0x00 vendor = 'Intel Corporation' device = 'PRO/1000 GT' Any help would be greatly appreciated. Clay - Original Message - From: Jack Vogel [EMAIL PROTECTED] To: Clayton Milos [EMAIL PROTECTED] Cc: freebsd-stable@freebsd.org; ke han [EMAIL PROTECTED] Sent: Tuesday, November 07, 2006 2:14 AM Subject: Re: em driver testing Well, so run 6.2 BETA3 plus the patch I posted as Patrick mentioned and then report on that. You've got a lot of potential problem areas here, I have no experience with samba on FreeBSD. And that motherboard only has PCI as I recall, yes? Still, it should get rid of the watchdogs unless you have real hardware issues. Good luck, Jack On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote: Hi there I am having similar issues. Running 6.1-RELEASE. I'm using the box as a samba server with pure-ftpd on it too with 2.5T of raid storage in it. the box is running the generic MP kernel on a Tyan Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that passed all tests with memtest. When I pull a few concurrent files over samba or if i pull a big file (say 2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box with watchdog timeout on the em interface. Usually it pops up with timeouts on the xl interface at the same time and after a few seconds on the ahc (onboard adaptec scsi) interfce and I have to hard boot the box to get it back to life. I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the interface. In fact the interface will accept an IP address but times out pinging anything on the LAN. If a kernel developer would like access to the box to chek it out please mail me. Regards Clay Hello! On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote: I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. Put some CPU load on the machine, e.g. by running cd /usr/src sh while true do make -j4 buildworld done mk.log on one terminal and then transfer some data to the system, e.g. by fetch(1)ing via FTP from another box connected to the same LAN. On all systems I have, there is no need to saturate the Gbit-Link. 100 Mbit/s local connection will trigger the problem, too. If the problem exists on your system, you will see emN - watchdog timeout messages on the console and in /var/log/messages, followed by a reset of the interface and a short and recoverable, but complete, loss of connectivity. A couple of seconds, maybe. This is enough to frustrate people, who e.g. run large backup jobs over a single TCP connection that takes a couple of hours to complete - the interface reset aborts the backup :-/ I must say that it seems to me, these guys are putting a hell of a lot of effort into this problem and we are making progress. Things look quite good to me for 6.2-RELEASE. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL
Re: em driver testing
Clayton Milos wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. Has it not been established by someone that the problem is in freebsd (scheduler iirc) and not the drivers themselves? This along with the bge/bce wtachdog timeouts seems to confirm that. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On 11/7/06, Mike Jakubik [EMAIL PROTECTED] wrote: Clayton Milos wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. Has it not been established by someone that the problem is in freebsd (scheduler iirc) and not the drivers themselves? This along with the bge/bce wtachdog timeouts seems to confirm that. Yes, I think its pretty likely to be in the timer/clock code, something must have changed. However, I like the design change we made to em better anyway, the net/if timer is UP design, and has ALWAYS been vulnerable to races, its best to do what we did (its in patch yet and not checked in btw). Jack ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Jack Vogel wrote: On 11/7/06, Clayton Milos [EMAIL PROTECTED] wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. I'm not sure what it is, but the fact that a variety of nic drivers are having this same problem indicates that something changed in the if_timer and its caller, someone knowing that subsystem would be better qualified to say what. The other drivers should do the same thing that em did, and stop using the net layer timer :) Jack I think it's more that the if_em driver watchdog was insulating the if_xl driver. Once the if_em component was removed, the if_xl driver was the next in line to be a victim. So yes, like you say, all of the drivers need to be fixed. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Mike Jakubik wrote: Clayton Milos wrote: Hi Jack I patched the driver and re-compiled the kernel and userland. All appears well with the em driver now. No more errors on it. I am getting watchdog timeouts on the xl driver now though. It was happenning before at the same time as the em ones. Now I've passed a lot of traffic on the em interface but the xl interface gets watchdog errors. The em interface still works fine but the xl one is no usable after this. Has it not been established by someone that the problem is in freebsd (scheduler iirc) and not the drivers themselves? This along with the bge/bce wtachdog timeouts seems to confirm that. Mike, If you have insight into the bce driver, I would highly appreciate if you would share it. Scott (the guy who fixed bce) ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Scott Long wrote: Mike, If you have insight into the bce driver, I would highly appreciate if you would share it. Scott (the guy who fixed bce) I don't have any bce hardware myself, I'm just using the information from the list. I have some em and fxp hardware however that i can use to do tests. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Mike Jakubik wrote: Scott Long wrote: Mike, If you have insight into the bce driver, I would highly appreciate if you would share it. Scott (the guy who fixed bce) I don't have any bce hardware myself, I'm just using the information from the list. I have some em and fxp hardware however that i can use to do tests. It's just unclear to me how you're associating bce problems with checksum offloading and IP fragmentation to em problems with design issues in the watchdog code. Scott ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
At 04:52 PM 11/7/2006, Scott Long wrote: I think it's more that the if_em driver watchdog was insulating the if_xl driver. Once the if_em component was removed, the if_xl driver was the next in line to be a victim. So yes, like you say, all of the drivers need to be fixed. I wonder if thats what the issue I was seeing on the rl interface. While trying to stress test an em based machine, I saw Nov 6 17:33:05 releng6-865 kernel: rl0: watchdog timeout while blasting out UDP traffic via netrate from the rl based box to the em based box. ---Mike ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Jack, I have done some tests, here are my results. On 6.2-BETA3 i was able to get a timeout while compiling the kernel and ftping a large file from another server with the same card. On 6.2-STABLE cvsuped today i was not able to produce a timeout, i then applied your patch and the results were the same. [EMAIL PROTECTED]:10:0: class=0x02 card=0x11768086 chip=0x10768086 rev=0x00 hdr=0x00 vendor = 'Intel Corporation' device = '82547EI Gigabit Ethernet Controller' ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
em driver testing
According to the 6.2-beta3 announcemetn, there are improvements to the em driver. The most important of the things that have been worked on is the driver for em(4). I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. thanks, ke han ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
If memory serves me right, ke han wrote: According to the 6.2-beta3 announcemetn, there are improvements to the em driver. The most important of the things that have been worked on is the driver for em(4). I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. There were a number of problems in em(4) that appeared post-6.1. A new version of the em(4) driver was merged to the RELENG_6 branch just prior to 6.2-BETA3. That version is better, but still has some unresolved issues (problems have been reported with jumbo frames and with watchdog timeouts). There's a new patch by jfv@ in an email a few days ago (with subject New em patch for 6.2 BETA 3) that might fix these problems. It might be worthwhile to try this out. Basically, just look around the archives of stable@ to see discussions of the em driver over the past month or so. I believe that a number of prior problems were uncovered when people just tried to push a lot of traffic through the interface. Bruce. signature.asc Description: OpenPGP digital signature
Re: em driver testing
Hello! On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote: I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. Put some CPU load on the machine, e.g. by running cd /usr/src sh while true do make -j4 buildworld done mk.log on one terminal and then transfer some data to the system, e.g. by fetch(1)ing via FTP from another box connected to the same LAN. On all systems I have, there is no need to saturate the Gbit-Link. 100 Mbit/s local connection will trigger the problem, too. If the problem exists on your system, you will see emN - watchdog timeout messages on the console and in /var/log/messages, followed by a reset of the interface and a short and recoverable, but complete, loss of connectivity. A couple of seconds, maybe. This is enough to frustrate people, who e.g. run large backup jobs over a single TCP connection that takes a couple of hours to complete - the interface reset aborts the backup :-/ I must say that it seems to me, these guys are putting a hell of a lot of effort into this problem and we are making progress. Things look quite good to me for 6.2-RELEASE. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Hi there I am having similar issues. Running 6.1-RELEASE. I'm using the box as a samba server with pure-ftpd on it too with 2.5T of raid storage in it. the box is running the generic MP kernel on a Tyan Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that passed all tests with memtest. When I pull a few concurrent files over samba or if i pull a big file (say 2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box with watchdog timeout on the em interface. Usually it pops up with timeouts on the xl interface at the same time and after a few seconds on the ahc (onboard adaptec scsi) interfce and I have to hard boot the box to get it back to life. I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the interface. In fact the interface will accept an IP address but times out pinging anything on the LAN. If a kernel developer would like access to the box to chek it out please mail me. Regards Clay Hello! On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote: I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. Put some CPU load on the machine, e.g. by running cd /usr/src sh while true do make -j4 buildworld done mk.log on one terminal and then transfer some data to the system, e.g. by fetch(1)ing via FTP from another box connected to the same LAN. On all systems I have, there is no need to saturate the Gbit-Link. 100 Mbit/s local connection will trigger the problem, too. If the problem exists on your system, you will see emN - watchdog timeout messages on the console and in /var/log/messages, followed by a reset of the interface and a short and recoverable, but complete, loss of connectivity. A couple of seconds, maybe. This is enough to frustrate people, who e.g. run large backup jobs over a single TCP connection that takes a couple of hours to complete - the interface reset aborts the backup :-/ I must say that it seems to me, these guys are putting a hell of a lot of effort into this problem and we are making progress. Things look quite good to me for 6.2-RELEASE. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Well, so run 6.2 BETA3 plus the patch I posted as Patrick mentioned and then report on that. You've got a lot of potential problem areas here, I have no experience with samba on FreeBSD. And that motherboard only has PCI as I recall, yes? Still, it should get rid of the watchdogs unless you have real hardware issues. Good luck, Jack On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote: Hi there I am having similar issues. Running 6.1-RELEASE. I'm using the box as a samba server with pure-ftpd on it too with 2.5T of raid storage in it. the box is running the generic MP kernel on a Tyan Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that passed all tests with memtest. When I pull a few concurrent files over samba or if i pull a big file (say 2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box with watchdog timeout on the em interface. Usually it pops up with timeouts on the xl interface at the same time and after a few seconds on the ahc (onboard adaptec scsi) interfce and I have to hard boot the box to get it back to life. I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the interface. In fact the interface will accept an IP address but times out pinging anything on the LAN. If a kernel developer would like access to the box to chek it out please mail me. Regards Clay Hello! On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote: I have a Sun X4100 which uses Intel ethernet. I would like to install amd64 6.2beta3 on this server and put it through some tests. But I have no idea what tests to run or how to run them. Can someone provide some pointers? I am happy to post my findings. Put some CPU load on the machine, e.g. by running cd /usr/src sh while true do make -j4 buildworld done mk.log on one terminal and then transfer some data to the system, e.g. by fetch(1)ing via FTP from another box connected to the same LAN. On all systems I have, there is no need to saturate the Gbit-Link. 100 Mbit/s local connection will trigger the problem, too. If the problem exists on your system, you will see emN - watchdog timeout messages on the console and in /var/log/messages, followed by a reset of the interface and a short and recoverable, but complete, loss of connectivity. A couple of seconds, maybe. This is enough to frustrate people, who e.g. run large backup jobs over a single TCP connection that takes a couple of hours to complete - the interface reset aborts the backup :-/ I must say that it seems to me, these guys are putting a hell of a lot of effort into this problem and we are making progress. Things look quite good to me for 6.2-RELEASE. HTH, Patrick -- punkt.de GmbH Internet - Dienstleistungen - Beratung Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100 76137 Karlsruhe http://punkt.de ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
Just out of curiousity - why wasn't the offending MPSAFE related changes to em just reverted after discovering the em instability? The driver -was- stable a couple of months ago, no? Adrian -- Adrian Chadd - [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: em driver testing
On 11/6/06, Adrian Chadd [EMAIL PROTECTED] wrote: Just out of curiousity - why wasn't the offending MPSAFE related changes to em just reverted after discovering the em instability? The driver -was- stable a couple of months ago, no? Actually it was not. Some reports have cited problems back to 6.0 or before. The watchdog design was fundamentally flawed from an SMP point of view and needed to be changed. We also didnt want to go backwards if possible. My Intel driver had support for new hardware that was good to pick up. There's lots of new stuff coming too, so stay tuned :) Jack ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]