Re: em driver testing

2006-11-08 Thread Mike Jakubik

On Tue, November 7, 2006 6:18 pm, Scott Long wrote:


It's just unclear to me how you're associating bce problems with
checksum offloading and IP fragmentation to em problems with design
issues in the watchdog code.


You are correct, the bce watchdog timeouts seem to be related to hw
checksums, apologies for the mistake.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Nikolay Pavlov
On Monday,  6 November 2006 at 22:42:18 -0800, Jack Vogel wrote:
 On 11/6/06, Adrian Chadd [EMAIL PROTECTED] wrote:
 Just out of curiousity - why wasn't the offending MPSAFE related
 changes to em just reverted after discovering the em instability? The
 driver -was- stable a couple of months ago, no?
 
 Actually it was not. Some reports have cited problems back
 to 6.0 or before.

Well i have 5.5 box with very similar symptomatic :)
I do not see watchdog timeouts on it, but a lot of UP/DOWN events.

 
 The watchdog design was fundamentally flawed from an SMP
 point of view and needed to be changed.
 
 We also didnt want to go backwards if possible. My Intel driver
 had support for new hardware that was good to pick up.
 
 There's lots of new stuff coming too, so stay tuned :)

After 48 hours of production running there is no watchdog timeouts on my
6.2 SMP server with your patch. Thanks for all who working on this.

 
 Jack
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]

-- 
==  
- Best regards, Nikolay Pavlov. ---
==  

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Jeremy Chadwick
On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote:
 Well i have 5.5 box with very similar symptomatic :)
 I do not see watchdog timeouts on it, but a lot of UP/DOWN events.

Are you sure this is the same problem as what's being discussed
here?  If you revert to a previous kernel or em driver, does the
problem (link up/down) go away?  Are you sure you don't actually
have a flaky cable or RJ45 connector?  What does the switch your
NIC is connected to say? (does it show link going up and down)

I feel horrible for both Scott and Jack -- I think there's tons
of people coming out of the woodwork with ME TOO comments who
may in fact be suffering from other problems, and are looking for
a scapegoat thread.

-- 
| Jeremy Chadwick jdc at parodius.com |
| Parodius Networkinghttp://www.parodius.com/ |
| UNIX Systems Administrator   Mountain View, CA, USA |
| Making life hard for others since 1977.   PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Scott Long

Jeremy Chadwick wrote:

On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote:

Well i have 5.5 box with very similar symptomatic :)
I do not see watchdog timeouts on it, but a lot of UP/DOWN events.


Are you sure this is the same problem as what's being discussed
here?  If you revert to a previous kernel or em driver, does the
problem (link up/down) go away?  Are you sure you don't actually
have a flaky cable or RJ45 connector?  What does the switch your
NIC is connected to say? (does it show link going up and down)

I feel horrible for both Scott and Jack -- I think there's tons
of people coming out of the woodwork with ME TOO comments who
may in fact be suffering from other problems, and are looking for
a scapegoat thread.



The timeout/watchdog mechanism in the interface layer has been a problem
ever since the MPSAFE work was done on the network stack.  It's prone to
races, and as the OS has improved and gotten faster over the past 2
years, those races have gotten bigger.  In a way, it's a actually a
positive indication of progress and improvement =-)

I don't doubt that there are users with other problems.  We spent some
time collecting as much user data as we could in order to find patterns
and weed out the uncommon cases.  But this timer/watchdog thing looks to
be a strong candidate for being the root cause of many of the problems.
We'll continue to investigate these problems and address other drivers.

Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Nikolay Pavlov
On Wednesday,  8 November 2006 at  7:41:02 -0800, Jeremy Chadwick wrote:
 On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote:
  Well i have 5.5 box with very similar symptomatic :)
  I do not see watchdog timeouts on it, but a lot of UP/DOWN events.
 
 Are you sure this is the same problem as what's being discussed
 here?  If you revert to a previous kernel or em driver, does the
 problem (link up/down) go away?  Are you sure you don't actually
 have a flaky cable or RJ45 connector?  What does the switch your
 NIC is connected to say? (does it show link going up and down)

I am pretty sure. All my servers using the same em chip, on all my 6.1
boxes either UP or SMP i see watchdog timeout, average load of this 
adapters is 5000 - 6000 interrunpts per second. I have only one box with
5.5 (same task and same platform), but i am not claiming that this is 
exactly the watchdog problem, it's just very symptomatic in context of 
discussion. In any case new 6.2 em patch works for me, at least i do not 
see watchdog timeouts after 48 hours of uptime.

By the way the box is connected to 2950 switch, i can't find any
problems on cabling.

Here is how it looks like on 5.5:

Oct 18 05:38:45 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 05:38:50 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 05:39:21 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 05:39:32 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again
Oct 18 05:52:22 ms6 kernel: em0: Link is Down
Oct 18 05:55:13 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 05:55:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 05:55:44 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 05:55:46 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again
Oct 18 06:01:52 ms6 kernel: em0: Link is Down
Oct 18 06:03:54 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:03:54 ms6 kernel: em0: Link is Down
Oct 18 06:04:01 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:16:07 ms6 kernel: em0: Link is Down
Oct 18 06:18:16 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:21:55 ms6 kernel: em0: Link is Down
Oct 18 06:25:12 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:25:25 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:25:27 ms6 kernel: em0: Link is Down
Oct 18 06:25:33 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:25:43 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:26:10 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again
Oct 18 06:43:12 ms6 kernel: em0: Link is Down
Oct 18 06:45:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:45:44 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:46:15 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:46:27 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:46:28 ms6 kernel: em0: Link is Down
Oct 18 06:46:34 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 06:46:46 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:47:17 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 06:47:26 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again
Oct 18 07:02:51 ms6 kernel: em0: Link is Down
Oct 18 07:04:42 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 07:04:44 ms6 kernel: em0: Link is Down
Oct 18 07:04:50 ms6 kernel: em0: Link is up 1000 Mbps Full Duplex
Oct 18 07:05:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 18 07:05:25 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again
Oct 18 16:40:05 ms6 kernel: receive error 60 from nfs server 
206.53.x.x:/usr/home/shared
Oct 19 03:55:13 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: not 
responding
Oct 19 03:55:15 ms6 kernel: nfs server 206.53.x.x:/usr/home/shared: is alive 
again

After that date it was rebooted at least three times and i don't 
see such symptoms any more.

 
 I feel horrible for both Scott and Jack -- I think there's tons
 of people coming out of the woodwork with ME TOO comments who
 may in fact be suffering from other problems, and are looking for
 a scapegoat thread.

Just ignore me. Patch works for me and this is end.

-- 
==  
- Best regards, Nikolay Pavlov. ---
==  

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Nikolay Pavlov
On Wednesday,  8 November 2006 at  9:26:26 -0700, Scott Long wrote:
 Jeremy Chadwick wrote:
 On Wed, Nov 08, 2006 at 04:40:03PM +0200, Nikolay Pavlov wrote:
 Well i have 5.5 box with very similar symptomatic :)
 I do not see watchdog timeouts on it, but a lot of UP/DOWN events.
 
 Are you sure this is the same problem as what's being discussed
 here?  If you revert to a previous kernel or em driver, does the
 problem (link up/down) go away?  Are you sure you don't actually
 have a flaky cable or RJ45 connector?  What does the switch your
 NIC is connected to say? (does it show link going up and down)
 
 I feel horrible for both Scott and Jack -- I think there's tons
 of people coming out of the woodwork with ME TOO comments who
 may in fact be suffering from other problems, and are looking for
 a scapegoat thread.
 
 
 The timeout/watchdog mechanism in the interface layer has been a problem
 ever since the MPSAFE work was done on the network stack.  It's prone to
 races, and as the OS has improved and gotten faster over the past 2
 years, those races have gotten bigger.  In a way, it's a actually a
 positive indication of progress and improvement =-)
 
 I don't doubt that there are users with other problems.  We spent some
 time collecting as much user data as we could in order to find patterns
 and weed out the uncommon cases.  But this timer/watchdog thing looks to
 be a strong candidate for being the root cause of many of the problems.
 We'll continue to investigate these problems and address other drivers.
 
 Scott

Thanks Scott. From my side, I'd like to say that I'm always ready to 
help you in testing.

-- 
==  
- Best regards, Nikolay Pavlov. ---
==  

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Steven Hartland

Scott Long wrote:

I don't doubt that there are users with other problems.  We spent some
time collecting as much user data as we could in order to find
patterns and weed out the uncommon cases.  But this timer/watchdog
thing looks to be a strong candidate for being the root cause of many
of the problems. We'll continue to investigate these problems and
address other drivers. 


Out of interest are there any priorities for drivers that will get
fixed before 6.2 release? Obviously em will but what others: bge, xl?
   
   Steve




This e.mail is private and confidential between Multiplay (UK) Ltd. and the person or entity to whom it is addressed. In the event of misdirection, the recipient is prohibited from using, copying, printing or otherwise disseminating it or any information contained in it. 


In the event of misdirection, illegible or incomplete transmission please 
telephone +44 845 868 1337
or return the E.mail to [EMAIL PROTECTED]

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Scott Long

Steven Hartland wrote:

Scott Long wrote:


I don't doubt that there are users with other problems.  We spent some
time collecting as much user data as we could in order to find
patterns and weed out the uncommon cases.  But this timer/watchdog
thing looks to be a strong candidate for being the root cause of many
of the problems. We'll continue to investigate these problems and
address other drivers. 



Out of interest are there any priorities for drivers that will get
fixed before 6.2 release? Obviously em will but what others: bge, xl?
  Steve



My personally opinion is that the changes needed are too risky to rush
into 6.2 for all of the different drivers.  For the vast majority of
people, what is in 6.2 works quite well, and there is no need to
introduce new bugs.  We are pushing forward with if_em because the
problems there are pretty widespread, and Intel is providing direct
engineering and QA input.  Thus, risk is reduced.

Scott

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-08 Thread Ivan Voras
Scott Long wrote:

 My personally opinion is that the changes needed are too risky to rush
 into 6.2 for all of the different drivers.  For the vast majority of
 people, what is in 6.2 works quite well, and there is no need to
 introduce new bugs.  We are pushing forward with if_em because the
 problems there are pretty widespread, and Intel is providing direct
 engineering and QA input.  Thus, risk is reduced.

From a purely user's point of view, it would be nice if the problems
were clearly mentioned in errata together with links to the newest
patches or some other kind of pointer to the ongoing work so people know
where to look if these bugs hit them.

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Greg Byshenk
On Mon, Nov 06, 2006 at 04:14:40PM -0800, Jack Vogel wrote:
 Well, so run 6.2 BETA3 plus the patch I posted as Patrick
 mentioned and then report on that. You've got a lot of
 potential problem areas here, I have no experience with
 samba on FreeBSD. And that motherboard only has PCI
 as I recall, yes? Still, it should get rid of the watchdogs
 unless you have real hardware issues.


As a point of information, I don't think that samba specifically has
anything to do with the problem.

I am running samba on FreeBSD, and have two servers that are rather
heavily used (one is the filestore for a CFD cluster, and the other
for a Maya/Muster rendering cluster), each having two em interfaces
and SMP -- and have not seen any watchdog issues (they are currently
running FreeBSD 6.2-PRERELEASE as of Oct  7 -- but no problems with
any earlier 6.1-STABLE versions either).
 

-- 
greg byshenk  -  [EMAIL PROTECTED]  -  Leiden, NL
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Clayton Milos

Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was 
happenning before at the same time as the em ones. Now I've passed a lot of 
traffic on the em interface but the xl interface gets watchdog errors. The 
em interface still works fine but the xl one is no usable after this.


The motherboard has 2 onboard xl's and I am using the one for a live IP and 
the other one is doing nothing. It is a server motherboard with an AMD762 
north bridge. It has 64bit pci 66MHz slots which I have the em card in. The 
em card is a 32bit pcs 33MHz card though.


Here's what the xl card is with pciconf -lhv
[EMAIL PROTECTED]:15:0:  class=0x02 card=0x246210f1 chip=0x980010b7 rev=0x78 
hdr=0x00

   vendor   = '3COM Corp, Networking Division'
   device   = '3C980-TX Fast EtherLink XL Server Adapter2'

The em card is such:
[EMAIL PROTECTED]:9:0:   class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 
hdr=0x00

   vendor   = 'Intel Corporation'
   device   = 'PRO/1000 GT'


Any help would be greatly appreciated.


Clay



- Original Message - 
From: Jack Vogel [EMAIL PROTECTED]

To: Clayton Milos [EMAIL PROTECTED]
Cc: freebsd-stable@freebsd.org; ke han [EMAIL PROTECTED]
Sent: Tuesday, November 07, 2006 2:14 AM
Subject: Re: em driver testing



Well, so run 6.2 BETA3 plus the patch I posted as Patrick
mentioned and then report on that. You've got a lot of
potential problem areas here, I have no experience with
samba on FreeBSD. And that motherboard only has PCI
as I recall, yes? Still, it should get rid of the watchdogs
unless you have real hardware issues.

Good luck,

Jack


On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote:

Hi there

I am having similar issues. Running 6.1-RELEASE.

I'm using the box as a samba server with pure-ftpd on it too with 2.5T of
raid storage in it. the box is running the generic MP kernel on a Tyan
Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram 
that

passed all tests with memtest.

When I pull a few concurrent files over samba or if i pull a big file 
(say
2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the 
box
with watchdog timeout on the em interface. Usually it pops up with 
timeouts

on the xl interface at the same time and after a few seconds on the ahc
(onboard adaptec scsi) interfce and I have to hard boot the box to get it
back to life.

I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom
BCM5701TKHB chipset on it. It crashes within minutes with no traffic on 
the

interface. In fact the interface will accept an IP address but times out
pinging anything on the LAN.

If a kernel developer would like access to the box to chek it out please
mail me.

Regards

Clay


 Hello!

 On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote:

 I have a Sun X4100 which uses Intel ethernet.  I would like to
 install amd64 6.2beta3 on this server and put it through some tests.
 But I have no idea what tests to run or how to run them.
 Can someone provide some pointers?  I am happy to post my findings.

 Put some CPU load on the machine, e.g. by running

 cd /usr/src
 sh
 while true
 do
 make -j4 buildworld
 done mk.log

 on one terminal and then transfer some data to the system, e.g.
 by fetch(1)ing via FTP from another box connected to the same
 LAN. On all systems I have, there is no need to saturate the
 Gbit-Link. 100 Mbit/s local connection will trigger the problem, too.

 If the problem exists on your system, you will see emN - watchdog 
 timeout

 messages on the console and in /var/log/messages, followed by a
 reset of the interface and a short and recoverable, but complete,
 loss of connectivity. A couple of seconds, maybe. This is enough
 to frustrate people, who e.g. run large backup jobs over a single
 TCP connection that takes a couple of hours to complete - the interface
 reset aborts the backup :-/

 I must say that it seems to me, these guys are putting a hell of
 a lot of effort into this problem and we are making progress.
 Things look quite good to me for 6.2-RELEASE.

 HTH,
 Patrick
 --
 punkt.de GmbH Internet - Dienstleistungen - Beratung
 Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100
 76137 Karlsruhe   http://punkt.de
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to 
 [EMAIL PROTECTED]



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED

Re: em driver testing

2006-11-07 Thread Jack Vogel

On 11/7/06, Clayton Milos [EMAIL PROTECTED] wrote:

Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was
happenning before at the same time as the em ones. Now I've passed a lot of
traffic on the em interface but the xl interface gets watchdog errors. The
em interface still works fine but the xl one is no usable after this.


I'm not sure what it is, but the fact that a variety of nic drivers
are having this
same problem indicates that something changed in the if_timer and its
caller, someone knowing that subsystem would be better qualified to
say what.

The other drivers should do the same thing that em did, and stop using
the net layer timer :)

Jack
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Scott Long
We've basically identified problems with the way that watchdogs are 
handled.  It is very fragile and sensitive to timing, so it's not

surprising that adjusting the the timing in one driver will affect
another driver.  The solution is to push the timieout/watchdog
logic entirely into the NIC drivers, like we did for if_em.  That
will take some time, and I doubt that xl specifically will get fixed
for 6.2.

Scott


Clayton Milos wrote:

Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was 
happenning before at the same time as the em ones. Now I've passed a lot 
of traffic on the em interface but the xl interface gets watchdog 
errors. The em interface still works fine but the xl one is no usable 
after this.


The motherboard has 2 onboard xl's and I am using the one for a live IP 
and the other one is doing nothing. It is a server motherboard with an 
AMD762 north bridge. It has 64bit pci 66MHz slots which I have the em 
card in. The em card is a 32bit pcs 33MHz card though.


Here's what the xl card is with pciconf -lhv
[EMAIL PROTECTED]:15:0:  class=0x02 card=0x246210f1 chip=0x980010b7 rev=0x78 
hdr=0x00

   vendor   = '3COM Corp, Networking Division'
   device   = '3C980-TX Fast EtherLink XL Server Adapter2'

The em card is such:
[EMAIL PROTECTED]:9:0:   class=0x02 card=0x13768086 chip=0x107c8086 rev=0x05 
hdr=0x00

   vendor   = 'Intel Corporation'
   device   = 'PRO/1000 GT'


Any help would be greatly appreciated.


Clay



- Original Message - From: Jack Vogel [EMAIL PROTECTED]
To: Clayton Milos [EMAIL PROTECTED]
Cc: freebsd-stable@freebsd.org; ke han [EMAIL PROTECTED]
Sent: Tuesday, November 07, 2006 2:14 AM
Subject: Re: em driver testing



Well, so run 6.2 BETA3 plus the patch I posted as Patrick
mentioned and then report on that. You've got a lot of
potential problem areas here, I have no experience with
samba on FreeBSD. And that motherboard only has PCI
as I recall, yes? Still, it should get rid of the watchdogs
unless you have real hardware issues.

Good luck,

Jack


On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote:


Hi there

I am having similar issues. Running 6.1-RELEASE.

I'm using the box as a samba server with pure-ftpd on it too with 
2.5T of

raid storage in it. the box is running the generic MP kernel on a Tyan
Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg 
ram that

passed all tests with memtest.

When I pull a few concurrent files over samba or if i pull a big file 
(say
2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up 
the box
with watchdog timeout on the em interface. Usually it pops up with 
timeouts

on the xl interface at the same time and after a few seconds on the ahc
(onboard adaptec scsi) interfce and I have to hard boot the box to 
get it

back to life.

I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom
BCM5701TKHB chipset on it. It crashes within minutes with no traffic 
on the

interface. In fact the interface will accept an IP address but times out
pinging anything on the LAN.

If a kernel developer would like access to the box to chek it out please
mail me.

Regards

Clay


 Hello!

 On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote:

 I have a Sun X4100 which uses Intel ethernet.  I would like to
 install amd64 6.2beta3 on this server and put it through some tests.
 But I have no idea what tests to run or how to run them.
 Can someone provide some pointers?  I am happy to post my findings.

 Put some CPU load on the machine, e.g. by running

 cd /usr/src
 sh
 while true
 do
 make -j4 buildworld
 done mk.log

 on one terminal and then transfer some data to the system, e.g.
 by fetch(1)ing via FTP from another box connected to the same
 LAN. On all systems I have, there is no need to saturate the
 Gbit-Link. 100 Mbit/s local connection will trigger the problem, too.

 If the problem exists on your system, you will see emN - watchdog  
timeout

 messages on the console and in /var/log/messages, followed by a
 reset of the interface and a short and recoverable, but complete,
 loss of connectivity. A couple of seconds, maybe. This is enough
 to frustrate people, who e.g. run large backup jobs over a single
 TCP connection that takes a couple of hours to complete - the 
interface

 reset aborts the backup :-/

 I must say that it seems to me, these guys are putting a hell of
 a lot of effort into this problem and we are making progress.
 Things look quite good to me for 6.2-RELEASE.

 HTH,
 Patrick
 --
 punkt.de GmbH Internet - Dienstleistungen - Beratung
 Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100
 76137 Karlsruhe   http://punkt.de

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL

Re: em driver testing

2006-11-07 Thread Mike Jakubik

Clayton Milos wrote:

Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was 
happenning before at the same time as the em ones. Now I've passed a 
lot of traffic on the em interface but the xl interface gets watchdog 
errors. The em interface still works fine but the xl one is no usable 
after this.


Has it not been established by someone that the problem is in freebsd 
(scheduler iirc) and not the drivers themselves? This along with the 
bge/bce wtachdog timeouts seems to confirm that.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Jack Vogel

On 11/7/06, Mike Jakubik [EMAIL PROTECTED] wrote:

Clayton Milos wrote:
 Hi Jack


 I patched the driver and re-compiled the kernel and userland.

 All appears well with the em driver now. No more errors on it.
 I am getting watchdog timeouts on the xl driver now though. It was
 happenning before at the same time as the em ones. Now I've passed a
 lot of traffic on the em interface but the xl interface gets watchdog
 errors. The em interface still works fine but the xl one is no usable
 after this.

Has it not been established by someone that the problem is in freebsd
(scheduler iirc) and not the drivers themselves? This along with the
bge/bce wtachdog timeouts seems to confirm that.


Yes, I think its pretty likely to be in the timer/clock code, something
must have changed.

However, I like the design change we made to em better anyway,
the net/if timer is UP design, and has ALWAYS been vulnerable to
races, its best to do what we did (its in patch yet and not checked
in btw).

Jack
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Scott Long

Jack Vogel wrote:

On 11/7/06, Clayton Milos [EMAIL PROTECTED] wrote:


Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was
happenning before at the same time as the em ones. Now I've passed a 
lot of
traffic on the em interface but the xl interface gets watchdog errors. 
The

em interface still works fine but the xl one is no usable after this.



I'm not sure what it is, but the fact that a variety of nic drivers
are having this
same problem indicates that something changed in the if_timer and its
caller, someone knowing that subsystem would be better qualified to
say what.

The other drivers should do the same thing that em did, and stop using
the net layer timer :)

Jack


I think it's more that the if_em driver watchdog was insulating the 
if_xl driver.  Once the if_em component was removed, the if_xl driver

was the next in line to be a victim.  So yes, like you say, all of the
drivers need to be fixed.

Scott
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Scott Long

Mike Jakubik wrote:

Clayton Milos wrote:


Hi Jack


I patched the driver and re-compiled the kernel and userland.

All appears well with the em driver now. No more errors on it.
I am getting watchdog timeouts on the xl driver now though. It was 
happenning before at the same time as the em ones. Now I've passed a 
lot of traffic on the em interface but the xl interface gets watchdog 
errors. The em interface still works fine but the xl one is no usable 
after this.



Has it not been established by someone that the problem is in freebsd 
(scheduler iirc) and not the drivers themselves? This along with the 
bge/bce wtachdog timeouts seems to confirm that.




Mike,

If you have insight into the bce driver, I would highly appreciate if
you would share it.

Scott (the guy who fixed bce)

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Mike Jakubik

Scott Long wrote:

Mike,

If you have insight into the bce driver, I would highly appreciate if
you would share it.

Scott (the guy who fixed bce)



I don't have any bce hardware myself,  I'm just using the information 
from the list. I have some em and fxp hardware however that i can use to 
do tests.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Scott Long

Mike Jakubik wrote:

Scott Long wrote:


Mike,

If you have insight into the bce driver, I would highly appreciate if
you would share it.

Scott (the guy who fixed bce)



I don't have any bce hardware myself,  I'm just using the information 
from the list. I have some em and fxp hardware however that i can use to 
do tests.




It's just unclear to me how you're associating bce problems with 
checksum offloading and IP fragmentation to em problems with design 
issues in the watchdog code.


Scott

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Mike Tancsa

At 04:52 PM 11/7/2006, Scott Long wrote:

I think it's more that the if_em driver watchdog was insulating the 
if_xl driver.  Once the if_em component was removed, the if_xl driver

was the next in line to be a victim.  So yes, like you say, all of the
drivers need to be fixed.


I wonder if thats what the issue I was seeing on the rl interface. 
While trying to stress test an em based machine, I saw


Nov  6 17:33:05 releng6-865 kernel: rl0: watchdog timeout

while blasting out UDP traffic via netrate from the rl based box to 
the em based box.


---Mike 


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-07 Thread Mike Jakubik

Jack,

I have done some tests, here are my results. On 6.2-BETA3 i was able to 
get a timeout while compiling the kernel and ftping a large file from 
another server with the same card. On 6.2-STABLE cvsuped today i was not 
able to produce a timeout, i then applied your patch and the results 
were the same.


[EMAIL PROTECTED]:10:0:  class=0x02 card=0x11768086 chip=0x10768086 rev=0x00 
hdr=0x00

   vendor   = 'Intel Corporation'
   device   = '82547EI Gigabit Ethernet Controller'

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


em driver testing

2006-11-06 Thread ke han
According to the 6.2-beta3 announcemetn, there are improvements to  
the em driver.


The most important of the things that have been worked on is the
driver for em(4). 

I have a Sun X4100 which uses Intel ethernet.  I would like to  
install amd64 6.2beta3 on this server and put it through some tests.

But I have no idea what tests to run or how to run them.
Can someone provide some pointers?  I am happy to post my findings.

thanks, ke han
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-06 Thread Bruce A. Mah
If memory serves me right, ke han wrote:
 According to the 6.2-beta3 announcemetn, there are improvements to  
 the em driver.
 
 The most important of the things that have been worked on is the
 driver for em(4). 
 
 I have a Sun X4100 which uses Intel ethernet.  I would like to  
 install amd64 6.2beta3 on this server and put it through some tests.
 But I have no idea what tests to run or how to run them.
 Can someone provide some pointers?  I am happy to post my findings.

There were a number of problems in em(4) that appeared post-6.1.  A new
version of the em(4) driver was merged to the RELENG_6 branch just prior
to 6.2-BETA3.  That version is better, but still has some unresolved
issues (problems have been reported with jumbo frames and with watchdog
timeouts).  There's a new patch by jfv@ in an email a few days ago (with
subject New em patch for 6.2 BETA 3) that might fix these problems.
It might be worthwhile to try this out.

Basically, just look around the archives of stable@ to see discussions
of the em driver over the past month or so.  I believe that a number of
prior problems were uncovered when people just tried to push a lot of
traffic through the interface.

Bruce.




signature.asc
Description: OpenPGP digital signature


Re: em driver testing

2006-11-06 Thread Patrick M. Hausen
Hello!

On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote:

 I have a Sun X4100 which uses Intel ethernet.  I would like to  
 install amd64 6.2beta3 on this server and put it through some tests.
 But I have no idea what tests to run or how to run them.
 Can someone provide some pointers?  I am happy to post my findings.

Put some CPU load on the machine, e.g. by running

cd /usr/src
sh
while true
do
make -j4 buildworld
done mk.log

on one terminal and then transfer some data to the system, e.g.
by fetch(1)ing via FTP from another box connected to the same
LAN. On all systems I have, there is no need to saturate the
Gbit-Link. 100 Mbit/s local connection will trigger the problem, too.

If the problem exists on your system, you will see emN - watchdog timeout
messages on the console and in /var/log/messages, followed by a
reset of the interface and a short and recoverable, but complete,
loss of connectivity. A couple of seconds, maybe. This is enough
to frustrate people, who e.g. run large backup jobs over a single
TCP connection that takes a couple of hours to complete - the interface
reset aborts the backup :-/

I must say that it seems to me, these guys are putting a hell of
a lot of effort into this problem and we are making progress.
Things look quite good to me for 6.2-RELEASE.

HTH,
Patrick
-- 
punkt.de GmbH Internet - Dienstleistungen - Beratung
Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100
76137 Karlsruhe   http://punkt.de
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-06 Thread Clayton Milos

Hi there

I am having similar issues. Running 6.1-RELEASE.

I'm using the box as a samba server with pure-ftpd on it too with 2.5T of 
raid storage in it. the box is running the generic MP kernel on a Tyan 
Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that 
passed all tests with memtest.


When I pull a few concurrent files over samba or if i pull a big file (say 
2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box 
with watchdog timeout on the em interface. Usually it pops up with timeouts 
on the xl interface at the same time and after a few seconds on the ahc 
(onboard adaptec scsi) interfce and I have to hard boot the box to get it 
back to life.


I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom 
BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the 
interface. In fact the interface will accept an IP address but times out 
pinging anything on the LAN.


If a kernel developer would like access to the box to chek it out please 
mail me.


Regards

Clay



Hello!

On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote:


I have a Sun X4100 which uses Intel ethernet.  I would like to
install amd64 6.2beta3 on this server and put it through some tests.
But I have no idea what tests to run or how to run them.
Can someone provide some pointers?  I am happy to post my findings.


Put some CPU load on the machine, e.g. by running

cd /usr/src
sh
while true
do
make -j4 buildworld
done mk.log

on one terminal and then transfer some data to the system, e.g.
by fetch(1)ing via FTP from another box connected to the same
LAN. On all systems I have, there is no need to saturate the
Gbit-Link. 100 Mbit/s local connection will trigger the problem, too.

If the problem exists on your system, you will see emN - watchdog timeout
messages on the console and in /var/log/messages, followed by a
reset of the interface and a short and recoverable, but complete,
loss of connectivity. A couple of seconds, maybe. This is enough
to frustrate people, who e.g. run large backup jobs over a single
TCP connection that takes a couple of hours to complete - the interface
reset aborts the backup :-/

I must say that it seems to me, these guys are putting a hell of
a lot of effort into this problem and we are making progress.
Things look quite good to me for 6.2-RELEASE.

HTH,
Patrick
--
punkt.de GmbH Internet - Dienstleistungen - Beratung
Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100
76137 Karlsruhe   http://punkt.de
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-06 Thread Jack Vogel

Well, so run 6.2 BETA3 plus the patch I posted as Patrick
mentioned and then report on that. You've got a lot of
potential problem areas here, I have no experience with
samba on FreeBSD. And that motherboard only has PCI
as I recall, yes? Still, it should get rid of the watchdogs
unless you have real hardware issues.

Good luck,

Jack


On 11/6/06, Clayton Milos [EMAIL PROTECTED] wrote:

Hi there

I am having similar issues. Running 6.1-RELEASE.

I'm using the box as a samba server with pure-ftpd on it too with 2.5T of
raid storage in it. the box is running the generic MP kernel on a Tyan
Thunder K7 with the latest bios v2.14 and dual AthlonMP's. ECC Reg ram that
passed all tests with memtest.

When I pull a few concurrent files over samba or if i pull a big file (say
2-3G) over ftp to my laptop it runs at 30MB/sec but usually locks up the box
with watchdog timeout on the em interface. Usually it pops up with timeouts
on the xl interface at the same time and after a few seconds on the ahc
(onboard adaptec scsi) interfce and I have to hard boot the box to get it
back to life.

I''ve tried the same box with a 3com 3C996B-T NIC which has a Broadcom
BCM5701TKHB chipset on it. It crashes within minutes with no traffic on the
interface. In fact the interface will accept an IP address but times out
pinging anything on the LAN.

If a kernel developer would like access to the box to chek it out please
mail me.

Regards

Clay


 Hello!

 On Tue, Nov 07, 2006 at 04:55:50AM +0800, ke han wrote:

 I have a Sun X4100 which uses Intel ethernet.  I would like to
 install amd64 6.2beta3 on this server and put it through some tests.
 But I have no idea what tests to run or how to run them.
 Can someone provide some pointers?  I am happy to post my findings.

 Put some CPU load on the machine, e.g. by running

 cd /usr/src
 sh
 while true
 do
 make -j4 buildworld
 done mk.log

 on one terminal and then transfer some data to the system, e.g.
 by fetch(1)ing via FTP from another box connected to the same
 LAN. On all systems I have, there is no need to saturate the
 Gbit-Link. 100 Mbit/s local connection will trigger the problem, too.

 If the problem exists on your system, you will see emN - watchdog timeout
 messages on the console and in /var/log/messages, followed by a
 reset of the interface and a short and recoverable, but complete,
 loss of connectivity. A couple of seconds, maybe. This is enough
 to frustrate people, who e.g. run large backup jobs over a single
 TCP connection that takes a couple of hours to complete - the interface
 reset aborts the backup :-/

 I must say that it seems to me, these guys are putting a hell of
 a lot of effort into this problem and we are making progress.
 Things look quite good to me for 6.2-RELEASE.

 HTH,
 Patrick
 --
 punkt.de GmbH Internet - Dienstleistungen - Beratung
 Vorholzstr. 25Tel. 0721 9109 -0 Fax: -100
 76137 Karlsruhe   http://punkt.de
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to [EMAIL PROTECTED]


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-06 Thread Adrian Chadd

Just out of curiousity - why wasn't the offending MPSAFE related
changes to em just reverted after discovering the em instability? The
driver -was- stable a couple of months ago, no?



Adrian

--
Adrian Chadd - [EMAIL PROTECTED]
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: em driver testing

2006-11-06 Thread Jack Vogel

On 11/6/06, Adrian Chadd [EMAIL PROTECTED] wrote:

Just out of curiousity - why wasn't the offending MPSAFE related
changes to em just reverted after discovering the em instability? The
driver -was- stable a couple of months ago, no?


Actually it was not. Some reports have cited problems back
to 6.0 or before.

The watchdog design was fundamentally flawed from an SMP
point of view and needed to be changed.

We also didnt want to go backwards if possible. My Intel driver
had support for new hardware that was good to pick up.

There's lots of new stuff coming too, so stay tuned :)

Jack
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]