Re: Odd TCP glitches in new currents
We aren't doing mcast at this time. If there's anyone from Nortel lurking behind this list, UCLA CS is pretty close to throwing out the Accelars due to a lack of tech support response. No, UCLA CS is not capable of doing department-wide mcast because of a set of peculiar bugs in the Accelar's code. It will only do DVMRP snooping on a limited number of mcast groups (~400 or so). What we actually see is 3x that number. And so we're waiting for some upgraded code that Nortel/Bay has claimed is coming for the better part of a year now. -scooter On Fri, 24 Dec 1999, Glendon Gross wrote: Are you sure that this is a problem with the local interface dropping packets, or could it just be a multicast router that is suppressing packets? I have noticed with my new FreeBSD box running mrouted, exceptionally good routing performance. But my linux boxes are more consistent in their response. So I concluded that my upstream neighbors are supressing the broadcasts as a feature of the multicast routing protocol. I don't think it's a problem with my local interface, just a feature of the DVMRP protocol. Can anyone recommend a good reference on this? I've been reading RFC-1075 and don't really understand it.--Glen Gross On Tue, 30 Nov 1999, B. Scott Michel wrote: On Wed, 22 Dec 1999, Jonathan Lemon wrote: On Dec 12, 1999 at 11:37:42AM -0800, Matthew Dillon wrote: I had a Netgear FS509 switch here that would eat packets transmitted through the GigE port under certain conditions. Netgear shipped me a new one, and I've been happy with it, until the same problem started happening again this morning. There's some oddities in the 3.3 and 3.4 kernels as well -- I've actually nailed down the plexicity and speed on both the Accellar and my humble PC, and yet, I'm looking at weird TCP lockups from time to time. Mostly seems to be related to NFSv3, but will also happen when doing cvsup. There's no magic number of how many bytes are queued waiting to go out the interface. And it seems to be limited to specific connections, i.e. an NFS TCP connection can be jammed and yet I can be happily talking to cvsup3 doing an update. The interface in question is a NetGear: pn0: 82c169 PNIC 10/100BaseTX rev 0x20 int a irq 11 on pci0.9.0 What is odd is that the output error metric from netstat -in monotonically increases. Yes, I could post my configuration, etc., and I could go back to running -current, but I have a PhD to make progress on. And I'm willing to wait to try out the consolidated 2x040/PNIC driver when 4.0 finally rolls out. -scooter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message Scott Michel| No research ideal ever survives UCLA Computer Science | contact with implementation. PhD Graduate Student| To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
:... : I'm pretty sure that the box was getiting receive interrupts because : every time I sent a packet to it from the outside systat -vm showed : a PCI interrupt for the network device. However 'netstat -in 1' did : not show the statistics for the received packets until 64 had : accumulated. It could be that the statistics are not being accumulated : on a per-reception basis and that the receive packets are actually : getting through, and that its the transmit side which is broken. I don't : know the code well enough yet to make the determination. : :If things are done in these drives as they are in the if_de driver then :what you are seeing is the fact that if_opackets and are only :updated when the tx ring is reclaimed by an interrupt, not Next time this bug rears its ugly head I'll get a tcpdump going to try to figure out what is actually going on. Ooh, and I just had a thought -- a profiled kernel might help track down the problem as well by enabling it to see which routines get hit (and which don't). I don't see anything specific in the code so far, other then there being a lot of memory mapped (apparently shared with the device) objects that haven't been volatilized. So far I can't tie that into anything though. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
On Wed, Dec 22, 1999 at 10:18:56PM -0800, Matthew Dillon wrote: I'm adding Bill Paul to the list specifically. Hmm. Now this is odd! I think I may have found something! All of my 'rl' driver cards fail this test: apollo# linktest -m 0.1:0.2 -s 16 -f16 lander lander# linktest -m 0.1:0.2 -s 16 -f16 apollo They get about 1% packet loss with the test. Always. 100BaseTX full or half duplex, or 10BaseT -- I still get failures. I can't repeat this with a RealTek 8039 (that's an 'ed'-NIC) and a RealTek 8139 (that's the 'rl'-one) running 10BaseT. Note that I am _NOT_ running -CURRENT on any of these machines, they both run 2.2-STABLE (rev. 1.17 of rl.c). The packetloss when using small packets is exactly 0 - that is no packetloss occured during the minute or so which I was running linktest. I just started it again and will leave it running for a couple of hours, but I doubt that this will make a change. Whoops, I in fact experienced packet loss now: overdose(194.94.249.94)-foobar.franken.de lost 1/1606 overdose(194.94.249.94)-foobar.franken.de lost 2/2702 foobar.franken.de-overdose(194.94.249.94) lost 1/3412 overdose(194.94.249.94)-foobar.franken.de lost 3/3829 Note that was playing PCM-files via NFS at this time, so there was additional network traffic of ~180 KByte/s. These here now occured although there was no additional network traffic: overdose(194.94.249.94)-foobar.franken.de lost 4/5491 overdose(194.94.249.94)-foobar.franken.de lost 5/5692 overdose(194.94.249.94)-foobar.franken.de lost 6/7277 overdose(194.94.249.94)-foobar.franken.de lost 7/8661 overdose(194.94.249.94)-foobar.franken.de lost 8/9412 overdose(194.94.249.94)-foobar.franken.de lost 9/11393 overdose(194.94.249.94)-foobar.franken.de lost 10/13699 foobar.franken.de-overdose(194.94.249.94) lost 2/13728 overdose(194.94.249.94)-foobar.franken.de lost 11/16426 It seems as if this was roughly the same amount of packetloss as you experienced. rl0: RealTek 8139 10/100BaseTX irq 11 at device 3.0 on pci0 rl0: Ethernet address: 00:50:ba:d1:89:05 miibus0: MII bus on rl0 All of my 'fxp' driver cards succeed with the above test perfectly. If I test an fxp machine verses an 'rl' machine, linktest shows that the 'rl' cards can transmit small packets just fine but they lose out trying to receive them! Nope, it's the other way round for me. overdose has the 'rl'-NIC, foobar has the 'ed'-NIC. I hope to be able to do a few additional tests soon. Methinks there is something going on with the 'rl' driver and/or the RealTek cards! My experience with those cards isn't the best, so I'd place my bets on the cards. bye, Harold -- Shabby Sleep is an abstinence syndrome wich occurs due to lack of caffein. Wed Mar 4 04:53:33 CET 1998 #unix, ircnet To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
... I'm pretty sure that the box was getiting receive interrupts because every time I sent a packet to it from the outside systat -vm showed a PCI interrupt for the network device. However 'netstat -in 1' did not show the statistics for the received packets until 64 had accumulated. It could be that the statistics are not being accumulated on a per-reception basis and that the receive packets are actually getting through, and that its the transmit side which is broken. I don't know the code well enough yet to make the determination. If things are done in these drives as they are in the if_de driver then what you are seeing is the fact that if_opackets and are only updated when the tx ring is reclaimed by an interrupt, not when we actually queue the packet to the card. This has been a source of confusion for a long time, and IMNSO we should move the if_ipackets+= in the code. Here is an idle box, with an dc21143 in it showing probably what you are seing (the only network traffic to this box is the output of this running netstat -I de0 1 command: input (de0) output packets errs bytespackets errs bytes colls 1 0 60 0 0138 0 2 0182 0 0250 0 2 0158 0 0138 0 ... 100 + lines of output deleted... 3 0256 0 0138 0 1 0 60122 0138 0 3 0256 0 0138 0 1 0 60 0 0138 0 Search for lines like this: sc-tulip_if.if_opackets += xmits; in the driver to see when we update the counter, then look at how interrupt per packet drivers do it and propose a nice clean solution :-) -- Rod Grimes - KD7CAX @ CN85sl - (RWG25) [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Ok, here's the current status: The RealTek boards ('rl' driver, D-Link brand, RealTek chip vendor) appear to have serious packet loss problems with small packets. The cause is currently unknown. I had two different machines (an older PPro 200 and a somewhat newer K6-2/233) with the boards in and both exhibited the problem. The problem is fairly trivial to reproduce using linktest: http://www.backplane.com/FreeSrc/linktest-1.1.c host1# linktest -s 16 -f8 host2 host1# linktest -s 16 -f8 host1 These boards were the cause of my TCP problems. The D-Link boards came with the D-Link switch I had purchased. I removed the boards and replaced them with the two LinkSys boards that came with the LinkSys switch I had purchased. The LinkSys boards ('dc' driver, LNE100TX+ fame, LC82C115 PNIC II vendor) do not appear to have the packet loss problem. I have not had a reoccurance of my TCP glitches and my linktest tests have all come out roses. I'm hoping Bill will be able to find the problem with the D-Link boards, just so everyone else using them doesn't hit the same hangup, but my problem at least appears to be solved after replacing the boards. I've stuck my D-Link board into another diskless test machine and it's available for testing potential fixes, debugging, etc. In regards to the switches themselves: Both the LinkSys and the D-Link 5-Port switches appear to work well. I've interchanged them with each other and tested them pretty significantly with four machines attached. The LinkSys seems to be limited to around 25 MBytes/sec in aggregate throughput. The D-Link maxed out my machines (35 MBytes/sec) so I do not know what it's ultimate limitation is. The small-packet test maxed out my machines at 35,000 packets per second. So while I couldn't find the limitations of the switches, they're plenty good enough for me! The only problem I've come up against is that when I change the duplex with ifconfig the ethernet port is not reset and the switches do not recognize that the duplex has changed. If I 'ifconfig XXX media auto', however, the ports are reset and the switches negotiate full-duplex properly. If I ifconfig between 10 and 100BaseT the ports are reset and the switches appear to figure out the mode properly as well. So that's where I am. There was never anything wrong with the switches or the cabling - the entire problem was due to the D-Link ethernet cards. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
In message [EMAIL PROTECTED], Matthew Dillon writes: : :make sure you test odd packet lengths. (as in "not even") :there are occasional bugs that turn up with that sort of thing. Yup. Way ahead of you. Hmm. usleep() seems to have a high granularity - only about 150 Hz. How annoying! Increase your HZ. I'm using 1000 as default these days. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Just a quick note, not entirely on-topic: Bill Paul wrote: [...] Yes, I know there's a minimum frame length of 60 bytes. And the rl_encap() routine has the following code: /* Pad frames to at least 60 bytes. */ if (m_head-m_pkthdr.len RL_MIN_FRAMELEN) { m_head-m_pkthdr.len += (RL_MIN_FRAMELEN - m_head-m_pkthdr.len); m_head-m_len = m_head-m_pkthdr.len; } [...] 60 bytes, I just adjust bump up m_pkthdr.len and m_len. This adjuster length gets used later in rl_start() when transmission is triggered. I haven't read through the code yet, so I don't know where the extra memory in that buffer originated from, or rather if it has been zeroed before reaching this point. Otherwise you are leaking data from the kernel out to the network. Other OSes have done this before. It can be used for "data fishing" by just pinging the machine. Eventually it turns up all sorts of interesting information ([partial] passwords, for example). How many other NICs are unable to auto-pad, and how many of the drivers just add "random" data that happened to be laying around inside the kernel...? Just curious, /Mikko (Off to make sure that if_ed in my home firewall isn't doing anything like this...) To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
:Okay, I patched if_rl.c in -current to fixe the problem demonstrated by :Matt's linktest program. The bug was actually on the receive side of the :rl driver, not the transmit side. A packet can wrap from the end of the :RX buffer back to the beginning, and in some cases these packets would :get lost due to botched use of m_pullup(). I can run the linktest :program now without losing any frames. : :There's another way around this which is to allocate a whole mbuf :cluster when you know the packet is wrapped and bcopy the data manually :instead of using m_devget(), but I'm not sure I want to waste a whole :cluster just for that case. : :-Bill : :-- := :-Bill Paul(212) 854-6020 | System Manager, Master of Unix-Fu Great! Thanks for your help, Bill! -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Of all the gin joints in all the towns in all the world, Matthew Dillon had to walk into mine and say: Heh heh. Sorry about this, I believe I have further information on another older problem. Bill, remember those ethernet lockups I was having with the 'xl' driver all those months ago that we could never track down? And remember how I kept telling you that I could never duplicate the problem here? Well, they happen with the 'dc' driver too. But this time I'm not getting a complete lockup. The network actually continues to work well enough, well, just barely well enough, that I can still use it. slowly. It appears that the 'dc' driver continues to take receive interrupts (see the systat -vm snapshot at the end), but winds up not processing any of the packets. Except when 64 packets accumulate then suddenly all 64 get processed all at once! Then nothing again until the next 64 accumulate. Uh. That's... strange. First of all, you haven't said if this is the same machine that experienced the problems with the xl driver. Second, the number 64 sticks out in this case. If you look at if_dc.c (uh... you did actually look at the code, right?), you'll see that dc_encap() will only ask for a "TX done" interrupt every 64 packets. Why? Well, reclaiming transmit buffers is a fairly unimportant task and I wanted to cut down on the number of interrupts that were generated, and when the tulip reaches the last descriptor in a transmit chain, it's supposed to generate a "no more buffers in TX ring" interrupt, which will also trigger a TX buffer reclamation (i.e. dc_txeof() will be called for either interrupt). This behavior is controlled by the DC_TX_USE_TX_INTR flag, which is set for the PNIC II chip. I also use the DC_TX_POLL flag, which means that the chip is programmed to poll the TX ring and start transmission itself rather than having the driver write to the TX DMA start register. This means no register accesses on transmit, which is always nice. You can ask for a "TX done" interrupt to be scheduled for each transmitted packet by using the DC_TX_INTR_ALWAYS flag, which is currently only used for the PNIC I (82c168/82c169) because it blows goats. Anyway. I *never* see this behavior on any of my test machines. I have a LinkSys LNE100TX V2.0 card with the 82c115 chip, as well as a couple of Macronix cards, a Davicom card, several Intel/DEC 21143 cards, ASIX cards and ADMtek cards, and PNIC I-based LinkSys cards. None of them exhibit this behavior when I test them. This netstat is on the machine with the 'dc' driver that locked up, when I ping it from another machine. The 'dc' driver still works--- barely. It doesn't processes any packets until 64 have been received, then it processes them all at once. The transmit side appears to work fine and the receive side appears to get interrupts but does not appear to process incoming packets. Yet, obviously, the packets are being accumulated somewhere because I don't have any packet loss, just incredibly long and odd ping times. No no no. You can't say "the receive side appears to get interrupts." That's speculation. You can stare at the machine and theorize about what appears to be happening all you want: it won't do a damn bit of good until you actually test your theory. You know that an "RX done" interrupt has been delivered if dc_rxeof() is called. So do something to verify that it's being called: stick a printf() in dc_rxeof() that tells you when it trips. Then duplicate the behavior and watch what happens. This occurs when I am running netscape on the same box over a remote X connection (read: Lots of packets going over the network plus lots of local PCI activity talking to the graphics card). Same problem occurs with different graphics adapters but I believe this same problem also occured with the 'xl' driver on the card I had in before I put this card in. Yes, but the one vital fact you keep leaving out is: does this always happen with the same machine. If so, then describe this machine. What PCI chipset does it have? And more to the point, what cards have you used in this machine that *didn't* exhibit this problem. No wait, let me guess: Intel fxp. Right? G. I'm very puzzled by the fact that nobody else has *ever* reported any problem even remotely like this. Of course, with the level of feedback I get, it's possible that 50 people are having the same problem and simply never bothered to tell me. And watch what happens after I managed to 'ifconfig dc0 media auto', it goes back to normal... suddenly everything is working properly again. And what happens if instead of auto, you use "ifconfg dc0 media 100baseTX mediaopt full-duplex" to lock the media setting down? Or what happens if you shut down and restart the X server? -Bill --
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
: It appears that the 'dc' driver continues to take receive interrupts : (see the systat -vm snapshot at the end), but winds up not processing : any of the packets. Except when 64 packets accumulate then suddenly all : 64 get processed all at once! Then nothing again until the next 64 : accumulate. : :Uh. That's... strange. First of all, you haven't said if this is the :same machine that experienced the problems with the xl driver. Second, :the number 64 sticks out in this case. If you look at if_dc.c (uh... :you did actually look at the code, right?), you'll see that dc_encap() :will only ask for a "TX done" interrupt every 64 packets. Why? Well, :reclaiming transmit buffers is a fairly unimportant task and I wanted to I'm trying to narrow down the area enough that I can mess with the driver myself and hopefully locate the problem, since it can't be reproduced easily. I was hoping the magic number 64 could be related to something - and you have apparently been able to do that, which gives me a place to start anyway. netstat shows the trigger to be the reception of 64 packets rather then the transmission, though. Is there anything at all about the number 64 that could be related to the receiver? I'm pretty sure that the box was getiting receive interrupts because every time I sent a packet to it from the outside systat -vm showed a PCI interrupt for the network device. However 'netstat -in 1' did not show the statistics for the received packets until 64 had accumulated. It could be that the statistics are not being accumulated on a per-reception basis and that the receive packets are actually getting through, and that its the transmit side which is broken. I don't know the code well enough yet to make the determination. Previously it was not possible to add debugging code due to the amount of network traffic involved. With the new card, though, it should be possible to add conditional debugging code that could then be turned on with the sysctl because the network does not lock up completely (so I can still run 'sysctl' even if it takes it 5 minutes to load over NFS). :Yes, but the one vital fact you keep leaving out is: does this always :happen with the same machine. If so, then describe this machine. What :PCI chipset does it have? And more to the point, what cards have you :used in this machine that *didn't* exhibit this problem. : :No wait, let me guess: Intel fxp. Right? G. I only have one machine with this configuration (diskless workstation, everything running over NFS, plus X Display), so yes. The problem only occurs on one machine. It started occuring mid-year, after I threw the card in that used the xl driver. The previous ethernet card used a 'de' driver I believe and didn't have the problem. The only 'fxp' ethernets I have are in two of my test boxes - built into the motherboard. I don't think I have any PCI cards that use that driver. The LinkSys card in my server has never locked up, and the card using the 'xl' driver in my other diskless test machine (which doesn't have an X display) has never locked up either. : And watch what happens after I managed to 'ifconfig dc0 media auto', : it goes back to normal... suddenly everything is working properly : again. : :And what happens if instead of auto, you use "ifconfg dc0 media 100baseTX :mediaopt full-duplex" to lock the media setting down? Or what happens if :you shut down and restart the X server? : :-Bill I'll try that next time the problem occurs but I doubt it will have any effect. Changing the duplex mode does not appear to reset the port whereas forcing the media to 'auto' does appear to reset the port. This is actually another problem (switches don't appear to pick up the duplex change if the port isn't reset), but not one I'm concerned with. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Of all the gin joints in all the towns in all the world, Matthew Dillon had to walk into mine and say: I'm trying to narrow down the area enough that I can mess with the driver myself and hopefully locate the problem, since it can't be reproduced easily. I was hoping the magic number 64 could be related to something - and you have apparently been able to do that, which gives me a place to start anyway. netstat shows the trigger to be the reception of 64 packets rather then the transmission, though. Is there anything at all about the number 64 that could be related to the receiver? 64 is also the number of descriptors/buffers in the RX ring. When you fill up the RX ring, the chip is supposed to generate a 'no RX buffer available' interrupt. The driver will check the RX ring for packets when either an 'RX OK' or 'no RX buffers available' interrupt is delivered, but you should be getting an 'RX OK' interrupt on every received packet. The datasheet for the PNIC II is at: http://www.freebsd.org/~wpaul/Macronix/PNIC_II.PDF This is the datasheet LinkSys gave me when they first came out with the LNE100TX v2.0 board. It's very similar to the Macronix 98715A datasheet. I'm pretty sure that the box was getiting receive interrupts because every time I sent a packet to it from the outside systat -vm showed a PCI interrupt for the network device. However 'netstat -in 1' did not show the statistics for the received packets until 64 had accumulated. It could be that the statistics are not being accumulated on a per-reception basis and that the receive packets are actually getting through, and that its the transmit side which is broken. I don't know the code well enough yet to make the determination. The dc_rxeof() routine is what increments ifp-if_ipackets, so if netstat -in doesn't show any change until after 64 packets have arrived, then it isn't getting the 'RX OK' interrupts. But I promise you that I have never seen a condition where 'RX OK' interrupts failed to arrive even though 'no RX buffer available' interrupts did. The interrupt handler re-enables interrupts just before it exits, so there should never be a case where interrupts are turned off and never turned back on again. -Bill I'll try that next time the problem occurs but I doubt it will have any effect. Changing the duplex mode does not appear to reset the port whereas forcing the media to 'auto' does appear to reset the port. This is actually another problem (switches don't appear to pick up the duplex change if the port isn't reset), but not one I'm concerned with. In general what you want to do is a) switch modes and b) reset the link so that the guy on the other side re-senses the media. However both sides can only agree on the duplex setting as the result of an NWAY autoneg session: if you manually select 100baseTX full duplex, the link partner can only sense the link speed (100mbs as opposed to 10) but not the duplex mode. The rule is that if you don't have NWAY but can sense the link speed, you default to half duplex and let the operator manually fix things if necessary (that's what operators are for). Of course this only works if the switch has a management interface that allows you to configure things like that. Some don't, which can make your life tough. I'm pretty sure the speed and duplex setting don't really have anything to do with this particular problem though. I was just wondering why renegotiating the media would have any effect. It's possible that dc_init() may be called in there somewhere, which could be resetting all of the driver state. -Bill -- = -Bill Paul(212) 854-6020 | System Manager, Master of Unix-Fu Work: [EMAIL PROTECTED] | Center for Telecommunications Research Home: [EMAIL PROTECTED] | Columbia University, New York City = "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness" = To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
At 8:00 PM +1300 1999/12/22, Joe Abley wrote: Sorry if this is stating the obvious, but I've seen more than one clueful person bitten by this: hard-wire your duplex setting on your machine and also on the switch If you check http://www.backplane.com/diablo/hard.html and scroll down to the "Network:" section (from the looks of things, written sometime back in 1997 or perhaps 1998), you'll see that Matt has been well aware of this problem for some time. That's not to say that he might have forgotten his own advice on this issue, just that he is (or should be ;-) well aware of it. That said, I have to admit that I have yet to be bitten by this problem, and I have not (yet) configured machines switches at this site to forcibly select a particular media speed and plexicity. -- Brad Knowles [EMAIL PROTECTED] http://www.shub-internet.org/brad/ http://wwwkeys.pgp.net:11371/pks/lookup?op=getsearch=0xE38CCEF1 Your mouse has moved. Windows NT must be restarted for the change to take effect. Reboot now? [ OK ] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
: clueful person bitten by this: : :hard-wire your duplex setting on your machine and also on the switch : : If you check http://www.backplane.com/diablo/hard.html and :scroll down to the "Network:" section (from the looks of things, :written sometime back in 1997 or perhaps 1998), you'll see that Matt :has been well aware of this problem for some time. : : That's not to say that he might have forgotten his own advice on :this issue, just that he is (or should be ;-) well aware of it. : : : That said, I have to admit that I have yet to be bitten by this :problem, and I have not (yet) configured machines switches at this :site to forcibly select a particular media speed and plexicity. : :-- :Brad Knowles [EMAIL PROTECTED] http://www.shub-internet.org/brad/ : http://wwwkeys.pgp.net:11371/pks/lookup?op=getsearch=0xE38CCEF1 That's less of a problem as boards have started to conform better, but I definitely checked it - full-duplex worked fine (I could push over 15 MBits aggregate). My packet loss occured with both half and full duplex. I finally tracked it down. The loss is occuring in the link between two of my switches. The link goes across my apartment - about 60 feet of Cat-5 cable. That should be well within spec (you are supposed to be able to do 100 meters) but it causes packet loss. The switches autonegotiate full-duplex for the link (and I verified that it's actually running at full duplex), but that's where the packet loss occurs. Very weird. I was finally able to fix it by dropping in a 10BaseT hub to force the switches to negotiate 10BaseT across the link. Maybe my cable is damaged or something. I'll run a second cable to see if that's the problem or whether. The second switch is a LinkSys. I have a D-Link near my servers and a LinkSys near my workstation. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
On Dec 12, 1999 at 11:37:42AM -0800, Matthew Dillon wrote: I finally tracked it down. The loss is occuring in the link between two of my switches. The link goes across my apartment - about 60 feet of Cat-5 cable. That should be well within spec (you are supposed to be able to do 100 meters) but it causes packet loss. The switches autonegotiate full-duplex for the link (and I verified that it's actually running at full duplex), but that's where the packet loss occurs. Very weird. I was finally able to fix it by dropping in a 10BaseT hub to force the switches to negotiate 10BaseT across the link. Maybe my cable is damaged or something. I'll run a second cable to see if that's the problem or whether. The second switch is a LinkSys. I have a D-Link near my servers and a LinkSys near my workstation. Another thing I to keep in mind, is that sometimes the switch is bad. I had a Netgear FS509 switch here that would eat packets transmitted through the GigE port under certain conditions. Netgear shipped me a new one, and I've been happy with it, until the same problem started happening again this morning. Perhaps in this case, it's a bad fiber cable, I'll have to do some more testing to track it down. -- Jonathan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
On Wed, 22 Dec 1999, Jonathan Lemon wrote: On Dec 12, 1999 at 11:37:42AM -0800, Matthew Dillon wrote: I had a Netgear FS509 switch here that would eat packets transmitted through the GigE port under certain conditions. Netgear shipped me a new one, and I've been happy with it, until the same problem started happening again this morning. There's some oddities in the 3.3 and 3.4 kernels as well -- I've actually nailed down the plexicity and speed on both the Accellar and my humble PC, and yet, I'm looking at weird TCP lockups from time to time. Mostly seems to be related to NFSv3, but will also happen when doing cvsup. There's no magic number of how many bytes are queued waiting to go out the interface. And it seems to be limited to specific connections, i.e. an NFS TCP connection can be jammed and yet I can be happily talking to cvsup3 doing an update. The interface in question is a NetGear: pn0: 82c169 PNIC 10/100BaseTX rev 0x20 int a irq 11 on pci0.9.0 What is odd is that the output error metric from netstat -in monotonically increases. Yes, I could post my configuration, etc., and I could go back to running -current, but I have a PhD to make progress on. And I'm willing to wait to try out the consolidated 2x040/PNIC driver when 4.0 finally rolls out. -scooter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
: :There's some oddities in the 3.3 and 3.4 kernels as well -- I've actually :nailed down the plexicity and speed on both the Accellar and my humble PC, :and yet, I'm looking at weird TCP lockups from time to time. : :Mostly seems to be related to NFSv3, but will also happen when doing :cvsup. There's no magic number of how many bytes are queued waiting to go :out the interface. And it seems to be limited to specific connections, :i.e. an NFS TCP connection can be jammed and yet I can be happily talking :to cvsup3 doing an update. If an NFS TCP connection is jammed you can easily determine whether the problem is NFS or the TCP stack by looking at the netstat -tn output. 'netstat -tn | fgrep tcp' on both the client and server and locate the NFS tcp connection in question, then see if there is traffic built-up on it. If there is input traffic built-up on either the client or server then NFS isn't reading the socket. But if there is output traffic built-up (and no input traffic built-up by the receiving end) then the problem is somewhere in the TCP stack. --- Well, My problem still persists -- it wasn't the link between my two switches. I am having the same problem across just about every tcp connection I make, whether it's over a local switch or a hub and it doesn't seem to matter what kind of ethernet cards I have either. I am clueless as to what is going on. It seems to only happen with TCP connections. I wrote a UDP-based packet loss test program that sends UDP packets at varying rates and sizes in both directions and figures out where the loss is occuring, and I get nada. In fact, while its running in the background I am *still* getting TCP stutters and tcpdump still shows one machine sending a packet that the other machine never gets! I have no friggin clue as to why TCP packets fail when UDP packets don't. I am beginning to seriously suspect a software problem. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
make sure you test odd packet lengths. (as in "not even") there are occasional bugs that turn up with that sort of thing. On Wed, 22 Dec 1999, Matthew Dillon wrote: I am clueless as to what is going on. It seems to only happen with TCP connections. I wrote a UDP-based packet loss test program that sends UDP packets at varying rates and sizes in both directions and figures out where the loss is occuring, and I get nada. In fact, while its running in the background I am *still* getting TCP stutters and tcpdump still shows one machine sending a packet that the other machine never gets! I have no friggin clue as to why TCP packets fail when UDP packets don't. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
: :make sure you test odd packet lengths. (as in "not even") :there are occasional bugs that turn up with that sort of thing. Yup. Way ahead of you. Hmm. usleep() seems to have a high granularity - only about 150 Hz. How annoying! I've put the linktest program up on my web site. This one adds a '-f' option that allows you to specify to run the test as quickly as possible with up to N packets in transit to any given host at any given moment, default 1 (i.e. -f == -f1. Try -f2, -f3...). http://apollo.backplane.com/FreeSrc/linktest-1.0.c -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
On 1999-Dec-23 15:12:53 +1100, Matthew Dillon [EMAIL PROTECTED] wrote: In fact, while its running in the background I am *still* getting TCP stutters and tcpdump still shows one machine sending a packet that the other machine never gets! I have no friggin clue as to why TCP packets fail when UDP packets don't. If the problem shows up at 10baseX speeds, you could try setting up a 10base2 network comprising the two test machines and a third machine as a sniffer. The thinwire will allow an independent sniffer without introducing any other hardware (like hubs) that might affect the results. If you suspect a s/w problem, have the sniffer run different s/w (a commercial LAN analyser if you have one available, otherwise maybe something non-FreeBSD). This should allow you to identify whether it's a transmit or receive problem. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
new linktest program avail (was Re: Odd TCP glitches in new currents)
A new version of linktest is up, much enhanced: * fixes cpu use problems due to calling random() too much * fixes usleep (we now use a pipe and select()) This version can really stuff the network. http://www.backplane.com/FreeSrc/linktest-1.1.c Running the following tests with my LinkSys 5-port switch (I haven't done this with the D-Link yet): test3-test4 only test3# linktest -s 1200 -f8 test4 test4# linktest -s 1200 -f8 test3 11.9 MBytes/sec in both directions (23 MB/sec across the switch) lander-apollo only lander# linktest -s 1200 -f8 apollo apollo# linktest -s 1200 -f8 lander 4.5 MBytse/sec in both directions (10 MBytse/sec across the switch) Now both tests running in parallel. test3# linktest -s 1200 -f8 test4 test4# linktest -s 1200 -f8 test3 lander# linktest -s 1200 -f8 apollo apollo# linktest -s 1200 -f8 lander 7.9 MBytes/sec in both directions for test3-test4 (15 MB/sec) (6300 pps in both directions - 12000 pps across the switch) 4.3 MBytes/sec in both directions for apollo-lander (8.6 MB/sec) (3500 pps in both directions - 7000 pps across the switch) Interesting, eh? The test3-test4 test slowed down when I ran the apollo-lander test. Still, the switch performs plenty good enough for a tiny little 5-port item. And... no packet loss in the test - except my TCP connection when I type is still getting packet loss. Weeeird. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
I'm adding Bill Paul to the list specifically. Hmm. Now this is odd! I think I may have found something! All of my 'rl' driver cards fail this test: apollo# linktest -m 0.1:0.2 -s 16 -f16 lander lander# linktest -m 0.1:0.2 -s 16 -f16 apollo They get about 1% packet loss with the test. Always. 100BaseTX full or half duplex, or 10BaseT -- I still get failures. rl0: RealTek 8139 10/100BaseTX irq 11 at device 3.0 on pci0 rl0: Ethernet address: 00:50:ba:d1:89:05 miibus0: MII bus on rl0 All of my 'fxp' driver cards succeed with the above test perfectly. If I test an fxp machine verses an 'rl' machine, linktest shows that the 'rl' cards can transmit small packets just fine but they lose out trying to receive them! (test3 has an 'fxp' driver, apollo has an 'rl' driver. Both are on the same switch!) test3(216.240.41.13)-apollo.backplane.com lost 79/89027 test3(216.240.41.13)-apollo.backplane.com lost 80/89990 test3(216.240.41.13)-apollo.backplane.com lost 81/90953 test3(216.240.41.13)-apollo.backplane.com lost 82/92879 test3(216.240.41.13)-apollo.backplane.com lost 83/93842 test3(216.240.41.13)-apollo.backplane.com lost 84/94805 test3(216.240.41.13)-apollo.backplane.com lost 85/96730 Methinks there is something going on with the 'rl' driver and/or the RealTek cards! -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Of all the gin joints in all the towns in all the world, Matthew Dillon had to walk into mine and say: I'm adding Bill Paul to the list specifically. Hmm. Now this is odd! I think I may have found something! All of my 'rl' driver cards fail this test: Oh sure. Bet the farm on the absolute worst NIC on the whole damn planet, why don't you. Why spend a few bucks on some nice 3c905B or 3c905C cards and beat up on them when you can buy ten RealTek cards for a dollar. About as reliable as a pair of tin cans and a piece of string, but gosh they sure are cheap. You'll have to wait until at least tomorrow before I can look into this, since I won't be able to do any debugging until I throw my one and only RealTek 8139 sample adapter into a machine and run some tests with it. rl0: RealTek 8139 10/100BaseTX irq 11 at device 3.0 on pci0 rl0: Ethernet address: 00:50:ba:d1:89:05 miibus0: MII bus on rl0 pciconf -l would be nice here too (to see the PCI revision code). Methinks there is something going on with the 'rl' driver and/or the RealTek cards! Gee, y'think? I don't suppose you ran any similar tests with, say, one of those LinkSys cards you had the other day. Or maybe a 3Com card. I mean, it's just a little anti-climactic, you know? I put all that blood, sweat and tears into if_xl and if_dc, but do people do stress tests with them to help me identify weaknesses? No, they pound on the house of cards that is if_rl. *sigh* -Bill -- = -Bill Paul(212) 854-6020 | System Manager, Master of Unix-Fu Work: [EMAIL PROTECTED] | Center for Telecommunications Research Home: [EMAIL PROTECTED] | Columbia University, New York City = "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness" = To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
On Dec 12, 1999 at 01:41:04AM -0500, Bill Paul wrote: Of all the gin joints in all the towns in all the world, Matthew Dillon had to walk into mine and say: I'm adding Bill Paul to the list specifically. Hmm. Now this is odd! I think I may have found something! All of my 'rl' driver cards fail this test: Oh sure. Bet the farm on the absolute worst NIC on the whole damn planet, why don't you. Sorry, but I can't resist quoting this: /* * The RealTek 8139 PCI NIC redefines the meaning of 'low end.' This is * probably the worst PCI ethernet controller ever made, with the possible * exception of the FEAST chip made by SMC. */ -- Jonathan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Woa! May have found something - 'rl' driver and small packets (was Re: Odd TCP glitches in new currents)
Of all the gin joints in all the towns in all the world, Matthew Dillon had to walk into mine and say: (taking this off -current) apollo# linktest -s 51 -f1 lander 1-51 byte payload - errors lander# linktest -s 51 -f1 apollo apollo# linktest -s 52 -f1 lander 52+ byte payload - no errors lander# linktest -s 52 -f1 apollo You know, this kinda sounds like a jabber lockup. Bill, are you following the *MINIMUM* ethernet frame size specification for ethernet? *sigh* No, I've been living on Mars since 1975 and we don't get IEEE spec documents up here. Yes, I know there's a minimum frame length of 60 bytes. And the rl_encap() routine has the following code: /* Pad frames to at least 60 bytes. */ if (m_head-m_pkthdr.len RL_MIN_FRAMELEN) { m_head-m_pkthdr.len += (RL_MIN_FRAMELEN - m_head-m_pkthdr.len); m_head-m_len = m_head-m_pkthdr.len; } The RealTek doesn't autopad, so you have to handle it manually. You're only allowed one DMA buffer per transmission, so outbound packets are coalesced into a single mbuf cluster buffer in rl_encap(). A cluster buffer is always 2K, and frames can never be larger than 1514 bytes, so we know there'll always be plenty of room. In the case of frames less 60 bytes, I just adjust bump up m_pkthdr.len and m_len. This adjuster length gets used later in rl_start() when transmission is triggered. Incidentally, you should be using tcpdump -n -e -i rl0 to measure the actual frame length of failing and succeeding transmissions: that's usually a much better indicator of what might be going wrong. You could calculate it from the data buffer length, but I suck at math; I find it's easier just to monitor the offending frames. -Bill = -Bill Paul(212) 854-6020 | System Manager, Master of Unix-Fu Work: [EMAIL PROTECTED] | Center for Telecommunications Research Home: [EMAIL PROTECTED] | Columbia University, New York City = "It is not I who am crazy; it is I who am mad!" - Ren Hoek, "Space Madness" = To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
In message [EMAIL PROTECTED], Garrett Wollman write s: On Tue, 21 Dec 1999 12:50:50 -0800 (PST), Matthew Dillon [EMAIL PROTECTED] said: I have NOT tested this fix yet, so I don't know if it works, but I believe the problem is that on high speed networks the milliscond round trip delay is short enough that you can get 1-tick timeouts. Hmmm. I thought we agreed that 200 msec was the minimum reasonable RTO. That code doesn't seem to have made it in. I assume you mean 20 msec (= 2 tick @ 100 Hz ) ? 200 msec is enough to get halfway around the globe... -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
On Tue, 21 Dec 1999 22:13:51 +0100, Poul-Henning Kamp [EMAIL PROTECTED] said: Hmmm. I thought we agreed that 200 msec was the minimum reasonable RTO. That code doesn't seem to have made it in. I assume you mean 20 msec (= 2 tick @ 100 Hz ) ? 200 msec is enough to get halfway around the globe... No, I mean 200 msec. If you make the RTO be any shorter than that, you'll slow-start every packet you send to a machine which is running delayed-ACK (i.e., almost everyone). The official standard RTO is I think 500 msec, but this might be too high. We have ``bad retransmit recovery'' which is supposed to detect some instances of this and disable slow-start in that case. -GAWollman -- Garrett A. Wollman | O Siem / We are all family / O Siem / We're all the same [EMAIL PROTECTED] | O Siem / The fires of freedom Opinions not those of| Dance in the burning flame MIT, LCS, CRS, or NSA| - Susan Aglukark and Chad Irschick To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
: :Hmmm. I thought we agreed that 200 msec was the minimum reasonable :RTO. That code doesn't seem to have made it in. : :I assume you mean 20 msec (= 2 tick @ 100 Hz ) ? 200 msec is enough :to get halfway around the globe... : :-- :Poul-Henning Kamp FreeBSD coreteam member I just rebooted both machines and it didn't fix the problem. I did a packet trace on both boxes and there does indeed appear to be packet loss. I may have thrown out a red herring, sorry about that folks! Something odd is going on, that's for sure. My packet trace shows that there was packet loss and that the retry did *NOT* occur immediately, so my premise goes out the window. I don't understand why my tcp connection has this sort of packet loss when all of my ping tests succeed 100%. I am totally baffled. -Matt (make window wide to view. Note: my xntpd's aren't synchronized well enough this soon after reboot so the two machine's times do not match very well). machine #1 (did not receive packet sequence 20400) 13:12:28.730938 216.240.41.6.4006 216.240.41.2.22: P 20360:20380(20) ack 36645 win 17520 (DF) [tos 0x10] 13:12:28.756646 216.240.41.2.22 216.240.41.6.4006: P 36645:36665(20) ack 20380 win 17520 (DF) 13:12:28.794196 216.240.41.6.4006 216.240.41.2.22: P 20380:20400(20) ack 36665 win 17520 (DF) [tos 0x10] 13:12:28.816622 216.240.41.2.22 216.240.41.6.4006: P 36665:36685(20) ack 20400 win 17520 (DF) 13:12:28.962999 216.240.41.6.4006 216.240.41.2.22: P 20420:20440(20) ack 36685 win 17520 (DF) [tos 0x10] 13:12:28.963059 216.240.41.2.22 216.240.41.6.4006: . ack 20400 win 17520 (DF) 13:12:29.027297 216.240.41.6.4006 216.240.41.2.22: P 20440:20460(20) ack 36685 win 17520 (DF) [tos 0x10] machine #2 (sent packet sequence 20400, then timed out later and resent) 13:12:27.743652 216.240.41.6.4006 216.240.41.2.22: . ack 36645 win 17520 (DF) [tos 0x10] 13:12:28.176252 216.240.41.6.4006 216.240.41.2.22: P 20360:20380(20) ack 36645 win 17520 (DF) [tos 0x10] 13:12:28.202078 216.240.41.2.22 216.240.41.6.4006: P 36645:36665(20) ack 20380 win 17520 (DF) 13:12:28.239533 216.240.41.6.4006 216.240.41.2.22: P 20380:20400(20) ack 36665 win 17520 (DF) [tos 0x10] 13:12:28.262069 216.240.41.2.22 216.240.41.6.4006: P 36665:36685(20) ack 20400 win 17520 (DF) 13:12:28.336525 216.240.41.6.4006 216.240.41.2.22: P 20400:20420(20) ack 36685 win 17520 (DF) [tos 0x10] 13:12:28.408355 216.240.41.6.4006 216.240.41.2.22: P 20420:20440(20) ack 36685 win 17520 (DF) [tos 0x10] 13:12:28.408512 216.240.41.2.22 216.240.41.6.4006: . ack 20400 win 17520 (DF) 13:12:28.472656 216.240.41.6.4006 216.240.41.2.22: P 20440:20460(20) ack 36685 win 17520 (DF) [tos 0x10] 13:12:28.472805 216.240.41.2.22 216.240.41.6.4006: . ack 20400 win 17520 (DF) 13:12:28.545556 216.240.41.6.4006 216.240.41.2.22: P 20460:20480(20) ack 36685 win 17520 (DF) [tos 0x10] 13:12:28.545703 216.240.41.2.22 216.240.41.6.4006: . ack 20400 win 17520 (DF) 13:12:28.545770 216.240.41.6.4006 216.240.41.2.22: P 20400:20480(80) ack 36685 win 17520 (DF) [tos 0x10] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: Odd TCP glitches in new currents
On Tue, Dec 21, 1999 at 01:23:05PM -0800, Matthew Dillon wrote: I just rebooted both machines and it didn't fix the problem. I did a packet trace on both boxes and there does indeed appear to be packet loss. Sorry if this is stating the obvious, but I've seen more than one clueful person bitten by this: hard-wire your duplex setting on your machine and also on the switch Even if the switch and NIC appear to auto-negotiate a sensible duplex setting, I have seen many cases where they will forget for no apparent reason, usually in the middle of the night just after you have stepped onto a plane to fly to a different country. If one end thinks it is full-duplex and the other end thinks it is half, then late collisions can occur which will not result in MAC-layer retransmissions from the full-duplex-thinking station -- hence packet loss. Joe (possibly #2 in a series of red herrings :) -- Ua lawa küpono ka hakahaka pä o këia pä malule To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message