excuse my tardiness; i was on a conference bridge yesterday for 11 hours helping resolve a sev 1 issue caused by said TCP 0 window messages.
there is way too much context for me to type here but let me relate the highlights as a "lessons learned" missive. our application gets a modest amount of data streamed at it (50MB/s) from a set of servers on the other side of a firewall. the visible symptom is that just a little while after opening a socket and sending data, the sending side would stop sending data. around 4am thursday morning, the above symptom kicked in for all 80 sockets connecting 5 source servers to 8 destination servers roughly simultaneously (within 2-4 minutes). we had changed nothing on the server's software. throughout the day, as the sev 1 callout got escalated and different shifts of people got called, i had to re-explain our theory that although the networking folks could not find anything at all, save one link that occasionally had some dropped packets (around 0.1-1%), it had to be an external thing (because all 8 servers go it at the same time) and was most likely the network. we ended up rebooting all teh servers a few times and reloading all the app software, to no avail. because there was literally nothing else to do, the networking folks worked on the link that was dropping packets and found it was a 4 port channel with a couple of ports down and utilisation was high. so given they actually had a guy on site, they had him go and check it physically. long story short, he was able to replace a couple broken GBICs and fix a miscabling and voila! the link was now running at 4x the original speed. and no packet loss!! and wouldn't you know it, approximately 10s later, no more TCP 0 window messages and our data streaming started working. and has worked flawlessly since. so despite the fact that "this minorly defective link couldn't possibly" cause the problem, it apparently did. (although no-one could explain the mechanism.) On Sep 13, 2012, at 1:41 PM, John Stoffel wrote: >>>>>> "Andrew" == Andrew Hume <and...@research.att.com> writes: > > Andrew> to my surprise, i am getting networking weirdness on my machines. > > Like what? > > Andrew> a sniffer revealed the dreaded "TCP 0 window" message, which > Andrew> apparently means that the receiving system has run out of > Andrew> something (TCP buffers?). > > What do you see in 'dmesg' output? > > Andrew> now, i never expect much out of linux and i am always satisfied. > > So just reboot, won't that fix the problem? *snark* > > Andrew> nevertheless, i would expect some indication that something > Andrew> has run low or needs to be increased. what should i look for? > Andrew> are there magic variables to set? this is on Red Hat > Andrew> enterprise 6. > > So what are these system(s) doing and how is their network setup, etc? > Give us more details on the hardware and what errors you're seeing. > Otherwise you really don't give us enough information to help. > > John ------------------ Andrew Hume (best -> Telework) +1 623-551-2845 and...@research.att.com (Work) +1 973-236-2014 AT&T Labs - Research; member of USENIX and LOPSA
_______________________________________________ Tech mailing list Tech@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/