excuse my tardiness; i was on a conference bridge yesterday for 11 hours
helping resolve a sev 1 issue caused by said TCP 0 window messages.

there is way too much context for me to type here but let me relate the 
highlights
as a "lessons learned" missive.

our application gets a modest amount of data streamed at it (50MB/s)
from a set of servers on the other side of a firewall. the visible symptom
is that just a little while after opening a socket and sending data, the
sending side would stop sending data.

around 4am thursday morning, the above symptom kicked in
for all 80 sockets connecting 5 source servers to 8 destination servers
roughly simultaneously (within 2-4 minutes). we had changed nothing
on the server's software.

throughout the day, as the sev 1 callout got escalated and different shifts of 
people
got called, i had to re-explain our theory that although the networking folks
could not find anything at all, save one link that occasionally had some 
dropped packets
(around 0.1-1%), it had to be an external thing (because all 8 servers
go it at the same time) and was most likely the network. we ended up
rebooting all teh servers a few times and reloading all the app software, to no 
avail.

because there was literally nothing else to do, the networking folks worked on 
the link
that was dropping packets and found it was a 4 port channel with a couple of 
ports
down and utilisation was high. so given they actually had a guy on site, they 
had
him go and check it physically. long story short, he was able to replace a 
couple broken
GBICs and fix a miscabling and voila! the link was now running at 4x the 
original speed.
and no packet loss!!

and wouldn't you know it, approximately 10s later, no more TCP 0 window messages
and our data streaming started working. and has worked flawlessly since. so 
despite
the fact that "this minorly defective link couldn't possibly" cause the problem,
it apparently did. (although no-one could explain the mechanism.)

On Sep 13, 2012, at 1:41 PM, John Stoffel wrote:

>>>>>> "Andrew" == Andrew Hume <and...@research.att.com> writes:
> 
> Andrew> to my surprise, i am getting networking weirdness on my machines.
> 
> Like what?  
> 
> Andrew> a sniffer revealed the dreaded "TCP 0 window" message, which
> Andrew> apparently means that the receiving system has run out of
> Andrew> something (TCP buffers?).
> 
> What do you see in 'dmesg' output?  
> 
> Andrew> now, i never expect much out of linux and i am always satisfied.
> 
> So just reboot, won't that fix the problem?  *snark*
> 
> Andrew> nevertheless, i would expect some indication that something
> Andrew> has run low or needs to be increased. what should i look for?
> Andrew> are there magic variables to set?  this is on Red Hat
> Andrew> enterprise 6.
> 
> So what are these system(s) doing and how is their network setup, etc?
> Give us more details on the hardware and what errors you're seeing.
> Otherwise you really don't give us enough information to help.
> 
> John


------------------
Andrew Hume  (best -> Telework) +1 623-551-2845
and...@research.att.com  (Work) +1 973-236-2014
AT&T Labs - Research; member of USENIX and LOPSA




_______________________________________________
Tech mailing list
Tech@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to