UNCLASSIFIED This sounds like the TCP window scaling problem of days gone by, but I can't be sure.
Has an upstream router been upgraded/downgraded/improperly configure lately? Not sure what to suggest other than: sysctl net.ipv4.tcp_window_scaling=0 Which disables it. Greg. From: [email protected] [mailto:[email protected]] On Behalf Of Andrew Hume Sent: Thursday, 18 October 2012 3:10 PM To: Nathan Hruby Cc: [email protected] Subject: Re: [lopsa-tech] linux and networking i apologise for not being very clear. (its been a long day.) we have an 8-node cluster. each node is a modest dell 910 or somesuch; 128GB mem and 32 cores. each node also has 8 1Gbps NICs; most are rarely used but 2 are used a lot. technically, these occupy 4 interfaces as they are two bonded pairs (active-passive). each bonded pair is on one of two VPNs. one VPN goes out to the general intranet; the other is a VPN local to the cluster. the local VPN is pounded on hard. i estimate 800 Mbps during peak hours. i have noticed no performance issues with the traffic on this VPN; its all zeromq message traffic but i carefully monitor it for latency. the messages i send are typically 100+ bytes, and zeromq normally bundles several together for transmission. on the external VPN, we have 80 inbound feeds (10 per nodes) typically around 23 Mbps each. what we notice is that these socket connections occasionally go dry, that is, data stops coming. using tcpdump and sniffers, we determine that was because the server starts sending window size 0 messages back to the source systems. in fact, a sniffer revealed the window size starting around 26K and then quite quickly dropping all the way down to 10, 8, 1 and then zero. at this point, the processes receiving data over the socket exits, and gets restarted a couple of minutes later. by then, the condition clears, the window size goes back up to 26K and all is well for 6-10 minutes and then some other group of sockets fails. strangely, not every socket on a node fails; sometimes most, sometimes just a few, rarely all. i take a window size of zero as definitive of tcp/ip stack congestion. but i freely admit to not knowing (nor wanting to know) about networking. which is why i ask for advice on this group. thanks andrew On Oct 17, 2012, at 8:16 PM, Nathan Hruby wrote: On Wed, Oct 17, 2012 at 5:22 PM, Andrew Hume <[email protected]> wrote: screwed by linux again. sigh. so apparently i am overloading my pathetic linux system with too much tcp/ip traffic. is there any way to detect this while (or before or after) it is happening? of course no error messages are emitted. but might there be some other thing buried away somewhere, like /proc? If there are no messages emitted, how do you know it's overloaded with [network] traffic? -n -- ------------------------------------------- nathan hruby <[email protected]> metaphysically wrinkle-free ------------------------------------------- ----------------------- Andrew Hume 623-551-2845 (VO and best) 973-236-2014 (NJ) [email protected] IMPORTANT: This email remains the property of the Department of Defence and is subject to the jurisdiction of section 70 of the Crimes Act 1914. If you have received this email in error, you are requested to contact the sender and delete the email.
_______________________________________________ Tech mailing list [email protected] https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
