Re: [lopsa-tech] linux and networking [SEC=UNCLASSIFIED]

Robinson, Greg Wed, 17 Oct 2012 22:20:01 -0700

UNCLASSIFIED

This sounds like the TCP window scaling problem of days gone by, but I
can't be sure.


 

Has an upstream router been upgraded/downgraded/improperly configure
lately?

 

Not sure what to suggest other than:

 

sysctl net.ipv4.tcp_window_scaling=0

 

Which disables it.

 

Greg.

 

From: [email protected] [mailto:[email protected]]
On Behalf Of Andrew Hume
Sent: Thursday, 18 October 2012 3:10 PM
To: Nathan Hruby
Cc: [email protected]
Subject: Re: [lopsa-tech] linux and networking

 

i apologise for not being very clear.

(its been a long day.)

 

we have an 8-node cluster.

each node is a modest dell 910 or somesuch; 128GB mem and 32 cores.

each node also has 8 1Gbps NICs; most are rarely used but 2 are used a
lot.

technically, these occupy 4 interfaces as they are two bonded pairs
(active-passive).

each bonded pair is on one of two VPNs.

one VPN goes out to the general intranet; the other is a VPN local to
the cluster.

 

the local VPN is pounded on hard. i estimate 800 Mbps during peak hours.

i have noticed no performance issues with the traffic on this VPN; its
all

zeromq message traffic but i carefully monitor it for latency. the
messages

i send are typically 100+ bytes, and zeromq normally bundles several
together

for transmission.

 

on the external VPN, we have 80 inbound feeds (10 per nodes) typically
around 23 Mbps each.

what we notice is that these socket connections occasionally go dry,
that is, data stops coming.

using tcpdump and sniffers, we determine that was because the server
starts sending

window size 0  messages back to the source systems. in fact, a sniffer
revealed

the window size starting around 26K and then quite quickly dropping all
the

way down to 10, 8, 1 and then zero.

 

at this point, the processes receiving data over the socket exits, and
gets restarted

a couple of minutes later. by then, the condition clears, the window

size goes back up to 26K and all is well for 6-10 minutes and then some
other

group of sockets fails. strangely, not every socket on a node fails;
sometimes

most, sometimes just a few, rarely all.

 

i take a window size of zero as definitive of tcp/ip stack congestion.

 

but i freely admit to not knowing (nor wanting to know) about
networking.

which is why i ask for advice on this group.

 

            thanks

                        andrew

 

On Oct 17, 2012, at 8:16 PM, Nathan Hruby wrote:

 

On Wed, Oct 17, 2012 at 5:22 PM, Andrew Hume <[email protected]>
wrote:

screwed by linux again. sigh.

         

        so apparently i am overloading my pathetic linux system with too
much tcp/ip

        traffic.

        is there any way to detect this while (or before or after) it is
happening?

        of course no error messages are emitted.

        but might there be some other thing buried away somewhere, like
/proc?


If there are no messages emitted, how do you know it's overloaded with
[network] traffic?

-n
-- 
-------------------------------------------
nathan hruby <[email protected]>
metaphysically wrinkle-free
-------------------------------------------

 


-----------------------
Andrew Hume
623-551-2845 (VO and best)
973-236-2014 (NJ)
[email protected]



 

IMPORTANT: This email remains the property of the Department of Defence
and is subject to the jurisdiction of section 70 of the Crimes Act 1914.
If you have received this email in error, you are requested to contact
the sender and delete the email.

_______________________________________________
Tech mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/tech
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-tech] linux and networking [SEC=UNCLASSIFIED]

Reply via email to