Since ping is ICMP, not TCP, you probably want to investigate a mix of TCP and 
CPU stats to see what is behind the slow pings. I’d guess you are getting 
network impacts beyond what the ping times are hinting at.  ICMP isn’t subject 
to retransmission, so your TCP situation could be far worse than ping latencies 
may suggest.

From: "Hanauer, Arnulf, Vodacom South Africa (External)" 
<arnulf.hana...@vcontractor.co.za>
Reply-To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Date: Thursday, February 13, 2020 at 2:06 AM
To: "user@cassandra.apache.org" <user@cassandra.apache.org>
Subject: RE: Connection reset by peer

Message from External Sender

Thanks to both Erik/Shaun for your responses,

Both your explanations are plausible in my scenario, this is what I have done 
subsequently which seems to have improved the situation,



  1.  The cluster was very busy trying to run repairs/sync the new replicas 
(about 350GB)  in the new DC (Gossip was temporarily marking down the source 
nodes at different points in time)

  *   Disabled Reaper, stopped all validation/repairs



  1.  I removed the new replica’s to stop any potential read_repair across the 
WAN

  *   I will recreate the replica’s over the weekend during quiet time & run 
the repair to sync



  1.  The network ping response time was quite high around 10-15msec at error 
times

  *   This dropped to under 1ms later in the day when some jobs were rerun 
successfully



  1.  I will apply some of the recommended TCP_KEEPALIVE settings Shaun pointed 
me to



Last question: In all your experiences, how high can the latency (simple ping 
response times go) before it becomes a problem? (Obviously the lower the better 
but is there some sort of cut off/formula where problems can be expected 
intermittently like the connection resets)




Kind regards

Arnulf Hanauer



From: Erick Ramirez <erick.rami...@datastax.com>
Sent: Thursday, 13 February 2020 03:10
To: user@cassandra.apache.org
Subject: Re: Connection reset by peer

I generally see these exceptions when the cluster is overloaded. I think what's 
happening is that when the app/driver sends a read request, the coordinator 
takes a long time to respond because the nodes are busy serving other requests. 
The driver gives up (client-side timeout reached) and the socket is closed. 
Meanwhile, the coordinator eventually gets results from replicas and tries to 
send the response back to the app/driver but can't because the connection is no 
longer there. Does this scenario sound plausible for your cluster?


Erick Ramirez  |  Developer Relations

erick.rami...@datastax.com<mailto:erick.rami...@datastax.com> | 
datastax.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.datastax.com&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=oyYBNcLj4BzfqCjvEHuDzXmZkX8R5MtotlqwEWho-00&e=>
[Image removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=7rDnURxBUotireZrWpzcR_mY7BZXKHB5ZdWdM8vTk8M&e=>[Image
 removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=h7hKSipDEQl0tgb37QwekOoQE1Y-3QhaLLULRMd9DqI&e=>[Image
 removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_datastax&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=31sxw8O_fdMzeXz76-QrZQS3bq_L-LLVvBqMxeodcNw&e=>[Image
 removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=http-3A__feeds.feedburner.com_datastax&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=_unwgpOLLsvGHP24hTAwh8bzGr_4KkY8GFvyKNMSBMk&e=>[Image
 removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_datastax_&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=qPj5b9cuH0NLCmBOfH1kne5X5NFDCw-666DJC0MG7KY&e=>

[Image removed by 
sender.]<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.datastax.com_accelerate&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=Ge0gsBmNU_3J0wPN4Pb436QnAVoIQAh4eSmqhc3TuP8&e=>


On Wed, 12 Feb 2020 at 21:13, Hanauer, Arnulf, Vodacom South Africa (External) 
<arnulf.hana...@vcontractor.co.za<mailto:arnulf.hana...@vcontractor.co.za>> 
wrote:
Hi Cassandra folks,

We are getting a lot of these errors and transactions are timing out and I was 
wondering if this can be caused by Cassandra itself or if this is a genuine 
Linux network issue only. The client job reports Cassandra node down after this 
occurs but I suspect this is due to the connection failure – need some 
clarification as where to go look for a solution.


INFO  [epollEventLoopGroup-2-10] 2020-02-12 11:53:42,748 Message.java:623 - 
Unexpected exception during request; channel = [id: 0x8a3e6831, 
L:/10.132.65.152:9042<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.132.65.152-3A9042&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=1sZ5FMt-3UrgadnG9Cc6_IqG9H5CFQVSzM1yR7wFJTg&e=>
 - 
R:/10.132.11.15:48020<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.132.11.15-3A48020&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=yWhvhFSzDm29vCLOQDSWYw0eTl4kgZHiZZotUbG4Myo&e=>]
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer
        at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]

INFO  [epollEventLoopGroup-2-15] 2020-02-12 11:42:46,871 Message.java:623 - 
Unexpected exception during request; channel = [id: 0xa071f1c8, 
L:/10.132.65.152:9042<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.132.65.152-3A9042&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=1sZ5FMt-3UrgadnG9Cc6_IqG9H5CFQVSzM1yR7wFJTg&e=>
 - 
R:/10.132.11.15:45134<https://urldefense.proofpoint.com/v2/url?u=http-3A__10.132.11.15-3A45134&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=tVo2eGdTzPwA83VLgRcLGkB61FUGDFKoQz2Vdt-hz3Y&e=>]
io.netty.channel.unix.Errors$NativeIoException: syscall:read(...)() failed: 
Connection reset by peer
        at io.netty.channel.unix.FileDescriptor.readAddress(...)(Unknown 
Source) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]


Source and Destination IP addresses are in the same DC (LAN).

I did recycle all the Cassandra services on all the nodes in both clusters but 
the problem remains.

The only change made recently was the adding of replicas in the second DC for 
the keyspace that is being written to when these messages occur (not had a 
chance to run a full repair yet to sync the replicas)


FYI:
Cassandra 3.11.2
5 Node cluster each in 2 DC’s


Kind regards
Arnulf Hanauer









"This e-mail is sent on the Terms and Conditions that can be accessed by 
Clicking on this link 
https://webmail.vodacom.co.za/tc/default.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vodacom.co.za_vodacom_terms_email-2Dacceptable-2Duser-2Dpolicy&d=DwMFAg&c=adz96Xi0w1RHqtPMowiL2g&r=DPfYm4e7OLSdVEGyWr82F_m1fTjoAHtX5mdHEINlrQw&m=Cz0CXUbGNM5oF7LQwJE1Z3tCQtOsH_Oerb8gVDKOshU&s=LutuQpxi284UPHm0bQsqVMlLobQnBwQQ694tK8g1Reo&e=>
 "








"This e-mail is sent on the Terms and Conditions that can be accessed by 
Clicking on this link 
https://webmail.vodacom.co.za/tc/default.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.vodacom.co.za_vodacom_terms_email-2Dacceptable-2Duser-2Dpolicy&d=DwMGaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=c-C7O0_zkmlTwnd17_nhfJhje8_WYG-35vExarZqrXA&s=DtrJ_3ZekOM0srLjklR8LzDpXZIZ7289bej5fgZ_rQs&e=>
 "

Reply via email to