Re: Repair Hangs while requesting Merkle Trees

2015-11-29 Thread Anuj Wadehra
Hi All,

I am summarizing the setup, problem & key observations till now:

Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We 
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more 
nodes in remote DC.

Observations till now:
1. Repair hangs intermittently on one node of  DC2.. Only on one occasion, 
repair hung on one other node in DC2 too.
2. Mostly, the node from which Merkle tree was not received does NOT have any 
message "Sending completed merkle tree .." in logs.
3. Often Hinted Handoffs get triggered across DCs and hint replays time-out.
4. Many times, when repair is run after long time it FAILS initially. But, if 
we restart Cassandra and re-run repair , it SUCCEEDS.

Logs: DEBUG logs Attached.

Observations from Log:1. When we started repair on 10.X.15.115, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for 2 nodes in remote DC: 
10.X.14.113 and 10.X.14.111. Merkle tree were received from these 2 nodes.

2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115 
(for which no error occurred)

3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out.
If it's a network issue then why the issue is only in DC2 and mostly observed 
on one node.

ThanksAnuj 


On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:
 

 Yes. I think you are correct, problem might have resolved via Cassandra 
restart rather than increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one 
public.
We have strange configuration and we need to correct it as per the 
recommendation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
 . 

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address. 
In seeds, we put PUBLIC IP of other nodes but private IP for the local node. 
There were some issues if we tried to access local node via its public IP.


Thanks
Anuj
 

On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" 
<anujw_2...@yahoo.co.in>
 Date: Tuesday, 24 November, 2015, 12:38 AM
 
 The issue might be related to the
 ESTABLISHED connections just in one end. I don't think
 it might be related to inter_dc_tcp_nodelay or
 request_timeout_in_ms options. Did you restart the process
 when you changed the request_timeout_in_ms option? This
 might be why the problem got fixed and not the option
 change.
 
 This seem
 like a network issue or a misconfiguration of this specific
 node. Are you using EC2? Is listen_address ==
 broadcast_address? Are all nodes using the same
 configuration? What java are you using?
 
 You may want to enable TRACE on
 OutgoingTcpConnection and IncomingTcpConnection and compare
 the outputs of healthy nodes with the faulty node.
 
 2015-11-23 10:04 GMT-08:00
 Anuj Wadehra <anujw_2...@yahoo.co.in>:
 Any
 comments on ESTABLISHED connections at one end?
 
 
 
 Moreover, inter_dc_tcp_nodelay is false. Can this be the
 reason that  latency between two DC is more and repair
 messages are getting dropped?
 
 
 
 Can increasing request_timeout_in_ms deal with the latency
 issue..
 
 
 
 I see some hinted handoffs being triggered for cross DC
 nodes..and hints replay being timed-out..Is that an
 indication of a network issue?
 
 
 
 I am getting in tough with network team to capture netstats
 and tcpdump too..
 
 
 
 Thanks
 
 Anuj
 
 
 
 
 
 
 
 On Wed, 18/11/15, Anuj Wadehra
 <anujw_2...@yahoo.co.in>
 wrote:
 
 
 
  Subject: Re: Repair Hangs while requesting Merkle Trees
 
  To: "user@cassandra.apache.org"
 <user@cassandra.apache.org>
 
  Date: Wednesday, 18 November, 2015, 7:57 AM
 
 
 
  Thanks Bryan !!
 
  Connection
 
  is in ESTBLISHED state on on end and completely missing
 at
 
  other end (in another dc).
 
  Yes,
 
  we can revisit TCP tuning.But the problem is node
 specific.
 
  So not sure whether tuning is the culprit.
 
 
 
  ThanksAnuj
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <br...@blockcypher.com>
 
  Date:Wed, 18 Nov, 2015 at
 
   2:04 am
 
  Subject:Re: Repair Hangs
 
  while requesting Merkle Trees
 
 
 
   Ah OK, might
 
  have misunderstood you. Streaming socket should not be
 in
 
  play during merkle tree generation (validation
 compaction).
 
  They may come in play during merkle tree exchange- that
 
  I'm not sure about. You can read a bit more here: 
https://issues.apache.org/jira/browse/CASSANDRA-8611.
 
  Regardless, you should have it set-
 
  1 hr is usually a good conservative estimate, but you can
 

Re: Repair Hangs while requesting Merkle Trees

2015-11-29 Thread Anuj Wadehra
Yes. I think you are correct, problem might have resolved via Cassandra restart 
rather than increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one 
public.
We have strange configuration and we need to correct it as per the 
recommendation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
 . 

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address. 
In seeds, we put PUBLIC IP of other nodes but private IP for the local node. 
There were some issues if we tried to access local node via its public IP.


Thanks
Anuj
 

On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" 
<anujw_2...@yahoo.co.in>
 Date: Tuesday, 24 November, 2015, 12:38 AM
 
 The issue might be related to the
 ESTABLISHED connections just in one end. I don't think
 it might be related to inter_dc_tcp_nodelay or
 request_timeout_in_ms options. Did you restart the process
 when you changed the request_timeout_in_ms option? This
 might be why the problem got fixed and not the option
 change.
 
 This seem
 like a network issue or a misconfiguration of this specific
 node. Are you using EC2? Is listen_address ==
 broadcast_address? Are all nodes using the same
 configuration? What java are you using?
 
 You may want to enable TRACE on
 OutgoingTcpConnection and IncomingTcpConnection and compare
 the outputs of healthy nodes with the faulty node.
 
 2015-11-23 10:04 GMT-08:00
 Anuj Wadehra <anujw_2...@yahoo.co.in>:
 Any
 comments on ESTABLISHED connections at one end?
 
 
 
 Moreover, inter_dc_tcp_nodelay is false. Can this be the
 reason that  latency between two DC is more and repair
 messages are getting dropped?
 
 
 
 Can increasing request_timeout_in_ms deal with the latency
 issue..
 
 
 
 I see some hinted handoffs being triggered for cross DC
 nodes..and hints replay being timed-out..Is that an
 indication of a network issue?
 
 
 
 I am getting in tough with network team to capture netstats
 and tcpdump too..
 
 
 
 Thanks
 
 Anuj
 
 
 
 
 
 
 
 On Wed, 18/11/15, Anuj Wadehra
 <anujw_2...@yahoo.co.in>
 wrote:
 
 
 
  Subject: Re: Repair Hangs while requesting Merkle Trees
 
  To: "user@cassandra.apache.org"
 <user@cassandra.apache.org>
 
  Date: Wednesday, 18 November, 2015, 7:57 AM
 
 
 
  Thanks Bryan !!
 
  Connection
 
  is in ESTBLISHED state on on end and completely missing
 at
 
  other end (in another dc).
 
  Yes,
 
  we can revisit TCP tuning.But the problem is node
 specific.
 
  So not sure whether tuning is the culprit.
 
 
 
  ThanksAnuj
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <br...@blockcypher.com>
 
  Date:Wed, 18 Nov, 2015 at
 
   2:04 am
 
  Subject:Re: Repair Hangs
 
  while requesting Merkle Trees
 
 
 
   Ah OK, might
 
  have misunderstood you. Streaming socket should not be
 in
 
  play during merkle tree generation (validation
 compaction).
 
  They may come in play during merkle tree exchange- that
 
  I'm not sure about. You can read a bit more here: 
https://issues.apache.org/jira/browse/CASSANDRA-8611.
 
  Regardless, you should have it set-
 
  1 hr is usually a good conservative estimate, but you can
 go
 
  much lower safely.
 
  What state is the connection on that
 
  only shows on one side? Is it ESTABLISHED, or something
 like
 
  CLOSE_WAIT?
 
  Here's
 
  a good place to start for tuning, though it doesn't
 have
 
  as much about network tuning: 
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.
 
  More generally, TCP tuning usually revolves around a
 balance
 
  between latency and bandwidth. Over long connections
 
  (we're talking 10s of ms, instead of the sub 1ms
 you
 
  usually see in a good dc network), your expectations
 will
 
  shift greatly. Stuff like NODELAY on tcp is very nice
 for
 
  cutting your latencies when you're inside a DC, but
 will
 
  generate lots of small packets that will hurt your
 bandwidth
 
  over longer connections due to the need to wait for
 acks.
 
  otc_coalescing_strategy is on a similar vein, bundling
 
  together nearby messages to trade latency for
 throughput.
 
  You'll also probably want to tune your tcp buffers
 and
 
  window sizes, since that determines how much data can
 be
 
  in-flight between acknowledgements, and the default size
 is
 
  pitiful for any decent  network size. Google
 
   around for TCP tuning/buffer tuning and you should
 find
 
  some good resources.
 
  On Mon, Nov 16, 2015 at
 
  5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
 wrote:
 
  Hi Bryan,
 
  Thanks for the reply !!I
 
  didnt mean streaming_socket_tomeout_in_ms. I meant when
 you
 
  run netstats (Li

Re: Repair Hangs while requesting Merkle Trees

2015-11-29 Thread Anuj Wadehra
Please find attached netstat -t -as output for the node on which repair hung 
and the node which never got Merkle Tree Request.
ThanksAnuj
 


On Sunday, 29 November 2015 11:13 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:
 

 Hi All,

I am summarizing the setup, problem & key observations till now:

Setup: Cassandra 2.0.14. 2 DCs with 3 nodes each connected via 10Gbps VPN. We 
run repair with -par and -pr option.
Problem: Repair Hangs. Merkle Tree Responses are not received from one or more 
nodes in remote DC.

Observations till now:
1. Repair hangs intermittently on one node of  DC2.. Only on one occasion, 
repair hung on one other node in DC2 too.
2. Mostly, the node from which Merkle tree was not received does NOT have any 
message "Sending completed merkle tree .." in logs.
3. Often Hinted Handoffs get triggered across DCs and hint replays time-out.
4. Many times, when repair is run after long time it FAILS initially. But, if 
we restart Cassandra and re-run repair , it SUCCEEDS.

Logs: DEBUG logs Attached.

Observations from Log:1. When we started repair on 10.X.15.115, we got error 
messages "error writing to /X.X.X.X
java.io.IOException: Connection timed out" for 2 nodes in remote DC: 
10.X.14.113 and 10.X.14.111. Merkle tree were received from these 2 nodes.

2. Merkle Tree reponse was not received from 3rd node in remote DC: 10.X.14.115 
(for which no error occurred)

3. Hinted handoff started for 3rd node (10.X.14.115 ) but hint replay timed-out.
If it's a network issue then why the issue is only in DC2 and mostly observed 
on one node.

ThanksAnuj 


On Sunday, 29 November 2015 10:44 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:
 

 Yes. I think you are correct, problem might have resolved via Cassandra 
restart rather than increasing request timeout.

We are NOT on EC2. We have 2 interfaces on each node: one private and one 
public.
We have strange configuration and we need to correct it as per the 
recommendation at 
https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html
 . 

AS-IS config:
We use broadcast address=listen address=PUBLIC IP address. 
In seeds, we put PUBLIC IP of other nodes but private IP for the local node. 
There were some issues if we tried to access local node via its public IP.


Thanks
Anuj
 

On Tue, 24/11/15, Paulo Motta <pauloricard...@gmail.com> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>, "Anuj Wadehra" 
<anujw_2...@yahoo.co.in>
 Date: Tuesday, 24 November, 2015, 12:38 AM
 
 The issue might be related to the
 ESTABLISHED connections just in one end. I don't think
 it might be related to inter_dc_tcp_nodelay or
 request_timeout_in_ms options. Did you restart the process
 when you changed the request_timeout_in_ms option? This
 might be why the problem got fixed and not the option
 change.
 
 This seem
 like a network issue or a misconfiguration of this specific
 node. Are you using EC2? Is listen_address ==
 broadcast_address? Are all nodes using the same
 configuration? What java are you using?
 
 You may want to enable TRACE on
 OutgoingTcpConnection and IncomingTcpConnection and compare
 the outputs of healthy nodes with the faulty node.
 
 2015-11-23 10:04 GMT-08:00
 Anuj Wadehra <anujw_2...@yahoo.co.in>:
 Any
 comments on ESTABLISHED connections at one end?
 
 
 
 Moreover, inter_dc_tcp_nodelay is false. Can this be the
 reason that  latency between two DC is more and repair
 messages are getting dropped?
 
 
 
 Can increasing request_timeout_in_ms deal with the latency
 issue..
 
 
 
 I see some hinted handoffs being triggered for cross DC
 nodes..and hints replay being timed-out..Is that an
 indication of a network issue?
 
 
 
 I am getting in tough with network team to capture netstats
 and tcpdump too..
 
 
 
 Thanks
 
 Anuj
 
 
 
 
 
 
 
 On Wed, 18/11/15, Anuj Wadehra
 <anujw_2...@yahoo.co.in>
 wrote:
 
 
 
  Subject: Re: Repair Hangs while requesting Merkle Trees
 
  To: "user@cassandra.apache.org"
 <user@cassandra.apache.org>
 
  Date: Wednesday, 18 November, 2015, 7:57 AM
 
 
 
  Thanks Bryan !!
 
  Connection
 
  is in ESTBLISHED state on on end and completely missing
 at
 
  other end (in another dc).
 
  Yes,
 
  we can revisit TCP tuning.But the problem is node
 specific.
 
  So not sure whether tuning is the culprit.
 
 
 
  ThanksAnuj
 
  Sent
 
  from Yahoo Mail on Android  From:"Bryan
 
  Cheng" <br...@blockcypher.com>
 
  Date:Wed, 18 Nov, 2015 at
 
   2:04 am
 
  Subject:Re: Repair Hangs
 
  while requesting Merkle Trees
 
 
 
   Ah OK, might
 
  have misunderstood you. Streaming socket should not be
 in
 
  play during merkle tree generation (validation
 compaction).
 
  They may come in play during m

Re: Repair Hangs while requesting Merkle Trees

2015-11-23 Thread Anuj Wadehra
Any comments on ESTABLISHED connections at one end? 

Moreover, inter_dc_tcp_nodelay is false. Can this be the reason that  latency 
between two DC is more and repair messages are getting dropped?

Can increasing request_timeout_in_ms deal with the latency issue..

I see some hinted handoffs being triggered for cross DC nodes..and hints replay 
being timed-out..Is that an indication of a network issue?

I am getting in tough with network team to capture netstats and tcpdump too..

Thanks
Anuj



On Wed, 18/11/15, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

 Subject: Re: Repair Hangs while requesting Merkle Trees
 To: "user@cassandra.apache.org" <user@cassandra.apache.org>
 Date: Wednesday, 18 November, 2015, 7:57 AM
 
 Thanks Bryan !!
 Connection
 is in ESTBLISHED state on on end and completely missing at
 other end (in another dc).
 Yes,
 we can revisit TCP tuning.But the problem is node specific.
 So not sure whether tuning is the culprit.
 
 ThanksAnuj
 Sent
 from Yahoo Mail on Android  From:"Bryan
 Cheng" <br...@blockcypher.com>
 Date:Wed, 18 Nov, 2015 at
  2:04 am
 Subject:Re: Repair Hangs
 while requesting Merkle Trees
 
  Ah OK, might
 have misunderstood you. Streaming socket should not be in
 play during merkle tree generation (validation compaction).
 They may come in play during merkle tree exchange- that
 I'm not sure about. You can read a bit more here: 
https://issues.apache.org/jira/browse/CASSANDRA-8611.
 Regardless, you should have it set-
 1 hr is usually a good conservative estimate, but you can go
 much lower safely.
 What state is the connection on that
 only shows on one side? Is it ESTABLISHED, or something like
 CLOSE_WAIT?
 Here's
 a good place to start for tuning, though it doesn't have
 as much about network tuning: 
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.
 More generally, TCP tuning usually revolves around a balance
 between latency and bandwidth. Over long connections
 (we're talking 10s of ms, instead of the sub 1ms you
 usually see in a good dc network), your expectations will
 shift greatly. Stuff like NODELAY on tcp is very nice for
 cutting your latencies when you're inside a DC, but will
 generate lots of small packets that will hurt your bandwidth
 over longer connections due to the need to wait for acks.
 otc_coalescing_strategy is on a similar vein, bundling
 together nearby messages to trade latency for throughput.
 You'll also probably want to tune your tcp buffers and
 window sizes, since that determines how much data can be
 in-flight between acknowledgements, and the default size is
 pitiful for any decent  network size. Google
  around for TCP tuning/buffer tuning and you should find
 some good resources.
 On Mon, Nov 16, 2015 at
 5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
 Hi Bryan,
 Thanks for the reply !!I
 didnt mean streaming_socket_tomeout_in_ms. I meant when you
 run netstats (Linux cmnd) on  node A in DC1, you will
 notice that there is connection in established state with
 node B in DC2. But when you run netstats on node B, you wont
  find any connection with node A. Such connections are there
 across dc? Is it a problem.
 We havent set
 streaming_socket_timeout_in_ms which I know must be set. But
 I am not  sure wtheher setting this property has any impact
 on merkle tree requests. I thought its valid for data
 streaming if some mismatch is
  found and data needs to be streamed.Please confirm. Whats
 the value you use for streaming socket
 timeout?
 Morever, if
 socket timeout is the issue, that should happen on other
 nodes too...repair is not running on just one node, as
 merkle tree request is getting lost n not transmitted to one
 or more nodes in remote dc.
 I am not sure about exact distance.
 But they are connected with a very high speed 10gbps
 link.
 When you say
 different TCP stack tuning..do u have any document/blog/link
 describing recommendations for multi Dc Cassandra setup? 
 Can you elaborate what all settings
  need to be different? 
 
 ThanksAnuj
 
 
 
 
 
 
 
 Sent
 from Yahoo Mail on Android  From:"Bryan
 Cheng" <br...@blockcypher.com>
 Date:Tue, 17 Nov, 2015 at 5:54
 am
 Subject:Re: Repair
  Hangs while requesting Merkle Trees
 
  Hi Anuj,
 Did you mean
 streaming_socket_timeout_in_ms? If not, then you definitely
 want that set. Even the best network connections will break
 occasionally, and in Cassandra < 2.1.10 (I believe) this
 would leave those connections hanging indefinitely on one
 end.
 How far away are
 your two DC's from a network perspective, out of
 curiosity? You'll almost certainly be doing different
 TCP stack tuning for cross-DC, notably your buffer sizes,
 window params, cassandra-specific stuff like
 otc_coalescing_strategy, inter_dc_tcp_nodelay,
 etc.
 On Sat, Nov 14, 2015 at
 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
 One more observa

Re: Repair Hangs while requesting Merkle Trees

2015-11-23 Thread Paulo Motta
The issue might be related to the ESTABLISHED connections just in one end.
I don't think it might be related to inter_dc_tcp_nodelay or
request_timeout_in_ms options. Did you restart the process when you changed
the request_timeout_in_ms option? This might be why the problem got fixed
and not the option change.

This seem like a network issue or a misconfiguration of this specific node.
Are you using EC2? Is listen_address == broadcast_address? Are all nodes
using the same configuration? What java are you using?

You may want to enable TRACE on OutgoingTcpConnection and
IncomingTcpConnection and compare the outputs of healthy nodes with the
faulty node.

2015-11-23 10:04 GMT-08:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

> Any comments on ESTABLISHED connections at one end?
>
> Moreover, inter_dc_tcp_nodelay is false. Can this be the reason that
> latency between two DC is more and repair messages are getting dropped?
>
> Can increasing request_timeout_in_ms deal with the latency issue..
>
> I see some hinted handoffs being triggered for cross DC nodes..and hints
> replay being timed-out..Is that an indication of a network issue?
>
> I am getting in tough with network team to capture netstats and tcpdump
> too..
>
> Thanks
> Anuj
>
>
> 
> On Wed, 18/11/15, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
>
>  Subject: Re: Repair Hangs while requesting Merkle Trees
>  To: "user@cassandra.apache.org" <user@cassandra.apache.org>
>  Date: Wednesday, 18 November, 2015, 7:57 AM
>
>  Thanks Bryan !!
>  Connection
>  is in ESTBLISHED state on on end and completely missing at
>  other end (in another dc).
>  Yes,
>  we can revisit TCP tuning.But the problem is node specific.
>  So not sure whether tuning is the culprit.
>
>  ThanksAnuj
>  Sent
>  from Yahoo Mail on Android  From:"Bryan
>  Cheng" <br...@blockcypher.com>
>  Date:Wed, 18 Nov, 2015 at
>   2:04 am
>  Subject:Re: Repair Hangs
>  while requesting Merkle Trees
>
>   Ah OK, might
>  have misunderstood you. Streaming socket should not be in
>  play during merkle tree generation (validation compaction).
>  They may come in play during merkle tree exchange- that
>  I'm not sure about. You can read a bit more here:
> https://issues.apache.org/jira/browse/CASSANDRA-8611.
>  Regardless, you should have it set-
>  1 hr is usually a good conservative estimate, but you can go
>  much lower safely.
>  What state is the connection on that
>  only shows on one side? Is it ESTABLISHED, or something like
>  CLOSE_WAIT?
>  Here's
>  a good place to start for tuning, though it doesn't have
>  as much about network tuning:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html.
>  More generally, TCP tuning usually revolves around a balance
>  between latency and bandwidth. Over long connections
>  (we're talking 10s of ms, instead of the sub 1ms you
>  usually see in a good dc network), your expectations will
>  shift greatly. Stuff like NODELAY on tcp is very nice for
>  cutting your latencies when you're inside a DC, but will
>  generate lots of small packets that will hurt your bandwidth
>  over longer connections due to the need to wait for acks.
>  otc_coalescing_strategy is on a similar vein, bundling
>  together nearby messages to trade latency for throughput.
>  You'll also probably want to tune your tcp buffers and
>  window sizes, since that determines how much data can be
>  in-flight between acknowledgements, and the default size is
>  pitiful for any decent  network size. Google
>   around for TCP tuning/buffer tuning and you should find
>  some good resources.
>  On Mon, Nov 16, 2015 at
>  5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:
>  Hi Bryan,
>  Thanks for the reply !!I
>  didnt mean streaming_socket_tomeout_in_ms. I meant when you
>  run netstats (Linux cmnd) on  node A in DC1, you will
>  notice that there is connection in established state with
>  node B in DC2. But when you run netstats on node B, you wont
>   find any connection with node A. Such connections are there
>  across dc? Is it a problem.
>  We havent set
>  streaming_socket_timeout_in_ms which I know must be set. But
>  I am not  sure wtheher setting this property has any impact
>  on merkle tree requests. I thought its valid for data
>  streaming if some mismatch is
>   found and data needs to be streamed.Please confirm. Whats
>  the value you use for streaming socket
>  timeout?
>  Morever, if
>  socket timeout is the issue, that should happen on other
>  nodes too...repair is not running on just one node, as
>  merkle tree request is getting lost n not transmitted to one
>  or more nodes 

Re: Repair Hangs while requesting Merkle Trees

2015-11-17 Thread Anuj Wadehra
Thanks Bryan !!


Connection is in ESTBLISHED state on on end and completely missing at other end 
(in another dc).


Yes, we can revisit TCP tuning.But the problem is node specific. So not sure 
whether tuning is the culprit.

Thanks

Anuj

Sent from Yahoo Mail on Android

From:"Bryan Cheng" <br...@blockcypher.com>
Date:Wed, 18 Nov, 2015 at 2:04 am
Subject:Re: Repair Hangs while requesting Merkle Trees

Ah OK, might have misunderstood you. Streaming socket should not be in play 
during merkle tree generation (validation compaction). They may come in play 
during merkle tree exchange- that I'm not sure about. You can read a bit more 
here: https://issues.apache.org/jira/browse/CASSANDRA-8611.


Regardless, you should have it set- 1 hr is usually a good conservative 
estimate, but you can go much lower safely.


What state is the connection on that only shows on one side? Is it ESTABLISHED, 
or something like CLOSE_WAIT?


Here's a good place to start for tuning, though it doesn't have as much about 
network tuning: 
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html. More 
generally, TCP tuning usually revolves around a balance between latency and 
bandwidth. Over long connections (we're talking 10s of ms, instead of the sub 
1ms you usually see in a good dc network), your expectations will shift 
greatly. Stuff like NODELAY on tcp is very nice for cutting your latencies when 
you're inside a DC, but will generate lots of small packets that will hurt your 
bandwidth over longer connections due to the need to wait for acks. 
otc_coalescing_strategy is on a similar vein, bundling together nearby messages 
to trade latency for throughput. You'll also probably want to tune your tcp 
buffers and window sizes, since that determines how much data can be in-flight 
between acknowledgements, and the default size is pitiful for any decent  
network size. Google around for TCP
 tuning/buffer tuning and you should find some good resources.


On Mon, Nov 16, 2015 at 5:23 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi Bryan,


Thanks for the reply !!

I didnt mean streaming_socket_tomeout_in_ms. I meant when you run netstats 
(Linux cmnd) on  node A in DC1, you will notice that there is connection in 
established state with node B in DC2. But when you run netstats on node B, you 
wont find any connection with node A. Such connections are there across dc? Is 
it a problem.


We havent set streaming_socket_timeout_in_ms which I know must be set. But I am 
not  sure wtheher setting this property has any impact on merkle tree requests. 
I thought its valid for data streaming if some mismatch is found and data needs 
to be streamed.Please confirm. Whats the value you use for streaming socket 
timeout?


Morever, if socket timeout is the issue, that should happen on other nodes 
too...repair is not running on just one node, as merkle tree request is getting 
lost n not transmitted to one or more nodes in remote dc.


I am not sure about exact distance. But they are connected with a very high 
speed 10gbps link.


When you say different TCP stack tuning..do u have any document/blog/link 
describing recommendations for multi Dc Cassandra setup?  Can you elaborate 
what all settings need to be different? 



Thanks

Anuj









Sent from Yahoo Mail on Android

From:"Bryan Cheng" <br...@blockcypher.com>
Date:Tue, 17 Nov, 2015 at 5:54 am


Subject:Re: Repair Hangs while requesting Merkle Trees

Hi Anuj,


Did you mean streaming_socket_timeout_in_ms? If not, then you definitely want 
that set. Even the best network connections will break occasionally, and in 
Cassandra < 2.1.10 (I believe) this would leave those connections hanging 
indefinitely on one end.


How far away are your two DC's from a network perspective, out of curiosity? 
You'll almost certainly be doing different TCP stack tuning for cross-DC, 
notably your buffer sizes, window params, cassandra-specific stuff like 
otc_coalescing_strategy, inter_dc_tcp_nodelay, etc.


On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

One more observation.We observed that there are few TCP connections which node 
shows as Established but when we go to node at other end,connection is not 
there. They are called "phantom" connections I guess. Can this be a possible 
cause?


Thanks

Anuj


Sent from Yahoo Mail on Android

From:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
Date:Sat, 14 Nov, 2015 at 11:59 pm


Subject:Re: Repair Hangs while requesting Merkle Trees

Thanks Daemeon !!


I wil capture the output of netstats and share in next few days. We were 
thinking of taking tcp dumps also. If its a network issue and increasing 
request timeout worked, not sure how Cassandra is dropping messages based on 
timeout.Repair messages are non droppable and not supposed to be timedout.


2 of the 3 nodes in the DC are able to complete repair without any issue. Just 
one 

Re: Repair Hangs while requesting Merkle Trees

2015-11-16 Thread Bryan Cheng
Hi Anuj,

Did you mean streaming_socket_timeout_in_ms? If not, then you definitely
want that set. Even the best network connections will break occasionally,
and in Cassandra < 2.1.10 (I believe) this would leave those connections
hanging indefinitely on one end.

How far away are your two DC's from a network perspective, out of
curiosity? You'll almost certainly be doing different TCP stack tuning for
cross-DC, notably your buffer sizes, window params, cassandra-specific
stuff like otc_coalescing_strategy, inter_dc_tcp_nodelay, etc.

On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in>
wrote:

> One more observation.We observed that there are few TCP connections which
> node shows as Established but when we go to node at other end,connection is
> not there. They are called "phantom" connections I guess. Can this be a
> possible cause?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> --
> *From*:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
> *Date*:Sat, 14 Nov, 2015 at 11:59 pm
>
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
> Thanks Daemeon !!
>
> I wil capture the output of netstats and share in next few days. We were
> thinking of taking tcp dumps also. If its a network issue and increasing
> request timeout worked, not sure how Cassandra is dropping messages based
> on timeout.Repair messages are non droppable and not supposed to be
> timedout.
>
> 2 of the 3 nodes in the DC are able to complete repair without any issue.
> Just one node is problematic.
>
> I also observed frequent messages in logs of other nodes which say that
> hints replay timedout..and the node where hints were being replayed is
> always a remote dc node. Is it related some how?
>
> Thanks
> Anuj
>
> Sent from Yahoo Mail on Android
> <https://overview.mail.yahoo.com/mobile/?.src=Android>
> ----------
> *From*:"daemeon reiydelle" <daeme...@gmail.com>
> *Date*:Thu, 12 Nov, 2015 at 10:34 am
> *Subject*:Re: Repair Hangs while requesting Merkle Trees
>
>
> Have you checked the network statistics on that machine? (netstats -tas)
> while attempting to repair ... if netstats show ANY issues you have a
> problem. If you can put the command in a loop running every 60 seconds for
> maybe 15 minutes and post back?
>
> Out of curiousity, how many remote DC nodes are getting successfully
> repaired?
>
>
>
> *...*
>
>
>
>
>
>
> *“Life should not be a journey to the grave with the intention of arriving
> safely in apretty and well preserved body, but rather to skid in broadside
> in a cloud of smoke,thoroughly used up, totally worn out, and loudly
> proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
> (+1) 415.501.0198 <%28%2B1%29%20415.501.0198>London (+44) (0) 20 8144 9872
> <%28%2B44%29%20%280%29%2020%208144%209872>*
>
> On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in>
> wrote:
>
>> Hi,
>>
>> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps
>> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only
>> one node in DC2, we are unable to complete repair as it always hangs. Node
>> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
>> show that they sent the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <
>> anujw_2...@yahoo.co.in> wrote:
>>
>>
>> Hi,
>>
>> We have 2 DCs at remote locations with 10GBps connectivity.We are able to
>> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are
>> unable to complete repair as it always hangs. Node sends Merkle Tree
>> requests, but one or more nodes in DC1 (remote) never show that they sent
>> the merkle tree reply to requesting node.
>> Repair hangs infinitely.
>>
>> After increasing request_timeout_in_ms on affected node, we were able to
>> successfully run repair on one of the two occassions.
>>
>> Any comments, why this is happening on just one node? In
>> OutboundTcpConnection.java,  when isTimeOut method always returns false for
>> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
>> increasing request timeout solved problem on one occasion ?
>>
>>
>> Thanks
>> Anuj Wadehra
>>
>>
>>
>


Re: Repair Hangs while requesting Merkle Trees

2015-11-16 Thread Anuj Wadehra
Hi Bryan,


Thanks for the reply !!

I didnt mean streaming_socket_tomeout_in_ms. I meant when you run netstats 
(Linux cmnd) on  node A in DC1, you will notice that there is connection in 
established state with node B in DC2. But when you run netstats on node B, you 
wont find any connection with node A. Such connections are there across dc? Is 
it a problem.


We havent set streaming_socket_timeout_in_ms which I know must be set. But I am 
not  sure wtheher setting this property has any impact on merkle tree requests. 
I thought its valid for data streaming if some mismatch is found and data needs 
to be streamed.Please confirm. Whats the value you use for streaming socket 
timeout?


Morever, if socket timeout is the issue, that should happen on other nodes 
too...repair is not running on just one node, as merkle tree request is getting 
lost n not transmitted to one or more nodes in remote dc.


I am not sure about exact distance. But they are connected with a very high 
speed 10gbps link.


When you say different TCP stack tuning..do u have any document/blog/link 
describing recommendations for multi Dc Cassandra setup?  Can you elaborate 
what all settings need to be different? 



Thanks

Anuj









Sent from Yahoo Mail on Android

From:"Bryan Cheng" <br...@blockcypher.com>
Date:Tue, 17 Nov, 2015 at 5:54 am
Subject:Re: Repair Hangs while requesting Merkle Trees

Hi Anuj,


Did you mean streaming_socket_timeout_in_ms? If not, then you definitely want 
that set. Even the best network connections will break occasionally, and in 
Cassandra < 2.1.10 (I believe) this would leave those connections hanging 
indefinitely on one end.


How far away are your two DC's from a network perspective, out of curiosity? 
You'll almost certainly be doing different TCP stack tuning for cross-DC, 
notably your buffer sizes, window params, cassandra-specific stuff like 
otc_coalescing_strategy, inter_dc_tcp_nodelay, etc.


On Sat, Nov 14, 2015 at 10:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

One more observation.We observed that there are few TCP connections which node 
shows as Established but when we go to node at other end,connection is not 
there. They are called "phantom" connections I guess. Can this be a possible 
cause?


Thanks

Anuj


Sent from Yahoo Mail on Android

From:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
Date:Sat, 14 Nov, 2015 at 11:59 pm


Subject:Re: Repair Hangs while requesting Merkle Trees

Thanks Daemeon !!


I wil capture the output of netstats and share in next few days. We were 
thinking of taking tcp dumps also. If its a network issue and increasing 
request timeout worked, not sure how Cassandra is dropping messages based on 
timeout.Repair messages are non droppable and not supposed to be timedout.


2 of the 3 nodes in the DC are able to complete repair without any issue. Just 
one node is problematic.


I also observed frequent messages in logs of other nodes which say that hints 
replay timedout..and the node where hints were being replayed is always a 
remote dc node. Is it related some how?


Thanks

Anuj

Sent from Yahoo Mail on Android

From:"daemeon reiydelle" <daeme...@gmail.com>
Date:Thu, 12 Nov, 2015 at 10:34 am
Subject:Re: Repair Hangs while requesting Merkle Trees



Have you checked the network statistics on that machine? (netstats -tas) while 
attempting to repair ... if netstats show ANY issues you have a problem. If you 
can put the command in a loop running every 60 seconds for maybe 15 minutes and 
post back?

Out of curiousity, how many remote DC nodes are getting successfully repaired?



...
“Life should not be a journey to the grave with the intention of arriving 
safely in a
pretty and well preserved body, but rather to skid in broadside in a cloud of 
smoke,
thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a 
Ride!” 
- Hunter Thompson

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,


we are using 2.0.14. We have 2 DCs at remote locations with 10GBps 
connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one 
node in DC2, we are unable to complete repair as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra 




On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <anujw_2...

Re: Repair Hangs while requesting Merkle Trees

2015-11-14 Thread Anuj Wadehra
Thanks Daemeon !!


I wil capture the output of netstats and share in next few days. We were 
thinking of taking tcp dumps also. If its a network issue and increasing 
request timeout worked, not sure how Cassandra is dropping messages based on 
timeout.Repair messages are non droppable and not supposed to be timedout.


2 of the 3 nodes in the DC are able to complete repair without any issue. Just 
one node is problematic.


I also observed frequent messages in logs of other nodes which say that hints 
replay timedout..and the node where hints were being replayed is always a 
remote dc node. Is it related some how?


Thanks

Anuj

Sent from Yahoo Mail on Android

From:"daemeon reiydelle" <daeme...@gmail.com>
Date:Thu, 12 Nov, 2015 at 10:34 am
Subject:Re: Repair Hangs while requesting Merkle Trees

Have you checked the network statistics on that machine? (netstats -tas) while 
attempting to repair ... if netstats show ANY issues you have a problem. If you 
can put the command in a loop running every 60 seconds for maybe 15 minutes and 
post back?

Out of curiousity, how many remote DC nodes are getting successfully repaired?



...
“Life should not be a journey to the grave with the intention of arriving 
safely in a
pretty and well preserved body, but rather to skid in broadside in a cloud of 
smoke,
thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a 
Ride!” 
- Hunter Thompson

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,


we are using 2.0.14. We have 2 DCs at remote locations with 10GBps 
connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one 
node in DC2, we are unable to complete repair as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra 




On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:



Hi,


We have 2 DCs at remote locations with 10GBps connectivity.We are able to 
complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable 
to complete repair as it always hangs. Node sends Merkle Tree requests, but one 
or more nodes in DC1 (remote) never show that they sent the merkle tree reply 
to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra






Re: Repair Hangs while requesting Merkle Trees

2015-11-14 Thread Anuj Wadehra
One more observation.We observed that there are few TCP connections which node 
shows as Established but when we go to node at other end,connection is not 
there. They are called "phantom" connections I guess. Can this be a possible 
cause?


Thanks

Anuj


Sent from Yahoo Mail on Android

From:"Anuj Wadehra" <anujw_2...@yahoo.co.in>
Date:Sat, 14 Nov, 2015 at 11:59 pm
Subject:Re: Repair Hangs while requesting Merkle Trees

Thanks Daemeon !!


I wil capture the output of netstats and share in next few days. We were 
thinking of taking tcp dumps also. If its a network issue and increasing 
request timeout worked, not sure how Cassandra is dropping messages based on 
timeout.Repair messages are non droppable and not supposed to be timedout.


2 of the 3 nodes in the DC are able to complete repair without any issue. Just 
one node is problematic.


I also observed frequent messages in logs of other nodes which say that hints 
replay timedout..and the node where hints were being replayed is always a 
remote dc node. Is it related some how?


Thanks

Anuj

Sent from Yahoo Mail on Android

From:"daemeon reiydelle" <daeme...@gmail.com>
Date:Thu, 12 Nov, 2015 at 10:34 am
Subject:Re: Repair Hangs while requesting Merkle Trees



Have you checked the network statistics on that machine? (netstats -tas) while 
attempting to repair ... if netstats show ANY issues you have a problem. If you 
can put the command in a loop running every 60 seconds for maybe 15 minutes and 
post back?

Out of curiousity, how many remote DC nodes are getting successfully repaired?



...
“Life should not be a journey to the grave with the intention of arriving 
safely in a
pretty and well preserved body, but rather to skid in broadside in a cloud of 
smoke,
thoroughly used up, totally worn out, and loudly proclaiming “Wow! What a 
Ride!” 
- Hunter Thompson

Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872


On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra <anujw_2...@yahoo.co.in> wrote:

Hi,


we are using 2.0.14. We have 2 DCs at remote locations with 10GBps 
connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one 
node in DC2, we are unable to complete repair as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra 




On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <anujw_2...@yahoo.co.in> 
wrote:



Hi,


We have 2 DCs at remote locations with 10GBps connectivity.We are able to 
complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable 
to complete repair as it always hangs. Node sends Merkle Tree requests, but one 
or more nodes in DC1 (remote) never show that they sent the merkle tree reply 
to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?



Thanks

Anuj Wadehra






Re: Repair Hangs while requesting Merkle Trees

2015-11-11 Thread daemeon reiydelle
Have you checked the network statistics on that machine? (netstats -tas)
while attempting to repair ... if netstats show ANY issues you have a
problem. If you can put the command in a loop running every 60 seconds for
maybe 15 minutes and post back?

Out of curiousity, how many remote DC nodes are getting successfully
repaired?



*...*






*“Life should not be a journey to the grave with the intention of arriving
safely in apretty and well preserved body, but rather to skid in broadside
in a cloud of smoke,thoroughly used up, totally worn out, and loudly
proclaiming “Wow! What a Ride!” - Hunter ThompsonDaemeon C.M. ReiydelleUSA
(+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Nov 11, 2015 at 1:06 PM, Anuj Wadehra 
wrote:

> Hi,
>
> we are using 2.0.14. We have 2 DCs at remote locations with 10GBps
> connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only
> one node in DC2, we are unable to complete repair as it always hangs. Node
> sends Merkle Tree requests, but one or more nodes in DC1 (remote) never
> show that they sent the merkle tree reply to requesting node.
> Repair hangs infinitely.
>
> After increasing request_timeout_in_ms on affected node, we were able to
> successfully run repair on one of the two occassions.
>
> Any comments, why this is happening on just one node? In
> OutboundTcpConnection.java,  when isTimeOut method always returns false for
> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
> increasing request timeout solved problem on one occasion ?
>
>
> Thanks
> Anuj Wadehra
>
>
>
> On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra <
> anujw_2...@yahoo.co.in> wrote:
>
>
> Hi,
>
> We have 2 DCs at remote locations with 10GBps connectivity.We are able to
> complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are
> unable to complete repair as it always hangs. Node sends Merkle Tree
> requests, but one or more nodes in DC1 (remote) never show that they sent
> the merkle tree reply to requesting node.
> Repair hangs infinitely.
>
> After increasing request_timeout_in_ms on affected node, we were able to
> successfully run repair on one of the two occassions.
>
> Any comments, why this is happening on just one node? In
> OutboundTcpConnection.java,  when isTimeOut method always returns false for
> non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why
> increasing request timeout solved problem on one occasion ?
>
>
> Thanks
> Anuj Wadehra
>
>
>


Re: Repair Hangs while requesting Merkle Trees

2015-11-11 Thread Anuj Wadehra
Hi,
we are using 2.0.14. We have 2 DCs at remote locations with 10GBps 
connectivity.We are able to complete repair (-par -pr) on 5 nodes. On only one 
node in DC2, we are unable to complete repair as it always hangs. Node sends 
Merkle Tree requests, but one or more nodes in DC1 (remote) never show that 
they sent the merkle tree reply to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?

Thanks
Anuj Wadehra 


 On Thursday, 12 November 2015 2:35 AM, Anuj Wadehra 
 wrote:
   

 Hi,
We have 2 DCs at remote locations with 10GBps connectivity.We are able to 
complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable 
to complete repair as it always hangs. Node sends Merkle Tree requests, but one 
or more nodes in DC1 (remote) never show that they sent the merkle tree reply 
to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?

Thanks
Anuj Wadehra


  

Repair Hangs while requesting Merkle Trees

2015-11-11 Thread Anuj Wadehra
Hi,
We have 2 DCs at remote locations with 10GBps connectivity.We are able to 
complete repair (-par -pr) on 5 nodes. On only one node in DC2, we are unable 
to complete repair as it always hangs. Node sends Merkle Tree requests, but one 
or more nodes in DC1 (remote) never show that they sent the merkle tree reply 
to requesting node.
Repair hangs infinitely. 

After increasing request_timeout_in_ms on affected node, we were able to 
successfully run repair on one of the two occassions.

Any comments, why this is happening on just one node? In 
OutboundTcpConnection.java,  when isTimeOut method always returns false for 
non-droppable verb such as Merkle Tree Request(verb=REPAIR_MESSAGE),why 
increasing request timeout solved problem on one occasion ?

Thanks
Anuj Wadehra