[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2024-04-11 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17836370#comment-17836370
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 4/11/24 10:18 PM:
-

I have reworked the [patch| [^CASSANDRA-18845-4_0_12.patch]] more so it a new 
method instead of modifying the existing waitToSettle, so it has the least 
change to any existing behavior. It directly called in 
MigrationCoordinator::awaitSchemaRequests to handle if node bootstrapping 
(since need nodes in UP state in order to get schema and stream sstables from). 
And just before enabling native transport.


was (Author: cam1982):
I have reworked the patch more so it a new method instead of modifying the 
existing waitToSettle. So it has the least change to any existing behavior. It 
directly called in MigrationCoordinator::awaitSchemaRequests to handle if node 
bootstrapping (since need nodes in UP state in order to get schema and stream 
sstables from). And just before enabling native transport. 
https://issues.apache.org/jira/secure/attachment/13068153/CASSANDRA-18845-4_0_12.patch

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, CASSANDRA-18845-4_0_12.patch, 
> delay.log, example.log, image-2023-09-14-11-16-23-020.png, stream.log, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-20 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17767358#comment-17767358
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/21/23 2:59 AM:


 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though is it taking over 30 seconds to get the first 
ECHO response. Since there are dtests that rely on having CQL up while nodes 
are down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) 
that is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.


was (Author: cam1982):
 
{noformat}
Sep 19 08:09:45 ip-10-1-57-23 cassandra[131402]: INFO  
org.apache.cassandra.gms.Gossiper Waiting for gossip to settle...
Sep 19 08:10:56 ip-10-1-57-23 cassandra[131402]: DEBUG 
org.apache.cassandra.gms.Gossiper Sending a EchoMessage to 
/35.83.14.80{noformat}
I am struggling to reproduce this ^ I seen it twice, and after enabling more 
logging haven't been able to reproduce again.

 

What I do sometimes see though it taking over 30 seconds to get the first ECHO 
response. Since there are dtests that rely on having CQL up while nodes are 
down, I have attached a patch [^18845-seperate.patch] (against 5.0 branch) that 
is opt-in. Having settle just check for currentLive == liveSize is still 
allowing NTR to start while nodes are marked down. Yes you can increase 
cassandra.gossip_settle_poll_success_required (and/or the other properties) to 
mitigate it but these increase the minimum startup time. Whereas 
[^18845-seperate.patch] doesn't add to this when the cluster is healthy.

 

A more elaborate solution would be to specify the required consistency level. 
And for all token ranges owned by the node you check if you have the needed 
live endpoints to satisfy the consistency level.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-seperate.patch, delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, stream.log, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766850#comment-17766850
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18845 at 9/19/23 3:31 PM:


[~cam1982] you can simulate lost echo even in a setup with 2 nodes. This is 
possible with in-jvm dtests, definitely. You can drop whole communication 
between nodes like this (1) and then resume it afterwards like this (2).

(1) 
https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/AuthTest.java#L99-L101
(2) 
https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/AuthTest.java#L118-L120


was (Author: smiklosovic):
[~cam1982] you can simulate lost echo even in a setup with 2 nodes. This is 
possible with in-jvm dtests, definitely. You can drop whole communication 
between nodes like this (1)

(1) 
https://github.com/apache/cassandra/blob/trunk/test/distributed/org/apache/cassandra/distributed/test/AuthTest.java#L99-L101

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: delay.log, example.log, 
> image-2023-09-14-11-16-23-020.png, test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-19 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17766677#comment-17766677
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/19/23 7:32 AM:


Tested the patch 3 times to confirm it working. See test1.log test2.log and 
test3.log


was (Author: cam1982):
!test1.log|width=7,height=7,align=absmiddle!

!test2.log|width=7,height=7,align=absmiddle!

!test3.log|width=7,height=7,align=absmiddle!

Tested the patch 3 times to confirm it working.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: example.log, image-2023-09-14-11-16-23-020.png, 
> test1.log, test2.log, test3.log
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-15 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765558#comment-17765558
 ] 

Brandon Williams edited comment on CASSANDRA-18845 at 9/15/23 11:03 AM:


CASSANDRA-18543 is going to be reverted on CASSANDRA-18854 for causing a 
regression.  The next step for this ticket to move forward will be to create 
tests that demonstrate the problem and guard against regressions.


was (Author: brandon.williams):
CASSANDRA-18543 is going to be reverted for causing a regression.  The next 
step for this ticket to move forward will be to create tests that demonstrate 
the problem and guard against regressions.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-14 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765035#comment-17765035
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18845 at 9/14/23 7:29 AM:


Interesting. I am curious what causes that initial delay. What you are saying 
is that it takes a lot of time for the nodes to be up and then it appears (from 
the log you posted)  like all of them are reported more or less at the same 
time? There is an initial delay of dozes of seconds before it starts to get 
reported? If that is true then it probably makes sense to have a condition like 
that so we see at least some other nodes to be up to count it and increase 
numOkay.

However, if we have this 
{code:java}
if (currentSize == epSize && currentLive == liveSize && (epSize == liveSize || 
liveSize > 1))
{code}

Then what if we have 

{code}
currentSize = 2 , epSize = 2, currentLive = 2, liveSize = 2
{code}

That "if" would return true, so numOkay would be increased and it would count 
it as a valid round.

However, and it is a little bit hard to formulate it correctly, but is not it 
true that we are not guaranteeing that QUORUM would be satisfied here anyway? 
Because it could stay on all "twos" for all rounds and we would say that gossip 
settled while there is bunch of other nodes to be reported but they just have 
not made it and we were stuck on 2 for three rounds.




was (Author: smiklosovic):
Interesting. I am curious what causes that initial delay. What you are saying 
is that it takes a lot of time for the nodes to be up and then it appears (from 
the log you posted)  like all of them are reported more or less at the same 
time? There is an initial delay of dozes of seconds before it starts to get 
reported? If that is true then it probably makes sense to have a condition like 
that so we see at least some other nodes to be up to count it and increase 
numOkay.

However, if we have this 
{code:java}
if (currentSize == epSize && currentLive == liveSize && (epSize == liveSize || 
liveSize > 1))
{code}

Then what if we have 

{code}
currentSize = 2 , epSize = 2, currentLive = 2, liveSize = 2
{code}

That "if" would return true, so numOkay would be increased and it would count 
it as a valid round.

However, and it is a little bit hard to formulate it correctly, but is not it 
true that we are not guaranteeing that QUORUM would be satisfied here anyway? 
Because it could stay on all "twos" for all rounds and we would say that gossip 
settled while there is bunch of other nodes to be reported but they just have 
not make it and we were stuck on 2 for three rounds.



> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch, image-2023-09-14-11-16-23-020.png
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764726#comment-17764726
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18845 at 9/13/23 2:51 PM:


Yeah, like ... if there is 20 nodes, RF is 5 and QUORUM is 3, then "liveSize > 
1" is at least 2. But how do we know that these "2" satisfy _each query on 
local quorum_ ? Maybe there is a query for which quorum requires such nodes to 
be alive which are not detected yet, or maybe I am missing something here.


was (Author: smiklosovic):
Yeah, like ... if there is 20 nodes, RF is 5 and QUORUM is 3, then "liveSize > 
1" is at least 2. But how do we know that these "2" satisfy _each query on 
local quorum_ ? Maybe there is a query for which quorum requires such nodes 
live which are not detected yet, or maybe I am missing something here.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764515#comment-17764515
 ] 

Stefan Miklosovic edited comment on CASSANDRA-18845 at 9/13/23 6:58 AM:


I instructed Cameron privately about strong preference for an in-jvm dtest to 
verify and test this behavior. Looking at the test steps described in his 
comment, it should be rather straightforward to come up with one.


was (Author: smiklosovic):
I instructed Cameron privately about strong preference for an in-jvm dtest to 
verify and test this behavior. Looking at the test steps described in his 
comment about, it should be rather straightforward to come up with one.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch, 18845-4.0.patch, 18845-4.1.patch, 
> 18845-5.0.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-18845) Waiting for gossip to settle on live endpoints

2023-09-12 Thread Cameron Zemek (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-18845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764467#comment-17764467
 ] 

Cameron Zemek edited comment on CASSANDRA-18845 at 9/13/23 3:32 AM:


I have attached patched. Tested this as follows:
 # Spin up single node cluster. Works due to epSize == liveSize check that lets 
it bypass the liveSize > 1 check
 # Spin up 3 node cluster. All 3 nodes start up NTR as expected.
 # Shutdown all nodes. Start up first node it stays waiting in gossip due to 
the liveSize > 1 requirement
 # Start up second node. Now both nodes start NTR since liveSize > 1 and there 
are no other incoming `is now UP` events so gossip looks settled.

NOTE: I had to disable the if condition for call to Gossiper.waitToSettle() 
since was using loopback addresses


was (Author: cam1982):
I have attached patched. Tested this as follows:
 # Spin up single node cluster. Works due to epSize == liveSize check that lets 
it bypass the liveSize > 1 check
 # Spin up 3 node cluster. All 3 nodes start up NTR as expected.
 # Shutdown all nodes. Start up first node it stays waiting in gossip due to 
the liveSize > 1 requirement
 # Start up second node. Now both nodes start NTR since liveSize > 1 and there 
are no other incoming `is now UP` events so gossip looks settled.

> Waiting for gossip to settle on live endpoints
> --
>
> Key: CASSANDRA-18845
> URL: https://issues.apache.org/jira/browse/CASSANDRA-18845
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Cameron Zemek
>Priority: Normal
> Attachments: 18845-3.11.patch
>
>
> This is a follow up to CASSANDRA-18543
> Although that ticket added ability to set cassandra.gossip_settle_min_wait_ms 
> this is tedious and error prone. On a node just observed a 79 second gap 
> between waiting for gossip and the first echo response to indicate a node is 
> UP.
> The problem being that do not want to start Native Transport until gossip 
> settles otherwise queries can fail consistency such as LOCAL_QUORUM as it 
> thinks the replicas are still in DOWN state.
> Instead of having to set gossip_settle_min_wait_ms I am proposing that 
> (outside single node cluster) wait for UP message from another node before 
> considering gossip as settled. Eg.
> {code:java}
>             if (currentSize == epSize && currentLive == liveSize && liveSize 
> > 1)
>             {
>                 logger.debug("Gossip looks settled.");
>                 numOkay++;
>             } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org