[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Fix Version/s: (was: 3.11.x) (was: 4.x) (was: 3.0.x) (was: 2.2.x) 4.0 3.11.0 3.0.13 2.2.10 > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.10, 3.0.13, 3.11.0, 4.0 > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anonymous updated CASSANDRA-12653: -- Resolution: Fixed Status: Resolved (was: Ready to Commit) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-12653: Status: Ready to Commit (was: Awaiting Feedback) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-12653: --- Attachment: (was: 12653-3.0.patch) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-12653: --- Attachment: (was: 12653-trunk.patch) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-12653: --- Attachment: (was: 12653-2.2.patch) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Awaiting Feedback (was: Open) A few questions/comments on the latest patches: * On all versions, a space is missing in the conditional {{if(firstSynSendAt == 0)}} * On 2.2, it looks like the patch adds the {{valuesEqual}} method from later versions. Was this intentional? It looks unused. * On all versions, using {{firstSynSendAt == 0}} to check if it has been initialized isn't entirely safe. It's entirely legal (although admittedly rare) for {{System.nanoTime}} to return 0. If this happened, all acks would be rejected. * Comparisons for two {{System.nanoTime}} values (such as in {{GossipDigestAckVerbHandler}}) should not use t1 < t2. Instead, one should check the difference (t1 - t2 < 0) because numerical overflow could occur in the {{System.nanoTime}} long. * In {{maybeFinishShadowRound}}/{{finishShadowRound}}, we should add the states to the {{endpointShadowStateMap}} before setting {{inShadowRound}} to false. It looks like the current behavior admits a race where {{doShadowRound}} could read {{inShadowRound == false}} and exit its loop and copy the endpointShadowStateMap before it is filled by shadow round finish. * I believe {{firstSynSendAt}} is accessed from multiple threads and needs to be volatile. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Open (was: Patch Available) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.11.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Fix Version/s: 4.x 3.x 3.0.x 2.2.x > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Fix For: 2.2.x, 3.0.x, 3.x, 4.x > > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Patch Available (was: Awaiting Feedback) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Awaiting Feedback (was: Open) Thanks for the ping - I gave the patch another skim and have some questions. The approach of returning endpoint states obtained through gossip seems sound. I definitely like this idea because it (mostly) prevents us from needing to reason about how shadow rounds affect the proper gossip process. That said, I'm not sure it fully accomplishes this goal. At the moment, it should be safe if a later response comes back after the gossiper has started, but we need to be careful to preserve this property in the future. I'm not sure how the timestamp check is supposed to work. It initializes the field using System.nanoTime() the first time we send a gossip, but in the gossip digest ack verb handler, we check timestamps using the endpoint state update timestamp, which is not serialized inter-node and also initialized using System.nanoTime() by the local JVM. It seems to me that this reduces to a boolean check that the gossiper has been properly started at least once, since this check will only fail when firstSynSendAt == 0. Am I missing something here? It also seems to me that we should initialize the field on starting the gossiper rather than checking and possibly initializing it every time we send a gossip message. > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joel Knighton updated CASSANDRA-12653: -- Status: Open (was: Patch Available) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joshua McKenzie updated CASSANDRA-12653: Reviewer: Joel Knighton > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-12653: --- Attachment: 12653-2.2.patch 12653-3.0.patch 12653-trunk.patch > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > Attachments: 12653-2.2.patch, 12653-3.0.patch, 12653-trunk.patch > > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-12653) In-flight shadow round requests
[ https://issues.apache.org/jira/browse/CASSANDRA-12653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-12653: --- Status: Patch Available (was: Open) > In-flight shadow round requests > --- > > Key: CASSANDRA-12653 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12653 > Project: Cassandra > Issue Type: Bug > Components: Distributed Metadata >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Minor > > Bootstrapping or replacing a node in the cluster requires to gather and check > some host IDs or tokens by doing a gossip "shadow round" once before joining > the cluster. This is done by sending a gossip SYN to all seeds until we > receive a response with the cluster state, from where we can move on in the > bootstrap process. Receiving a response will call the shadow round done and > calls {{Gossiper.resetEndpointStateMap}} for cleaning up the received state > again. > The issue here is that at this point there might be other in-flight requests > and it's very likely that shadow round responses from other seeds will be > received afterwards, while the current state of the bootstrap process doesn't > expect this to happen (e.g. gossiper may or may not be enabled). > One side effect will be that MigrationTasks are spawned for each shadow round > reply except the first. Tasks might or might not execute based on whether at > execution time {{Gossiper.resetEndpointStateMap}} had been called, which > effects the outcome of {{FailureDetector.instance.isAlive(endpoint))}} at > start of the task. You'll see error log messages such as follows when this > happend: > {noformat} > INFO [SharedPool-Worker-1] 2016-09-08 08:36:39,255 Gossiper.java:993 - > InetAddress /xx.xx.xx.xx is now UP > ERROR [MigrationStage:1]2016-09-08 08:36:39,255 FailureDetector.java:223 > - unknown endpoint /xx.xx.xx.xx > {noformat} > Although is isn't pretty, I currently don't see any serious harm from this, > but it would be good to get a second opinion (feel free to close as "wont > fix"). > /cc [~Stefania] [~thobbs] -- This message was sent by Atlassian JIRA (v6.3.4#6332)