[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782 ] David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:38 PM: -- Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, -trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s-, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| was (Author: dcapwell): Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17231782#comment-17231782 ] David Capwell edited comment on CASSANDRA-15158 at 11/13/20, 8:20 PM: -- Starting commit CI Results: Yellow. 3.1 org.apache.cassandra.service.MigrationCoordinatorTest but passes locally, trunk org.apache.cassandra.distributed.test.ring.BootstrapTest fails frequently due to schemas not present added commit which increases timeout from 30s to 90s, and other expected issues. ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| was (Author: dcapwell): Starting commit CI Results (pending): ||Branch||Source||Circle CI||Jenkins|| |cassandra-3.0|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.0-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/200/]| |cassandra-3.11|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-cassandra-3.11-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/201/]| |trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-15158-trunk-7E401495-E38F-4857-80C1-2C27028F572E]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/202/]| > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216816#comment-17216816 ] Aleksey Yeschenko edited comment on CASSANDRA-15158 at 10/30/20, 4:07 PM: -- Left a small comment on the 3.0 branch. Also, the following nits for {{MigrationCoordinator}}: 1. A bunch of unused imports 2. {{shouldApplySchemaFrom()}} has an unused argument 3. {{requestQueue}} could be an {{ArrayDequeue}} instead of a {{LinkedList}} - should set a good example for anyone randomly reading this code, even if it's not critical to do the right thing in this context EDIT: LGTM, +1, ship it was (Author: iamaleksey): Left a small comment on the 3.0 branch. Also, the following nits for {{MigrationCoordinator}}: 1. A bunch of unused imports 2. {{shouldApplySchemaFrom()}} has an unused argument 3. {{requestQueue}} could be an {{ArrayDequeue}} instead of a {{LinkedList}} - should set a good example for anyone randomly reading this code, even if it's not critical to do the right thing in this context > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Fix For: 3.0.x, 3.11.x, 4.0-beta > > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193009#comment-17193009 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/9/20, 4:50 PM: I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved though. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. onChange method in onJoin merges schemas again too, in case state is SCHEMA so I am not completely sure why we are merging schemas on a join anyway? {code:java} public void onJoin(InetAddress endpoint, EndpointState epState) { for (Map.Entry entry : epState.states()) { onChange(endpoint, entry.getKey(), entry.getValue()); } // this is weird MigrationManager.instance.scheduleSchemaPull(endpoint, epState); } public void onAlive(InetAddress endpoint, EndpointState state) { // this is weird as well MigrationManager.instance.scheduleSchemaPull(endpoint, state); if (tokenMetadata.isMember(endpoint)) notifyUp(endpoint); } {code} (1) https://github.com/instaclustr/cassandra/blob/15158-original-fix/test/distributed/org/apache/cassandra/distributed/test/BootstrappingSchemaAgreementTest.java was (Author: stefan.miklosovic): I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved though. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. onChange method in onJoin merges schemas again too, in case state is SCHEMA so I am not completely sure why we are merging schemas on a join anyway? {code:java}
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193009#comment-17193009 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/9/20, 4:47 PM: I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved though. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. onChange method in onJoin merges schemas again too, in case state is SCHEMA so I am not completely sure why we are merging schemas on a join anyway? {code:java} public void onJoin(InetAddress endpoint, EndpointState epState) { for (Map.Entry entry : epState.states()) { onChange(endpoint, entry.getKey(), entry.getValue()); } // this is weird MigrationManager.instance.scheduleSchemaPull(endpoint, epState); } public void onAlive(InetAddress endpoint, EndpointState state) { // this is weird as well MigrationManager.instance.scheduleSchemaPull(endpoint, state); if (tokenMetadata.isMember(endpoint)) notifyUp(endpoint); } {code} (1) https://github.com/instaclustr/cassandra/blob/0ceb1d6edb55916e68ae436e99c932e5ce28f68a/test/distributed/org/apache/cassandra/distributed/test/BootstrappingSchemaAgreementTest.java was (Author: stefan.miklosovic): I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved though. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. (1)
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193009#comment-17193009 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/9/20, 4:19 PM: I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved though. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. (1) https://github.com/instaclustr/cassandra/blob/0ceb1d6edb55916e68ae436e99c932e5ce28f68a/test/distributed/org/apache/cassandra/distributed/test/BootstrappingSchemaAgreementTest.java was (Author: stefan.miklosovic): I have improved the original work of mine and I wrote a test for that. Jenkins build does not fail anymore so I believe I have totally on par solution when it comes to dtests as I do not have time to fix dtests which the other solution breaks. While I admit that the improved version is technicaly more superior, the necessity to have clean build and same behaviour when it comes to dtests is more important to me at this moment. It would be awesome if dtests and issues I spotted are resolved thought. The test is here (1), the main logic is that a cluster of two nodes is started, the third node is started afterwards and I am dropping all migration messages to the other two, simulating some communication error between them. After some time, migration messages starts to flow again. So by doing this, I ll test the internals of the logic I wrote and it seems to do its job. One issue I am little bit concerned of is that StorageService is issuing schema migration requests on "onAlive, onJoin ..." in StorageService and these requests are not part of the waitForSchema() logic. It is understandable that it is like that as we need to track migration requests after a node fully bootstraps but we should skip this from happening when a node is under bootstrapping. I wrapped the bodies of these methods into "if (hasJoined())" but it was invoked anyway. However, it does not matter too much if this is outside of the logic I did because if schema migration was sucessful, the rewritten logic in waitForSchema does not have anything to deal with so we are done anyway. For skipping this in test, I used ByteBuddy to intercept MigrationManager#scheduleSchemaPull to do nothing hence I effectively skip migration schemas to be sent outside of the change I did. (1) https://github.com/instaclustr/cassandra/blob/0ceb1d6edb55916e68ae436e99c932e5ce28f68a/test/distributed/org/apache/cassandra/distributed/test/BootstrappingSchemaAgreementTest.java > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191705#comment-17191705 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/7/20, 1:49 PM: I am getting this exception on totally clean node, I am bootstrapping a cluster of 3 nodes: {code:java} cassandra_node_1| INFO [ScheduledTasks:1] 2020-09-07 15:10:13,037 TokenMetadata.java:517 - Updating topology for all endpoints that have changed cassandra_node_1| INFO [HANDSHAKE-spark-master-1/172.19.0.5] 2020-09-07 15:10:13,311 OutboundTcpConnection.java:561 - Handshaking version with spark-master-1/172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,870 Gossiper.java:1141 - Node /172.19.0.5 is now part of the cluster cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,904 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,907 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:14,052 Gossiper.java:1103 - InetAddress /172.19.0.5 is now UP cassandra_node_1| WARN [MessagingService-Incoming-/172.19.0.5] 2020-09-07 15:10:14,119 IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from socket; closing cassandra_node_1| org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 5bc52802-de25-35ed-aeab-188eecebb090. If a table was just created, this is likely due to the schema not being fully propagated. Please wait for schema agreement on table creation. cassandra_node_1| at org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1578) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:899) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:874) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] {code} That cfId stands for system_auth/roles. It seems like we are applying changes before schema agreement has occured so that table is not there yet to apply mutations against. This is the log from the second node. The first one booted fine, the second one throws this, the third one boots fine. It seems like eventually everything is just fine however that exception is ... concerning. was (Author: stefan.miklosovic): I am getting this exception on totally clean node, I am bootstrapping a cluster of 3 nodes: {code:java} cassandra_node_1| INFO [ScheduledTasks:1] 2020-09-07 15:10:13,037 TokenMetadata.java:517 - Updating topology for all endpoints that have changed cassandra_node_1| INFO [HANDSHAKE-spark-master-1/172.19.0.5] 2020-09-07 15:10:13,311 OutboundTcpConnection.java:561 - Handshaking version with spark-master-1/172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,870 Gossiper.java:1141 - Node /172.19.0.5 is now part of the cluster cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,904 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,907 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:14,052 Gossiper.java:1103 - InetAddress /172.19.0.5 is now UP cassandra_node_1| WARN [MessagingService-Incoming-/172.19.0.5] 2020-09-07 15:10:14,119 IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from socket; closing cassandra_node_1| org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191711#comment-17191711 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/7/20, 1:42 PM: There is also a runtime error as that concurrent hash map from that package is not on the class path. I removed it here, I just squashed all changes in Blakes branch + this one fix: https://github.com/instaclustr/cassandra/commit/e23677deeb7c836b4b7c80f98009353668351620 was (Author: stefan.miklosovic): There is also a runtime error as that concurrent hash map from that package is not on the class path. I removed it here, I just squashed all changes in Blakes branch + this one fix: https://github.com/instaclustr/cassandra/commit/af82bc2f1a4f9eff09458101c63027e919873af9 > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191711#comment-17191711 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/7/20, 1:37 PM: There is also a runtime error as that concurrent hash map from that package is not on the class path. I removed it here, I just squashed all changes in Blakes branch + this one fix: https://github.com/instaclustr/cassandra/commit/af82bc2f1a4f9eff09458101c63027e919873af9 was (Author: stefan.miklosovic): There is also a runtime error as that concurrent hash map from that package is not a class path. I removed it here, I just squashed all changes in Blakes branch + this one fix: https://github.com/instaclustr/cassandra/commit/af82bc2f1a4f9eff09458101c63027e919873af9 > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17191705#comment-17191705 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 9/7/20, 1:18 PM: I am getting this exception on totally clean node, I am bootstrapping a cluster of 3 nodes: {code:java} cassandra_node_1| INFO [ScheduledTasks:1] 2020-09-07 15:10:13,037 TokenMetadata.java:517 - Updating topology for all endpoints that have changed cassandra_node_1| INFO [HANDSHAKE-spark-master-1/172.19.0.5] 2020-09-07 15:10:13,311 OutboundTcpConnection.java:561 - Handshaking version with spark-master-1/172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,870 Gossiper.java:1141 - Node /172.19.0.5 is now part of the cluster cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,904 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,907 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:14,052 Gossiper.java:1103 - InetAddress /172.19.0.5 is now UP cassandra_node_1| WARN [MessagingService-Incoming-/172.19.0.5] 2020-09-07 15:10:14,119 IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from socket; closing cassandra_node_1| org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 5bc52802-de25-35ed-aeab-188eecebb090. If a table was just created, this is likely due to the schema not being fully propagated. Please wait for schema agreement on table creation. cassandra_node_1| at org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1578) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:899) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:874) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] cassandra_node_1| at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183) ~[apache-cassandra-3.11.9-SNAPSHOT.jar:3.11.9-SNAPSHOT] {code} That cfId stands for system_auth/roles. It seems like we are applying changes before schema agreement has occured so that table is not there yet to apply mutations against. was (Author: stefan.miklosovic): I am getting this exception on totally clean node, I am bootstrapping a cluster of 3 nodes: {code:java} cassandra_node_1| INFO [ScheduledTasks:1] 2020-09-07 15:10:13,037 TokenMetadata.java:517 - Updating topology for all endpoints that have changed cassandra_node_1| INFO [HANDSHAKE-spark-master-1/172.19.0.5] 2020-09-07 15:10:13,311 OutboundTcpConnection.java:561 - Handshaking version with spark-master-1/172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,870 Gossiper.java:1141 - Node /172.19.0.5 is now part of the cluster cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,904 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:13,907 TokenMetadata.java:497 - Updating topology for /172.19.0.5 cassandra_node_1| INFO [GossipStage:1] 2020-09-07 15:10:14,052 Gossiper.java:1103 - InetAddress /172.19.0.5 is now UP cassandra_node_1| WARN [MessagingService-Incoming-/172.19.0.5] 2020-09-07 15:10:14,119 IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from socket; closing cassandra_node_1| org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for cfId 5bc52802-de25-35ed-aeab-188eecebb090. If a table was just created, this is likely due to the schema not being fully propagated. Please wait for schema agreement on table creation. cassandra_node_1| at
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17189642#comment-17189642 ] Blake Eggleston edited comment on CASSANDRA-15158 at 9/2/20, 7:18 PM: -- Possibly, you still need to check in the submission task in case the node has died in the meantime. There would still be an intersection of node flapping rate and unfortunate scheduling where the lockup could occur though. The queue, while a little awkward, also makes us a bit more resilient against other unanticipated states and/or bugs. was (Author: bdeggleston): Possibly, you still need to check in the submission task in case the node has died in the meantime. There would still be an intersection of node flapping rate and unfortunate scheduling where the lockup could occur though > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188581#comment-17188581 ] Aleksey Yeschenko edited comment on CASSANDRA-15158 at 9/1/20, 3:49 PM: Pushed some minor tweaks [here|https://github.com/iamaleksey/cassandra/commits/15158-review]. Made some bits more idiomatic, and changed the way in-flight requests are being kept track of. In general, this does the job and solves the problem in the description. It doesn't, however, fully deal with storms in large clusters caused by a sequence of updates in quick succession, but, it's not intended to, either. EDIT: the amount of synchronisation here bothers me a tiny bit, as all of it will likely have to be eventually gotten rid of, when and if TPC happens, but I can live with it. was (Author: iamaleksey): Pushed some minor tweaks [here|https://github.com/iamaleksey/cassandra/commits/15158-review]. Made some bits more idiomatic, and changed the way in-flight requests are being kept track of. In general, this does the job and solves the problem in the description. It doesn't, however, fully deal with storms in large clusters caused by a sequence of updates in quick succession, but, it's not intended to, either. > Wait for schema agreement rather than in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Blake Eggleston >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17133717#comment-17133717 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 6/11/20, 9:58 PM: - Hi Blake, because of your very helpful explanation I was able to put together yet another version of the solution to this problem. You will find it here [https://github.com/apache/cassandra/pull/628] Thanks for the review in advance was (Author: stefan.miklosovic): Hi Blake, because of your very helpful explanation I was able to put together yet another version of the solution to this problem. You will find it here [https://github.com/apache/cassandra/compare/trunk...smiklosovic:CASSANDRA-15158-rework] Thanks for the review in advance > Wait for schema agreement rather then in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Ben Bromhead >Priority: Normal > Time Spent: 10m > Remaining Estimate: 0h > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102543#comment-17102543 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 5/8/20, 12:49 PM: - It seems to me that one aspect of the PR was overlooked so I just iterate on that one. The mechanim how to not flood nodes with schema pull messages is incorporated in the loop over callbacks. If you notice it, there are sleeps of various lenghts based on a request being already sent or not. This sleep will actually "delay" the next schema pull from the other node because during this time of a sleep, some schema could come from the node we just sent a message to so on the next iteration when another node is compared on schema equality, it may happen that there is not any need to pull it anymore because they are on par. Hence we are not blindly sending messages to all nodes. If there are some discrepancies, there is the global timeout set after which whole bootstrapping process will be evaluated as errorneous and (in the current code) we throw a ConfigurationException. This behaviour might be relaxed but I consider it more appropriate to just throw it there. was (Author: stefan.miklosovic): It seems to me that one aspect of the PR was overlooked so I just iterate on that one. The mechanims how to not flood nodes with schema pull messages is incorporated in a loop over callbacks. If you notice it, there are sleeps of various lenghts based on a request being already sent or not. This sleep will actually "delay" the next schema pull from the other node because during this time of a sleep, a schema could come in so on the next iteration when another node is compared on schema equality, it may happen that there is not any need to pull it anymore because they are on par. Hence we are not blindly sending messages to all nodes. If there are some discrepancies, there is the global timeout set after which whole bootstrapping process will be evaluated as errorneous and (in the current code) we throw a ConfigurationException. This behaviour might be relaxed but I consider it more appropriate to just throw it there. > Wait for schema agreement rather then in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Ben Bromhead >Priority: Normal > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102374#comment-17102374 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 5/8/20, 8:50 AM: Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} How can a node report its schema while being unreachable? {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema. {quote} Got you, this might be tracked. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at the time I was coding it, if you know of more better version, please tell me otherwise I am not sure what might be better here and we could stick with this for a time being? The whole testing methodology was based on these callbacks and checking their inner state which results into having a methods which are accepting them so we can elaborate on their state. Without "injecting" them from outside, I would not be able to do that. was (Author: stefan.miklosovic): Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} How can a node report its schema while being unreachable? {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema. {quote} -I am sorry, I am not following what you say here, in particular the very last sentence. I think the schema is ever pull (message is sent) _only_ in case that reported schema version from Gossipper is different, only after that we are ever sending a message.- I am taking this back, you might be right here, I see what you mean, but this make whole solution even more complicated. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at the time I was coding it, if you know of more better version, please tell me otherwise I am not sure what might be better here and we could stick with this for a time being? The whole testing methodology was based on these
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102374#comment-17102374 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 5/8/20, 8:42 AM: Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} How can a node report its schema while being unreachable? {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema. {quote} -I am sorry, I am not following what you say here, in particular the very last sentence. I think the schema is ever pull (message is sent) _only_ in case that reported schema version from Gossipper is different, only after that we are ever sending a message.- I am taking this back, you might be right here, I see what you mean, but this make whole solution even more complicated. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at the time I was coding it, if you know of more better version, please tell me otherwise I am not sure what might be better here and we could stick with this for a time being? The whole testing methodology was based on these callbacks and checking their inner state which results into having a methods which are accepting them so we can elaborate on their state. Without "injecting" them from outside, I would not be able to do that. was (Author: stefan.miklosovic): Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do?{quote} How can a node report its schema while being unreachable? {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema.{quote} I am sorry, I am not following what you say here, in particular the very last sentence. I think the schema is ever pull (message is sent) _only_ in case that reported schema version from Gossipper is different, only after that we are ever sending a message. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102374#comment-17102374 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 5/8/20, 8:34 AM: Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do?{quote} How can a node report its schema while being unreachable? {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema.{quote} I am sorry, I am not following what you say here, in particular the very last sentence. I think the schema is ever pull (message is sent) _only_ in case that reported schema version from Gossipper is different, only after that we are ever sending a message. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at the time I was coding it, if you know of more better version, please tell me otherwise I am not sure what might be better here and we could stick with this for a time being? The whole testing methodology was based on these callbacks and checking their inner state which results into having a methods which are accepting them so we can elaborate on their state. Without "injecting" them from outside, I would not be able to do that. was (Author: stefan.miklosovic): Hi [~bdeggleston], commenting on design issues, I am not completely sure if these issues you are talking about are related to this patch or they are already existing? We could indeed focus on the points you raised but it seems to me that the current (comitted) code is worse without this patch than with as I guess these problems are already there? Isn't the goal here to have all nodes on same versions? Isn't the very fact that there are multiple versions pretty strange to begin with so we should not even try to join a node if they mismatch hence there is nothing to deal with in the first place? {quote}It will only wait until it has _some_ schema to begin bootstrapping, not all {quote} This is the most likely not true unless I am not getting something. The node to be bootstrapped will never advance in doing so unless all nodes have same versions. {quote} For instance, if a single node is reporting a schema version that no one else has, but the node is unreachable, what do we do? {quote} We should fail whole bootstrapping and one should go and fix it. {quote}Next, I like how this limits the number of messages sent to a given endpoint, but we should also limit the number of messages we send out for a given schema version. If we have a large cluster, and all nodes are reporting the same version, we don't need to ask every node for it's schema.{quote} I am sorry, I am not following what you say here, in particular the very last sentence. I think the schema is ever pull (message is sent) _only_ in case that reported schema version from Gossipper is different, only after that we are ever sending a message. When it comes to testing, I admit that adding isRunningForcibly method feels like a hack but I had very hard time to test this stuff out. It was basically the only reasonable way possible at the time I was coding it, if you know of more better version, please tell me otherwise I am not sure what might be better here and we could stick with this for a time being? The whole testing methodology was based on these callbacks and checking their inner state which results into having a methods which are
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093752#comment-17093752 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 4/28/20, 7:35 PM: - Hi [~bdeggleston] I took the patch and rewored it little bit [https://github.com/smiklosovic/cassandra/tree/CASSANDRA-15158-2] Looking forward to have some feedback! was (Author: stefan.miklosovic): Hi [~bdeggleston] I took the patch and rewored it little bit [https://github.com/smiklosovic/cassandra/commits/CASSANDRA-15158] You have said that there is not any way how to confirm that any in flight migration tasks have been completed and applied. I do not know how to verify this was indeed done but what I did is that I have exposed the information if a particular migration task has failed or not to process (based on onFailure callback) so we can work on this further if necessary. Logically, it is same stuff as it was before in the original patch but the code is reorganised a bit. The "escape hatch" is one global bootstrap timeout, if it is passed and schemas are still not in agreement, it is still uknown to me what we want to do - either fail completely and halt that node or we allow to proceed with big fat warning. Looking forward to have some feedback! > Wait for schema agreement rather then in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Ben Bromhead >Priority: Normal > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather then in flight schema requests when bootstrapping
[ https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17093752#comment-17093752 ] Stefan Miklosovic edited comment on CASSANDRA-15158 at 4/27/20, 7:19 PM: - Hi [~bdeggleston] I took the patch and rewored it little bit [https://github.com/smiklosovic/cassandra/commits/CASSANDRA-15158] You have said that there is not any way how to confirm that any in flight migration tasks have been completed and applied. I do not know how to verify this was indeed done but what I did is that I have exposed the information if a particular migration task has failed or not to process (based on onFailure callback) so we can work on this further if necessary. Logically, it is same stuff as it was before in the original patch but the code is reorganised a bit. The "escape hatch" is one global bootstrap timeout, if it is passed and schemas are still not in agreement, it is still uknown to me what we want to do - either fail completely and halt that node or we allow to proceed with big fat warning. Looking forward to have some feedback! was (Author: stefan.miklosovic): Hi [~bdeggleston] I took the patch and rewored it little bit [https://github.com/smiklosovic/cassandra/commits/CASSANDRA-15158] You have said that there is not any way how to confirm that any in flight migration tasks have been completed and applied. I do not know how to verify this was indeed done but what I did is that I have exposed the information if a particular migration task has failed or nor to process (based on onFailure callback) so we can work on this further if necessary. Logically, it is same stuff as it was before in the original patch but the code is reorganised a bit. The "escape hatch" is one global bootstrap timeout, if it is passed and schemas are still not in agreement, it is still uknown to me what we want to do - either fail completely and halt that service or we allow to proceed with big fat warning. Looking forward to have some feedback! > Wait for schema agreement rather then in flight schema requests when > bootstrapping > -- > > Key: CASSANDRA-15158 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15158 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Gossip, Cluster/Schema >Reporter: Vincent White >Assignee: Ben Bromhead >Priority: Normal > > Currently when a node is bootstrapping we use a set of latches > (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of > in-flight schema pull requests, and we don't proceed with > bootstrapping/stream until all the latches are released (or we timeout > waiting for each one). One issue with this is that if we have a large schema, > or the retrieval of the schema from the other nodes was unexpectedly slow > then we have no explicit check in place to ensure we have actually received a > schema before we proceed. > While it's possible to increase "migration_task_wait_in_seconds" to force the > node to wait on each latche longer, there are cases where this doesn't help > because the callbacks for the schema pull requests have expired off the > messaging service's callback map > (org.apache.cassandra.net.MessagingService#callbacks) after > request_timeout_in_ms (default 10 seconds) before the other nodes were able > to respond to the new node. > This patch checks for schema agreement between the bootstrapping node and the > rest of the live nodes before proceeding with bootstrapping. It also adds a > check to prevent the new node from flooding existing nodes with simultaneous > schema pull requests as can happen in large clusters. > Removing the latch system should also prevent new nodes in large clusters > getting stuck for extended amounts of time as they wait > `migration_task_wait_in_seconds` on each of the latches left orphaned by the > timed out callbacks. > > ||3.11|| > |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]| > |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]| > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org