[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599466#comment-16599466 ] Ariel Weisberg commented on CASSANDRA-14408: Obviously +1 from me as I committed it. > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use
[ https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599465#comment-16599465 ] Ariel Weisberg commented on CASSANDRA-14407: Obviously +1 from me as I committed it. > Transient Replication: Add support for correct reads when transient > replication is in use > - > > Key: CASSANDRA-14407 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14407 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Digest reads should never be sent to transient replicas. > Mismatches with results from transient replicas shouldn't trigger read repair. > Read repair should never attempt to repair a transient replica. > Reads should always include at least one full replica. They should also > prefer transient replicas where possible. > Range scans must ensure the entire scanned range performs replica selection > that satisfies the requirement that every range scanned includes one full > replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599464#comment-16599464 ] Ariel Weisberg commented on CASSANDRA-14406: Obviously +1 from me as I committed it. > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599463#comment-16599463 ] Ariel Weisberg commented on CASSANDRA-14405: Obviously +1 from me as I committed it. > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14406: --- Fix Version/s: 4.0 > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14408: --- Status: Ready to Commit (was: Patch Available) > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)
[ https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14409: --- Resolution: Fixed Status: Resolved (was: Ready to Commit) The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication: Support ring changes when transient replication is in > use (add/remove node, change RF, add/remove DC) > > > Key: CASSANDRA-14409 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14409 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination, Core, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Major > Fix For: 4.0 > > > The additional state transitions that transient replication introduces > require streaming and nodetool cleanup to behave differently. We already have > code that does the streaming, but in some cases we shouldn't stream any data > and in others when we stream to receive data we have to make sure we stream > from a full replica and not a transient replica. > Transitioning from not replicated to transiently replicated means that a node > must stay pending until the next incremental repair completes at which point > the data for that range is known to be available at full replicas. > Transitioning from transiently replicated to fully replicated requires > streaming from a full replica and is identical to how we stream from not > replicated to replicated. The transition must be managed so the transient > replica is not read from as a full replica until streaming completes. It can > be used immediately for a write quorum. > Transitioning from fully replicated to transiently replicated requires > cleanup to remove repaired data from the transiently replicated range to > reclaim space. It can be used immediately for a write quorum. > Transitioning from transiently replicated to not replicated requires cleanup > to be run to remove the formerly transiently replicated data. > nodetool move, removenode, cleanup, decommission, and rebuild need to handle > these issues as does bootstrap. > Update web site, documentation, NEWS.txt with a description of the steps for > doing common operations. Add/remove DC, Add/remove node(s), replace node, > change RF. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)
[ https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14409: --- Status: Ready to Commit (was: Patch Available) > Transient Replication: Support ring changes when transient replication is in > use (add/remove node, change RF, add/remove DC) > > > Key: CASSANDRA-14409 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14409 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination, Core, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Major > Fix For: 4.0 > > > The additional state transitions that transient replication introduces > require streaming and nodetool cleanup to behave differently. We already have > code that does the streaming, but in some cases we shouldn't stream any data > and in others when we stream to receive data we have to make sure we stream > from a full replica and not a transient replica. > Transitioning from not replicated to transiently replicated means that a node > must stay pending until the next incremental repair completes at which point > the data for that range is known to be available at full replicas. > Transitioning from transiently replicated to fully replicated requires > streaming from a full replica and is identical to how we stream from not > replicated to replicated. The transition must be managed so the transient > replica is not read from as a full replica until streaming completes. It can > be used immediately for a write quorum. > Transitioning from fully replicated to transiently replicated requires > cleanup to remove repaired data from the transiently replicated range to > reclaim space. It can be used immediately for a write quorum. > Transitioning from transiently replicated to not replicated requires cleanup > to be run to remove the formerly transiently replicated data. > nodetool move, removenode, cleanup, decommission, and rebuild need to handle > these issues as does bootstrap. > Update web site, documentation, NEWS.txt with a description of the steps for > doing common operations. Add/remove DC, Add/remove node(s), replace node, > change RF. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14408: --- Resolution: Fixed Status: Resolved (was: Ready to Commit) The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use
[ https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14407: --- Resolution: Fixed Status: Resolved (was: Ready to Commit) The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication: Add support for correct reads when transient > replication is in use > - > > Key: CASSANDRA-14407 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14407 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Digest reads should never be sent to transient replicas. > Mismatches with results from transient replicas shouldn't trigger read repair. > Read repair should never attempt to repair a transient replica. > Reads should always include at least one full replica. They should also > prefer transient replicas where possible. > Range scans must ensure the entire scanned range performs replica selection > that satisfies the requirement that every range scanned includes one full > replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14406: --- Status: Patch Available (was: Open) > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14406: --- Status: Ready to Commit (was: Patch Available) > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use
[ https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14407: --- Status: Ready to Commit (was: Patch Available) > Transient Replication: Add support for correct reads when transient > replication is in use > - > > Key: CASSANDRA-14407 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14407 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Digest reads should never be sent to transient replicas. > Mismatches with results from transient replicas shouldn't trigger read repair. > Read repair should never attempt to repair a transient replica. > Reads should always include at least one full replica. They should also > prefer transient replicas where possible. > Range scans must ensure the entire scanned range performs replica selection > that satisfies the requirement that every range scanned includes one full > replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14405: --- Resolution: Fixed Status: Resolved (was: Ready to Commit) > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599458#comment-16599458 ] Ariel Weisberg commented on CASSANDRA-14405: The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14406: --- Resolution: Fixed Status: Resolved (was: Ready to Commit) The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14405: --- Status: Ready to Commit (was: Patch Available) > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use
[ https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ariel Weisberg updated CASSANDRA-14407: --- Status: Patch Available (was: Open) > Transient Replication: Add support for correct reads when transient > replication is in use > - > > Key: CASSANDRA-14407 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14407 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Digest reads should never be sent to transient replicas. > Mismatches with results from transient replicas shouldn't trigger read repair. > Read repair should never attempt to repair a transient replica. > Reads should always include at least one full replica. They should also > prefer transient replicas where possible. > Range scans must ensure the entire scanned range performs replica selection > that satisfies the requirement that every range scanned includes one full > replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599457#comment-16599457 ] Ariel Weisberg commented on CASSANDRA-14404: The initial implementation of Transient Replication and Cheap Quorums was committed as [f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd]. > Transient Replication & Cheap Quorums: Decouple storage requirements from > consensus group size using incremental repair > --- > > Key: CASSANDRA-14404 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14404 > Project: Cassandra > Issue Type: New Feature > Components: Coordination, Core, CQL, Distributed Metadata, Hints, > Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, > Testing, Tools >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Major > Fix For: 4.0 > > > Transient Replication is an implementation of [Witness > Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf] > that leverages incremental repair to make full replicas consistent with > transient replicas that don't store the entire data set. Witness replicas are > used in real world systems such as Megastore and Spanner to increase > availability inexpensively without having to commit to more full copies of > the database. Transient replicas implement functionality similar to > upgradable and temporary replicas from the paper. > With transient replication the replication factor is increased beyond the > desired level of data redundancy by adding replicas that only store data when > sufficient full replicas are unavailable to store the data. These replicas > are called transient replicas. When incremental repair runs transient > replicas stream any data they have received to full replicas and once the > data is fully replicated it is dropped at the transient replicas. > Cheap quorums are a further set of optimizations on the write path to avoid > writing to transient replicas unless sufficient full replicas are available > as well as optimizations on the read path to prefer reading from transient > replicas. When writing at quorum to a table configured to use transient > replication the quorum will always prefer available full replicas over > transient replicas so that transient replicas don't have to process writes. > Rapid write protection (similar to rapid read protection) reduces tail > latency when full replicas are slow/unavailable to respond by sending writes > to additional replicas if necessary. > Transient replicas can generally service reads faster because they don't have > to do anything beyond bloom filter checks if they have no data. With vnodes > and larger size clusters they will not have a large quantity of data even in > failure cases where transient replicas start to serve a steady amount of > write traffic for some of their transiently replicated ranges. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
cassandra-dtest git commit: Transient Replication and Cheap Quorums, update existing tests
Repository: cassandra-dtest Updated Branches: refs/heads/master 3d760e6da -> 4e1c05565 Transient Replication and Cheap Quorums, update existing tests Patch by Ariel Weisberg; Reviewed by Blake Eggleston for CASSANDRA-14404 Co-authored-by: Blake Eggleston Co-authored-by: Alex Petrov Project: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/commit/4e1c0556 Tree: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/tree/4e1c0556 Diff: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/diff/4e1c0556 Branch: refs/heads/master Commit: 4e1c05565aada57466b8edcdff43f1c7ebb7cd3e Parents: 3d760e6 Author: Ariel Weisberg Authored: Fri Jun 22 12:28:30 2018 -0700 Committer: Ariel Weisberg Committed: Fri Aug 31 21:41:09 2018 -0400 -- byteman/failing_repair.btm| 7 byteman/read_repair/sorted_live_endpoints.btm | 13 ++- byteman/read_repair/stop_data_reads.btm | 2 +- byteman/read_repair/stop_digest_reads.btm | 2 +- byteman/slow_writes.btm | 7 byteman/stop_reads.btm| 8 byteman/stop_rr_writes.btm| 8 byteman/stop_writes.btm | 8 byteman/throw_on_digest.btm | 7 read_repair_test.py | 44 -- repair_tests/repair_test.py | 3 +- 11 files changed, 93 insertions(+), 16 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/failing_repair.btm -- diff --git a/byteman/failing_repair.btm b/byteman/failing_repair.btm new file mode 100644 index 000..ea82888 --- /dev/null +++ b/byteman/failing_repair.btm @@ -0,0 +1,7 @@ +RULE fail repairs +CLASS org.apache.cassandra.repair.RepairMessageVerbHandler +METHOD doVerb +AT ENTRY +IF true +DO throw new RuntimeException("Repair failed"); +ENDRULE http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/sorted_live_endpoints.btm -- diff --git a/byteman/read_repair/sorted_live_endpoints.btm b/byteman/read_repair/sorted_live_endpoints.btm index 221e958..bfcfb1a 100644 --- a/byteman/read_repair/sorted_live_endpoints.btm +++ b/byteman/read_repair/sorted_live_endpoints.btm @@ -1,15 +1,8 @@ RULE sorted live endpoints -CLASS org.apache.cassandra.service.StorageProxy -METHOD getLiveSortedEndpoints +CLASS org.apache.cassandra.locator.SimpleSnitch +METHOD sortedByProximity AT ENTRY -BIND ep1 = org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.1"); - ep2 = org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.2"); - ep3 = org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.3"); - eps = new java.util.ArrayList(); IF true DO -eps.add(ep1); -eps.add(ep2); -eps.add(ep3); -return eps; +return $unsortedAddress.sorted(java.util.Comparator.naturalOrder()); ENDRULE \ No newline at end of file http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/stop_data_reads.btm -- diff --git a/byteman/read_repair/stop_data_reads.btm b/byteman/read_repair/stop_data_reads.btm index 9506aba..905a110 100644 --- a/byteman/read_repair/stop_data_reads.btm +++ b/byteman/read_repair/stop_data_reads.btm @@ -4,7 +4,7 @@ CLASS org.apache.cassandra.db.ReadCommandVerbHandler METHOD doVerb # wait until command is declared locally. because generics AFTER WRITE $command -# bail out if it's not a digest request +# bail out if it's a data request IF NOT $command.isDigestQuery() DO return; ENDRULE http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/stop_digest_reads.btm -- diff --git a/byteman/read_repair/stop_digest_reads.btm b/byteman/read_repair/stop_digest_reads.btm index 92c54f6..adb9b31 100644 --- a/byteman/read_repair/stop_digest_reads.btm +++ b/byteman/read_repair/stop_digest_reads.btm @@ -4,7 +4,7 @@ CLASS org.apache.cassandra.db.ReadCommandVerbHandler METHOD doVerb # wait until command is declared locally. because generics AFTER WRITE $command -# bail out if it's not a digest request +# bail out if it's a digest request IF $command.isDigestQuery() DO return; ENDRULE http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/slow_writes.btm -- diff --git a/byteman/slow_writes.btm b/byteman/slow_writes.btm new file mode 100644 index 000..a82dd0a --- /dev/null +++
[11/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java -- diff --git a/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java b/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java new file mode 100644 index 000..7eedab7 --- /dev/null +++ b/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java @@ -0,0 +1,142 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.cassandra.repair; + +import java.util.Collections; +import java.util.List; +import java.util.UUID; + +import com.google.common.annotations.VisibleForTesting; +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; + +import org.apache.cassandra.dht.Range; +import org.apache.cassandra.dht.Token; +import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.RangesAtEndpoint; +import org.apache.cassandra.streaming.PreviewKind; +import org.apache.cassandra.streaming.ProgressInfo; +import org.apache.cassandra.streaming.StreamEvent; +import org.apache.cassandra.streaming.StreamEventHandler; +import org.apache.cassandra.streaming.StreamOperation; +import org.apache.cassandra.streaming.StreamPlan; +import org.apache.cassandra.streaming.StreamState; +import org.apache.cassandra.tracing.TraceState; +import org.apache.cassandra.tracing.Tracing; +import org.apache.cassandra.utils.FBUtilities; + +/** + * SymmetricLocalSyncTask performs streaming between local(coordinator) node and remote replica. + */ +public class SymmetricLocalSyncTask extends SymmetricSyncTask implements StreamEventHandler +{ +private final TraceState state = Tracing.instance.get(); + +private static final Logger logger = LoggerFactory.getLogger(SymmetricLocalSyncTask.class); + +private final boolean remoteIsTransient; +private final UUID pendingRepair; +private final boolean pullRepair; + +public SymmetricLocalSyncTask(RepairJobDesc desc, TreeResponse r1, TreeResponse r2, boolean remoteIsTransient, UUID pendingRepair, boolean pullRepair, PreviewKind previewKind) +{ +super(desc, r1, r2, previewKind); +this.remoteIsTransient = remoteIsTransient; +this.pendingRepair = pendingRepair; +this.pullRepair = pullRepair; +} + +@VisibleForTesting +StreamPlan createStreamPlan(InetAddressAndPort dst, List> differences) +{ +StreamPlan plan = new StreamPlan(StreamOperation.REPAIR, 1, false, pendingRepair, previewKind) + .listeners(this) + .flushBeforeTransfer(pendingRepair == null) + // see comment on RangesAtEndpoint.toDummyList for why we synthesize replicas here + .requestRanges(dst, desc.keyspace, RangesAtEndpoint.toDummyList(differences), + RangesAtEndpoint.toDummyList(Collections.emptyList()), desc.columnFamily); // request ranges from the remote node + +if (!pullRepair && !remoteIsTransient) +{ +// send ranges to the remote node if we are not performing a pull repair +// see comment on RangesAtEndpoint.toDummyList for why we synthesize replicas here +plan.transferRanges(dst, desc.keyspace, RangesAtEndpoint.toDummyList(differences), desc.columnFamily); +} + +return plan; +} + +/** + * Starts sending/receiving our list of differences to/from the remote endpoint: creates a callback + * that will be called out of band once the streams complete. + */ +@Override +protected void startSync(List> differences) +{ +InetAddressAndPort local = FBUtilities.getBroadcastAddressAndPort(); +// We can take anyone of the node as source or destination, however if one is localhost, we put at source to avoid a forwarding +InetAddressAndPort dst = r2.endpoint.equals(local) ? r1.endpoint : r2.endpoint; + +String message = String.format("Performing streaming repair of %d ranges with %s", differences.size(), dst); +logger.info("{} {}",
[03/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java -- diff --git a/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java b/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java index 294731a..4f7cde0 100644 --- a/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java +++ b/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java @@ -20,8 +20,6 @@ package org.apache.cassandra.service; import java.util.*; -import javax.xml.crypto.Data; - import com.google.common.collect.ImmutableList; import com.google.common.collect.Sets; import org.junit.Assert; @@ -36,13 +34,13 @@ import org.apache.cassandra.db.Keyspace; import org.apache.cassandra.db.RowUpdateBuilder; import org.apache.cassandra.db.lifecycle.SSTableSet; import org.apache.cassandra.db.lifecycle.View; -import org.apache.cassandra.dht.IPartitioner; import org.apache.cassandra.dht.Range; import org.apache.cassandra.dht.Token; import org.apache.cassandra.exceptions.ConfigurationException; import org.apache.cassandra.io.sstable.format.SSTableReader; import org.apache.cassandra.locator.AbstractReplicationStrategy; import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.Replica; import org.apache.cassandra.locator.TokenMetadata; import org.apache.cassandra.repair.messages.RepairOption; import org.apache.cassandra.streaming.PreviewKind; @@ -107,13 +105,13 @@ public class ActiveRepairServiceTest public void testGetNeighborsPlusOne() throws Throwable { // generate rf+1 nodes, and ensure that all nodes are returned -Set expected = addTokens(1 + Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor()); +Set expected = addTokens(1 + Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas); expected.remove(FBUtilities.getBroadcastAddressAndPort()); -Collection> ranges = StorageService.instance.getLocalRanges(KEYSPACE5); +Iterable> ranges = StorageService.instance.getLocalReplicas(KEYSPACE5).ranges(); Set neighbors = new HashSet<>(); for (Range range : ranges) { -neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, ranges, range, null, null)); +neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, ranges, range, null, null).endpoints()); } assertEquals(expected, neighbors); } @@ -124,19 +122,19 @@ public class ActiveRepairServiceTest TokenMetadata tmd = StorageService.instance.getTokenMetadata(); // generate rf*2 nodes, and ensure that only neighbors specified by the ARS are returned -addTokens(2 * Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor()); +addTokens(2 * Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas); AbstractReplicationStrategy ars = Keyspace.open(KEYSPACE5).getReplicationStrategy(); Set expected = new HashSet<>(); -for (Range replicaRange : ars.getAddressRanges().get(FBUtilities.getBroadcastAddressAndPort())) +for (Replica replica : ars.getAddressReplicas().get(FBUtilities.getBroadcastAddressAndPort())) { - expected.addAll(ars.getRangeAddresses(tmd.cloneOnlyTokenMap()).get(replicaRange)); + expected.addAll(ars.getRangeAddresses(tmd.cloneOnlyTokenMap()).get(replica.range()).endpoints()); } expected.remove(FBUtilities.getBroadcastAddressAndPort()); -Collection> ranges = StorageService.instance.getLocalRanges(KEYSPACE5); +Iterable> ranges = StorageService.instance.getLocalReplicas(KEYSPACE5).ranges(); Set neighbors = new HashSet<>(); for (Range range : ranges) { -neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, ranges, range, null, null)); +neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, ranges, range, null, null).endpoints()); } assertEquals(expected, neighbors); } @@ -147,18 +145,18 @@ public class ActiveRepairServiceTest TokenMetadata tmd = StorageService.instance.getTokenMetadata(); // generate rf+1 nodes, and ensure that all nodes are returned -Set expected = addTokens(1 + Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor()); +Set expected = addTokens(1 + Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas); expected.remove(FBUtilities.getBroadcastAddressAndPort()); // remove remote endpoints TokenMetadata.Topology topology = tmd.cloneOnlyTokenMap().getTopology(); HashSet localEndpoints =
[15/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/dht/RangeStreamer.java -- diff --git a/src/java/org/apache/cassandra/dht/RangeStreamer.java b/src/java/org/apache/cassandra/dht/RangeStreamer.java index 110fed6..e8aa5d3 100644 --- a/src/java/org/apache/cassandra/dht/RangeStreamer.java +++ b/src/java/org/apache/cassandra/dht/RangeStreamer.java @@ -18,27 +18,40 @@ package org.apache.cassandra.dht; import java.util.*; +import java.util.function.BiFunction; +import java.util.function.Function; +import java.util.stream.Collectors; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; -import com.google.common.collect.ArrayListMultimap; +import com.google.common.base.Predicate; import com.google.common.collect.HashMultimap; +import com.google.common.collect.Iterables; +import com.google.common.collect.Iterators; import com.google.common.collect.Multimap; -import com.google.common.collect.Sets; +import org.apache.cassandra.gms.FailureDetector; +import org.apache.cassandra.locator.Endpoints; +import org.apache.cassandra.locator.EndpointsByReplica; import org.apache.cassandra.locator.InetAddressAndPort; import org.apache.cassandra.locator.LocalStrategy; +import org.apache.cassandra.locator.EndpointsByRange; +import org.apache.cassandra.locator.EndpointsForRange; +import org.apache.cassandra.locator.RangesAtEndpoint; +import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict; import org.apache.commons.lang3.StringUtils; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.cassandra.db.Keyspace; -import org.apache.cassandra.gms.EndpointState; import org.apache.cassandra.gms.Gossiper; import org.apache.cassandra.gms.IFailureDetector; import org.apache.cassandra.locator.AbstractReplicationStrategy; import org.apache.cassandra.locator.IEndpointSnitch; +import org.apache.cassandra.locator.Replica; +import org.apache.cassandra.locator.ReplicaCollection; +import org.apache.cassandra.locator.Replicas; import org.apache.cassandra.locator.TokenMetadata; import org.apache.cassandra.service.StorageService; import org.apache.cassandra.streaming.PreviewKind; @@ -47,13 +60,25 @@ import org.apache.cassandra.streaming.StreamResultFuture; import org.apache.cassandra.streaming.StreamOperation; import org.apache.cassandra.utils.FBUtilities; +import static com.google.common.base.Predicates.and; +import static com.google.common.base.Predicates.not; +import static com.google.common.collect.Iterables.all; +import static com.google.common.collect.Iterables.any; +import static org.apache.cassandra.locator.Replica.fullReplica; + /** - * Assists in streaming ranges to a node. + * Assists in streaming ranges to this node. */ public class RangeStreamer { private static final Logger logger = LoggerFactory.getLogger(RangeStreamer.class); +public static Predicate ALIVE_PREDICATE = replica -> + (!Gossiper.instance.isEnabled() || + (Gossiper.instance.getEndpointStateForEndpoint(replica.endpoint()) == null || + Gossiper.instance.getEndpointStateForEndpoint(replica.endpoint()).isAlive())) && + FailureDetector.instance.isAlive(replica.endpoint()); + /* bootstrap tokens. can be null if replacing the node. */ private final Collection tokens; /* current token ring */ @@ -62,26 +87,59 @@ public class RangeStreamer private final InetAddressAndPort address; /* streaming description */ private final String description; -private final Multimap>>> toFetch = HashMultimap.create(); -private final Set sourceFilters = new HashSet<>(); +private final Multimap> toFetch = HashMultimap.create(); +private final Set> sourceFilters = new HashSet<>(); private final StreamPlan streamPlan; private final boolean useStrictConsistency; private final IEndpointSnitch snitch; private final StreamStateStore stateStore; -/** - * A filter applied to sources to stream from when constructing a fetch map. - */ -public static interface ISourceFilter +public static class FetchReplica { -public boolean shouldInclude(InetAddressAndPort endpoint); +public final Replica local; +public final Replica remote; + +public FetchReplica(Replica local, Replica remote) +{ +Preconditions.checkNotNull(local); +Preconditions.checkNotNull(remote); +assert local.isLocal() && !remote.isLocal(); +this.local = local; +this.remote = remote; +} + +public String toString() +{ +return "FetchReplica{" + +
[04/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java -- diff --git a/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java b/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java index 9c90d57..4afeb5a 100644 --- a/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java +++ b/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java @@ -20,7 +20,6 @@ package org.apache.cassandra.locator; import java.net.UnknownHostException; import java.util.ArrayList; import java.util.Arrays; -import java.util.Collection; import java.util.Collections; import java.util.HashMap; import java.util.List; @@ -39,9 +38,11 @@ import org.apache.cassandra.service.StorageService; import org.apache.cassandra.utils.Pair; import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; public class OldNetworkTopologyStrategyTest { + private List keyTokens; private TokenMetadata tmd; private Map> expectedResults; @@ -53,7 +54,7 @@ public class OldNetworkTopologyStrategyTest } @Before -public void init() +public void init() throws Exception { keyTokens = new ArrayList(); tmd = new TokenMetadata(); @@ -160,11 +161,11 @@ public class OldNetworkTopologyStrategyTest { for (Token keyToken : keyTokens) { -List endpoints = strategy.getNaturalEndpoints(keyToken); -for (int j = 0; j < endpoints.size(); j++) +int j = 0; +for (InetAddressAndPort endpoint : strategy.getNaturalReplicasForToken(keyToken).endpoints()) { ArrayList hostsExpected = expectedResults.get(keyToken.toString()); -assertEquals(endpoints.get(j), hostsExpected.get(j)); +assertEquals(endpoint, hostsExpected.get(j++)); } } } @@ -188,7 +189,6 @@ public class OldNetworkTopologyStrategyTest assertEquals(ranges.left.iterator().next().left, tokensAfterMove[movingNodeIdx]); assertEquals(ranges.left.iterator().next().right, tokens[movingNodeIdx]); assertEquals("No data should be fetched", ranges.right.size(), 0); - } @Test @@ -205,7 +205,6 @@ public class OldNetworkTopologyStrategyTest assertEquals("No data should be streamed", ranges.left.size(), 0); assertEquals(ranges.right.iterator().next().left, tokens[movingNodeIdx]); assertEquals(ranges.right.iterator().next().right, tokensAfterMove[movingNodeIdx]); - } @SuppressWarnings("unchecked") @@ -366,16 +365,21 @@ public class OldNetworkTopologyStrategyTest TokenMetadata tokenMetadataAfterMove = initTokenMetadata(tokensAfterMove); AbstractReplicationStrategy strategy = new OldNetworkTopologyStrategy("Keyspace1", tokenMetadataCurrent, endpointSnitch, optsWithRF(2)); -Collection> currentRanges = strategy.getAddressRanges().get(movingNode); -Collection> updatedRanges = strategy.getPendingAddressRanges(tokenMetadataAfterMove, tokensAfterMove[movingNodeIdx], movingNode); - -Pair>, Set>> ranges = StorageService.instance.calculateStreamAndFetchRanges(currentRanges, updatedRanges); +RangesAtEndpoint currentRanges = strategy.getAddressReplicas().get(movingNode); +RangesAtEndpoint updatedRanges = strategy.getPendingAddressRanges(tokenMetadataAfterMove, tokensAfterMove[movingNodeIdx], movingNode); -return ranges; +return asRanges(StorageService.calculateStreamAndFetchRanges(currentRanges, updatedRanges)); } private static Map optsWithRF(int rf) { return Collections.singletonMap("replication_factor", Integer.toString(rf)); } + +public static Pair>, Set>> asRanges(Pair replicas) +{ +Set> leftRanges = replicas.left.ranges(); +Set> rightRanges = replicas.right.ranges(); +return Pair.create(leftRanges, rightRanges); +} } http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java -- diff --git a/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java b/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java index 56fd181..8e0bc00 100644 --- a/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java +++ b/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java @@ -26,7 +26,6 @@ import org.apache.cassandra.dht.Token; import org.junit.Test; import java.net.UnknownHostException; -import java.util.Collection; import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertTrue; @@ -38,17 +37,29 @@ public class PendingRangeMapsTest {
[13/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java -- diff --git a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java index cb2ea46..c63f4f3 100644 --- a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java +++ b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java @@ -20,10 +20,12 @@ package org.apache.cassandra.locator; import java.util.*; import java.util.Map.Entry; +import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict; import org.slf4j.Logger; import org.slf4j.LoggerFactory; import org.apache.cassandra.dht.Datacenters; +import org.apache.cassandra.dht.Range; import org.apache.cassandra.exceptions.ConfigurationException; import org.apache.cassandra.dht.Token; import org.apache.cassandra.locator.TokenMetadata.Topology; @@ -49,14 +51,17 @@ import com.google.common.collect.Multimap; */ public class NetworkTopologyStrategy extends AbstractReplicationStrategy { -private final Map datacenters; +private final Map datacenters; +private final ReplicationFactor aggregateRf; private static final Logger logger = LoggerFactory.getLogger(NetworkTopologyStrategy.class); public NetworkTopologyStrategy(String keyspaceName, TokenMetadata tokenMetadata, IEndpointSnitch snitch, Map configOptions) throws ConfigurationException { super(keyspaceName, tokenMetadata, snitch, configOptions); -Map newDatacenters = new HashMap(); +int replicas = 0; +int trans = 0; +Map newDatacenters = new HashMap<>(); if (configOptions != null) { for (Entry entry : configOptions.entrySet()) @@ -64,12 +69,15 @@ public class NetworkTopologyStrategy extends AbstractReplicationStrategy String dc = entry.getKey(); if (dc.equalsIgnoreCase("replication_factor")) throw new ConfigurationException("replication_factor is an option for SimpleStrategy, not NetworkTopologyStrategy"); -Integer replicas = Integer.valueOf(entry.getValue()); -newDatacenters.put(dc, replicas); +ReplicationFactor rf = ReplicationFactor.fromString(entry.getValue()); +replicas += rf.allReplicas; +trans += rf.transientReplicas(); +newDatacenters.put(dc, rf); } } datacenters = Collections.unmodifiableMap(newDatacenters); +aggregateRf = ReplicationFactor.withTransient(replicas, trans); logger.info("Configured datacenter replicas are {}", FBUtilities.toString(datacenters)); } @@ -79,7 +87,8 @@ public class NetworkTopologyStrategy extends AbstractReplicationStrategy private static final class DatacenterEndpoints { /** List accepted endpoints get pushed into. */ -Set endpoints; +EndpointsForRange.Mutable replicas; + /** * Racks encountered so far. Replicas are put into separate racks while possible. * For efficiency the set is shared between the instances, using the location pair (dc, rack) to make sure @@ -90,41 +99,51 @@ public class NetworkTopologyStrategy extends AbstractReplicationStrategy /** Number of replicas left to fill from this DC. */ int rfLeft; int acceptableRackRepeats; +int transients; -DatacenterEndpoints(int rf, int rackCount, int nodeCount, Set endpoints, Set> racks) +DatacenterEndpoints(ReplicationFactor rf, int rackCount, int nodeCount, EndpointsForRange.Mutable replicas, Set> racks) { -this.endpoints = endpoints; +this.replicas = replicas; this.racks = racks; // If there aren't enough nodes in this DC to fill the RF, the number of nodes is the effective RF. -this.rfLeft = Math.min(rf, nodeCount); +this.rfLeft = Math.min(rf.allReplicas, nodeCount); // If there aren't enough racks in this DC to fill the RF, we'll still use at least one node from each rack, // and the difference is to be filled by the first encountered nodes. -acceptableRackRepeats = rf - rackCount; +acceptableRackRepeats = rf.allReplicas - rackCount; + +// if we have fewer replicas than rf calls for, reduce transients accordingly +int reduceTransients = rf.allReplicas - this.rfLeft; +transients = Math.max(rf.transientReplicas() - reduceTransients, 0); +ReplicationFactor.validate(rfLeft, transients); } /** - * Attempts to add an endpoint to the replicas for this datacenter, adding to the endpoints set if successful. + * Attempts to add an endpoint to the replicas
[08/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java -- diff --git a/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java b/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java index 61b9948..031326e 100644 --- a/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java +++ b/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java @@ -17,11 +17,12 @@ */ package org.apache.cassandra.service.reads; -import java.util.List; import java.util.concurrent.TimeUnit; import com.google.common.base.Preconditions; -import com.google.common.collect.Iterables; + +import org.apache.cassandra.locator.ReplicaLayout; + import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -37,15 +38,20 @@ import org.apache.cassandra.db.partitions.PartitionIterator; import org.apache.cassandra.exceptions.ReadFailureException; import org.apache.cassandra.exceptions.ReadTimeoutException; import org.apache.cassandra.exceptions.UnavailableException; +import org.apache.cassandra.locator.EndpointsForToken; import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.Replica; +import org.apache.cassandra.locator.ReplicaCollection; import org.apache.cassandra.net.MessageOut; import org.apache.cassandra.net.MessagingService; -import org.apache.cassandra.service.StorageProxy; import org.apache.cassandra.service.reads.repair.ReadRepair; import org.apache.cassandra.service.StorageProxy.LocalReadRunnable; import org.apache.cassandra.tracing.TraceState; import org.apache.cassandra.tracing.Tracing; +import static com.google.common.collect.Iterables.all; +import static com.google.common.collect.Iterables.tryFind; + /** * Sends a read request to the replicas needed to satisfy a given ConsistencyLevel. * @@ -59,32 +65,27 @@ public abstract class AbstractReadExecutor private static final Logger logger = LoggerFactory.getLogger(AbstractReadExecutor.class); protected final ReadCommand command; -protected final ConsistencyLevel consistency; -protected final List targetReplicas; -protected final ReadRepair readRepair; -protected final DigestResolver digestResolver; -protected final ReadCallback handler; +private final ReplicaLayout.ForToken replicaLayout; +protected final ReadRepair readRepair; +protected final DigestResolver digestResolver; +protected final ReadCallback handler; protected final TraceState traceState; protected final ColumnFamilyStore cfs; protected final long queryStartNanoTime; +private final int initialDataRequestCount; protected volatile PartitionIterator result = null; -protected final Keyspace keyspace; -protected final int blockFor; - -AbstractReadExecutor(Keyspace keyspace, ColumnFamilyStore cfs, ReadCommand command, ConsistencyLevel consistency, List targetReplicas, long queryStartNanoTime) +AbstractReadExecutor(ColumnFamilyStore cfs, ReadCommand command, ReplicaLayout.ForToken replicaLayout, int initialDataRequestCount, long queryStartNanoTime) { this.command = command; -this.consistency = consistency; -this.targetReplicas = targetReplicas; -this.readRepair = ReadRepair.create(command, queryStartNanoTime, consistency); -this.digestResolver = new DigestResolver(keyspace, command, consistency, readRepair, targetReplicas.size()); -this.handler = new ReadCallback(digestResolver, consistency, command, targetReplicas, queryStartNanoTime); +this.replicaLayout = replicaLayout; +this.initialDataRequestCount = initialDataRequestCount; +this.readRepair = ReadRepair.create(command, replicaLayout, queryStartNanoTime); +this.digestResolver = new DigestResolver<>(command, replicaLayout, readRepair, queryStartNanoTime); +this.handler = new ReadCallback<>(digestResolver, replicaLayout.consistencyLevel().blockFor(replicaLayout.keyspace()), command, replicaLayout, queryStartNanoTime); this.cfs = cfs; this.traceState = Tracing.instance.get(); this.queryStartNanoTime = queryStartNanoTime; -this.keyspace = keyspace; -this.blockFor = consistency.blockFor(keyspace); // Set the digest version (if we request some digests). This is the smallest version amongst all our target replicas since new nodes @@ -92,8 +93,8 @@ public abstract class AbstractReadExecutor // TODO: we need this when talking with pre-3.0 nodes. So if we preserve the digest format moving forward, we can get rid of this once // we stop being compatible with pre-3.0 nodes. int digestVersion = MessagingService.current_version; -for (InetAddressAndPort replica : targetReplicas) -digestVersion = Math.min(digestVersion,
[16/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java -- diff --git a/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java b/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java index 9766454..afe628b 100644 --- a/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java +++ b/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java @@ -56,6 +56,7 @@ import org.apache.cassandra.index.Index; import org.apache.cassandra.io.sstable.Component; import org.apache.cassandra.io.sstable.Descriptor; import org.apache.cassandra.io.sstable.ISSTableScanner; +import org.apache.cassandra.io.sstable.SSTable; import org.apache.cassandra.io.sstable.SSTableMultiWriter; import org.apache.cassandra.io.sstable.format.SSTableReader; import org.apache.cassandra.io.sstable.metadata.MetadataCollector; @@ -112,6 +113,7 @@ public class CompactionStrategyManager implements INotificationConsumer /** * Variables guarded by read and write lock above */ +private final PendingRepairHolder transientRepairs; private final PendingRepairHolder pendingRepairs; private final CompactionStrategyHolder repaired; private final CompactionStrategyHolder unrepaired; @@ -156,10 +158,11 @@ public class CompactionStrategyManager implements INotificationConsumer return compactionStrategyIndexForDirectory(descriptor); } }; -pendingRepairs = new PendingRepairHolder(cfs, router); +transientRepairs = new PendingRepairHolder(cfs, router, true); +pendingRepairs = new PendingRepairHolder(cfs, router, false); repaired = new CompactionStrategyHolder(cfs, router, true); unrepaired = new CompactionStrategyHolder(cfs, router, false); -holders = ImmutableList.of(pendingRepairs, repaired, unrepaired); +holders = ImmutableList.of(transientRepairs, pendingRepairs, repaired, unrepaired); cfs.getTracker().subscribe(this); logger.trace("{} subscribed to the data tracker.", this); @@ -176,7 +179,6 @@ public class CompactionStrategyManager implements INotificationConsumer * Return the next background task * * Returns a task for the compaction strategy that needs it the most (most estimated remaining tasks) - * */ public AbstractCompactionTask getNextBackgroundTask(int gcBefore) { @@ -188,18 +190,16 @@ public class CompactionStrategyManager implements INotificationConsumer return null; int numPartitions = getNumTokenPartitions(); + // first try to promote/demote sstables from completed repairs -List repairFinishedSuppliers = pendingRepairs.getRepairFinishedTaskSuppliers(); -if (!repairFinishedSuppliers.isEmpty()) -{ -Collections.sort(repairFinishedSuppliers); -for (TaskSupplier supplier : repairFinishedSuppliers) -{ -AbstractCompactionTask task = supplier.getTask(); -if (task != null) -return task; -} -} +AbstractCompactionTask repairFinishedTask; +repairFinishedTask = pendingRepairs.getNextRepairFinishedTask(); +if (repairFinishedTask != null) +return repairFinishedTask; + +repairFinishedTask = transientRepairs.getNextRepairFinishedTask(); +if (repairFinishedTask != null) +return repairFinishedTask; // sort compaction task suppliers by remaining tasks descending List suppliers = new ArrayList<>(numPartitions * holders.size()); @@ -393,64 +393,28 @@ public class CompactionStrategyManager implements INotificationConsumer } } - - @VisibleForTesting -List getRepaired() +CompactionStrategyHolder getRepairedUnsafe() { -readLock.lock(); -try -{ -return Lists.newArrayList(repaired.allStrategies()); -} -finally -{ -readLock.unlock(); -} +return repaired; } @VisibleForTesting -List getUnrepaired() +CompactionStrategyHolder getUnrepairedUnsafe() { -readLock.lock(); -try -{ -return Lists.newArrayList(unrepaired.allStrategies()); -} -finally -{ -readLock.unlock(); -} +return unrepaired; } @VisibleForTesting -Iterable getForPendingRepair(UUID sessionID) +PendingRepairHolder getPendingRepairsUnsafe() { -readLock.lock(); -try -{ -return pendingRepairs.getStrategiesFor(sessionID); -} -finally -{ -
[17/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/db/DiskBoundaryManager.java -- diff --git a/src/java/org/apache/cassandra/db/DiskBoundaryManager.java b/src/java/org/apache/cassandra/db/DiskBoundaryManager.java index 72b5e2a..acfe71a 100644 --- a/src/java/org/apache/cassandra/db/DiskBoundaryManager.java +++ b/src/java/org/apache/cassandra/db/DiskBoundaryManager.java @@ -19,8 +19,9 @@ package org.apache.cassandra.db; import java.util.ArrayList; -import java.util.Collection; +import java.util.Comparator; import java.util.List; +import java.util.stream.Collectors; import org.slf4j.Logger; import org.slf4j.LoggerFactory; @@ -30,6 +31,8 @@ import org.apache.cassandra.dht.IPartitioner; import org.apache.cassandra.dht.Range; import org.apache.cassandra.dht.Splitter; import org.apache.cassandra.dht.Token; +import org.apache.cassandra.locator.RangesAtEndpoint; +import org.apache.cassandra.locator.Replica; import org.apache.cassandra.locator.TokenMetadata; import org.apache.cassandra.service.PendingRangeCalculatorService; import org.apache.cassandra.service.StorageService; @@ -68,7 +71,7 @@ public class DiskBoundaryManager private static DiskBoundaries getDiskBoundaryValue(ColumnFamilyStore cfs) { -Collection> localRanges; +RangesAtEndpoint localRanges; long ringVersion; TokenMetadata tmd; @@ -87,7 +90,7 @@ public class DiskBoundaryManager // Reason we use use the future settled TMD is that if we decommission a node, we want to stream // from that node to the correct location on disk, if we didn't, we would put new files in the wrong places. // We do this to minimize the amount of data we need to move in rebalancedisks once everything settled -localRanges = cfs.keyspace.getReplicationStrategy().getAddressRanges(tmd.cloneAfterAllSettled()).get(FBUtilities.getBroadcastAddressAndPort()); +localRanges = cfs.keyspace.getReplicationStrategy().getAddressReplicas(tmd.cloneAfterAllSettled(), FBUtilities.getBroadcastAddressAndPort()); } logger.debug("Got local ranges {} (ringVersion = {})", localRanges, ringVersion); } @@ -106,9 +109,18 @@ public class DiskBoundaryManager if (localRanges == null || localRanges.isEmpty()) return new DiskBoundaries(dirs, null, ringVersion, directoriesVersion); -List> sortedLocalRanges = Range.sort(localRanges); +// note that Range.sort unwraps any wraparound ranges, so we need to sort them here +List> fullLocalRanges = Range.sort(localRanges.stream() + .filter(Replica::isFull) + .map(Replica::range) + .collect(Collectors.toList())); +List> transientLocalRanges = Range.sort(localRanges.stream() + .filter(Replica::isTransient) + .map(Replica::range) + .collect(Collectors.toList())); + +List positions = getDiskBoundaries(fullLocalRanges, transientLocalRanges, cfs.getPartitioner(), dirs); -List positions = getDiskBoundaries(sortedLocalRanges, cfs.getPartitioner(), dirs); return new DiskBoundaries(dirs, positions, ringVersion, directoriesVersion); } @@ -121,15 +133,26 @@ public class DiskBoundaryManager * * The final entry in the returned list will always be the partitioner maximum tokens upper key bound */ -private static List getDiskBoundaries(List> sortedLocalRanges, IPartitioner partitioner, Directories.DataDirectory[] dataDirectories) +private static List getDiskBoundaries(List> fullRanges, List> transientRanges, IPartitioner partitioner, Directories.DataDirectory[] dataDirectories) { assert partitioner.splitter().isPresent(); + Splitter splitter = partitioner.splitter().get(); boolean dontSplitRanges = DatabaseDescriptor.getNumTokens() > 1; -List boundaries = splitter.splitOwnedRanges(dataDirectories.length, sortedLocalRanges, dontSplitRanges); + +List weightedRanges = new ArrayList<>(fullRanges.size() + transientRanges.size()); +for (Range r : fullRanges) +weightedRanges.add(new Splitter.WeightedRange(1.0, r)); + +for (Range r : transientRanges) +weightedRanges.add(new Splitter.WeightedRange(0.1, r)); + + weightedRanges.sort(Comparator.comparing(Splitter.WeightedRange::left)); + +List boundaries = splitter.splitOwnedRanges(dataDirectories.length,
[14/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java -- diff --git a/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java b/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java index db73b4f..8eb8603 100644 --- a/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java +++ b/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java @@ -21,6 +21,9 @@ import java.util.Collection; import java.util.Set; import java.util.UUID; +import com.google.common.base.Preconditions; + +import org.apache.cassandra.io.sstable.SSTable; import org.apache.cassandra.schema.TableMetadata; import org.apache.cassandra.schema.TableMetadataRef; import org.apache.cassandra.db.RowIndexEntry; @@ -85,13 +88,15 @@ public class BigFormat implements SSTableFormat long keyCount, long repairedAt, UUID pendingRepair, + boolean isTransient, TableMetadataRef metadata, MetadataCollector metadataCollector, SerializationHeader header, Collection observers, LifecycleTransaction txn) { -return new BigTableWriter(descriptor, keyCount, repairedAt, pendingRepair, metadata, metadataCollector, header, observers, txn); +SSTable.validateRepairedMetadata(repairedAt, pendingRepair, isTransient); +return new BigTableWriter(descriptor, keyCount, repairedAt, pendingRepair, isTransient, metadata, metadataCollector, header, observers, txn); } } @@ -120,7 +125,7 @@ public class BigFormat implements SSTableFormat // mb (3.0.7, 3.7): commit log lower bound included // mc (3.0.8, 3.9): commit log intervals included -// na (4.0.0): uncompressed chunks, pending repair session, checksummed sstable metadata file, new Bloomfilter format +// na (4.0.0): uncompressed chunks, pending repair session, isTransient, checksummed sstable metadata file, new Bloomfilter format // // NOTE: when adding a new version, please add that to LegacySSTableTest, too. @@ -131,6 +136,7 @@ public class BigFormat implements SSTableFormat public final boolean hasMaxCompressedLength; private final boolean hasPendingRepair; private final boolean hasMetadataChecksum; +private final boolean hasIsTransient; /** * CASSANDRA-9067: 4.0 bloom filter representation changed (two longs just swapped) * have no 'static' bits caused by using the same upper bits for both bloom filter and token distribution. @@ -148,6 +154,7 @@ public class BigFormat implements SSTableFormat hasCommitLogIntervals = version.compareTo("mc") >= 0; hasMaxCompressedLength = version.compareTo("na") >= 0; hasPendingRepair = version.compareTo("na") >= 0; +hasIsTransient = version.compareTo("na") >= 0; hasMetadataChecksum = version.compareTo("na") >= 0; hasOldBfFormat = version.compareTo("na") < 0; } @@ -176,6 +183,12 @@ public class BigFormat implements SSTableFormat } @Override +public boolean hasIsTransient() +{ +return hasIsTransient; +} + +@Override public int correspondingMessagingVersion() { return correspondingMessagingVersion; http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java -- diff --git a/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java b/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java index b5488ed..7513e95 100644 --- a/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java +++ b/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java @@ -68,13 +68,14 @@ public class BigTableWriter extends SSTableWriter long keyCount, long repairedAt, UUID pendingRepair, + boolean isTransient, TableMetadataRef metadata, MetadataCollector metadataCollector, SerializationHeader header, Collection observers, LifecycleTransaction txn) { -super(descriptor, keyCount, repairedAt, pendingRepair, metadata, metadataCollector, header, observers); +super(descriptor, keyCount, repairedAt,
[12/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/SimpleStrategy.java -- diff --git a/src/java/org/apache/cassandra/locator/SimpleStrategy.java b/src/java/org/apache/cassandra/locator/SimpleStrategy.java index 545ad28..7a000b7 100644 --- a/src/java/org/apache/cassandra/locator/SimpleStrategy.java +++ b/src/java/org/apache/cassandra/locator/SimpleStrategy.java @@ -21,9 +21,9 @@ import java.util.ArrayList; import java.util.Collections; import java.util.Collection; import java.util.Iterator; -import java.util.List; import java.util.Map; +import org.apache.cassandra.dht.Range; import org.apache.cassandra.exceptions.ConfigurationException; import org.apache.cassandra.dht.Token; @@ -36,34 +36,41 @@ import org.apache.cassandra.dht.Token; */ public class SimpleStrategy extends AbstractReplicationStrategy { +private final ReplicationFactor rf; + public SimpleStrategy(String keyspaceName, TokenMetadata tokenMetadata, IEndpointSnitch snitch, Map configOptions) { super(keyspaceName, tokenMetadata, snitch, configOptions); +this.rf = ReplicationFactor.fromString(this.configOptions.get("replication_factor")); } -public List calculateNaturalEndpoints(Token token, TokenMetadata metadata) +public EndpointsForRange calculateNaturalReplicas(Token token, TokenMetadata metadata) { -int replicas = getReplicationFactor(); -ArrayList tokens = metadata.sortedTokens(); -List endpoints = new ArrayList(replicas); +ArrayList ring = metadata.sortedTokens(); +if (ring.isEmpty()) +return EndpointsForRange.empty(new Range<>(metadata.partitioner.getMinimumToken(), metadata.partitioner.getMinimumToken())); + +Token replicaEnd = TokenMetadata.firstToken(ring, token); +Token replicaStart = metadata.getPredecessor(replicaEnd); +Range replicaRange = new Range<>(replicaStart, replicaEnd); +Iterator iter = TokenMetadata.ringIterator(ring, token, false); -if (tokens.isEmpty()) -return endpoints; +EndpointsForRange.Builder replicas = EndpointsForRange.builder(replicaRange, rf.allReplicas); // Add the token at the index by default -Iterator iter = TokenMetadata.ringIterator(tokens, token, false); -while (endpoints.size() < replicas && iter.hasNext()) +while (replicas.size() < rf.allReplicas && iter.hasNext()) { -InetAddressAndPort ep = metadata.getEndpoint(iter.next()); -if (!endpoints.contains(ep)) -endpoints.add(ep); +Token tk = iter.next(); +InetAddressAndPort ep = metadata.getEndpoint(tk); +if (!replicas.containsEndpoint(ep)) +replicas.add(new Replica(ep, replicaRange, replicas.size() < rf.fullReplicas)); } -return endpoints; +return replicas.build(); } -public int getReplicationFactor() +public ReplicationFactor getReplicationFactor() { -return Integer.parseInt(this.configOptions.get("replication_factor")); +return rf; } public void validateOptions() throws ConfigurationException http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/SystemReplicas.java -- diff --git a/src/java/org/apache/cassandra/locator/SystemReplicas.java b/src/java/org/apache/cassandra/locator/SystemReplicas.java new file mode 100644 index 000..13a9d74 --- /dev/null +++ b/src/java/org/apache/cassandra/locator/SystemReplicas.java @@ -0,0 +1,62 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.cassandra.locator; + +import java.util.ArrayList; +import java.util.Collection; +import java.util.List; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +import org.apache.cassandra.config.DatabaseDescriptor; +import org.apache.cassandra.dht.Range; +import org.apache.cassandra.dht.Token; + +public class SystemReplicas +{ +private static final Map
[07/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java -- diff --git a/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java b/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java index a43e3eb..4af4a92 100644 --- a/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java +++ b/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java @@ -18,7 +18,6 @@ package org.apache.cassandra.service.reads.repair; -import java.util.List; import java.util.Map; import java.util.function.Consumer; @@ -27,24 +26,28 @@ import org.apache.cassandra.db.Mutation; import org.apache.cassandra.db.partitions.PartitionIterator; import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators; import org.apache.cassandra.exceptions.ReadTimeoutException; -import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.ReplicaLayout; +import org.apache.cassandra.locator.Endpoints; +import org.apache.cassandra.locator.Replica; import org.apache.cassandra.service.reads.DigestResolver; /** * Bypasses the read repair path for short read protection and testing */ -public class NoopReadRepair implements ReadRepair +public class NoopReadRepair, L extends ReplicaLayout> implements ReadRepair { public static final NoopReadRepair instance = new NoopReadRepair(); private NoopReadRepair() {} -public UnfilteredPartitionIterators.MergeListener getMergeListener(InetAddressAndPort[] endpoints) +@Override +public UnfilteredPartitionIterators.MergeListener getMergeListener(L replicas) { return UnfilteredPartitionIterators.MergeListener.NOOP; } -public void startRepair(DigestResolver digestResolver, List allEndpoints, List contactedEndpoints, Consumer resultConsumer) +@Override +public void startRepair(DigestResolver digestResolver, Consumer resultConsumer) { resultConsumer.accept(digestResolver.getData()); } @@ -72,7 +75,7 @@ public class NoopReadRepair implements ReadRepair } @Override -public void repairPartition(DecoratedKey key, Map mutations, InetAddressAndPort[] destinations) +public void repairPartition(DecoratedKey partitionKey, Map mutations, L replicaLayout) { } http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java -- diff --git a/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java b/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java index 6cf761a..4cae3ae 100644 --- a/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java +++ b/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java @@ -28,18 +28,18 @@ import org.apache.cassandra.db.RegularAndStaticColumns; import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators; import org.apache.cassandra.db.rows.UnfilteredRowIterator; import org.apache.cassandra.db.rows.UnfilteredRowIterators; -import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.ReplicaLayout; public class PartitionIteratorMergeListener implements UnfilteredPartitionIterators.MergeListener { -private final InetAddressAndPort[] sources; +private final ReplicaLayout replicaLayout; private final ReadCommand command; private final ConsistencyLevel consistency; private final ReadRepair readRepair; -public PartitionIteratorMergeListener(InetAddressAndPort[] sources, ReadCommand command, ConsistencyLevel consistency, ReadRepair readRepair) +public PartitionIteratorMergeListener(ReplicaLayout replicaLayout, ReadCommand command, ConsistencyLevel consistency, ReadRepair readRepair) { -this.sources = sources; +this.replicaLayout = replicaLayout; this.command = command; this.consistency = consistency; this.readRepair = readRepair; @@ -47,10 +47,10 @@ public class PartitionIteratorMergeListener implements UnfilteredPartitionIterat public UnfilteredRowIterators.MergeListener getRowMergeListener(DecoratedKey partitionKey, List versions) { -return new RowIteratorMergeListener(partitionKey, columns(versions), isReversed(versions), sources, command, consistency, readRepair); +return new RowIteratorMergeListener(partitionKey, columns(versions), isReversed(versions), replicaLayout, command, consistency, readRepair); } -private RegularAndStaticColumns columns(List versions) +protected RegularAndStaticColumns columns(List versions) { Columns statics = Columns.NONE;
[06/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/CleanupTest.java -- diff --git a/test/unit/org/apache/cassandra/db/CleanupTest.java b/test/unit/org/apache/cassandra/db/CleanupTest.java index a096c78..46c0afd 100644 --- a/test/unit/org/apache/cassandra/db/CleanupTest.java +++ b/test/unit/org/apache/cassandra/db/CleanupTest.java @@ -107,7 +107,6 @@ public class CleanupTest SchemaLoader.compositeIndexCFMD(KEYSPACE2, CF_INDEXED2, true)); } -/* @Test public void testCleanup() throws ExecutionException, InterruptedException { @@ -116,7 +115,6 @@ public class CleanupTest Keyspace keyspace = Keyspace.open(KEYSPACE1); ColumnFamilyStore cfs = keyspace.getColumnFamilyStore(CF_STANDARD1); -UnfilteredPartitionIterator iter; // insert data and verify we get it back w/ range query fillCF(cfs, "val", LOOPS); @@ -124,8 +122,7 @@ public class CleanupTest // record max timestamps of the sstables pre-cleanup List expectedMaxTimestamps = getMaxTimestampList(cfs); -iter = Util.getRangeSlice(cfs); -assertEquals(LOOPS, Iterators.size(iter)); +assertEquals(LOOPS, Util.getAll(Util.cmd(cfs).build()).size()); // with one token in the ring, owned by the local node, cleanup should be a no-op CompactionManager.instance.performCleanup(cfs, 2); @@ -134,10 +131,8 @@ public class CleanupTest assert expectedMaxTimestamps.equals(getMaxTimestampList(cfs)); // check data is still there -iter = Util.getRangeSlice(cfs); -assertEquals(LOOPS, Iterators.size(iter)); +assertEquals(LOOPS, Util.getAll(Util.cmd(cfs).build()).size()); } -*/ @Test public void testCleanupWithIndexes() throws IOException, ExecutionException, InterruptedException http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/CleanupTransientTest.java -- diff --git a/test/unit/org/apache/cassandra/db/CleanupTransientTest.java b/test/unit/org/apache/cassandra/db/CleanupTransientTest.java new file mode 100644 index 000..9789183 --- /dev/null +++ b/test/unit/org/apache/cassandra/db/CleanupTransientTest.java @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.cassandra.db; + +import java.nio.ByteBuffer; +import java.util.ArrayList; +import java.util.LinkedList; +import java.util.List; +import java.util.UUID; + +import org.apache.cassandra.locator.RangesAtEndpoint; +import org.junit.BeforeClass; +import org.junit.Test; + +import org.apache.cassandra.SchemaLoader; +import org.apache.cassandra.Util; +import org.apache.cassandra.config.DatabaseDescriptor; +import org.apache.cassandra.db.compaction.CompactionManager; +import org.apache.cassandra.db.partitions.FilteredPartition; +import org.apache.cassandra.dht.IPartitioner; +import org.apache.cassandra.dht.RandomPartitioner; +import org.apache.cassandra.dht.Token; +import org.apache.cassandra.io.sstable.format.SSTableReader; +import org.apache.cassandra.locator.AbstractNetworkTopologySnitch; +import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.Replica; +import org.apache.cassandra.locator.TokenMetadata; +import org.apache.cassandra.schema.KeyspaceParams; +import org.apache.cassandra.service.PendingRangeCalculatorService; +import org.apache.cassandra.service.StorageService; +import org.apache.cassandra.utils.ByteBufferUtil; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertTrue; + +public class CleanupTransientTest +{ +private static final IPartitioner partitioner = RandomPartitioner.instance; +private static IPartitioner oldPartitioner; + +public static final int LOOPS = 200; +public static final String KEYSPACE1 = "CleanupTest1"; +public static final String CF_INDEXED1 = "Indexed1"; +public static final String CF_STANDARD1 = "Standard1"; + +public static final String KEYSPACE2
[10/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/StorageProxy.java -- diff --git a/src/java/org/apache/cassandra/service/StorageProxy.java b/src/java/org/apache/cassandra/service/StorageProxy.java index ed0cafc..c23eb88 100644 --- a/src/java/org/apache/cassandra/service/StorageProxy.java +++ b/src/java/org/apache/cassandra/service/StorageProxy.java @@ -29,7 +29,6 @@ import javax.management.MBeanServer; import javax.management.ObjectName; import com.google.common.base.Preconditions; -import com.google.common.base.Predicate; import com.google.common.cache.CacheLoader; import com.google.common.collect.*; import com.google.common.primitives.Ints; @@ -133,18 +132,10 @@ public class StorageProxy implements StorageProxyMBean HintsService.instance.registerMBean(); HintedHandOffManager.instance.registerMBean(); -standardWritePerformer = new WritePerformer() +standardWritePerformer = (mutation, targets, responseHandler, localDataCenter) -> { -public void apply(IMutation mutation, - Iterable targets, - AbstractWriteResponseHandler responseHandler, - String localDataCenter, - ConsistencyLevel consistency_level) -throws OverloadedException -{ -assert mutation instanceof Mutation; -sendToHintedEndpoints((Mutation) mutation, targets, responseHandler, localDataCenter, Stage.MUTATION); -} +assert mutation instanceof Mutation; +sendToHintedReplicas((Mutation) mutation, targets.selected(), responseHandler, localDataCenter, Stage.MUTATION); }; /* @@ -153,29 +144,19 @@ public class StorageProxy implements StorageProxyMBean * but on the latter case, the verb handler already run on the COUNTER_MUTATION stage, so we must not execute the * underlying on the stage otherwise we risk a deadlock. Hence two different performer. */ -counterWritePerformer = new WritePerformer() +counterWritePerformer = (mutation, targets, responseHandler, localDataCenter) -> { -public void apply(IMutation mutation, - Iterable targets, - AbstractWriteResponseHandler responseHandler, - String localDataCenter, - ConsistencyLevel consistencyLevel) -{ -counterWriteTask(mutation, targets, responseHandler, localDataCenter).run(); -} +EndpointsForToken selected = targets.selected().withoutSelf(); +Replicas.temporaryAssertFull(selected); // TODO CASSANDRA-14548 +counterWriteTask(mutation, selected, responseHandler, localDataCenter).run(); }; -counterWriteOnCoordinatorPerformer = new WritePerformer() +counterWriteOnCoordinatorPerformer = (mutation, targets, responseHandler, localDataCenter) -> { -public void apply(IMutation mutation, - Iterable targets, - AbstractWriteResponseHandler responseHandler, - String localDataCenter, - ConsistencyLevel consistencyLevel) -{ -StageManager.getStage(Stage.COUNTER_MUTATION) -.execute(counterWriteTask(mutation, targets, responseHandler, localDataCenter)); -} +EndpointsForToken selected = targets.selected().withoutSelf(); +Replicas.temporaryAssertFull(selected); // TODO CASSANDRA-14548 +StageManager.getStage(Stage.COUNTER_MUTATION) +.execute(counterWriteTask(mutation, selected, responseHandler, localDataCenter)); }; for(ConsistencyLevel level : ConsistencyLevel.values()) @@ -251,11 +232,9 @@ public class StorageProxy implements StorageProxyMBean while (System.nanoTime() - queryStartNanoTime < timeout) { // for simplicity, we'll do a single liveness check at the start of each attempt -PaxosParticipants p = getPaxosParticipants(metadata, key, consistencyForPaxos); -List liveEndpoints = p.liveEndpoints; -int requiredParticipants = p.participants; +ReplicaLayout.ForPaxos replicaLayout = ReplicaLayout.forPaxos(Keyspace.open(keyspaceName), key, consistencyForPaxos); -final PaxosBallotAndContention pair = beginAndRepairPaxos(queryStartNanoTime, key, metadata, liveEndpoints, requiredParticipants, consistencyForPaxos, consistencyForCommit, true, state); +final PaxosBallotAndContention pair =
[02/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/StorageServiceTest.java -- diff --git a/test/unit/org/apache/cassandra/service/StorageServiceTest.java b/test/unit/org/apache/cassandra/service/StorageServiceTest.java new file mode 100644 index 000..9d5c324 --- /dev/null +++ b/test/unit/org/apache/cassandra/service/StorageServiceTest.java @@ -0,0 +1,148 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.cassandra.service; + +import org.apache.cassandra.locator.EndpointsByReplica; +import org.apache.cassandra.locator.ReplicaCollection; +import org.junit.Before; +import org.junit.BeforeClass; +import org.junit.Test; + +import org.apache.cassandra.config.DatabaseDescriptor; +import org.apache.cassandra.dht.RandomPartitioner; +import org.apache.cassandra.dht.Range; +import org.apache.cassandra.dht.Token; +import org.apache.cassandra.locator.AbstractEndpointSnitch; +import org.apache.cassandra.locator.AbstractReplicationStrategy; +import org.apache.cassandra.locator.IEndpointSnitch; +import org.apache.cassandra.locator.InetAddressAndPort; +import org.apache.cassandra.locator.Replica; +import org.apache.cassandra.locator.ReplicaMultimap; +import org.apache.cassandra.locator.SimpleStrategy; +import org.apache.cassandra.locator.TokenMetadata; + +import static org.junit.Assert.assertEquals; + +public class StorageServiceTest +{ +static InetAddressAndPort aAddress; +static InetAddressAndPort bAddress; +static InetAddressAndPort cAddress; +static InetAddressAndPort dAddress; +static InetAddressAndPort eAddress; + +@BeforeClass +public static void setUpClass() throws Exception +{ +aAddress = InetAddressAndPort.getByName("127.0.0.1"); +bAddress = InetAddressAndPort.getByName("127.0.0.2"); +cAddress = InetAddressAndPort.getByName("127.0.0.3"); +dAddress = InetAddressAndPort.getByName("127.0.0.4"); +eAddress = InetAddressAndPort.getByName("127.0.0.5"); +} + +private static final Token threeToken = new RandomPartitioner.BigIntegerToken("3"); +private static final Token sixToken = new RandomPartitioner.BigIntegerToken("6"); +private static final Token nineToken = new RandomPartitioner.BigIntegerToken("9"); +private static final Token elevenToken = new RandomPartitioner.BigIntegerToken("11"); +private static final Token oneToken = new RandomPartitioner.BigIntegerToken("1"); + +Range aRange = new Range<>(oneToken, threeToken); +Range bRange = new Range<>(threeToken, sixToken); +Range cRange = new Range<>(sixToken, nineToken); +Range dRange = new Range<>(nineToken, elevenToken); +Range eRange = new Range<>(elevenToken, oneToken); + +@Before +public void setUp() +{ +DatabaseDescriptor.daemonInitialization(); +DatabaseDescriptor.setTransientReplicationEnabledUnsafe(true); +IEndpointSnitch snitch = new AbstractEndpointSnitch() +{ +public int compareEndpoints(InetAddressAndPort target, Replica r1, Replica r2) +{ +return 0; +} + +public String getRack(InetAddressAndPort endpoint) +{ +return "R1"; +} + +public String getDatacenter(InetAddressAndPort endpoint) +{ +return "DC1"; +} +}; + +DatabaseDescriptor.setEndpointSnitch(snitch); +} + +private AbstractReplicationStrategy simpleStrategy(TokenMetadata tmd) +{ +return new SimpleStrategy("MoveTransientTest", + tmd, + DatabaseDescriptor.getEndpointSnitch(), + com.google.common.collect.ImmutableMap.of("replication_factor", "3/1")); +} + +public static > void assertMultimapEqualsIgnoreOrder(ReplicaMultimap a, ReplicaMultimap b) +{ +if (!a.keySet().equals(b.keySet())) +assertEquals(a, b); +for (K key : a.keySet()) +{ +C ac = a.get(key); +C bc = b.get(key); +
[18/18] cassandra git commit: Transient Replication and Cheap Quorums
Transient Replication and Cheap Quorums Patch by Blake Eggleston, Benedict Elliott Smith, Marcus Eriksson, Alex Petrov, Ariel Weisberg; Reviewed by Blake Eggleston, Marcus Eriksson, Benedict Elliott Smith, Alex Petrov, Ariel Weisberg for CASSANDRA-14404 Co-authored-by: Blake Eggleston Co-authored-by: Benedict Elliott Smith Co-authored-by: Marcus Eriksson Co-authored-by: Alex Petrov Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f7431b43 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f7431b43 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f7431b43 Branch: refs/heads/trunk Commit: f7431b432875e334170ccdb19934d05545d2cebd Parents: 5b645de Author: Ariel Weisberg Authored: Thu Jul 5 18:10:40 2018 -0400 Committer: Ariel Weisberg Committed: Fri Aug 31 21:34:22 2018 -0400 -- CHANGES.txt | 1 + NEWS.txt| 4 + conf/cassandra.yaml | 4 + doc/source/architecture/dynamo.rst | 29 + doc/source/cql/ddl.rst | 14 +- ...iver-internal-only-3.12.0.post0-5838e2fd.zip | Bin 0 -> 269418 bytes pylib/cqlshlib/cql3handling.py | 1 + pylib/cqlshlib/cqlshhandling.py | 1 + pylib/cqlshlib/test/test_cqlsh_completion.py| 6 +- pylib/cqlshlib/test/test_cqlsh_output.py| 3 +- .../cassandra/batchlog/BatchlogManager.java | 45 +- .../org/apache/cassandra/config/Config.java | 2 + .../cassandra/config/DatabaseDescriptor.java| 35 +- .../apache/cassandra/cql3/QueryProcessor.java | 13 +- .../cql3/statements/BatchStatement.java | 4 +- .../cql3/statements/BatchUpdatesCollector.java | 2 +- .../cql3/statements/ModificationStatement.java | 4 +- .../statements/SingleTableUpdatesCollector.java | 2 +- .../cql3/statements/UpdatesCollector.java | 5 +- .../schema/AlterKeyspaceStatement.java | 86 +- .../statements/schema/AlterTableStatement.java | 7 + .../statements/schema/CreateIndexStatement.java | 5 + .../statements/schema/CreateTableStatement.java | 9 + .../statements/schema/CreateViewStatement.java | 5 + .../cql3/statements/schema/TableAttributes.java | 3 + .../apache/cassandra/db/ColumnFamilyStore.java | 24 +- .../apache/cassandra/db/ConsistencyLevel.java | 211 +++-- .../cassandra/db/DiskBoundaryManager.java | 39 +- src/java/org/apache/cassandra/db/Memtable.java | 1 + .../cassandra/db/MutationVerbHandler.java | 5 +- .../cassandra/db/PartitionRangeReadCommand.java | 28 +- .../org/apache/cassandra/db/ReadCommand.java| 33 +- .../apache/cassandra/db/SSTableImporter.java| 6 +- .../db/SinglePartitionReadCommand.java | 26 +- .../org/apache/cassandra/db/SystemKeyspace.java | 98 ++- .../cassandra/db/SystemKeyspaceMigrator40.java | 45 + .../org/apache/cassandra/db/TableCQLHelper.java | 1 + .../compaction/AbstractCompactionStrategy.java | 3 +- .../db/compaction/AbstractStrategyHolder.java | 7 +- .../db/compaction/CompactionManager.java| 295 --- .../db/compaction/CompactionStrategyHolder.java | 34 +- .../compaction/CompactionStrategyManager.java | 108 +-- .../cassandra/db/compaction/CompactionTask.java | 26 +- .../db/compaction/PendingRepairHolder.java | 42 +- .../db/compaction/PendingRepairManager.java | 45 +- .../cassandra/db/compaction/Scrubber.java | 4 +- .../cassandra/db/compaction/Upgrader.java | 10 +- .../cassandra/db/compaction/Verifier.java | 3 +- .../writers/CompactionAwareWriter.java | 2 + .../writers/DefaultCompactionWriter.java| 1 + .../writers/MajorLeveledCompactionWriter.java | 1 + .../writers/MaxSSTableSizeWriter.java | 1 + .../SplittingSizeTieredCompactionWriter.java| 1 + .../db/partitions/PartitionIterators.java | 12 - .../repair/CassandraKeyspaceRepairManager.java | 10 +- .../db/repair/PendingAntiCompaction.java| 22 +- .../db/streaming/CassandraOutgoingFile.java | 11 +- .../db/streaming/CassandraStreamManager.java| 36 +- .../db/streaming/CassandraStreamReader.java | 2 +- .../apache/cassandra/db/view/TableViews.java| 5 + .../apache/cassandra/db/view/ViewBuilder.java | 19 +- .../apache/cassandra/db/view/ViewManager.java | 2 +- .../org/apache/cassandra/db/view/ViewUtils.java | 64 +- src/java/org/apache/cassandra/dht/Range.java| 27 +- .../cassandra/dht/RangeFetchMapCalculator.java | 58 +- .../org/apache/cassandra/dht/RangeStreamer.java | 571 src/java/org/apache/cassandra/dht/Splitter.java | 95 +- .../apache/cassandra/dht/StreamStateStore.java | 25 +- .../ReplicationAwareTokenAllocator.java | 2 +-
[05/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java -- diff --git a/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java b/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java index 447d504..374a760 100644 --- a/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java +++ b/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java @@ -18,6 +18,7 @@ package org.apache.cassandra.db.repair; +import java.net.UnknownHostException; import java.util.ArrayList; import java.util.Collection; import java.util.Collections; @@ -42,6 +43,8 @@ import org.slf4j.LoggerFactory; import org.apache.cassandra.SchemaLoader; import org.apache.cassandra.config.DatabaseDescriptor; import org.apache.cassandra.db.compaction.CompactionManager; +import org.apache.cassandra.locator.RangesAtEndpoint; +import org.apache.cassandra.locator.Replica; import org.apache.cassandra.schema.TableId; import org.apache.cassandra.schema.TableMetadata; import org.apache.cassandra.schema.Schema; @@ -64,6 +67,9 @@ public class PendingAntiCompactionTest { private static final Logger logger = LoggerFactory.getLogger(PendingAntiCompactionTest.class); private static final Collection> FULL_RANGE; +private static final Collection> NO_RANGES = Collections.emptyList(); +private static InetAddressAndPort local; + static { DatabaseDescriptor.daemonInitialization(); @@ -77,9 +83,10 @@ public class PendingAntiCompactionTest private ColumnFamilyStore cfs; @BeforeClass -public static void setupClass() +public static void setupClass() throws Throwable { SchemaLoader.prepareServer(); +local = InetAddressAndPort.getByName("127.0.0.1"); } @Before @@ -89,6 +96,7 @@ public class PendingAntiCompactionTest cfm = CreateTableStatement.parse(String.format("CREATE TABLE %s.%s (k INT PRIMARY KEY, v INT)", ks, tbl), ks).build(); SchemaLoader.createKeyspace(ks, KeyspaceParams.simple(1), cfm); cfs = Schema.instance.getColumnFamilyStoreInstance(cfm.id); + } private void makeSSTables(int num) @@ -105,7 +113,7 @@ public class PendingAntiCompactionTest private static class InstrumentedAcquisitionCallback extends PendingAntiCompaction.AcquisitionCallback { -public InstrumentedAcquisitionCallback(UUID parentRepairSession, Collection> ranges) +public InstrumentedAcquisitionCallback(UUID parentRepairSession, RangesAtEndpoint ranges) { super(parentRepairSession, ranges); } @@ -155,7 +163,7 @@ public class PendingAntiCompactionTest ExecutorService executor = Executors.newSingleThreadExecutor(); try { -pac = new PendingAntiCompaction(sessionID, tables, ranges, executor); +pac = new PendingAntiCompaction(sessionID, tables, atEndpoint(ranges, NO_RANGES), executor); pac.run().get(); } finally @@ -217,7 +225,7 @@ public class PendingAntiCompactionTest Assert.assertTrue(repaired.intersects(FULL_RANGE)); Assert.assertTrue(unrepaired.intersects(FULL_RANGE)); - repaired.descriptor.getMetadataSerializer().mutateRepaired(repaired.descriptor, 1, null); + repaired.descriptor.getMetadataSerializer().mutateRepairMetadata(repaired.descriptor, 1, null, false); repaired.reloadSSTableMetadata(); PendingAntiCompaction.AcquisitionCallable acquisitionCallable = new PendingAntiCompaction.AcquisitionCallable(cfs, FULL_RANGE, UUIDGen.getTimeUUID()); @@ -243,7 +251,7 @@ public class PendingAntiCompactionTest Assert.assertTrue(repaired.intersects(FULL_RANGE)); Assert.assertTrue(unrepaired.intersects(FULL_RANGE)); - repaired.descriptor.getMetadataSerializer().mutateRepaired(repaired.descriptor, 0, UUIDGen.getTimeUUID()); + repaired.descriptor.getMetadataSerializer().mutateRepairMetadata(repaired.descriptor, 0, UUIDGen.getTimeUUID(), false); repaired.reloadSSTableMetadata(); Assert.assertTrue(repaired.isPendingRepair()); @@ -284,7 +292,7 @@ public class PendingAntiCompactionTest PendingAntiCompaction.AcquireResult result = acquisitionCallable.call(); Assert.assertNotNull(result); -InstrumentedAcquisitionCallback cb = new InstrumentedAcquisitionCallback(UUIDGen.getTimeUUID(), FULL_RANGE); +InstrumentedAcquisitionCallback cb = new InstrumentedAcquisitionCallback(UUIDGen.getTimeUUID(), atEndpoint(FULL_RANGE, NO_RANGES)); Assert.assertTrue(cb.submittedCompactions.isEmpty()); cb.apply(Lists.newArrayList(result)); @@ -308,7 +316,7 @@ public class PendingAntiCompactionTest Assert.assertNotNull(result);
[09/18] cassandra git commit: Transient Replication and Cheap Quorums
http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/StorageService.java -- diff --git a/src/java/org/apache/cassandra/service/StorageService.java b/src/java/org/apache/cassandra/service/StorageService.java index 9467c9a..7f4ae14 100644 --- a/src/java/org/apache/cassandra/service/StorageService.java +++ b/src/java/org/apache/cassandra/service/StorageService.java @@ -29,6 +29,7 @@ import java.util.concurrent.atomic.AtomicBoolean; import java.util.concurrent.atomic.AtomicInteger; import java.util.regex.MatchResult; import java.util.regex.Pattern; +import java.util.stream.Collectors; import java.util.stream.StreamSupport; import javax.annotation.Nullable; @@ -41,9 +42,12 @@ import javax.management.openmbean.TabularDataSupport; import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; import com.google.common.base.Predicate; +import com.google.common.base.Predicates; import com.google.common.collect.*; import com.google.common.util.concurrent.*; +import org.apache.cassandra.dht.RangeStreamer.FetchReplica; +import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict; import org.apache.commons.lang3.StringUtils; import org.slf4j.Logger; @@ -110,6 +114,8 @@ import org.apache.cassandra.utils.progress.ProgressEventType; import org.apache.cassandra.utils.progress.jmx.JMXBroadcastExecutor; import org.apache.cassandra.utils.progress.jmx.JMXProgressSupport; +import static com.google.common.collect.Iterables.transform; +import static com.google.common.collect.Iterables.tryFind; import static java.util.Arrays.asList; import static java.util.stream.Collectors.toList; import static org.apache.cassandra.index.SecondaryIndexManager.getIndexName; @@ -164,9 +170,9 @@ public class StorageService extends NotificationBroadcasterSupport implements IE return isShutdown; } -public Collection> getLocalRanges(String keyspaceName) +public RangesAtEndpoint getLocalReplicas(String keyspaceName) { -return getRangesForEndpoint(keyspaceName, FBUtilities.getBroadcastAddressAndPort()); +return getReplicasForEndpoint(keyspaceName, FBUtilities.getBroadcastAddressAndPort()); } public List> getLocalAndPendingRanges(String ks) @@ -174,9 +180,11 @@ public class StorageService extends NotificationBroadcasterSupport implements IE InetAddressAndPort broadcastAddress = FBUtilities.getBroadcastAddressAndPort(); Keyspace keyspace = Keyspace.open(ks); List> ranges = new ArrayList<>(); - ranges.addAll(keyspace.getReplicationStrategy().getAddressRanges().get(broadcastAddress)); -ranges.addAll(getTokenMetadata().getPendingRanges(ks, broadcastAddress)); -return Range.normalize(ranges); +for (Replica r : keyspace.getReplicationStrategy().getAddressReplicas(broadcastAddress)) +ranges.add(r.range()); +for (Replica r : getTokenMetadata().getPendingRanges(ks, broadcastAddress)) +ranges.add(r.range()); +return ranges; } public Collection> getPrimaryRanges(String keyspace) @@ -1225,11 +1233,11 @@ public class StorageService extends NotificationBroadcasterSupport implements IE if (keyspace == null) { for (String keyspaceName : Schema.instance.getNonLocalStrategyKeyspaces()) -streamer.addRanges(keyspaceName, getLocalRanges(keyspaceName)); +streamer.addRanges(keyspaceName, getLocalReplicas(keyspaceName)); } else if (tokens == null) { -streamer.addRanges(keyspace, getLocalRanges(keyspace)); +streamer.addRanges(keyspace, getLocalReplicas(keyspace)); } else { @@ -1251,14 +1259,16 @@ public class StorageService extends NotificationBroadcasterSupport implements IE } // Ensure all specified ranges are actually ranges owned by this host -Collection> localRanges = getLocalRanges(keyspace); +RangesAtEndpoint localReplicas = getLocalReplicas(keyspace); +RangesAtEndpoint.Builder streamRanges = new RangesAtEndpoint.Builder(FBUtilities.getBroadcastAddressAndPort(), ranges.size()); for (Range specifiedRange : ranges) { boolean foundParentRange = false; -for (Range localRange : localRanges) +for (Replica localReplica : localReplicas) { -if (localRange.contains(specifiedRange)) +if (localReplica.contains(specifiedRange)) { + streamRanges.add(localReplica.decorateSubrange(specifiedRange));
[01/18] cassandra git commit: Transient Replication and Cheap Quorums
Repository: cassandra Updated Branches: refs/heads/trunk 5b645de13 -> f7431b432 http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java -- diff --git a/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java b/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java new file mode 100644 index 000..8119400 --- /dev/null +++ b/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java @@ -0,0 +1,226 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.cassandra.service.reads; + +import java.util.concurrent.TimeUnit; + +import com.google.common.primitives.Ints; + +import org.apache.cassandra.Util; +import org.apache.cassandra.db.DecoratedKey; +import org.junit.Assert; +import org.junit.Before; +import org.junit.Test; + +import org.apache.cassandra.db.Clustering; +import org.apache.cassandra.db.ConsistencyLevel; +import org.apache.cassandra.db.DeletionTime; +import org.apache.cassandra.db.EmptyIterators; +import org.apache.cassandra.db.RangeTombstone; +import org.apache.cassandra.db.SimpleBuilders; +import org.apache.cassandra.db.SinglePartitionReadCommand; +import org.apache.cassandra.db.Slice; +import org.apache.cassandra.db.partitions.PartitionUpdate; +import org.apache.cassandra.db.rows.BTreeRow; +import org.apache.cassandra.db.rows.Row; +import org.apache.cassandra.locator.EndpointsForToken; +import org.apache.cassandra.locator.ReplicaLayout; +import org.apache.cassandra.schema.TableMetadata; +import org.apache.cassandra.service.reads.repair.TestableReadRepair; +import org.apache.cassandra.utils.ByteBufferUtil; + +import static org.apache.cassandra.db.ConsistencyLevel.QUORUM; +import static org.apache.cassandra.locator.Replica.fullReplica; +import static org.apache.cassandra.locator.Replica.transientReplica; +import static org.apache.cassandra.locator.ReplicaUtils.full; +import static org.apache.cassandra.locator.ReplicaUtils.trans; + +/** + * Tests DataResolvers handing of transient replicas + */ +public class DataResolverTransientTest extends AbstractReadResponseTest +{ +private static DecoratedKey key; + +@Before +public void setUp() +{ +key = Util.dk("key1"); +} + +private static PartitionUpdate.Builder update(TableMetadata metadata, String key, Row... rows) +{ +PartitionUpdate.Builder builder = new PartitionUpdate.Builder(metadata, dk(key), metadata.regularAndStaticColumns(), rows.length, false); +for (Row row: rows) +{ +builder.add(row); +} +return builder; +} + +private static PartitionUpdate.Builder update(Row... rows) +{ +return update(cfm, "key1", rows); +} + +private static Row.SimpleBuilder rowBuilder(int clustering) +{ +return new SimpleBuilders.RowBuilder(cfm, Integer.toString(clustering)); +} + +private static Row row(long timestamp, int clustering, int value) +{ +return rowBuilder(clustering).timestamp(timestamp).add("c1", Integer.toString(value)).build(); +} + +private static DeletionTime deletion(long timeMillis) +{ +TimeUnit MILLIS = TimeUnit.MILLISECONDS; +return new DeletionTime(MILLIS.toMicros(timeMillis), Ints.checkedCast(MILLIS.toSeconds(timeMillis))); +} + +/** + * Tests that the given update doesn't cause data resolver to attempt to repair a transient replica + */ +private void assertNoTransientRepairs(PartitionUpdate update) +{ +SinglePartitionReadCommand command = SinglePartitionReadCommand.fullPartitionRead(update.metadata(), nowInSec, key); +EndpointsForToken targetReplicas = EndpointsForToken.of(key.getToken(), full(EP1), full(EP2), trans(EP3)); +TestableReadRepair repair = new TestableReadRepair(command, QUORUM); +DataResolver resolver = new DataResolver(command, plan(targetReplicas, ConsistencyLevel.QUORUM), repair, 0); + +Assert.assertFalse(resolver.isDataPresent()); +resolver.preprocess(response(command, EP1, iter(update),
[jira] [Comment Edited] (CASSANDRA-14145) Detecting data resurrection during read
[ https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444 ] Jordan West edited comment on CASSANDRA-14145 at 9/1/18 1:16 AM: - Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag: if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} was (Author: jrwest): Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag (if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} > Detecting data resurrection during read > > > Key: CASSANDRA-14145 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14145 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Sam Tunnicliffe >Priority: Minor > Fix For: 4.x > > > We have
[jira] [Comment Edited] (CASSANDRA-14145) Detecting data resurrection during read
[ https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444 ] Jordan West edited comment on CASSANDRA-14145 at 9/1/18 1:16 AM: - Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag: if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} was (Author: jrwest): Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag: if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} > Detecting data resurrection during read > > > Key: CASSANDRA-14145 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14145 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Sam Tunnicliffe >Priority: Minor > Fix For: 4.x > > > We have seen
[jira] [Commented] (CASSANDRA-14145) Detecting data resurrection during read
[ https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444 ] Jordan West commented on CASSANDRA-14145: - Thanks [~beobal]. The recent updates are a big improvement. The {{InputCollector}} changes make things much more readable. I feel like we've reduced the footprint and risk considerably which is making me more comfortable about merging this late in the game. The feature is well hidden behind a flag (if tracking is not enabled we add a couple bytes to the internode messaging protocol that are not used but otherwise the changes to the existing path are negligible; {{InputCollector}} probably being the biggest. I do have one comment below I think should be addressed before merge. The other one we can open an open an improvement JIRA for. So I will give my +1 assuming the comment below is addressed so that we don't miss the deadline due to tz differences. We'll definitely need to test this for correctness and performance further but that can happen after 9/1. * This change I think should be made before we merge: In {{InputCollector}} is {{repairedSSTables}} necessary? It seems like an extra upfront allocation and iteration that we could skip. It also widens the unconfirmed window slightly since the status could change between the call to the constructor and the call to {{addSSTableIterator}}. What about going back to the on-demand allocation of {{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}? * This can be addressed in a subsequent JIRA if you agree: In {{RepairedDataVerifier}}, we could more quickly report confirmed failures if we separated tracking of confirmed and unconfirmed digests. If the number of confirmed digests is > 1 we still have a confirmed issue regardless of the number of unconfirmed digests. We could separately increment unconfirmed in this case if the unconfirmed digests don't match any of the confirmed ones (if it does we would assume its consistent). Minor Nits (up to you if you want to fix before merge): * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we shouldn’t try to characterize (“slight”) the performance impact besides to say it exists. Same goes for the identical comment in {{Config.java}} > Detecting data resurrection during read > > > Key: CASSANDRA-14145 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14145 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Sam Tunnicliffe >Priority: Minor > Fix For: 4.x > > > We have seen several bugs in which deleted data gets resurrected. We should > try to see if we can detect this on the read path and possibly fix it. Here > are a few examples which brought back data > A replica lost an sstable on startup which caused one replica to lose the > tombstone and not the data. This tombstone was past gc grace which means this > could resurrect data. We can detect such invalid states by looking at other > replicas. > If we are running incremental repair, Cassandra will keep repaired and > non-repaired data separate. Every-time incremental repair will run, it will > move the data from non-repaired to repaired. Repaired data across all > replicas should be 100% consistent. > Here is an example of how we can detect and mitigate the issue in most cases. > Say we have 3 machines, A,B and C. All these machines will have data split > b/w repaired and non-repaired. > 1. Machine A due to some bug bring backs data D. This data D is in repaired > dataset. All other replicas will have data D and tombstone T > 2. Read for data D comes from application which involve replicas A and B. The > data being read involves data which is in repaired state. A will respond > back to co-ordinator with data D and B will send nothing as tombstone is past > gc grace. This will cause digest mismatch. > 3. This patch will only kick in when there is a digest mismatch. Co-ordinator > will ask both replicas to send back all data like we do today but with this > patch, replicas will respond back what data it is returning is coming from > repaired vs non-repaired. If data coming from repaired does not match, we > know there is a something wrong!! At this time, co-ordinator cannot determine > if replica A has resurrected some data or replica B has lost some data. We > can still log error in the logs saying we hit an invalid state. > 4. Besides the log, we can take this further and even correct the response to > the query. After logging an invalid state, we can ask replica A and B (and > also C if alive) to send back all data for this including gcable tombstones. > If any machine returns a tombstone which is after this data, we know we > cannot return
[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420 ] Jordan West edited comment on CASSANDRA-13304 at 9/1/18 12:33 AM: -- Other than the bug (and minor nits if you choose to address them) above I am +1 on the these changes. We will need to make sure we test it more thoroughly once there is client support. Thanks for the updates [~beobal]! was (Author: jrwest): Other thank the bug (and minor nits if you choose to address them) above I am +1 on the these changes. We will need to make sure we test it more thoroughly once there is client support. Thanks for the updates [~beobal]! > Add checksumming to the native protocol > --- > > Key: CASSANDRA-13304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Michael Kjellman >Assignee: Sam Tunnicliffe >Priority: Blocker > Labels: client-impacting > Fix For: 4.x > > Attachments: 13304_v1.diff, boxplot-read-throughput.png, > boxplot-write-throughput.png > > > The native binary transport implementation doesn't include checksums. This > makes it highly susceptible to silently inserting corrupted data either due > to hardware issues causing bit flips on the sender/client side, C*/receiver > side, or network in between. > Attaching an implementation that makes checksum'ing mandatory (assuming both > client and server know about a protocol version that supports checksums) -- > and also adds checksumming to clients that request compression. > The serialized format looks something like this: > {noformat} > * 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 > * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Number of Compressed Chunks | Compressed Length (e1)/ > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * / Compressed Length cont. (e1) |Uncompressed Length (e1) / > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e1) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (e2)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (e2) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e2) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (en)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (en) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (en) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > {noformat} > The first pass here adds checksums only to the actual contents of the frame > body itself (and doesn't actually checksum lengths and headers). While it > would be great to fully add checksuming across the entire protocol, the > proposed implementation will ensure we at least catch corrupted data and > likely protect ourselves pretty well anyways. > I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor > implementation as it's been deprecated for a while -- is really slow and > crappy compared to LZ4 -- and we should do everything in our power to make > sure no one in the community is still using it. I left it in (for obvious > backwards compatibility aspects) old for clients that don't know about the
[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420 ] Jordan West edited comment on CASSANDRA-13304 at 9/1/18 12:32 AM: -- Other thank the bug (and minor nits if you choose to address them) above I am +1 on the these changes. We will need to make sure we test it more thoroughly once there is client support. Thanks for the updates [~beobal]! was (Author: jrwest): Other thank the bug above I am +1 on the these changes. We will need to make sure we test it more thoroughly once there is client support. Thanks for the updates [~beobal]! > Add checksumming to the native protocol > --- > > Key: CASSANDRA-13304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Michael Kjellman >Assignee: Sam Tunnicliffe >Priority: Blocker > Labels: client-impacting > Fix For: 4.x > > Attachments: 13304_v1.diff, boxplot-read-throughput.png, > boxplot-write-throughput.png > > > The native binary transport implementation doesn't include checksums. This > makes it highly susceptible to silently inserting corrupted data either due > to hardware issues causing bit flips on the sender/client side, C*/receiver > side, or network in between. > Attaching an implementation that makes checksum'ing mandatory (assuming both > client and server know about a protocol version that supports checksums) -- > and also adds checksumming to clients that request compression. > The serialized format looks something like this: > {noformat} > * 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 > * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Number of Compressed Chunks | Compressed Length (e1)/ > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * / Compressed Length cont. (e1) |Uncompressed Length (e1) / > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e1) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (e2)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (e2) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e2) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (en)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (en) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (en) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > {noformat} > The first pass here adds checksums only to the actual contents of the frame > body itself (and doesn't actually checksum lengths and headers). While it > would be great to fully add checksuming across the entire protocol, the > proposed implementation will ensure we at least catch corrupted data and > likely protect ourselves pretty well anyways. > I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor > implementation as it's been deprecated for a while -- is really slow and > crappy compared to LZ4 -- and we should do everything in our power to make > sure no one in the community is still using it. I left it in (for obvious > backwards compatibility aspects) old for clients that don't know about the > new protocol. > The current protocol has a
[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420 ] Jordan West commented on CASSANDRA-13304: - Other thank the bug above I am +1 on the these changes. We will need to make sure we test it more thoroughly once there is client support. Thanks for the updates [~beobal]! > Add checksumming to the native protocol > --- > > Key: CASSANDRA-13304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Michael Kjellman >Assignee: Sam Tunnicliffe >Priority: Blocker > Labels: client-impacting > Fix For: 4.x > > Attachments: 13304_v1.diff, boxplot-read-throughput.png, > boxplot-write-throughput.png > > > The native binary transport implementation doesn't include checksums. This > makes it highly susceptible to silently inserting corrupted data either due > to hardware issues causing bit flips on the sender/client side, C*/receiver > side, or network in between. > Attaching an implementation that makes checksum'ing mandatory (assuming both > client and server know about a protocol version that supports checksums) -- > and also adds checksumming to clients that request compression. > The serialized format looks something like this: > {noformat} > * 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 > * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Number of Compressed Chunks | Compressed Length (e1)/ > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * / Compressed Length cont. (e1) |Uncompressed Length (e1) / > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e1) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (e2)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (e2) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e2) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (en)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (en) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (en) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > {noformat} > The first pass here adds checksums only to the actual contents of the frame > body itself (and doesn't actually checksum lengths and headers). While it > would be great to fully add checksuming across the entire protocol, the > proposed implementation will ensure we at least catch corrupted data and > likely protect ourselves pretty well anyways. > I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor > implementation as it's been deprecated for a while -- is really slow and > crappy compared to LZ4 -- and we should do everything in our power to make > sure no one in the community is still using it. I left it in (for obvious > backwards compatibility aspects) old for clients that don't know about the > new protocol. > The current protocol has a 256MB (max) frame body -- where the serialized > contents are simply written in to the frame body. > If the client sends a compression option in the startup, we will install a > FrameCompressor inline. Unfortunately, we went with a decision to treat the > frame body separately from the
[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599415#comment-16599415 ] Blake Eggleston commented on CASSANDRA-14408: - +1 from me as well > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-13262) Incorrect cqlsh results when selecting same columns multiple times
[ https://issues.apache.org/jira/browse/CASSANDRA-13262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597316#comment-16597316 ] mck edited comment on CASSANDRA-13262 at 8/31/18 11:56 PM: --- New dtests running… || branch || testall || dtest || | [cassandra-2.2_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-2.2_13262] | [!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-2.2_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-2.2_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/625/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/625/] | | [cassandra-3.0_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-3.0_13262] | [!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.0_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.0_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/626/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/626/] | | [cassandra-3.11_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-3.11_13262] | [!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.11_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.11_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/627/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/627/] | EDIT: rebased branches. was (Author: michaelsembwever): New dtests running… || branch || testall || dtest || | [cassandra-2.2_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-2.2_13262] | [!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-2.2_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-2.2_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/618/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/618/] | | [cassandra-3.0_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-3.0_13262] | [!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.0_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.0_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/619/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/619/] | | [cassandra-3.11_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-3.11_13262] | [!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.11_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.11_13262] | [!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/620/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/620/] | > Incorrect cqlsh results when selecting same columns multiple times > -- > > Key: CASSANDRA-13262 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13262 > Project: Cassandra > Issue Type: Bug >Reporter: Stefan Podkowinski >Assignee: Murukesh Mohanan >Priority: Minor > Labels: lhf > Fix For: 4.0 > > Attachments: > 0001-Fix-incorrect-cqlsh-results-when-selecting-same-colu.patch, > CASSANDRA-13262-v2.2.txt, CASSANDRA-13262-v3.0.txt, CASSANDRA-13262-v3.11.txt > > > Just stumbled over this on trunk: > {quote} > cqlsh:test1> select a, b, c from table1; > a | b| c > ---+--+- > 1 |b | 2 > 2 | null | 2.2 > (2 rows) > cqlsh:test1> select a, a, b, c from table1; > a | a| b | c > ---+--+-+-- > 1 |b | 2 | null > 2 | null | 2.2 | null > (2 rows) > cqlsh:test1> select a, a, a, b, c from table1; > a | a| a | b| c > ---+--+---+--+-- > 1 |b | 2.0 | null | null > 2 | null | 2.2004768 | null | null > {quote} > My guess is that his is on the Python side, but haven't really looked into it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Blake Eggleston updated CASSANDRA-14408: Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston, Marcus Eriksson (was: Alex Petrov, Ariel Weisberg, Benedict, Marcus Eriksson) > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Blake Eggleston updated CASSANDRA-14406: Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston (was: Alex Petrov, Ariel Weisberg, Benedict) > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599413#comment-16599413 ] Blake Eggleston commented on CASSANDRA-14405: - +1 from me as well > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599414#comment-16599414 ] Blake Eggleston commented on CASSANDRA-14406: - +1 from me as well > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Blake Eggleston updated CASSANDRA-14405: Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston (was: Alex Petrov, Ariel Weisberg, Benedict) > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use
[ https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599406#comment-16599406 ] Benedict commented on CASSANDRA-14407: -- +1 also; I will be following up next week with a detailed comment and some follow up tickets, once I've had time to collate my notes. > Transient Replication: Add support for correct reads when transient > replication is in use > - > > Key: CASSANDRA-14407 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14407 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Digest reads should never be sent to transient replicas. > Mismatches with results from transient replicas shouldn't trigger read repair. > Read repair should never attempt to repair a transient replica. > Reads should always include at least one full replica. They should also > prefer transient replicas where possible. > Range scans must ensure the entire scanned range performs replica selection > that satisfies the requirement that every range scanned includes one full > replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599407#comment-16599407 ] Benedict commented on CASSANDRA-14406: -- +1 also; I will be following up next week with a detailed comment and some follow up tickets, once I've had time to collate my notes. > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599405#comment-16599405 ] Benedict commented on CASSANDRA-14405: -- +1 also; I will be following up next week with a detailed comment and some follow up tickets, once I've had time to collate my notes. > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599393#comment-16599393 ] Alex Petrov commented on CASSANDRA-14406: - The patch was reviewed, modified and incorporated into the final Transient Replication patch. +1 from my side of the review > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599395#comment-16599395 ] Alex Petrov commented on CASSANDRA-14408: - The patch was reviewed, modified and incorporated into the final Transient Replication patch. +1 from my side of the review > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas
[ https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-14408: Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Marcus Eriksson (was: Ariel Weisberg, Benedict, Marcus Eriksson) > Transient Replication: Incremental & Validation repair handling of transient > replicas > - > > Key: CASSANDRA-14408 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14408 > Project: Cassandra > Issue Type: Sub-task > Components: Repair >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > At transient replicas anti-compaction shouldn't output any data for transient > ranges as the data will be dropped after repair. > Transient replicas should also never have data streamed to them. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations
[ https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-14406: Reviewers: Alex Petrov, Ariel Weisberg, Benedict > Transient Replication: Implement cheap quorum write optimizations > - > > Key: CASSANDRA-14406 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14406 > Project: Cassandra > Issue Type: Sub-task > Components: Coordination >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > > Writes should never be sent to transient replicas unless necessary to satisfy > the requested consistency level. Such as RF not being sufficient for strong > consistency or not enough full replicas marked as alive. > If a write doesn't receive sufficient responses in time additional replicas > should be sent the write similar to Rapid Read Protection. > Hints should never be written for a transient replica. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-14405: Reviewers: Alex Petrov, Ariel Weisberg, Benedict (was: Ariel Weisberg, Benedict) > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor
[ https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599389#comment-16599389 ] Alex Petrov commented on CASSANDRA-14405: - The patch was reviewed, modified and incorporated into the final Transient Replication patch. +1 from my side of the review > Transient Replication: Metadata refactor > > > Key: CASSANDRA-14405 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14405 > Project: Cassandra > Issue Type: Sub-task > Components: Core, Distributed Metadata, Documentation and Website >Reporter: Ariel Weisberg >Assignee: Blake Eggleston >Priority: Major > Fix For: 4.0 > > > Add support to CQL and NTS for configuring keyspaces to have transient > replicas. > Add syntax allowing a keyspace using NTS to declare some replicas in each DC > as transient. > Implement metadata internal to the DB so that it's possible to identify what > replicas are transient for a given token or range. > Introduce Replica which is an InetAddressAndPort and a boolean indicating > whether the replica is transient. ReplicatedRange which is a wrapper around a > Range that indicates if the range is transient. > Block altering of keyspaces to use transient replication if they already > contain MVs or 2i. > Block the creation of MV or 2i in keyspaces using transient replication. > Block the creation/alteration of keyspaces using transient replication if the > experimental flag is not set. > Update web site, CQL spec, and any other documentation for the new syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-10699) Make schema alterations strongly consistent
[ https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] C. Scott Andreas updated CASSANDRA-10699: - Fix Version/s: (was: 4.0) 4.x > Make schema alterations strongly consistent > --- > > Key: CASSANDRA-10699 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10699 > Project: Cassandra > Issue Type: Sub-task >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Fix For: 4.x > > > Schema changes do not necessarily commute. This has been the case before > CASSANDRA-5202, but now is particularly problematic. > We should employ a strongly consistent protocol instead of relying on > marshalling {{Mutation}} objects with schema changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599314#comment-16599314 ] Jordan West edited comment on CASSANDRA-13304 at 8/31/18 10:08 PM: --- [~beobal] I'm still going over the changes but I wanted to post this bug now before I finish. The issue is that the checksum over the lengths is only calculated over the least-significant byte of the compressed and uncompressed lengths. This means that if we introduce corruption in the three most significant bytes we won't catch it, which leads to a host of different bugs (index out of bounds exception , lz4 decompression issues, etc). I pushed a patch [here|https://github.com/jrwest/cassandra/commit/e57a2508c26f05efb826a7f4342964fa6d6691bd] with a new test that catches the issue. I left in the fixed seed for now so its easy for you to run and see the failing example in a debugger (its the second example that fails). I've pasted a stack trace of the failure with that seed below. The example generated has a single byte input and introduces corruption into the 3rd byte in the stream (the most significant byte of the first length in the stream). This leads to a case where the checksums match but when we go to read the data we read past the total length of the buffer. A couple other comments while I'm here. All minor: * [~djoshi3] and I were reviewing the bug above before I posted and we were thinking it would be nice to refactor {{ChecksummingTransformer#transformInbound/transformOutbound}}. They are a bit large/unwieldy right now. We can open a JIRA to address this later if you prefer. * Re: the {{roundTripZeroLength}} property. This is mostly covered by the property I already added although this makes it more likely to generate a few cases. If you want to keep it I would recommend setting {{withExamples}} and using something small like 10 or 20 examples (since the state space is small). * The {{System.out.println}} I added in {{roundTripSafetyProperty}} should be removed Stack Trace: {code:java} java.lang.AssertionError: Property falsified after 2 example(s) Smallest found falsifying value(s) :- \{(c,3), 0, null, Adler32} Cause was :- java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54) at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401) at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388) at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106) at org.quicktheories.dsl.TheoryBuilder4.lambda$checkAssert$9(TheoryBuilder4.java:163) at org.quicktheories.dsl.TheoryBuilder4.lambda$check$8(TheoryBuilder4.java:151) at org.quicktheories.impl.Property.tryFalsification(Property.java:23) at org.quicktheories.impl.Core.shrink(Core.java:111) at org.quicktheories.impl.Core.run(Core.java:39) at org.quicktheories.impl.TheoryRunner.check(TheoryRunner.java:35) at org.quicktheories.dsl.TheoryBuilder4.check(TheoryBuilder4.java:150) at org.quicktheories.dsl.TheoryBuilder4.checkAssert(TheoryBuilder4.java:162) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.corruptionCausesFailure(ChecksummingTransformerTest.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at
[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599314#comment-16599314 ] Jordan West commented on CASSANDRA-13304: - [~beobal] I'm still going over the changes but I wanted to post this bug now before I finish. The issue is that the checksum over the lengths is only calculated over the least-significant byte of the compressed and uncompressed lengths. This means that if we introduce corruption in the three most significant bytes we won't catch it, which leads to a host of different bugs (index out of bounds exception , lz4 decompression issues, etc). I pushed a patch [here|https://github.com/jrwest/cassandra/commit/e57a2508c26f05efb826a7f4342964fa6d6691bd] with a new test that catches the issue. I left in the fixed seed for now so its easy for you to run and see the failing example in a debugger (its the second example that fails). I've pasted a stack trace of the failure with that seed below. The example generated has a single byte input and introduces corruption into the 3rd byte in the stream (the most significant byte of the first length in the stream). This leads to a case where the checksums match but when we go to read the data we read past the total length of the buffer. A couple other comments while I'm here. All minor: * [~djoshi3] and I were reviewing the bug above before I posted and we were thinking it would be nice to refactor {{ChecksummingTransformer#transformInbound/transformOutbound}}. They are a bit large/unwieldy right now. We can open a JIRA to address this later if you prefer. * Re: the {{roundTripZeroLength}} property. This is mostly covered by the property I already added although this makes it more likely to generate a few cases. If you want to keep it I would recommend setting {{withExamples}} and using something small like 10 or 20 examples (since the state space is small). * The {{System.out.println}} I added in {{roundTripSafetyProperty}} should be removed Stack Trace: {code:java} java.lang.AssertionError: Property falsified after 2 example(s) Smallest found falsifying value(s) :- \{(c,3), 0, null, Adler32} Cause was :- java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54) at io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401) at io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388) at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106) at org.quicktheories.dsl.TheoryBuilder4.lambda$checkAssert$9(TheoryBuilder4.java:163) at org.quicktheories.dsl.TheoryBuilder4.lambda$check$8(TheoryBuilder4.java:151) at org.quicktheories.impl.Property.tryFalsification(Property.java:23) at org.quicktheories.impl.Core.shrink(Core.java:111) at org.quicktheories.impl.Core.run(Core.java:39) at org.quicktheories.impl.TheoryRunner.check(TheoryRunner.java:35) at org.quicktheories.dsl.TheoryBuilder4.check(TheoryBuilder4.java:150) at org.quicktheories.dsl.TheoryBuilder4.checkAssert(TheoryBuilder4.java:162) at org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.corruptionCausesFailure(ChecksummingTransformerTest.java:87) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78) at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57) at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290) at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71) at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288) at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58) at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268) at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) at org.junit.runners.ParentRunner.run(ParentRunner.java:363) at
[jira] [Commented] (CASSANDRA-14618) Create fqltool replay command
[ https://issues.apache.org/jira/browse/CASSANDRA-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599274#comment-16599274 ] Jason Brown commented on CASSANDRA-14618: - +1, and please commit with CASSANDRA-14619 (as both patches are linked, code and review wise) > Create fqltool replay command > - > > Key: CASSANDRA-14618 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14618 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > Make it possible to replay the full query logs from CASSANDRA-13983 against > one or several clusters. The goal is to be able to compare different runs of > production traffic against different versions/configurations of Cassandra. > * It should be possible to take logs from several machines and replay them in > "order" by the timestamps recorded > * Record the results from each run to be able to compare different runs > (against different clusters/versions/etc) > * If {{fqltool replay}} is run against 2 or more clusters, the results should > be compared as we go -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14618) Create fqltool replay command
[ https://issues.apache.org/jira/browse/CASSANDRA-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-14618: Status: Ready to Commit (was: Patch Available) > Create fqltool replay command > - > > Key: CASSANDRA-14618 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14618 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > Make it possible to replay the full query logs from CASSANDRA-13983 against > one or several clusters. The goal is to be able to compare different runs of > production traffic against different versions/configurations of Cassandra. > * It should be possible to take logs from several machines and replay them in > "order" by the timestamps recorded > * Record the results from each run to be able to compare different runs > (against different clusters/versions/etc) > * If {{fqltool replay}} is run against 2 or more clusters, the results should > be compared as we go -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14619) Create fqltool compare command
[ https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown updated CASSANDRA-14619: Status: Ready to Commit (was: Patch Available) > Create fqltool compare command > -- > > Key: CASSANDRA-14619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14619 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > We need a {{fqltool compare}} command that can take the recorded runs from > CASSANDRA-14618 and compares them, it should output any differences and > potentially all queries against the mismatching partition up until the > mismatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command
[ https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599273#comment-16599273 ] Jason Brown commented on CASSANDRA-14619: - - ColumnDefsReader.readMarshallable - you read an int32 value, but in ColumnDefsWriter.writeMarshallable, you wrote an int16. Is this correct? The unit tests pass, but I'm not sure if RecordStore is being fully exercised. The same thing happens in RowReader vs RowWriter UPDATE: I stepped through the chronicle code and it looks like the library can optimize the value it writes out (it only gets written as a byte, basically, since your value is zero). So, while your API calls are incongruous, the library does a correct thing under the hood. I would still prefer you to switch to the reads to int16(), but that can be done on commit. I also had a few trivial comments on the PR linked above. They are minor, so just address them on commit (if you choose). Otherwise, +1 from me. > Create fqltool compare command > -- > > Key: CASSANDRA-14619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14619 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > We need a {{fqltool compare}} command that can take the recorded runs from > CASSANDRA-14618 and compares them, it should output any differences and > potentially all queries against the mismatching partition up until the > mismatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-14497) Add Role login cache
[ https://issues.apache.org/jira/browse/CASSANDRA-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599244#comment-16599244 ] Sam Tunnicliffe edited comment on CASSANDRA-14497 at 8/31/18 8:24 PM: -- {quote}{{logical}} role doesn't have password right? Can we use that? {quote} I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack of a password, then the answer is no because alternative {{IAuthenticator}} implementations may also not use passwords, but you still want users to be able to login. {quote}Then if we disable {{authorizer}}, it should not do the login check right? {quote} No, it still has to do that because it's a required privilege for connecting. I guess I over-simplified when I said perms are only the concern of the {{IAuthorizer}}. {quote}Maybe my questions are beyond the scope of this ticket. If we just want to add cache with minimized the impact. I think the patch looks good. {quote} I think there's definitely plenty of scope to improve the design of the auth subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll commit this patch in the meantime (after rebasing and CI) to reduce the impact of high login rates. Thanks [~jay.zhuang] was (Author: beobal): bq. {{logical}} role doesn't have password right? Can we use that? I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack of a password, then the answer is no because alternative {{IAuthenticator}} implementations may also not use passwords, but you still want users to be able to login. bq. Then if we disable {{authorizer}}, it should not do the login check right? No, it still has to do that because it's a required privilege for connecting. I guess I over-simplified when I said perms are only the concern of the {{IAuthorizer}}. bq. Maybe my questions are beyond the scope of this ticket. If we just want to add cache with minimized the impact. I think the patch looks good. I think there's definitely plenty of scope to improve the design of the auth subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll commit this patch in the meantime to reduce the impact of high login rates. Thanks [~jay.zhuang] > Add Role login cache > > > Key: CASSANDRA-14497 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14497 > Project: Cassandra > Issue Type: Improvement > Components: Auth >Reporter: Jay Zhuang >Assignee: Sam Tunnicliffe >Priority: Major > Labels: security > Fix For: 4.0 > > > The > [{{ClientState.login()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ClientState.java#L313] > function is used for all auth message: > [{{AuthResponse.java:82}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/messages/AuthResponse.java#L82]. > But the > [{{role.canLogin}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L521] > information is not cached. So it hits the database every time: > [{{CassandraRoleManager.java:407}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L407]. > For a cluster with lots of new connections, it's causing performance issue. > The mitigation for us is to increase the {{system_auth}} replication factor > to match the number of nodes, so > [{{local_one}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L488] > would be very cheap. The P99 dropped immediately, but I don't think it is > not a good solution. > I would purpose to add {{Role.canLogin}} to the RolesCache to improve the > auth performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14497) Add Role login cache
[ https://issues.apache.org/jira/browse/CASSANDRA-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599244#comment-16599244 ] Sam Tunnicliffe commented on CASSANDRA-14497: - bq. {{logical}} role doesn't have password right? Can we use that? I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack of a password, then the answer is no because alternative {{IAuthenticator}} implementations may also not use passwords, but you still want users to be able to login. bq. Then if we disable {{authorizer}}, it should not do the login check right? No, it still has to do that because it's a required privilege for connecting. I guess I over-simplified when I said perms are only the concern of the {{IAuthorizer}}. bq. Maybe my questions are beyond the scope of this ticket. If we just want to add cache with minimized the impact. I think the patch looks good. I think there's definitely plenty of scope to improve the design of the auth subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll commit this patch in the meantime to reduce the impact of high login rates. Thanks [~jay.zhuang] > Add Role login cache > > > Key: CASSANDRA-14497 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14497 > Project: Cassandra > Issue Type: Improvement > Components: Auth >Reporter: Jay Zhuang >Assignee: Sam Tunnicliffe >Priority: Major > Labels: security > Fix For: 4.0 > > > The > [{{ClientState.login()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ClientState.java#L313] > function is used for all auth message: > [{{AuthResponse.java:82}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/messages/AuthResponse.java#L82]. > But the > [{{role.canLogin}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L521] > information is not cached. So it hits the database every time: > [{{CassandraRoleManager.java:407}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L407]. > For a cluster with lots of new connections, it's causing performance issue. > The mitigation for us is to increase the {{system_auth}} replication factor > to match the number of nodes, so > [{{local_one}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L488] > would be very cheap. The P99 dropped immediately, but I don't think it is > not a good solution. > I would purpose to add {{Role.canLogin}} to the RolesCache to improve the > auth performance. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599238#comment-16599238 ] Dinesh Joshi commented on CASSANDRA-13304: -- {{ChecksummingTransformer::transformInbound}} - Very minor nit, I prefer a ternary operator here. Its more concise. You don't have to change it. I'm +1 on the patch. > Add checksumming to the native protocol > --- > > Key: CASSANDRA-13304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Michael Kjellman >Assignee: Sam Tunnicliffe >Priority: Blocker > Labels: client-impacting > Fix For: 4.x > > Attachments: 13304_v1.diff, boxplot-read-throughput.png, > boxplot-write-throughput.png > > > The native binary transport implementation doesn't include checksums. This > makes it highly susceptible to silently inserting corrupted data either due > to hardware issues causing bit flips on the sender/client side, C*/receiver > side, or network in between. > Attaching an implementation that makes checksum'ing mandatory (assuming both > client and server know about a protocol version that supports checksums) -- > and also adds checksumming to clients that request compression. > The serialized format looks something like this: > {noformat} > * 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 > * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Number of Compressed Chunks | Compressed Length (e1)/ > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * / Compressed Length cont. (e1) |Uncompressed Length (e1) / > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e1) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (e2)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (e2) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (e2) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e2) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |Compressed Length (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length (en)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |CRC32 Checksum of Lengths (en) | > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Compressed Bytes (en) +// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (en) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > {noformat} > The first pass here adds checksums only to the actual contents of the frame > body itself (and doesn't actually checksum lengths and headers). While it > would be great to fully add checksuming across the entire protocol, the > proposed implementation will ensure we at least catch corrupted data and > likely protect ourselves pretty well anyways. > I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor > implementation as it's been deprecated for a while -- is really slow and > crappy compared to LZ4 -- and we should do everything in our power to make > sure no one in the community is still using it. I left it in (for obvious > backwards compatibility aspects) old for clients that don't know about the > new protocol. > The current protocol has a 256MB (max) frame body -- where the serialized > contents are simply written in to the frame body. > If the client sends a compression option in the startup, we will install a > FrameCompressor inline. Unfortunately, we went with a decision to treat the > frame body separately from the header
[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol
[ https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599223#comment-16599223 ] Sam Tunnicliffe commented on CASSANDRA-13304: - thanks both. [~jrwest], I've pulled your patch into my branch and add a couple more tests. bq. Protocol comments I agree that this could be useful, so I've implemented a first cut of it to validate the protocol changes. The test for whether compression is applied or not is super-brutal (i.e. it just does applies the compression and checks the compressed size) but that could be refined later. I haven't updated either the diagram in {{ChecksummingTransformer}}, or the protocol spec, but those can come later (soon) if we commit this. bq. In ChecksummingTransformer::transformOutbound L185 - consider adding a debug log statement as this is an unexpected event. I considered this, but decided in the end that it wouldn't really be actionable, so a bit redundant. It would more useful to add a metric for the number of times where we have to resize a buffer, but probably only in conjunction with some targetted testing which we can't easily do with the existing drivers. I'll open a follow up JIRA for that. bq. In ChecksummingTransformerTest - Could you add a zero length round trip test so we cover that corner case as well? Added when pulling in Jordan's conversion of ChecksummingTransformerTest to a property-based test. bq. cassandra.yaml entry for compression/checksum block size is missing Added, thanks. bq. There was some conversation above about using ProtocolException instead of IOException when the checksums don't match. It seemed like there was agreement on using ProtocolException but the code still uses IOException. Good catch, thanks. bq. Would be nice to move ChecksummingTransformer#readUnsignedShort to something like ByteBufUtil#readUnsignedShort. Similar to ByteBufferUtil#readShortLength. I've moved it to CBUtil, which is the defacto equivalent of {{ByteBufferUtil}} bq. I thought that Optionals were not adding much value bq. StartupMessage#getChecksumType/getCompressor(): I'm not sure there is much benefit to using optional here given how its used at the call sites. argh, both good catches. The use of {{Optional}} was much more widespread earlier on & I thought I'd removed it all, but I missed this. bq. The comment about why the frame.compress package defines an ICompressor-Like interface was removed but is helpful since its not obvious at first. It should probably be expanded on a bit as well. https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/FrameCompressor.java#L37 I've added some javadoc to {{o.a.c.transport.frame.compress.Compressor}} bq. The created by comment at the top of ChecksummingCompressorTest should be removed Done > Add checksumming to the native protocol > --- > > Key: CASSANDRA-13304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13304 > Project: Cassandra > Issue Type: Improvement > Components: Core >Reporter: Michael Kjellman >Assignee: Sam Tunnicliffe >Priority: Blocker > Labels: client-impacting > Fix For: 4.x > > Attachments: 13304_v1.diff, boxplot-read-throughput.png, > boxplot-write-throughput.png > > > The native binary transport implementation doesn't include checksums. This > makes it highly susceptible to silently inserting corrupted data either due > to hardware issues causing bit flips on the sender/client side, C*/receiver > side, or network in between. > Attaching an implementation that makes checksum'ing mandatory (assuming both > client and server know about a protocol version that supports checksums) -- > and also adds checksumming to clients that request compression. > The serialized format looks something like this: > {noformat} > * 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 > * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Number of Compressed Chunks | Compressed Length (e1)/ > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * / Compressed Length cont. (e1) |Uncompressed Length (e1) / > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)| > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+// > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * | CRC32 Checksum (e1) || > * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ > * |
[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command
[ https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599225#comment-16599225 ] Dinesh Joshi commented on CASSANDRA-14619: -- [~krummas] Other than just one minor thing, I'm +1 on the PR. Please fix on commit - {{FQLQuery::toString}} is using {{"\n"}}, it would be better to use {{System.getProperty("line.separator")}}. > Create fqltool compare command > -- > > Key: CASSANDRA-14619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14619 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > We need a {{fqltool compare}} command that can take the recorded runs from > CASSANDRA-14618 and compares them, it should output any differences and > potentially all queries against the mismatching partition up until the > mismatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command
[ https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599169#comment-16599169 ] Dinesh Joshi commented on CASSANDRA-14619: -- I will take a look at it today. > Create fqltool compare command > -- > > Key: CASSANDRA-14619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14619 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > We need a {{fqltool compare}} command that can take the recorded runs from > CASSANDRA-14618 and compares them, it should output any differences and > potentially all queries against the mismatching partition up until the > mismatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command
[ https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599153#comment-16599153 ] Marcus Eriksson commented on CASSANDRA-14619: - bq. , every columnDefinition and row entry that is written out is prefixed with a 4-byte version number. I would really like to keep it in each document to be able to parse them out-of-context, let me know if you have really strong feelings about it, but I think the size of this will be tiny compared to actual result sets etc. And I changed it to two bytes I have pushed a branch rebased on almost latest trunk including the updated fields etc here: https://github.com/krummas/cassandra/commits/marcuse/fql_rebase > Create fqltool compare command > -- > > Key: CASSANDRA-14619 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14619 > Project: Cassandra > Issue Type: New Feature >Reporter: Marcus Eriksson >Assignee: Marcus Eriksson >Priority: Major > Labels: fqltool > Fix For: 4.x > > > We need a {{fqltool compare}} command that can take the recorded runs from > CASSANDRA-14618 and compares them, it should output any differences and > potentially all queries against the mismatching partition up until the > mismatch -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599146#comment-16599146 ] Ariel Weisberg commented on CASSANDRA-14404: There are no transient nodes. All nodes are the same. If you have transient replication enabled each node will transiently replicate some ranges instead of fully replicating them. Capacity requirements are reduced evenly across all nodes in the cluster. Nodes are not temporarily transient replicas during expansion. They need to stream data like a full replica for the transient range before they can serve reads. There is a pending state similar to how there is a pending state for full replicas. Transient replicas also always receive writes when they are pending. There may be some room to relax how that is handled, but for now we opt to send pending transient ranges a bit more data and avoid reading from them when maybe we could. This doesn't change how expansion works with vnodes. The same restrictions still apply. We won't officially support vnodes until we have done more testing and really thought through the corner cases. It's quite possible we will relax the restriction on creating transient keyspaces with vnodes in 4.0.x. > Transient Replication & Cheap Quorums: Decouple storage requirements from > consensus group size using incremental repair > --- > > Key: CASSANDRA-14404 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14404 > Project: Cassandra > Issue Type: New Feature > Components: Coordination, Core, CQL, Distributed Metadata, Hints, > Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, > Testing, Tools >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Major > Fix For: 4.0 > > > Transient Replication is an implementation of [Witness > Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf] > that leverages incremental repair to make full replicas consistent with > transient replicas that don't store the entire data set. Witness replicas are > used in real world systems such as Megastore and Spanner to increase > availability inexpensively without having to commit to more full copies of > the database. Transient replicas implement functionality similar to > upgradable and temporary replicas from the paper. > With transient replication the replication factor is increased beyond the > desired level of data redundancy by adding replicas that only store data when > sufficient full replicas are unavailable to store the data. These replicas > are called transient replicas. When incremental repair runs transient > replicas stream any data they have received to full replicas and once the > data is fully replicated it is dropped at the transient replicas. > Cheap quorums are a further set of optimizations on the write path to avoid > writing to transient replicas unless sufficient full replicas are available > as well as optimizations on the read path to prefer reading from transient > replicas. When writing at quorum to a table configured to use transient > replication the quorum will always prefer available full replicas over > transient replicas so that transient replicas don't have to process writes. > Rapid write protection (similar to rapid read protection) reduces tail > latency when full replicas are slow/unavailable to respond by sending writes > to additional replicas if necessary. > Transient replicas can generally service reads faster because they don't have > to do anything beyond bloom filter checks if they have no data. With vnodes > and larger size clusters they will not have a large quantity of data even in > failure cases where transient replicas start to serve a steady amount of > write traffic for some of their transiently replicated ranges. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL
[ https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14675: -- Status: Patch Available (was: In Progress) > Log the actual (if server-generated) timestamp and nowInSeconds used by > queries in FQL > -- > > Key: CASSANDRA-14675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14675 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Labels: fqltool > Fix For: 4.0.x > > > FQL doesn't currently log the actual timestamp - in microseconds - if it's > been server generated, nor the nowInSeconds value. It needs to, to allow for > - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic > playback tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL
[ https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14675: -- Resolution: Fixed Fix Version/s: (was: 4.0.x) 4.0 Status: Resolved (was: Patch Available) > Log the actual (if server-generated) timestamp and nowInSeconds used by > queries in FQL > -- > > Key: CASSANDRA-14675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14675 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Labels: fqltool > Fix For: 4.0 > > > FQL doesn't currently log the actual timestamp - in microseconds - if it's > been server generated, nor the nowInSeconds value. It needs to, to allow for > - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic > playback tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL
[ https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599139#comment-16599139 ] Aleksey Yeschenko commented on CASSANDRA-14675: --- Thanks, committed as [5b645de13f8bea775d5a979712b3bea910960255|https://github.com/apache/cassandra/commit/5b645de13f8bea775d5a979712b3bea910960255] to trunk. FQL/AuditLog code needs more cleanup. I did as much as I could as part of this patch. > Log the actual (if server-generated) timestamp and nowInSeconds used by > queries in FQL > -- > > Key: CASSANDRA-14675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14675 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Labels: fqltool > Fix For: 4.0 > > > FQL doesn't currently log the actual timestamp - in microseconds - if it's > been server generated, nor the nowInSeconds value. It needs to, to allow for > - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic > playback tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
cassandra git commit: Log the server-generated timestamp and nowInSeconds used by queries in FQL
Repository: cassandra Updated Branches: refs/heads/trunk 1e2f5244e -> 5b645de13 Log the server-generated timestamp and nowInSeconds used by queries in FQL patch by Aleksey Yeschenko; reviewed by Marcus Eriksson for CASSANDRA-14675 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/5b645de1 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/5b645de1 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/5b645de1 Branch: refs/heads/trunk Commit: 5b645de13f8bea775d5a979712b3bea910960255 Parents: 1e2f524 Author: Aleksey Yeshchenko Authored: Fri Aug 31 14:47:02 2018 +0100 Committer: Aleksey Yeshchenko Committed: Fri Aug 31 19:42:23 2018 +0100 -- CHANGES.txt | 1 + .../apache/cassandra/audit/AuditLogEntry.java | 5 + .../apache/cassandra/audit/AuditLogManager.java | 11 +- .../apache/cassandra/audit/BinAuditLogger.java | 6 +- .../cassandra/audit/BinLogAuditLogger.java | 93 --- .../apache/cassandra/audit/FullQueryLogger.java | 255 ++- .../org/apache/cassandra/cql3/QueryOptions.java | 3 +- .../apache/cassandra/service/QueryState.java| 22 +- .../apache/cassandra/tools/fqltool/Dump.java| 148 +++ .../transport/messages/BatchMessage.java| 2 +- .../cassandra/audit/FullQueryLoggerTest.java| 129 ++ 11 files changed, 418 insertions(+), 257 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index d2d9c86..9e76586 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0 + * Log server-generated timestamp and nowInSeconds used by queries in FQL (CASSANDRA-14675) * Add diagnostic events for read repairs (CASSANDRA-14668) * Use consistent nowInSeconds and timestamps values within a request (CASSANDRA-14671) * Add sampler for query time and expose with nodetool (CASSANDRA-14436) http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/src/java/org/apache/cassandra/audit/AuditLogEntry.java -- diff --git a/src/java/org/apache/cassandra/audit/AuditLogEntry.java b/src/java/org/apache/cassandra/audit/AuditLogEntry.java index 0b891d4..4d3b867 100644 --- a/src/java/org/apache/cassandra/audit/AuditLogEntry.java +++ b/src/java/org/apache/cassandra/audit/AuditLogEntry.java @@ -153,6 +153,11 @@ public class AuditLogEntry return options; } +public QueryState getState() +{ +return state; +} + public static class Builder { private static final InetAddressAndPort DEFAULT_SOURCE; http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/src/java/org/apache/cassandra/audit/AuditLogManager.java -- diff --git a/src/java/org/apache/cassandra/audit/AuditLogManager.java b/src/java/org/apache/cassandra/audit/AuditLogManager.java index ab9c2e9..25966f7 100644 --- a/src/java/org/apache/cassandra/audit/AuditLogManager.java +++ b/src/java/org/apache/cassandra/audit/AuditLogManager.java @@ -33,6 +33,7 @@ import org.apache.cassandra.config.DatabaseDescriptor; import org.apache.cassandra.cql3.CQLStatement; import org.apache.cassandra.cql3.QueryHandler; import org.apache.cassandra.cql3.QueryOptions; +import org.apache.cassandra.cql3.statements.BatchStatement; import org.apache.cassandra.exceptions.AuthenticationException; import org.apache.cassandra.exceptions.ConfigurationException; import org.apache.cassandra.exceptions.UnauthorizedException; @@ -187,7 +188,13 @@ public class AuditLogManager /** * Logs Batch queries to both FQL and standard audit logger. */ -public void logBatch(String batchTypeName, List queryOrIdList, List> values, List prepared, QueryOptions options, QueryState state, long queryStartTimeMillis) +public void logBatch(BatchStatement.Type type, + List queryOrIdList, + List> values, + List prepared, + QueryOptions options, + QueryState state, + long queryStartTimeMillis) { if (isAuditingEnabled()) { @@ -205,7 +212,7 @@ public class AuditLogManager { queryStrings.add(prepStatment.rawCQLStatement); } -fullQueryLogger.logBatch(batchTypeName, queryStrings, values, options, queryStartTimeMillis); +fullQueryLogger.logBatch(type, queryStrings, values, options, state, queryStartTimeMillis); } }
[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599128#comment-16599128 ] Alexander Dejanovski commented on CASSANDRA-14685: -- [~jasobrown], indeed, nodes 2 and 3 are still showing ongoing streams although node1 is down : {noformat} $ ccm node2 nodetool netstats Mode: NORMAL Repair e28883b0-ad4b-11e8-82ca-5fbf27df5fb6 /127.0.0.1 Sending 2 files, 49304220 bytes total. Already sent 0 files, 5373952 bytes total /Users/adejanovski/.ccm/inc-repair-issue/node2/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db 5373952/34243878 bytes(15%) sent to idx:0/127.0.0.1 Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 0 2 0 Small messages n/a 0 244612 0 Gossip messages n/a 23 531 0 $ ccm node3 nodetool netstats Mode: NORMAL Repair e269d820-ad4b-11e8-82ca-5fbf27df5fb6 /127.0.0.1 Sending 2 files, 49166315 bytes total. Already sent 1 files, 11748602 bytes total /Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-11-big-Data.db 8865018/8865018 bytes(100%) sent to idx:0/127.0.0.1 /Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db 2883584/34198115 bytes(8%) sent to idx:0/127.0.0.1 Read Repair Statistics: Attempted: 0 Mismatch (Blocking): 0 Mismatch (Background): 0 Pool Name Active Pending Completed Dropped Large messages n/a 0 2 0 Small messages n/a 0 244611 0 Gossip messages n/a 0 820 0 {noformat} > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Assignee: Jason Brown >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > ccm node1 stop > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > ccm node1 start{noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully >
[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Dejanovski updated CASSANDRA-14685: - Description: The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' $f done for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; do grep -v "rpc_" $f > ${f}.tmp cat ${f}.tmp > $f done ccm start {noformat} I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a few 10s of MBs of data (killed it after some time). Obviously cassandra-stress works as well : {noformat} bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 --replication "{'class':'SimpleStrategy', 'replication_factor':2}" --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1 {noformat} Flush and delete all SSTables in node1 : {noformat} ccm node1 nodetool flush ccm node1 stop rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* ccm node1 start{noformat} Then throttle streaming throughput to 1MB/s so we have time to take node1 down during the streaming phase and run repair: {noformat} ccm node1 nodetool setstreamthroughput 1 ccm node2 nodetool setstreamthroughput 1 ccm node3 nodetool setstreamthroughput 1 ccm node1 nodetool repair tlp_stress {noformat} Once streaming starts, shut down node1 and start it again : {noformat} ccm node1 stop ccm node1 start {noformat} Run repair again : {noformat} ccm node1 nodetool repair tlp_stress {noformat} The command will return very quickly, showing that it skipped all sstables : {noformat} [2018-08-31 19:05:16,292] Repair completed successfully [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds $ ccm node1 nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens OwnsHost ID Rack UN 127.0.0.1 228,64 KiB 256 ? 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 UN 127.0.0.2 60,09 MiB 256 ? fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 UN 127.0.0.3 57,59 MiB 256 ? a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 {noformat} sstablemetadata will then show that nodes 2 and 3 have SSTables still in "pending repair" state : {noformat} ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | grep repair SSTable: /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 {noformat} Restarting these nodes wouldn't help either. was: The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e
[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair
[ https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599094#comment-16599094 ] Constance Eustace commented on CASSANDRA-14404: --- So are (basically) these transient nodes basically serving as centralized hinted handoff caches rather than having the hinted handoffs cluttering up full replicas, especially nodes that have no concern for the token range involved? I understand that hinted handoffs aren't being replaced by this, but is that kind of the idea? Are the transient nodes sitting around? Will the transient nodes have cheaper/lower hardware requirements? During cluster expansion, does the newly streaming node acquiring data function as a temporary transient node until it becomes a full replica? Likewise while shrinking, does a previously full replica function as a transient while it streams off data? Can this help vnode expansion with multiple concurrent nodes? Admittedly I'm not familiar with how much work has gone into fixing cluster expansion with vnodes, it is my understanding that you typically expand only one node at a time or in multiples of the datacenter size > Transient Replication & Cheap Quorums: Decouple storage requirements from > consensus group size using incremental repair > --- > > Key: CASSANDRA-14404 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14404 > Project: Cassandra > Issue Type: New Feature > Components: Coordination, Core, CQL, Distributed Metadata, Hints, > Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, > Testing, Tools >Reporter: Ariel Weisberg >Assignee: Ariel Weisberg >Priority: Major > Fix For: 4.0 > > > Transient Replication is an implementation of [Witness > Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf] > that leverages incremental repair to make full replicas consistent with > transient replicas that don't store the entire data set. Witness replicas are > used in real world systems such as Megastore and Spanner to increase > availability inexpensively without having to commit to more full copies of > the database. Transient replicas implement functionality similar to > upgradable and temporary replicas from the paper. > With transient replication the replication factor is increased beyond the > desired level of data redundancy by adding replicas that only store data when > sufficient full replicas are unavailable to store the data. These replicas > are called transient replicas. When incremental repair runs transient > replicas stream any data they have received to full replicas and once the > data is fully replicated it is dropped at the transient replicas. > Cheap quorums are a further set of optimizations on the write path to avoid > writing to transient replicas unless sufficient full replicas are available > as well as optimizations on the read path to prefer reading from transient > replicas. When writing at quorum to a table configured to use transient > replication the quorum will always prefer available full replicas over > transient replicas so that transient replicas don't have to process writes. > Rapid write protection (similar to rapid read protection) reduces tail > latency when full replicas are slow/unavailable to respond by sending writes > to additional replicas if necessary. > Transient replicas can generally service reads faster because they don't have > to do anything beyond bloom filter checks if they have no data. With vnodes > and larger size clusters they will not have a large quantity of data even in > failure cases where transient replicas start to serve a steady amount of > write traffic for some of their transiently replicated ranges. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Brown reassigned CASSANDRA-14685: --- Assignee: Jason Brown > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Assignee: Jason Brown >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > {noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully > [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds > $ ccm node1 nodetool status > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID >Rack > UN 127.0.0.1 228,64 KiB 256 ? > 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 > UN 127.0.0.2 60,09 MiB 256 ? > fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 > UN 127.0.0.3 57,59 MiB 256 ? > a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 > {noformat} > sstablemetadata will then show that nodes 2 and 3 have SSTables still in > "pending repair" state : > {noformat} > ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | > grep repair > SSTable: > /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big > Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 > {noformat} > Restarting these nodes wouldn't help either. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
[ https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599086#comment-16599086 ] Jason Brown commented on CASSANDRA-14685: - Thanks for the report, [~adejanovski]. I'll be able to look into this next week, and I'm assigning the ticket to myself as a reminder. I'm not sure [~bdeggleston] can get to it before next week either. I'm not sure if this is due to the stream sessions on nodes 2 and 3 not properly closing (and thus not informing the repair sessions they are part of), or if it's something getting lost in the repair session. Do nodes 2/3 show any streaming or repair activities (via nodetool cmds) after the repair coordinator dies? > Incremental repair 4.0 : SSTables remain locked forever if the coordinator > dies during streaming > - > > Key: CASSANDRA-14685 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Alexander Dejanovski >Priority: Critical > > The changes in CASSANDRA-9143 modified the way incremental repair performs by > applying the following sequence of events : > * Anticompaction is executed on all replicas for all SSTables overlapping > the repaired ranges > * Anticompacted SSTables are then marked as "Pending repair" and cannot be > compacted anymore, nor part of another repair session > * Merkle trees are generated and compared > * Streaming takes place if needed > * Anticompaction is committed and "pending repair" table are marked as > repaired if it succeeded, or they are released if the repair session failed. > If the repair coordinator dies during the streaming phase, *the SSTables on > the replicas will remain in "pending repair" state and will never be eligible > for repair or compaction*, even after all the nodes in the cluster are > restarted. > Steps to reproduce (I've used Jason's 13938 branch that fixes streaming > errors) : > {noformat} > ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 > # Allow jmx access and remove all rpc_ settings in yaml > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; > do > sed -i'' -e > 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' > $f > done > for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; > do > grep -v "rpc_" $f > ${f}.tmp > cat ${f}.tmp > $f > done > ccm start > {noformat} > I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a > few 10s of MBs of data (killed it after some time). Obviously > cassandra-stress works as well : > {noformat} > bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 > --replication "{'class':'SimpleStrategy', 'replication_factor':2}" > --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host > 127.0.0.1 > {noformat} > Flush and delete all SSTables in node1 : > {noformat} > ccm node1 nodetool flush > rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* > {noformat} > Then throttle streaming throughput to 1MB/s so we have time to take node1 > down during the streaming phase and run repair: > {noformat} > ccm node1 nodetool setstreamthroughput 1 > ccm node2 nodetool setstreamthroughput 1 > ccm node3 nodetool setstreamthroughput 1 > ccm node1 nodetool repair tlp_stress > {noformat} > Once streaming starts, shut down node1 and start it again : > {noformat} > ccm node1 stop > ccm node1 start > {noformat} > Run repair again : > {noformat} > ccm node1 nodetool repair tlp_stress > {noformat} > The command will return very quickly, showing that it skipped all sstables : > {noformat} > [2018-08-31 19:05:16,292] Repair completed successfully > [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds > $ ccm node1 nodetool status > Datacenter: datacenter1 > === > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- AddressLoad Tokens OwnsHost ID >Rack > UN 127.0.0.1 228,64 KiB 256 ? > 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 > UN 127.0.0.2 60,09 MiB 256 ? > fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 > UN 127.0.0.3 57,59 MiB 256 ? > a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 > {noformat} > sstablemetadata will then show that nodes 2 and 3 have SSTables still in > "pending repair" state : > {noformat} > ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | > grep repair > SSTable: > /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big > Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 > {noformat} > Restarting
[jira] [Updated] (CASSANDRA-14668) Diag events for read repairs
[ https://issues.apache.org/jira/browse/CASSANDRA-14668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stefan Podkowinski updated CASSANDRA-14668: --- Resolution: Fixed Fix Version/s: (was: 4.x) 4.0 Status: Resolved (was: Patch Available) Committed as 1e2f5244e5 > Diag events for read repairs > > > Key: CASSANDRA-14668 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14668 > Project: Cassandra > Issue Type: Improvement > Components: Observability >Reporter: Stefan Podkowinski >Assignee: Stefan Podkowinski >Priority: Major > Fix For: 4.0 > > > Read repairs have been a highly discussed topic during the last months and > also saw some significant code changes. I'd like to be better prepared in > case we need to investigate any further RR issues in the future, by adding > diagnostic events that can be enabled for exposing informations such as: > * contacted endpoints > * digest responses by endpoint > * affected partition keys > * speculated reads / writes > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
cassandra git commit: Add diag events for read repairs
Repository: cassandra Updated Branches: refs/heads/trunk aed682513 -> 1e2f5244e Add diag events for read repairs patch by Stefan Podkowinski; reviewed by Mick Semb Wever for CASSANDRA-14668 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/1e2f5244 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/1e2f5244 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/1e2f5244 Branch: refs/heads/trunk Commit: 1e2f5244e5e341f32d23872104fad3b55dbf0cb0 Parents: aed6825 Author: Stefan Podkowinski Authored: Mon Aug 27 13:45:27 2018 +0200 Committer: Stefan Podkowinski Committed: Fri Aug 31 19:40:57 2018 +0200 -- CHANGES.txt | 1 + .../cassandra/service/reads/DigestResolver.java | 30 ++- .../reads/repair/AbstractReadRepair.java| 2 + .../reads/repair/BlockingPartitionRepair.java | 18 ++ .../reads/repair/PartitionRepairEvent.java | 102 ++ .../reads/repair/ReadRepairDiagnostics.java | 78 .../service/reads/repair/ReadRepairEvent.java | 114 +++ .../DiagEventsBlockingReadRepairTest.java | 192 +++ 8 files changed, 536 insertions(+), 1 deletion(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index e40cf27..d2d9c86 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0 + * Add diagnostic events for read repairs (CASSANDRA-14668) * Use consistent nowInSeconds and timestamps values within a request (CASSANDRA-14671) * Add sampler for query time and expose with nodetool (CASSANDRA-14436) * Clean up Message.Request implementations (CASSANDRA-14677) http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/src/java/org/apache/cassandra/service/reads/DigestResolver.java -- diff --git a/src/java/org/apache/cassandra/service/reads/DigestResolver.java b/src/java/org/apache/cassandra/service/reads/DigestResolver.java index b2eb0c6..897892f 100644 --- a/src/java/org/apache/cassandra/service/reads/DigestResolver.java +++ b/src/java/org/apache/cassandra/service/reads/DigestResolver.java @@ -25,9 +25,10 @@ import com.google.common.base.Preconditions; import org.apache.cassandra.db.*; import org.apache.cassandra.db.partitions.PartitionIterator; import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators; +import org.apache.cassandra.locator.InetAddressAndPort; import org.apache.cassandra.net.MessageIn; import org.apache.cassandra.service.reads.repair.ReadRepair; -import org.apache.cassandra.tracing.TraceState; +import org.apache.cassandra.utils.ByteBufferUtil; public class DigestResolver extends ResponseResolver { @@ -82,4 +83,31 @@ public class DigestResolver extends ResponseResolver { return dataResponse != null; } + +public DigestResolverDebugResult[] getDigestsByEndpoint() +{ +DigestResolverDebugResult[] ret = new DigestResolverDebugResult[responses.size()]; +for (int i = 0; i < responses.size(); i++) +{ +MessageIn message = responses.get(i); +ReadResponse response = message.payload; +String digestHex = ByteBufferUtil.bytesToHex(response.digest(command)); +ret[i] = new DigestResolverDebugResult(message.from, digestHex, message.payload.isDigestResponse()); +} +return ret; +} + +public static class DigestResolverDebugResult +{ +public InetAddressAndPort from; +public String digestHex; +public boolean isDigestResponse; + +private DigestResolverDebugResult(InetAddressAndPort from, String digestHex, boolean isDigestResponse) +{ +this.from = from; +this.digestHex = digestHex; +this.isDigestResponse = isDigestResponse; +} +} } http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java -- diff --git a/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java b/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java index a1cf827..7e3f0ae 100644 --- a/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java +++ b/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java @@ -122,6 +122,7 @@ public abstract class AbstractReadRepair implements ReadRepair Tracing.trace("Enqueuing full data read to {}", endpoint); sendReadCommand(endpoint, readCallback); } +
[jira] [Updated] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request
[ https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14671: -- Resolution: Fixed Fix Version/s: (was: 4.0.x) 4.0 Status: Resolved (was: Patch Available) > Use consistent nowInSeconds and timestamps values within a request > -- > > Key: CASSANDRA-14671 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14671 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Minor > Labels: fqltool > Fix For: 4.0 > > > We don't currently use consistent values of {{nowInSeconds}} and > {{timestamp}} in the codebase, and sometimes generate several server-side > timestamps for each in the same request. {{QueryState}} should cache the > values it generated so that the same values are used for the duration of > write/read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request
[ https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599063#comment-16599063 ] Aleksey Yeschenko commented on CASSANDRA-14671: --- Committed as [aed682513cc381b80705d1f971fddc394e8a62a5|https://github.com/apache/cassandra/commit/aed682513cc381b80705d1f971fddc394e8a62a5] to trunk, thanks. > Use consistent nowInSeconds and timestamps values within a request > -- > > Key: CASSANDRA-14671 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14671 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Minor > Labels: fqltool > Fix For: 4.0.x > > > We don't currently use consistent values of {{nowInSeconds}} and > {{timestamp}} in the codebase, and sometimes generate several server-side > timestamps for each in the same request. {{QueryState}} should cache the > values it generated so that the same values are used for the duration of > write/read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
cassandra git commit: Use consistent nowInSeconds and timestamps values within a request
Repository: cassandra Updated Branches: refs/heads/trunk f31d1a05a -> aed682513 Use consistent nowInSeconds and timestamps values within a request patch by Aleksey Yeschenko; reviewed by Chris Lohfink for CASSANDRA-14671 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/aed68251 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/aed68251 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/aed68251 Branch: refs/heads/trunk Commit: aed682513cc381b80705d1f971fddc394e8a62a5 Parents: f31d1a05 Author: Aleksey Yeshchenko Authored: Fri Aug 31 11:13:03 2018 +0100 Committer: Aleksey Yeshchenko Committed: Fri Aug 31 18:29:33 2018 +0100 -- CHANGES.txt | 1 + .../cassandra/cql3/BatchQueryOptions.java | 4 +- .../org/apache/cassandra/cql3/QueryOptions.java | 4 +- .../apache/cassandra/cql3/UpdateParameters.java | 4 +- .../cql3/statements/BatchStatement.java | 64 +++- .../cql3/statements/CQL3CasRequest.java | 43 +--- .../cql3/statements/ModificationStatement.java | 101 +-- .../cql3/statements/SelectStatement.java| 7 +- .../cassandra/io/sstable/CQLSSTableWriter.java | 10 +- .../apache/cassandra/service/QueryState.java| 54 +++--- .../org/apache/cassandra/cql3/ListsTest.java| 4 +- .../cassandra/transport/SerDeserTest.java | 7 +- .../io/sstable/StressCQLSSTableWriter.java | 9 +- 13 files changed, 206 insertions(+), 106 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 475cd48..e40cf27 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0 + * Use consistent nowInSeconds and timestamps values within a request (CASSANDRA-14671) * Add sampler for query time and expose with nodetool (CASSANDRA-14436) * Clean up Message.Request implementations (CASSANDRA-14677) * Disable old native protocol versions on demand (CASANDRA-14659) http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java -- diff --git a/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java b/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java index ac0d148..ac8f179 100644 --- a/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java +++ b/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java @@ -84,9 +84,9 @@ public abstract class BatchQueryOptions return wrapped.getTimestamp(state); } -public int getNowInSeconds() +public int getNowInSeconds(QueryState state) { -return wrapped.getNowInSeconds(); +return wrapped.getNowInSeconds(state); } private static class WithoutPerStatementVariables extends BatchQueryOptions http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/QueryOptions.java -- diff --git a/src/java/org/apache/cassandra/cql3/QueryOptions.java b/src/java/org/apache/cassandra/cql3/QueryOptions.java index e546304..f76d6b2 100644 --- a/src/java/org/apache/cassandra/cql3/QueryOptions.java +++ b/src/java/org/apache/cassandra/cql3/QueryOptions.java @@ -200,10 +200,10 @@ public abstract class QueryOptions return tstamp != Long.MIN_VALUE ? tstamp : state.getTimestamp(); } -public int getNowInSeconds() +public int getNowInSeconds(QueryState state) { int nowInSeconds = getSpecificOptions().nowInSeconds; -return Integer.MIN_VALUE == nowInSeconds ? FBUtilities.nowInSeconds() : nowInSeconds; +return nowInSeconds != Integer.MIN_VALUE ? nowInSeconds : state.getNowInSeconds(); } /** The keyspace that this query is bound to, or null if not relevant. */ http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/UpdateParameters.java -- diff --git a/src/java/org/apache/cassandra/cql3/UpdateParameters.java b/src/java/org/apache/cassandra/cql3/UpdateParameters.java index 500862e..740cd91 100644 --- a/src/java/org/apache/cassandra/cql3/UpdateParameters.java +++ b/src/java/org/apache/cassandra/cql3/UpdateParameters.java @@ -28,7 +28,6 @@ import org.apache.cassandra.db.filter.ColumnFilter; import org.apache.cassandra.db.partitions.Partition; import org.apache.cassandra.db.rows.*; import org.apache.cassandra.exceptions.InvalidRequestException; -import org.apache.cassandra.utils.FBUtilities; /** * Groups the parameters of an update
[jira] [Commented] (CASSANDRA-14145) Detecting data resurrection during read
[ https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599053#comment-16599053 ] Sam Tunnicliffe commented on CASSANDRA-14145: - bq. Since we lose a good chunk of the benefits of the optimization, would it make more sense to disable it entirely I agree. We've been pushing and pulling this around for a while now, in an attempt to retain some of the optimisations of {{queryMemtableAndSSTablesInTimestampOrder}} when tracking is enabled, but nothing so far has really been satisfactory. +1 to just disabling it when tracking for now & revisiting later. bq. I would turn inconclusive tracking off by default Fair enough. As everything is off by default for now it probably makes for a better UI to selectively turn things on rather than flipping some switches up and some down. To that end I've reversed the semantics of that third option from "only report confirmed" to "also report unconfirmed". bq. Missing new cassandra.yaml entries Sorry, forgot to git add the yaml file before pushing the previous commits. Updated for default changes above & pushed. bq. Minor nits Addressed and bundled in a single commit, with a couple of caveats: I came to the conclusion that {{RepairedDataInfo}} really ought to be private to the {{ReadCommand}} using it, so I removed the iface and just made it a private class. The digest and isConclusive are accessed via {{ReadCommand}} itself which I think does a better job of hiding unnecessary information. I did leave {{NO_OP_REPAIRED_DATA_INFO}} in (slightly renamed) as I'd rather a safe nullobject there than worry about null checking every use of it in case some path is overlooked or untested. I also felt that the splitting of iterators according to the repaired status of the sstables (or memtable) that they come from was a bit ugly and invasive, so I've refactored that into another inner class of {{ReadCommand}}, which tidied up {{SPRC/PRRC}} quite a bit. One last thing, [~jrwest] mentioned off-JIRA that there's an edge case where compaction is backed up and so an sstable may be marked pending, but the session to which it belongs has been purged. In that case, we'd mistakenly consider digests inconclusive, so I've added a check that the session actually exists, and if not we consider the sstable unrepaired. > Detecting data resurrection during read > > > Key: CASSANDRA-14145 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14145 > Project: Cassandra > Issue Type: Improvement >Reporter: sankalp kohli >Assignee: Sam Tunnicliffe >Priority: Minor > Fix For: 4.x > > > We have seen several bugs in which deleted data gets resurrected. We should > try to see if we can detect this on the read path and possibly fix it. Here > are a few examples which brought back data > A replica lost an sstable on startup which caused one replica to lose the > tombstone and not the data. This tombstone was past gc grace which means this > could resurrect data. We can detect such invalid states by looking at other > replicas. > If we are running incremental repair, Cassandra will keep repaired and > non-repaired data separate. Every-time incremental repair will run, it will > move the data from non-repaired to repaired. Repaired data across all > replicas should be 100% consistent. > Here is an example of how we can detect and mitigate the issue in most cases. > Say we have 3 machines, A,B and C. All these machines will have data split > b/w repaired and non-repaired. > 1. Machine A due to some bug bring backs data D. This data D is in repaired > dataset. All other replicas will have data D and tombstone T > 2. Read for data D comes from application which involve replicas A and B. The > data being read involves data which is in repaired state. A will respond > back to co-ordinator with data D and B will send nothing as tombstone is past > gc grace. This will cause digest mismatch. > 3. This patch will only kick in when there is a digest mismatch. Co-ordinator > will ask both replicas to send back all data like we do today but with this > patch, replicas will respond back what data it is returning is coming from > repaired vs non-repaired. If data coming from repaired does not match, we > know there is a something wrong!! At this time, co-ordinator cannot determine > if replica A has resurrected some data or replica B has lost some data. We > can still log error in the logs saying we hit an invalid state. > 4. Besides the log, we can take this further and even correct the response to > the query. After logging an invalid state, we can ask replica A and B (and > also C if alive) to send back all data for this including gcable tombstones. > If any machine returns a tombstone which is
[jira] [Commented] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request
[ https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599052#comment-16599052 ] Chris Lohfink commented on CASSANDRA-14671: --- +1 > Use consistent nowInSeconds and timestamps values within a request > -- > > Key: CASSANDRA-14671 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14671 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Minor > Labels: fqltool > Fix For: 4.0.x > > > We don't currently use consistent values of {{nowInSeconds}} and > {{timestamp}} in the codebase, and sometimes generate several server-side > timestamps for each in the same request. {{QueryState}} should cache the > values it generated so that the same values are used for the duration of > write/read. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL
[ https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599051#comment-16599051 ] Marcus Eriksson commented on CASSANDRA-14675: - and just realised that Dump.java needs update as well > Log the actual (if server-generated) timestamp and nowInSeconds used by > queries in FQL > -- > > Key: CASSANDRA-14675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14675 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Labels: fqltool > Fix For: 4.0.x > > > FQL doesn't currently log the actual timestamp - in microseconds - if it's > been server generated, nor the nowInSeconds value. It needs to, to allow for > - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic > playback tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL
[ https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599042#comment-16599042 ] Marcus Eriksson commented on CASSANDRA-14675: - oh and https://github.com/krummas/cassandra/commit/f22ea233de6886b7beae7ef57d50d4b5dd9bad4f - don't know how well chronicle works with repeated fields > Log the actual (if server-generated) timestamp and nowInSeconds used by > queries in FQL > -- > > Key: CASSANDRA-14675 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14675 > Project: Cassandra > Issue Type: Improvement >Reporter: Aleksey Yeschenko >Assignee: Aleksey Yeschenko >Priority: Major > Labels: fqltool > Fix For: 4.0.x > > > FQL doesn't currently log the actual timestamp - in microseconds - if it's > been server generated, nor the nowInSeconds value. It needs to, to allow for > - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic > playback tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14436) Add sampler for query time and expose with nodetool
[ https://issues.apache.org/jira/browse/CASSANDRA-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aleksey Yeschenko updated CASSANDRA-14436: -- Resolution: Fixed Fix Version/s: (was: 4.x) 4.0 Status: Resolved (was: Patch Available) > Add sampler for query time and expose with nodetool > --- > > Key: CASSANDRA-14436 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14436 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Major > Labels: 4.0-feature-freeze-review-requested, > pull-request-available > Fix For: 4.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Create a new {{nodetool profileload}} that functions just like toppartitions > but with more data, returning the slowest local reads and writes on the host > during a given duration and highest frequency touched partitions (same as > {{nodetool toppartitions}}). Refactor included to extend use of the sampler > for uses outside of top frequency (max instead of total sample values). > Future work to this is to include top cpu and allocations by query and > possibly tasks/cpu/allocations by stage during time window. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14436) Add sampler for query time and expose with nodetool
[ https://issues.apache.org/jira/browse/CASSANDRA-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599037#comment-16599037 ] Aleksey Yeschenko commented on CASSANDRA-14436: --- Committed as [f31d1a05a1f6f85f64c9b965009db814960c4eca|https://github.com/apache/cassandra/commit/f31d1a05a1f6f85f64c9b965009db814960c4eca] to trunk. Mostly just looked at potential negative effects on the read path and found none, but cleaned up {{ReadExecutionController}} a little in the process. I trust Chris and Dinesh to have collectively done a good job at implementation and review of the rest. > Add sampler for query time and expose with nodetool > --- > > Key: CASSANDRA-14436 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14436 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Lohfink >Assignee: Chris Lohfink >Priority: Major > Labels: 4.0-feature-freeze-review-requested, > pull-request-available > Fix For: 4.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Create a new {{nodetool profileload}} that functions just like toppartitions > but with more data, returning the slowest local reads and writes on the host > during a given duration and highest frequency touched partitions (same as > {{nodetool toppartitions}}). Refactor included to extend use of the sampler > for uses outside of top frequency (max instead of total sample values). > Future work to this is to include top cpu and allocations by query and > possibly tasks/cpu/allocations by stage during time window. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming
Alexander Dejanovski created CASSANDRA-14685: Summary: Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming Key: CASSANDRA-14685 URL: https://issues.apache.org/jira/browse/CASSANDRA-14685 Project: Cassandra Issue Type: Bug Components: Repair Reporter: Alexander Dejanovski The changes in CASSANDRA-9143 modified the way incremental repair performs by applying the following sequence of events : * Anticompaction is executed on all replicas for all SSTables overlapping the repaired ranges * Anticompacted SSTables are then marked as "Pending repair" and cannot be compacted anymore, nor part of another repair session * Merkle trees are generated and compared * Streaming takes place if needed * Anticompaction is committed and "pending repair" table are marked as repaired if it succeeded, or they are released if the repair session failed. If the repair coordinator dies during the streaming phase, *the SSTables on the replicas will remain in "pending repair" state and will never be eligible for repair or compaction*, even after all the nodes in the cluster are restarted. Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) : {noformat} ccm create inc-repair-issue -v github:jasobrown/13938 -n 3 # Allow jmx access and remove all rpc_ settings in yaml for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh; do sed -i'' -e 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g' $f done for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml; do grep -v "rpc_" $f > ${f}.tmp cat ${f}.tmp > $f done ccm start {noformat} I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a few 10s of MBs of data (killed it after some time). Obviously cassandra-stress works as well : {noformat} bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000 --replication "{'class':'SimpleStrategy', 'replication_factor':2}" --compaction "{'class': 'SizeTieredCompactionStrategy'}" --host 127.0.0.1 {noformat} Flush and delete all SSTables in node1 : {noformat} ccm node1 nodetool flush rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.* {noformat} Then throttle streaming throughput to 1MB/s so we have time to take node1 down during the streaming phase and run repair: {noformat} ccm node1 nodetool setstreamthroughput 1 ccm node2 nodetool setstreamthroughput 1 ccm node3 nodetool setstreamthroughput 1 ccm node1 nodetool repair tlp_stress {noformat} Once streaming starts, shut down node1 and start it again : {noformat} ccm node1 stop ccm node1 start {noformat} Run repair again : {noformat} ccm node1 nodetool repair tlp_stress {noformat} The command will return very quickly, showing that it skipped all sstables : {noformat} [2018-08-31 19:05:16,292] Repair completed successfully [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds $ ccm node1 nodetool status Datacenter: datacenter1 === Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- AddressLoad Tokens OwnsHost ID Rack UN 127.0.0.1 228,64 KiB 256 ? 437dc9cd-b1a1-41a5-961e-cfc99763e29f rack1 UN 127.0.0.2 60,09 MiB 256 ? fbcbbdbb-e32a-4716-8230-8ca59aa93e62 rack1 UN 127.0.0.3 57,59 MiB 256 ? a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0 rack1 {noformat} sstablemetadata will then show that nodes 2 and 3 have SSTables still in "pending repair" state : {noformat} ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | grep repair SSTable: /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62 {noformat} Restarting these nodes wouldn't help either. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[2/2] cassandra git commit: Add sampler for query time and expose with nodetool
Add sampler for query time and expose with nodetool patch by Chris Lohfink; reviewed by Dinesh Joshi for CASSANDRA-14436 Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f31d1a05 Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f31d1a05 Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f31d1a05 Branch: refs/heads/trunk Commit: f31d1a05a1f6f85f64c9b965009db814960c4eca Parents: a09dc3a Author: Chris Lohfink Authored: Fri Aug 17 00:40:54 2018 -0500 Committer: Aleksey Yeshchenko Committed: Fri Aug 31 18:14:31 2018 +0100 -- CHANGES.txt | 1 + .../apache/cassandra/db/ColumnFamilyStore.java | 67 ++--- .../cassandra/db/ColumnFamilyStoreMBean.java| 4 +- .../cassandra/db/ReadExecutionController.java | 89 --- .../db/SinglePartitionReadCommand.java | 4 +- .../cassandra/metrics/FrequencySampler.java | 103 .../apache/cassandra/metrics/MaxSampler.java| 75 ++ .../org/apache/cassandra/metrics/Sampler.java | 97 .../apache/cassandra/metrics/TableMetrics.java | 93 +-- .../apache/cassandra/net/MessagingService.java | 2 + .../apache/cassandra/service/StorageProxy.java | 4 +- .../cassandra/service/StorageService.java | 30 ++- .../cassandra/service/StorageServiceMBean.java | 3 +- .../org/apache/cassandra/tools/NodeProbe.java | 17 +- .../org/apache/cassandra/tools/NodeTool.java| 54 ++-- .../cassandra/tools/nodetool/ProfileLoad.java | 192 ++ .../cassandra/tools/nodetool/TopPartitions.java | 121 + .../org/apache/cassandra/utils/TopKSampler.java | 139 --- .../cassandra/metrics/MaxSamplerTest.java | 93 +++ .../apache/cassandra/metrics/SamplerTest.java | 247 +++ .../metrics/TopFrequencySamplerTest.java| 71 ++ .../cassandra/tools/TopPartitionsTest.java | 14 +- .../apache/cassandra/utils/TopKSamplerTest.java | 171 - 23 files changed, 1120 insertions(+), 571 deletions(-) -- http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/CHANGES.txt -- diff --git a/CHANGES.txt b/CHANGES.txt index 6489038..475cd48 100644 --- a/CHANGES.txt +++ b/CHANGES.txt @@ -1,4 +1,5 @@ 4.0 + * Add sampler for query time and expose with nodetool (CASSANDRA-14436) * Clean up Message.Request implementations (CASSANDRA-14677) * Disable old native protocol versions on demand (CASANDRA-14659) * Allow specifying now-in-seconds in native protocol (CASSANDRA-14664) http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/src/java/org/apache/cassandra/db/ColumnFamilyStore.java -- diff --git a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java index 496fc10..5e38584 100644 --- a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java +++ b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java @@ -42,7 +42,6 @@ import com.google.common.util.concurrent.*; import org.slf4j.Logger; import org.slf4j.LoggerFactory; -import com.clearspring.analytics.stream.Counter; import org.apache.cassandra.cache.*; import org.apache.cassandra.concurrent.*; import org.apache.cassandra.config.*; @@ -65,11 +64,9 @@ import org.apache.cassandra.exceptions.StartupException; import org.apache.cassandra.index.SecondaryIndexManager; import org.apache.cassandra.index.internal.CassandraIndex; import org.apache.cassandra.index.transactions.UpdateTransaction; -import org.apache.cassandra.io.FSError; import org.apache.cassandra.io.FSReadError; import org.apache.cassandra.io.FSWriteError; import org.apache.cassandra.io.sstable.Component; -import org.apache.cassandra.io.sstable.CorruptSSTableException; import org.apache.cassandra.io.sstable.Descriptor; import org.apache.cassandra.io.sstable.KeyIterator; import org.apache.cassandra.io.sstable.SSTableMultiWriter; @@ -77,8 +74,10 @@ import org.apache.cassandra.io.sstable.format.*; import org.apache.cassandra.io.sstable.metadata.MetadataCollector; import org.apache.cassandra.io.util.FileUtils; import org.apache.cassandra.io.util.RandomAccessReader; +import org.apache.cassandra.metrics.Sampler; +import org.apache.cassandra.metrics.Sampler.Sample; +import org.apache.cassandra.metrics.Sampler.SamplerType; import org.apache.cassandra.metrics.TableMetrics; -import org.apache.cassandra.metrics.TableMetrics.Sampler; import org.apache.cassandra.repair.TableRepairManager; import org.apache.cassandra.schema.*; import org.apache.cassandra.schema.CompactionParams.TombstoneOption; @@ -87,7 +86,6 @@ import org.apache.cassandra.service.CacheService; import
[1/2] cassandra git commit: Add sampler for query time and expose with nodetool
Repository: cassandra Updated Branches: refs/heads/trunk a09dc3a53 -> f31d1a05a http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java -- diff --git a/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java b/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java deleted file mode 100644 index f35d072..000 --- a/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java +++ /dev/null @@ -1,171 +0,0 @@ -/* - * - * Licensed to the Apache Software Foundation (ASF) under one - * or more contributor license agreements. See the NOTICE file - * distributed with this work for additional information - * regarding copyright ownership. The ASF licenses this file - * to you under the Apache License, Version 2.0 (the - * "License"); you may not use this file except in compliance - * with the License. You may obtain a copy of the License at - * - * http://www.apache.org/licenses/LICENSE-2.0 - * - * Unless required by applicable law or agreed to in writing, - * software distributed under the License is distributed on an - * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY - * KIND, either express or implied. See the License for the - * specific language governing permissions and limitations - * under the License. - * - */ -package org.apache.cassandra.utils; - -import java.util.List; -import java.util.Map; -import java.util.concurrent.CountDownLatch; -import java.util.concurrent.TimeUnit; -import java.util.concurrent.TimeoutException; -import java.util.concurrent.atomic.AtomicBoolean; - -import com.google.common.collect.Maps; -import com.google.common.util.concurrent.Uninterruptibles; -import org.junit.Test; - -import com.clearspring.analytics.hash.MurmurHash; -import com.clearspring.analytics.stream.Counter; -import org.junit.Assert; -import org.apache.cassandra.concurrent.NamedThreadFactory; -import org.apache.cassandra.utils.TopKSampler.SamplerResult; - -public class TopKSamplerTest -{ - -@Test -public void testSamplerSingleInsertionsEqualMulti() throws TimeoutException -{ -TopKSampler sampler = new TopKSampler(); -sampler.beginSampling(10); -insert(sampler); -waitForEmpty(1000); -SamplerResult single = sampler.finishSampling(10); - -TopKSampler sampler2 = new TopKSampler(); -sampler2.beginSampling(10); -for(int i = 1; i <= 10; i++) -{ - String key = "item" + i; - sampler2.addSample(key, MurmurHash.hash64(key), i); -} -waitForEmpty(1000); -Assert.assertEquals(countMap(single.topK), countMap(sampler2.finishSampling(10).topK)); -Assert.assertEquals(sampler2.hll.cardinality(), 10); -Assert.assertEquals(sampler.hll.cardinality(), sampler2.hll.cardinality()); -} - -@Test -public void testSamplerOutOfOrder() throws TimeoutException -{ -TopKSampler sampler = new TopKSampler(); -sampler.beginSampling(10); -insert(sampler); -waitForEmpty(1000); -SamplerResult single = sampler.finishSampling(10); -single = sampler.finishSampling(10); -} - -/** - * checking for exceptions from SS/HLL which are not thread safe - */ -@Test -public void testMultithreadedAccess() throws Exception -{ -final AtomicBoolean running = new AtomicBoolean(true); -final CountDownLatch latch = new CountDownLatch(1); -final TopKSampler sampler = new TopKSampler(); - -NamedThreadFactory.createThread(new Runnable() -{ -public void run() -{ -try -{ -while (running.get()) -{ -insert(sampler); -} -} finally -{ -latch.countDown(); -} -} - -} -, "inserter").start(); -try -{ -// start/stop in fast iterations -for(int i = 0; i<100; i++) -{ -sampler.beginSampling(i); -sampler.finishSampling(i); -} -// start/stop with pause to let it build up past capacity -for(int i = 0; i<3; i++) -{ -sampler.beginSampling(i); -Thread.sleep(250); -sampler.finishSampling(i); -} - -// with empty results -running.set(false); -latch.await(1, TimeUnit.SECONDS); -waitForEmpty(1000); -for(int i = 0; i<10; i++) -{ -sampler.beginSampling(i); -Thread.sleep(i); -sampler.finishSampling(i); -} -} finally -{ -running.set(false); -} -} - -private void insert(TopKSampler sampler)
[jira] [Commented] (CASSANDRA-14683) Pagestate is null after 2^31 rows
[ https://issues.apache.org/jira/browse/CASSANDRA-14683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599030#comment-16599030 ] Abhishek commented on CASSANDRA-14683: -- [~iamaleksey] I'm willing to fix this but I'm new to the codebase. Can you point me to the class which needs changes? > Pagestate is null after 2^31 rows > - > > Key: CASSANDRA-14683 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14683 > Project: Cassandra > Issue Type: Bug >Reporter: Abhishek >Priority: Major > > I am using the nodejs driver to take a dump of my table via > [pagination|http://datastax.github.io/nodejs-driver/features/paging/#manual-paging] > for a simple query. > My query is \{{select * from mytable}} > The table has close to 4 billion rows and cassandra stops returning results > exactly after 2147483647 rows. The pagestate is not returned after this. > Cassandra version - 3.0.9 > Nodejs cassandra driver version - 3.5.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14655) Upgrade C* to use latest guava (26.0)
[ https://issues.apache.org/jira/browse/CASSANDRA-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599028#comment-16599028 ] Andy Tolbert commented on CASSANDRA-14655: -- Thanks for providing the stack traces. Ahh, I suppose that the cause of the NPE could just be that no {{Cluster}} instance is provided in {{executeNet}} because it can't properly initialize because of the {{system_virtual_schema}} issue. In terms of {{system_virtual_schema}} being missing, I think the code that loads/creates that keyspace is [here|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/service/CassandraDaemon.java#L255-L256] during the setup of {{CassandraDaemon}}. My guess is that if you add the relevant code in CQLTester.requireNetwork, like [is done|https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/test/unit/org/apache/cassandra/cql3/CQLTester.java#L381] for the system keyspace via {{SystemKeyspace.finishStartup()}} that will fix this issue. > Upgrade C* to use latest guava (26.0) > - > > Key: CASSANDRA-14655 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14655 > Project: Cassandra > Issue Type: Improvement > Components: Libraries >Reporter: Sumanth Pasupuleti >Assignee: Sumanth Pasupuleti >Priority: Minor > Fix For: 4.x > > > C* currently uses guava 23.3. This JIRA is about changing C* to use latest > guava (26.0). Originated from a discussion in the mailing list. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org