date:20180831

[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599466#comment-16599466
 ] 

Ariel Weisberg commented on CASSANDRA-14408:


Obviously +1 from me as I committed it.

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599465#comment-16599465
 ] 

Ariel Weisberg commented on CASSANDRA-14407:


Obviously +1 from me as I committed it.

> Transient Replication: Add support for correct reads when transient 
> replication is in use
> -
>
> Key: CASSANDRA-14407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14407
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Digest reads should never be sent to transient replicas.
> Mismatches with results from transient replicas shouldn't trigger read repair.
> Read repair should never attempt to repair a transient replica.
> Reads should always include at least one full replica. They should also 
> prefer transient replicas where possible.
> Range scans must ensure the entire scanned range performs replica selection 
> that satisfies the requirement that every range scanned includes one full 
> replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599464#comment-16599464
 ] 

Ariel Weisberg commented on CASSANDRA-14406:


Obviously +1 from me as I committed it.

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599463#comment-16599463
 ] 

Ariel Weisberg commented on CASSANDRA-14405:


Obviously +1 from me as I committed it.

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14406:
---
Fix Version/s: 4.0

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14408:
---
Status: Ready to Commit  (was: Patch Available)

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14409:
---
Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14409) Transient Replication: Support ring changes when transient replication is in use (add/remove node, change RF, add/remove DC)

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14409:
---
Status: Ready to Commit  (was: Patch Available)

> Transient Replication: Support ring changes when transient replication is in 
> use (add/remove node, change RF, add/remove DC)
> 
>
> Key: CASSANDRA-14409
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14409
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination, Core, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> The additional state transitions that transient replication introduces 
> require streaming and nodetool cleanup to behave differently. We already have 
> code that does the streaming, but in some cases we shouldn't stream any data 
> and in others when we stream to receive data we have to make sure we stream 
> from a full replica and not a transient replica.
> Transitioning from not replicated to transiently replicated means that a node 
> must stay pending until the next incremental repair completes at which point 
> the data for that range is known to be available at full replicas.
> Transitioning from transiently replicated to fully replicated requires 
> streaming from a full replica and is identical to how we stream from not 
> replicated to replicated. The transition must be managed so the transient 
> replica is not read from as a full replica until streaming completes. It can 
> be used immediately for a write quorum.
> Transitioning from fully replicated to transiently replicated requires 
> cleanup to remove repaired data from the transiently replicated range to 
> reclaim space. It can be used immediately for a write quorum.
> Transitioning from transiently replicated to not replicated requires cleanup 
> to be run to remove the formerly transiently replicated data.
> nodetool move, removenode, cleanup, decommission, and rebuild need to handle 
> these issues as does bootstrap.
> Update web site, documentation, NEWS.txt with a description of the steps for 
> doing common operations. Add/remove DC, Add/remove node(s), replace node, 
> change RF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14408:
---
Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14407:
---
Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication: Add support for correct reads when transient 
> replication is in use
> -
>
> Key: CASSANDRA-14407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14407
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Digest reads should never be sent to transient replicas.
> Mismatches with results from transient replicas shouldn't trigger read repair.
> Read repair should never attempt to repair a transient replica.
> Reads should always include at least one full replica. They should also 
> prefer transient replicas where possible.
> Range scans must ensure the entire scanned range performs replica selection 
> that satisfies the requirement that every range scanned includes one full 
> replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14406:
---
Status: Patch Available  (was: Open)

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14406:
---
Status: Ready to Commit  (was: Patch Available)

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14407:
---
Status: Ready to Commit  (was: Patch Available)

> Transient Replication: Add support for correct reads when transient 
> replication is in use
> -
>
> Key: CASSANDRA-14407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14407
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Digest reads should never be sent to transient replicas.
> Mismatches with results from transient replicas shouldn't trigger read repair.
> Read repair should never attempt to repair a transient replica.
> Reads should always include at least one full replica. They should also 
> prefer transient replicas where possible.
> Range scans must ensure the entire scanned range performs replica selection 
> that satisfies the requirement that every range scanned includes one full 
> replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14405:
---
Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599458#comment-16599458
 ] 

Ariel Weisberg commented on CASSANDRA-14405:


The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14406:
---
Resolution: Fixed
Status: Resolved  (was: Ready to Commit)

The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14405:
---
Status: Ready to Commit  (was: Patch Available)

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use

2018-08-31 Thread Ariel Weisberg (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ariel Weisberg updated CASSANDRA-14407:
---
Status: Patch Available  (was: Open)

> Transient Replication: Add support for correct reads when transient 
> replication is in use
> -
>
> Key: CASSANDRA-14407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14407
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Digest reads should never be sent to transient replicas.
> Mismatches with results from transient replicas shouldn't trigger read repair.
> Read repair should never attempt to repair a transient replica.
> Reads should always include at least one full replica. They should also 
> prefer transient replicas where possible.
> Range scans must ensure the entire scanned range performs replica selection 
> that satisfies the requirement that every range scanned includes one full 
> replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599457#comment-16599457
 ] 

Ariel Weisberg commented on CASSANDRA-14404:


The initial implementation of Transient Replication and Cheap Quorums was 
committed as 
[f7431b432875e334170ccdb19934d05545d2cebd|https://github.com/apache/cassandra/commit/f7431b432875e334170ccdb19934d05545d2cebd].

> Transient Replication & Cheap Quorums: Decouple storage requirements from 
> consensus group size using incremental repair
> ---
>
> Key: CASSANDRA-14404
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14404
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Core, CQL, Distributed Metadata, Hints, 
> Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, 
> Testing, Tools
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> Transient Replication is an implementation of [Witness 
> Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf]
>  that leverages incremental repair to make full replicas consistent with 
> transient replicas that don't store the entire data set. Witness replicas are 
> used in real world systems such as Megastore and Spanner to increase 
> availability inexpensively without having to commit to more full copies of 
> the database. Transient replicas implement functionality similar to 
> upgradable and temporary replicas from the paper.
> With transient replication the replication factor is increased beyond the 
> desired level of data redundancy by adding replicas that only store data when 
> sufficient full replicas are unavailable to store the data. These replicas 
> are called transient replicas. When incremental repair runs transient 
> replicas stream any data they have received to full replicas and once the 
> data is fully replicated it is dropped at the transient replicas.
> Cheap quorums are a further set of optimizations on the write path to avoid 
> writing to transient replicas unless sufficient full replicas are available 
> as well as optimizations on the read path to prefer reading from transient 
> replicas. When writing at quorum to a table configured to use transient 
> replication the quorum will always prefer available full replicas over 
> transient replicas so that transient replicas don't have to process writes. 
> Rapid write protection (similar to rapid read protection) reduces tail 
> latency when full replicas are slow/unavailable to respond by sending writes 
> to additional replicas if necessary.
> Transient replicas can generally service reads faster because they don't have 
> to do anything beyond bloom filter checks if they have no data. With vnodes 
> and larger size clusters they will not have a large quantity of data even in 
> failure cases where transient replicas start to serve a steady amount of 
> write traffic for some of their transiently replicated ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

cassandra-dtest git commit: Transient Replication and Cheap Quorums, update existing tests

2018-08-31 Thread aweisberg

Repository: cassandra-dtest
Updated Branches:
  refs/heads/master 3d760e6da -> 4e1c05565


Transient Replication and Cheap Quorums, update existing tests

Patch by Ariel Weisberg; Reviewed by Blake Eggleston for CASSANDRA-14404

Co-authored-by: Blake Eggleston 
Co-authored-by: Alex Petrov 


Project: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/commit/4e1c0556
Tree: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/tree/4e1c0556
Diff: http://git-wip-us.apache.org/repos/asf/cassandra-dtest/diff/4e1c0556

Branch: refs/heads/master
Commit: 4e1c05565aada57466b8edcdff43f1c7ebb7cd3e
Parents: 3d760e6
Author: Ariel Weisberg 
Authored: Fri Jun 22 12:28:30 2018 -0700
Committer: Ariel Weisberg 
Committed: Fri Aug 31 21:41:09 2018 -0400

--
 byteman/failing_repair.btm|  7 
 byteman/read_repair/sorted_live_endpoints.btm | 13 ++-
 byteman/read_repair/stop_data_reads.btm   |  2 +-
 byteman/read_repair/stop_digest_reads.btm |  2 +-
 byteman/slow_writes.btm   |  7 
 byteman/stop_reads.btm|  8 
 byteman/stop_rr_writes.btm|  8 
 byteman/stop_writes.btm   |  8 
 byteman/throw_on_digest.btm   |  7 
 read_repair_test.py   | 44 --
 repair_tests/repair_test.py   |  3 +-
 11 files changed, 93 insertions(+), 16 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/failing_repair.btm
--
diff --git a/byteman/failing_repair.btm b/byteman/failing_repair.btm
new file mode 100644
index 000..ea82888
--- /dev/null
+++ b/byteman/failing_repair.btm
@@ -0,0 +1,7 @@
+RULE fail repairs
+CLASS org.apache.cassandra.repair.RepairMessageVerbHandler
+METHOD doVerb
+AT ENTRY
+IF true
+DO throw new RuntimeException("Repair failed");
+ENDRULE

http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/sorted_live_endpoints.btm
--
diff --git a/byteman/read_repair/sorted_live_endpoints.btm 
b/byteman/read_repair/sorted_live_endpoints.btm
index 221e958..bfcfb1a 100644
--- a/byteman/read_repair/sorted_live_endpoints.btm
+++ b/byteman/read_repair/sorted_live_endpoints.btm
@@ -1,15 +1,8 @@
 RULE sorted live endpoints
-CLASS org.apache.cassandra.service.StorageProxy
-METHOD getLiveSortedEndpoints
+CLASS org.apache.cassandra.locator.SimpleSnitch
+METHOD sortedByProximity
 AT ENTRY
-BIND ep1 = 
org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.1");
- ep2 = 
org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.2");
- ep3 = 
org.apache.cassandra.locator.InetAddressAndPort.getByName("127.0.0.3");
- eps = new java.util.ArrayList();
 IF true
 DO
-eps.add(ep1);
-eps.add(ep2);
-eps.add(ep3);
-return eps;
+return $unsortedAddress.sorted(java.util.Comparator.naturalOrder());
 ENDRULE
\ No newline at end of file

http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/stop_data_reads.btm
--
diff --git a/byteman/read_repair/stop_data_reads.btm 
b/byteman/read_repair/stop_data_reads.btm
index 9506aba..905a110 100644
--- a/byteman/read_repair/stop_data_reads.btm
+++ b/byteman/read_repair/stop_data_reads.btm
@@ -4,7 +4,7 @@ CLASS org.apache.cassandra.db.ReadCommandVerbHandler
 METHOD doVerb
 # wait until command is declared locally. because generics
 AFTER WRITE $command
-# bail out if it's not a digest request
+# bail out if it's a data request
 IF NOT $command.isDigestQuery()
 DO return;
 ENDRULE

http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/read_repair/stop_digest_reads.btm
--
diff --git a/byteman/read_repair/stop_digest_reads.btm 
b/byteman/read_repair/stop_digest_reads.btm
index 92c54f6..adb9b31 100644
--- a/byteman/read_repair/stop_digest_reads.btm
+++ b/byteman/read_repair/stop_digest_reads.btm
@@ -4,7 +4,7 @@ CLASS org.apache.cassandra.db.ReadCommandVerbHandler
 METHOD doVerb
 # wait until command is declared locally. because generics
 AFTER WRITE $command
-# bail out if it's not a digest request
+# bail out if it's a digest request
 IF $command.isDigestQuery()
 DO return;
 ENDRULE

http://git-wip-us.apache.org/repos/asf/cassandra-dtest/blob/4e1c0556/byteman/slow_writes.btm
--
diff --git a/byteman/slow_writes.btm b/byteman/slow_writes.btm
new file mode 100644
index 000..a82dd0a
--- /dev/null
+++

[11/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java
--
diff --git a/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java 
b/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java
new file mode 100644
index 000..7eedab7
--- /dev/null
+++ b/src/java/org/apache/cassandra/repair/SymmetricLocalSyncTask.java
@@ -0,0 +1,142 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.repair;
+
+import java.util.Collections;
+import java.util.List;
+import java.util.UUID;
+
+import com.google.common.annotations.VisibleForTesting;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.cassandra.dht.Range;
+import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.RangesAtEndpoint;
+import org.apache.cassandra.streaming.PreviewKind;
+import org.apache.cassandra.streaming.ProgressInfo;
+import org.apache.cassandra.streaming.StreamEvent;
+import org.apache.cassandra.streaming.StreamEventHandler;
+import org.apache.cassandra.streaming.StreamOperation;
+import org.apache.cassandra.streaming.StreamPlan;
+import org.apache.cassandra.streaming.StreamState;
+import org.apache.cassandra.tracing.TraceState;
+import org.apache.cassandra.tracing.Tracing;
+import org.apache.cassandra.utils.FBUtilities;
+
+/**
+ * SymmetricLocalSyncTask performs streaming between local(coordinator) node 
and remote replica.
+ */
+public class SymmetricLocalSyncTask extends SymmetricSyncTask implements 
StreamEventHandler
+{
+private final TraceState state = Tracing.instance.get();
+
+private static final Logger logger = 
LoggerFactory.getLogger(SymmetricLocalSyncTask.class);
+
+private final boolean remoteIsTransient;
+private final UUID pendingRepair;
+private final boolean pullRepair;
+
+public SymmetricLocalSyncTask(RepairJobDesc desc, TreeResponse r1, 
TreeResponse r2, boolean remoteIsTransient, UUID pendingRepair, boolean 
pullRepair, PreviewKind previewKind)
+{
+super(desc, r1, r2, previewKind);
+this.remoteIsTransient = remoteIsTransient;
+this.pendingRepair = pendingRepair;
+this.pullRepair = pullRepair;
+}
+
+@VisibleForTesting
+StreamPlan createStreamPlan(InetAddressAndPort dst, List> 
differences)
+{
+StreamPlan plan = new StreamPlan(StreamOperation.REPAIR, 1, false, 
pendingRepair, previewKind)
+  .listeners(this)
+  .flushBeforeTransfer(pendingRepair == null)
+  // see comment on RangesAtEndpoint.toDummyList for 
why we synthesize replicas here
+  .requestRanges(dst, desc.keyspace, 
RangesAtEndpoint.toDummyList(differences),
+  
RangesAtEndpoint.toDummyList(Collections.emptyList()), desc.columnFamily);  // 
request ranges from the remote node
+
+if (!pullRepair && !remoteIsTransient)
+{
+// send ranges to the remote node if we are not performing a pull 
repair
+// see comment on RangesAtEndpoint.toDummyList for why we 
synthesize replicas here
+plan.transferRanges(dst, desc.keyspace, 
RangesAtEndpoint.toDummyList(differences), desc.columnFamily);
+}
+
+return plan;
+}
+
+/**
+ * Starts sending/receiving our list of differences to/from the remote 
endpoint: creates a callback
+ * that will be called out of band once the streams complete.
+ */
+@Override
+protected void startSync(List> differences)
+{
+InetAddressAndPort local = FBUtilities.getBroadcastAddressAndPort();
+// We can take anyone of the node as source or destination, however if 
one is localhost, we put at source to avoid a forwarding
+InetAddressAndPort dst = r2.endpoint.equals(local) ? r1.endpoint : 
r2.endpoint;
+
+String message = String.format("Performing streaming repair of %d 
ranges with %s", differences.size(), dst);
+logger.info("{} {}",

[03/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java 
b/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java
index 294731a..4f7cde0 100644
--- a/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java
+++ b/test/unit/org/apache/cassandra/service/ActiveRepairServiceTest.java
@@ -20,8 +20,6 @@ package org.apache.cassandra.service;
 
 import java.util.*;
 
-import javax.xml.crypto.Data;
-
 import com.google.common.collect.ImmutableList;
 import com.google.common.collect.Sets;
 import org.junit.Assert;
@@ -36,13 +34,13 @@ import org.apache.cassandra.db.Keyspace;
 import org.apache.cassandra.db.RowUpdateBuilder;
 import org.apache.cassandra.db.lifecycle.SSTableSet;
 import org.apache.cassandra.db.lifecycle.View;
-import org.apache.cassandra.dht.IPartitioner;
 import org.apache.cassandra.dht.Range;
 import org.apache.cassandra.dht.Token;
 import org.apache.cassandra.exceptions.ConfigurationException;
 import org.apache.cassandra.io.sstable.format.SSTableReader;
 import org.apache.cassandra.locator.AbstractReplicationStrategy;
 import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.Replica;
 import org.apache.cassandra.locator.TokenMetadata;
 import org.apache.cassandra.repair.messages.RepairOption;
 import org.apache.cassandra.streaming.PreviewKind;
@@ -107,13 +105,13 @@ public class ActiveRepairServiceTest
 public void testGetNeighborsPlusOne() throws Throwable
 {
 // generate rf+1 nodes, and ensure that all nodes are returned
-Set expected = addTokens(1 + 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor());
+Set expected = addTokens(1 + 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas);
 expected.remove(FBUtilities.getBroadcastAddressAndPort());
-Collection> ranges = 
StorageService.instance.getLocalRanges(KEYSPACE5);
+Iterable> ranges = 
StorageService.instance.getLocalReplicas(KEYSPACE5).ranges();
 Set neighbors = new HashSet<>();
 for (Range range : ranges)
 {
-neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, 
ranges, range, null, null));
+neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, 
ranges, range, null, null).endpoints());
 }
 assertEquals(expected, neighbors);
 }
@@ -124,19 +122,19 @@ public class ActiveRepairServiceTest
 TokenMetadata tmd = StorageService.instance.getTokenMetadata();
 
 // generate rf*2 nodes, and ensure that only neighbors specified by 
the ARS are returned
-addTokens(2 * 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor());
+addTokens(2 * 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas);
 AbstractReplicationStrategy ars = 
Keyspace.open(KEYSPACE5).getReplicationStrategy();
 Set expected = new HashSet<>();
-for (Range replicaRange : 
ars.getAddressRanges().get(FBUtilities.getBroadcastAddressAndPort()))
+for (Replica replica : 
ars.getAddressReplicas().get(FBUtilities.getBroadcastAddressAndPort()))
 {
-
expected.addAll(ars.getRangeAddresses(tmd.cloneOnlyTokenMap()).get(replicaRange));
+
expected.addAll(ars.getRangeAddresses(tmd.cloneOnlyTokenMap()).get(replica.range()).endpoints());
 }
 expected.remove(FBUtilities.getBroadcastAddressAndPort());
-Collection> ranges = 
StorageService.instance.getLocalRanges(KEYSPACE5);
+Iterable> ranges = 
StorageService.instance.getLocalReplicas(KEYSPACE5).ranges();
 Set neighbors = new HashSet<>();
 for (Range range : ranges)
 {
-neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, 
ranges, range, null, null));
+neighbors.addAll(ActiveRepairService.getNeighbors(KEYSPACE5, 
ranges, range, null, null).endpoints());
 }
 assertEquals(expected, neighbors);
 }
@@ -147,18 +145,18 @@ public class ActiveRepairServiceTest
 TokenMetadata tmd = StorageService.instance.getTokenMetadata();
 
 // generate rf+1 nodes, and ensure that all nodes are returned
-Set expected = addTokens(1 + 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor());
+Set expected = addTokens(1 + 
Keyspace.open(KEYSPACE5).getReplicationStrategy().getReplicationFactor().allReplicas);
 expected.remove(FBUtilities.getBroadcastAddressAndPort());
 // remove remote endpoints
 TokenMetadata.Topology topology = 
tmd.cloneOnlyTokenMap().getTopology();
 HashSet localEndpoints =

[15/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/dht/RangeStreamer.java
--
diff --git a/src/java/org/apache/cassandra/dht/RangeStreamer.java 
b/src/java/org/apache/cassandra/dht/RangeStreamer.java
index 110fed6..e8aa5d3 100644
--- a/src/java/org/apache/cassandra/dht/RangeStreamer.java
+++ b/src/java/org/apache/cassandra/dht/RangeStreamer.java
@@ -18,27 +18,40 @@
 package org.apache.cassandra.dht;
 
 import java.util.*;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+import java.util.stream.Collectors;
 
 import com.google.common.annotations.VisibleForTesting;
 import com.google.common.base.Preconditions;
-import com.google.common.collect.ArrayListMultimap;
+import com.google.common.base.Predicate;
 import com.google.common.collect.HashMultimap;
+import com.google.common.collect.Iterables;
+import com.google.common.collect.Iterators;
 import com.google.common.collect.Multimap;
-import com.google.common.collect.Sets;
 
+import org.apache.cassandra.gms.FailureDetector;
+import org.apache.cassandra.locator.Endpoints;
+import org.apache.cassandra.locator.EndpointsByReplica;
 import org.apache.cassandra.locator.InetAddressAndPort;
 import org.apache.cassandra.locator.LocalStrategy;
 
+import org.apache.cassandra.locator.EndpointsByRange;
+import org.apache.cassandra.locator.EndpointsForRange;
+import org.apache.cassandra.locator.RangesAtEndpoint;
+import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict;
 import org.apache.commons.lang3.StringUtils;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import org.apache.cassandra.db.Keyspace;
-import org.apache.cassandra.gms.EndpointState;
 import org.apache.cassandra.gms.Gossiper;
 import org.apache.cassandra.gms.IFailureDetector;
 import org.apache.cassandra.locator.AbstractReplicationStrategy;
 import org.apache.cassandra.locator.IEndpointSnitch;
+import org.apache.cassandra.locator.Replica;
+import org.apache.cassandra.locator.ReplicaCollection;
+import org.apache.cassandra.locator.Replicas;
 import org.apache.cassandra.locator.TokenMetadata;
 import org.apache.cassandra.service.StorageService;
 import org.apache.cassandra.streaming.PreviewKind;
@@ -47,13 +60,25 @@ import org.apache.cassandra.streaming.StreamResultFuture;
 import org.apache.cassandra.streaming.StreamOperation;
 import org.apache.cassandra.utils.FBUtilities;
 
+import static com.google.common.base.Predicates.and;
+import static com.google.common.base.Predicates.not;
+import static com.google.common.collect.Iterables.all;
+import static com.google.common.collect.Iterables.any;
+import static org.apache.cassandra.locator.Replica.fullReplica;
+
 /**
- * Assists in streaming ranges to a node.
+ * Assists in streaming ranges to this node.
  */
 public class RangeStreamer
 {
 private static final Logger logger = 
LoggerFactory.getLogger(RangeStreamer.class);
 
+public static Predicate ALIVE_PREDICATE = replica ->
+ 
(!Gossiper.instance.isEnabled() ||
+  
(Gossiper.instance.getEndpointStateForEndpoint(replica.endpoint()) == null ||
+   
Gossiper.instance.getEndpointStateForEndpoint(replica.endpoint()).isAlive())) &&
+ 
FailureDetector.instance.isAlive(replica.endpoint());
+
 /* bootstrap tokens. can be null if replacing the node. */
 private final Collection tokens;
 /* current token ring */
@@ -62,26 +87,59 @@ public class RangeStreamer
 private final InetAddressAndPort address;
 /* streaming description */
 private final String description;
-private final Multimap>>> toFetch = HashMultimap.create();
-private final Set sourceFilters = new HashSet<>();
+private final Multimap> 
toFetch = HashMultimap.create();
+private final Set> sourceFilters = new HashSet<>();
 private final StreamPlan streamPlan;
 private final boolean useStrictConsistency;
 private final IEndpointSnitch snitch;
 private final StreamStateStore stateStore;
 
-/**
- * A filter applied to sources to stream from when constructing a fetch 
map.
- */
-public static interface ISourceFilter
+public static class FetchReplica
 {
-public boolean shouldInclude(InetAddressAndPort endpoint);
+public final Replica local;
+public final Replica remote;
+
+public FetchReplica(Replica local, Replica remote)
+{
+Preconditions.checkNotNull(local);
+Preconditions.checkNotNull(remote);
+assert local.isLocal() && !remote.isLocal();
+this.local = local;
+this.remote = remote;
+}
+
+public String toString()
+{
+return "FetchReplica{" +
+

[04/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java 
b/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java
index 9c90d57..4afeb5a 100644
--- a/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java
+++ b/test/unit/org/apache/cassandra/locator/OldNetworkTopologyStrategyTest.java
@@ -20,7 +20,6 @@ package org.apache.cassandra.locator;
 import java.net.UnknownHostException;
 import java.util.ArrayList;
 import java.util.Arrays;
-import java.util.Collection;
 import java.util.Collections;
 import java.util.HashMap;
 import java.util.List;
@@ -39,9 +38,11 @@ import org.apache.cassandra.service.StorageService;
 import org.apache.cassandra.utils.Pair;
 
 import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
 
 public class OldNetworkTopologyStrategyTest
 {
+
 private List keyTokens;
 private TokenMetadata tmd;
 private Map> expectedResults;
@@ -53,7 +54,7 @@ public class OldNetworkTopologyStrategyTest
 }
 
 @Before
-public void init()
+public void init() throws Exception
 {
 keyTokens = new ArrayList();
 tmd = new TokenMetadata();
@@ -160,11 +161,11 @@ public class OldNetworkTopologyStrategyTest
 {
 for (Token keyToken : keyTokens)
 {
-List endpoints = 
strategy.getNaturalEndpoints(keyToken);
-for (int j = 0; j < endpoints.size(); j++)
+int j = 0;
+for (InetAddressAndPort endpoint : 
strategy.getNaturalReplicasForToken(keyToken).endpoints())
 {
 ArrayList hostsExpected = 
expectedResults.get(keyToken.toString());
-assertEquals(endpoints.get(j), hostsExpected.get(j));
+assertEquals(endpoint, hostsExpected.get(j++));
 }
 }
 }
@@ -188,7 +189,6 @@ public class OldNetworkTopologyStrategyTest
 assertEquals(ranges.left.iterator().next().left, 
tokensAfterMove[movingNodeIdx]);
 assertEquals(ranges.left.iterator().next().right, 
tokens[movingNodeIdx]);
 assertEquals("No data should be fetched", ranges.right.size(), 0);
-
 }
 
 @Test
@@ -205,7 +205,6 @@ public class OldNetworkTopologyStrategyTest
 assertEquals("No data should be streamed", ranges.left.size(), 0);
 assertEquals(ranges.right.iterator().next().left, 
tokens[movingNodeIdx]);
 assertEquals(ranges.right.iterator().next().right, 
tokensAfterMove[movingNodeIdx]);
-
 }
 
 @SuppressWarnings("unchecked")
@@ -366,16 +365,21 @@ public class OldNetworkTopologyStrategyTest
 TokenMetadata tokenMetadataAfterMove = 
initTokenMetadata(tokensAfterMove);
 AbstractReplicationStrategy strategy = new 
OldNetworkTopologyStrategy("Keyspace1", tokenMetadataCurrent, endpointSnitch, 
optsWithRF(2));
 
-Collection> currentRanges = 
strategy.getAddressRanges().get(movingNode);
-Collection> updatedRanges = 
strategy.getPendingAddressRanges(tokenMetadataAfterMove, 
tokensAfterMove[movingNodeIdx], movingNode);
-
-Pair>, Set>> ranges = 
StorageService.instance.calculateStreamAndFetchRanges(currentRanges, 
updatedRanges);
+RangesAtEndpoint currentRanges = 
strategy.getAddressReplicas().get(movingNode);
+RangesAtEndpoint updatedRanges = 
strategy.getPendingAddressRanges(tokenMetadataAfterMove, 
tokensAfterMove[movingNodeIdx], movingNode);
 
-return ranges;
+return 
asRanges(StorageService.calculateStreamAndFetchRanges(currentRanges, 
updatedRanges));
 }
 
 private static Map optsWithRF(int rf)
 {
 return Collections.singletonMap("replication_factor", 
Integer.toString(rf));
 }
+
+public static Pair>, Set>> 
asRanges(Pair replicas)
+{
+Set> leftRanges = replicas.left.ranges();
+Set> rightRanges = replicas.right.ranges();
+return Pair.create(leftRanges, rightRanges);
+}
 }

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java
--
diff --git a/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java 
b/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java
index 56fd181..8e0bc00 100644
--- a/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java
+++ b/test/unit/org/apache/cassandra/locator/PendingRangeMapsTest.java
@@ -26,7 +26,6 @@ import org.apache.cassandra.dht.Token;
 import org.junit.Test;
 
 import java.net.UnknownHostException;
-import java.util.Collection;
 
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertTrue;
@@ -38,17 +37,29 @@ public class PendingRangeMapsTest {

[13/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
--
diff --git a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java 
b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
index cb2ea46..c63f4f3 100644
--- a/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
+++ b/src/java/org/apache/cassandra/locator/NetworkTopologyStrategy.java
@@ -20,10 +20,12 @@ package org.apache.cassandra.locator;
 import java.util.*;
 import java.util.Map.Entry;
 
+import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
 import org.apache.cassandra.dht.Datacenters;
+import org.apache.cassandra.dht.Range;
 import org.apache.cassandra.exceptions.ConfigurationException;
 import org.apache.cassandra.dht.Token;
 import org.apache.cassandra.locator.TokenMetadata.Topology;
@@ -49,14 +51,17 @@ import com.google.common.collect.Multimap;
  */
 public class NetworkTopologyStrategy extends AbstractReplicationStrategy
 {
-private final Map datacenters;
+private final Map datacenters;
+private final ReplicationFactor aggregateRf;
 private static final Logger logger = 
LoggerFactory.getLogger(NetworkTopologyStrategy.class);
 
 public NetworkTopologyStrategy(String keyspaceName, TokenMetadata 
tokenMetadata, IEndpointSnitch snitch, Map configOptions) 
throws ConfigurationException
 {
 super(keyspaceName, tokenMetadata, snitch, configOptions);
 
-Map newDatacenters = new HashMap();
+int replicas = 0;
+int trans = 0;
+Map newDatacenters = new HashMap<>();
 if (configOptions != null)
 {
 for (Entry entry : configOptions.entrySet())
@@ -64,12 +69,15 @@ public class NetworkTopologyStrategy extends 
AbstractReplicationStrategy
 String dc = entry.getKey();
 if (dc.equalsIgnoreCase("replication_factor"))
 throw new ConfigurationException("replication_factor is an 
option for SimpleStrategy, not NetworkTopologyStrategy");
-Integer replicas = Integer.valueOf(entry.getValue());
-newDatacenters.put(dc, replicas);
+ReplicationFactor rf = 
ReplicationFactor.fromString(entry.getValue());
+replicas += rf.allReplicas;
+trans += rf.transientReplicas();
+newDatacenters.put(dc, rf);
 }
 }
 
 datacenters = Collections.unmodifiableMap(newDatacenters);
+aggregateRf = ReplicationFactor.withTransient(replicas, trans);
 logger.info("Configured datacenter replicas are {}", 
FBUtilities.toString(datacenters));
 }
 
@@ -79,7 +87,8 @@ public class NetworkTopologyStrategy extends 
AbstractReplicationStrategy
 private static final class DatacenterEndpoints
 {
 /** List accepted endpoints get pushed into. */
-Set endpoints;
+EndpointsForRange.Mutable replicas;
+
 /**
  * Racks encountered so far. Replicas are put into separate racks 
while possible.
  * For efficiency the set is shared between the instances, using the 
location pair (dc, rack) to make sure
@@ -90,41 +99,51 @@ public class NetworkTopologyStrategy extends 
AbstractReplicationStrategy
 /** Number of replicas left to fill from this DC. */
 int rfLeft;
 int acceptableRackRepeats;
+int transients;
 
-DatacenterEndpoints(int rf, int rackCount, int nodeCount, 
Set endpoints, Set> racks)
+DatacenterEndpoints(ReplicationFactor rf, int rackCount, int 
nodeCount, EndpointsForRange.Mutable replicas, Set> racks)
 {
-this.endpoints = endpoints;
+this.replicas = replicas;
 this.racks = racks;
 // If there aren't enough nodes in this DC to fill the RF, the 
number of nodes is the effective RF.
-this.rfLeft = Math.min(rf, nodeCount);
+this.rfLeft = Math.min(rf.allReplicas, nodeCount);
 // If there aren't enough racks in this DC to fill the RF, we'll 
still use at least one node from each rack,
 // and the difference is to be filled by the first encountered 
nodes.
-acceptableRackRepeats = rf - rackCount;
+acceptableRackRepeats = rf.allReplicas - rackCount;
+
+// if we have fewer replicas than rf calls for, reduce transients 
accordingly
+int reduceTransients = rf.allReplicas - this.rfLeft;
+transients = Math.max(rf.transientReplicas() - reduceTransients, 
0);
+ReplicationFactor.validate(rfLeft, transients);
 }
 
 /**
- * Attempts to add an endpoint to the replicas for this datacenter, 
adding to the endpoints set if successful.
+ * Attempts to add an endpoint to the replicas

[08/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java
--
diff --git 
a/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java 
b/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java
index 61b9948..031326e 100644
--- a/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java
+++ b/src/java/org/apache/cassandra/service/reads/AbstractReadExecutor.java
@@ -17,11 +17,12 @@
  */
 package org.apache.cassandra.service.reads;
 
-import java.util.List;
 import java.util.concurrent.TimeUnit;
 
 import com.google.common.base.Preconditions;
-import com.google.common.collect.Iterables;
+
+import org.apache.cassandra.locator.ReplicaLayout;
+
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
@@ -37,15 +38,20 @@ import org.apache.cassandra.db.partitions.PartitionIterator;
 import org.apache.cassandra.exceptions.ReadFailureException;
 import org.apache.cassandra.exceptions.ReadTimeoutException;
 import org.apache.cassandra.exceptions.UnavailableException;
+import org.apache.cassandra.locator.EndpointsForToken;
 import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.Replica;
+import org.apache.cassandra.locator.ReplicaCollection;
 import org.apache.cassandra.net.MessageOut;
 import org.apache.cassandra.net.MessagingService;
-import org.apache.cassandra.service.StorageProxy;
 import org.apache.cassandra.service.reads.repair.ReadRepair;
 import org.apache.cassandra.service.StorageProxy.LocalReadRunnable;
 import org.apache.cassandra.tracing.TraceState;
 import org.apache.cassandra.tracing.Tracing;
 
+import static com.google.common.collect.Iterables.all;
+import static com.google.common.collect.Iterables.tryFind;
+
 /**
  * Sends a read request to the replicas needed to satisfy a given 
ConsistencyLevel.
  *
@@ -59,32 +65,27 @@ public abstract class AbstractReadExecutor
 private static final Logger logger = 
LoggerFactory.getLogger(AbstractReadExecutor.class);
 
 protected final ReadCommand command;
-protected final ConsistencyLevel consistency;
-protected final List targetReplicas;
-protected final ReadRepair readRepair;
-protected final DigestResolver digestResolver;
-protected final ReadCallback handler;
+private   final ReplicaLayout.ForToken replicaLayout;
+protected final ReadRepair 
readRepair;
+protected final DigestResolver 
digestResolver;
+protected final ReadCallback 
handler;
 protected final TraceState traceState;
 protected final ColumnFamilyStore cfs;
 protected final long queryStartNanoTime;
+private   final int initialDataRequestCount;
 protected volatile PartitionIterator result = null;
 
-protected final Keyspace keyspace;
-protected final int blockFor;
-
-AbstractReadExecutor(Keyspace keyspace, ColumnFamilyStore cfs, ReadCommand 
command, ConsistencyLevel consistency, List targetReplicas, 
long queryStartNanoTime)
+AbstractReadExecutor(ColumnFamilyStore cfs, ReadCommand command, 
ReplicaLayout.ForToken replicaLayout, int initialDataRequestCount, long 
queryStartNanoTime)
 {
 this.command = command;
-this.consistency = consistency;
-this.targetReplicas = targetReplicas;
-this.readRepair = ReadRepair.create(command, queryStartNanoTime, 
consistency);
-this.digestResolver = new DigestResolver(keyspace, command, 
consistency, readRepair, targetReplicas.size());
-this.handler = new ReadCallback(digestResolver, consistency, command, 
targetReplicas, queryStartNanoTime);
+this.replicaLayout = replicaLayout;
+this.initialDataRequestCount = initialDataRequestCount;
+this.readRepair = ReadRepair.create(command, replicaLayout, 
queryStartNanoTime);
+this.digestResolver = new DigestResolver<>(command, replicaLayout, 
readRepair, queryStartNanoTime);
+this.handler = new ReadCallback<>(digestResolver, 
replicaLayout.consistencyLevel().blockFor(replicaLayout.keyspace()), command, 
replicaLayout, queryStartNanoTime);
 this.cfs = cfs;
 this.traceState = Tracing.instance.get();
 this.queryStartNanoTime = queryStartNanoTime;
-this.keyspace = keyspace;
-this.blockFor = consistency.blockFor(keyspace);
 
 
 // Set the digest version (if we request some digests). This is the 
smallest version amongst all our target replicas since new nodes
@@ -92,8 +93,8 @@ public abstract class AbstractReadExecutor
 // TODO: we need this when talking with pre-3.0 nodes. So if we 
preserve the digest format moving forward, we can get rid of this once
 // we stop being compatible with pre-3.0 nodes.
 int digestVersion = MessagingService.current_version;
-for (InetAddressAndPort replica : targetReplicas)
-digestVersion = Math.min(digestVersion,

[16/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java
--
diff --git 
a/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java 
b/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java
index 9766454..afe628b 100644
--- a/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java
+++ b/src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java
@@ -56,6 +56,7 @@ import org.apache.cassandra.index.Index;
 import org.apache.cassandra.io.sstable.Component;
 import org.apache.cassandra.io.sstable.Descriptor;
 import org.apache.cassandra.io.sstable.ISSTableScanner;
+import org.apache.cassandra.io.sstable.SSTable;
 import org.apache.cassandra.io.sstable.SSTableMultiWriter;
 import org.apache.cassandra.io.sstable.format.SSTableReader;
 import org.apache.cassandra.io.sstable.metadata.MetadataCollector;
@@ -112,6 +113,7 @@ public class CompactionStrategyManager implements 
INotificationConsumer
 /**
  * Variables guarded by read and write lock above
  */
+private final PendingRepairHolder transientRepairs;
 private final PendingRepairHolder pendingRepairs;
 private final CompactionStrategyHolder repaired;
 private final CompactionStrategyHolder unrepaired;
@@ -156,10 +158,11 @@ public class CompactionStrategyManager implements 
INotificationConsumer
 return compactionStrategyIndexForDirectory(descriptor);
 }
 };
-pendingRepairs = new PendingRepairHolder(cfs, router);
+transientRepairs = new PendingRepairHolder(cfs, router, true);
+pendingRepairs = new PendingRepairHolder(cfs, router, false);
 repaired = new CompactionStrategyHolder(cfs, router, true);
 unrepaired = new CompactionStrategyHolder(cfs, router, false);
-holders = ImmutableList.of(pendingRepairs, repaired, unrepaired);
+holders = ImmutableList.of(transientRepairs, pendingRepairs, repaired, 
unrepaired);
 
 cfs.getTracker().subscribe(this);
 logger.trace("{} subscribed to the data tracker.", this);
@@ -176,7 +179,6 @@ public class CompactionStrategyManager implements 
INotificationConsumer
  * Return the next background task
  *
  * Returns a task for the compaction strategy that needs it the most (most 
estimated remaining tasks)
- *
  */
 public AbstractCompactionTask getNextBackgroundTask(int gcBefore)
 {
@@ -188,18 +190,16 @@ public class CompactionStrategyManager implements 
INotificationConsumer
 return null;
 
 int numPartitions = getNumTokenPartitions();
+
 // first try to promote/demote sstables from completed repairs
-List repairFinishedSuppliers = 
pendingRepairs.getRepairFinishedTaskSuppliers();
-if (!repairFinishedSuppliers.isEmpty())
-{
-Collections.sort(repairFinishedSuppliers);
-for (TaskSupplier supplier : repairFinishedSuppliers)
-{
-AbstractCompactionTask task = supplier.getTask();
-if (task != null)
-return task;
-}
-}
+AbstractCompactionTask repairFinishedTask;
+repairFinishedTask = pendingRepairs.getNextRepairFinishedTask();
+if (repairFinishedTask != null)
+return repairFinishedTask;
+
+repairFinishedTask = transientRepairs.getNextRepairFinishedTask();
+if (repairFinishedTask != null)
+return repairFinishedTask;
 
 // sort compaction task suppliers by remaining tasks descending
 List suppliers = new ArrayList<>(numPartitions * 
holders.size());
@@ -393,64 +393,28 @@ public class CompactionStrategyManager implements 
INotificationConsumer
 }
 }
 
-
-
 @VisibleForTesting
-List getRepaired()
+CompactionStrategyHolder getRepairedUnsafe()
 {
-readLock.lock();
-try
-{
-return Lists.newArrayList(repaired.allStrategies());
-}
-finally
-{
-readLock.unlock();
-}
+return repaired;
 }
 
 @VisibleForTesting
-List getUnrepaired()
+CompactionStrategyHolder getUnrepairedUnsafe()
 {
-readLock.lock();
-try
-{
-return Lists.newArrayList(unrepaired.allStrategies());
-}
-finally
-{
-readLock.unlock();
-}
+return unrepaired;
 }
 
 @VisibleForTesting
-Iterable getForPendingRepair(UUID sessionID)
+PendingRepairHolder getPendingRepairsUnsafe()
 {
-readLock.lock();
-try
-{
-return pendingRepairs.getStrategiesFor(sessionID);
-}
-finally
-{
-

[17/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/db/DiskBoundaryManager.java
--
diff --git a/src/java/org/apache/cassandra/db/DiskBoundaryManager.java 
b/src/java/org/apache/cassandra/db/DiskBoundaryManager.java
index 72b5e2a..acfe71a 100644
--- a/src/java/org/apache/cassandra/db/DiskBoundaryManager.java
+++ b/src/java/org/apache/cassandra/db/DiskBoundaryManager.java
@@ -19,8 +19,9 @@
 package org.apache.cassandra.db;
 
 import java.util.ArrayList;
-import java.util.Collection;
+import java.util.Comparator;
 import java.util.List;
+import java.util.stream.Collectors;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
@@ -30,6 +31,8 @@ import org.apache.cassandra.dht.IPartitioner;
 import org.apache.cassandra.dht.Range;
 import org.apache.cassandra.dht.Splitter;
 import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.locator.RangesAtEndpoint;
+import org.apache.cassandra.locator.Replica;
 import org.apache.cassandra.locator.TokenMetadata;
 import org.apache.cassandra.service.PendingRangeCalculatorService;
 import org.apache.cassandra.service.StorageService;
@@ -68,7 +71,7 @@ public class DiskBoundaryManager
 
 private static DiskBoundaries getDiskBoundaryValue(ColumnFamilyStore cfs)
 {
-Collection> localRanges;
+RangesAtEndpoint localRanges;
 
 long ringVersion;
 TokenMetadata tmd;
@@ -87,7 +90,7 @@ public class DiskBoundaryManager
 // Reason we use use the future settled TMD is that if we 
decommission a node, we want to stream
 // from that node to the correct location on disk, if we 
didn't, we would put new files in the wrong places.
 // We do this to minimize the amount of data we need to move 
in rebalancedisks once everything settled
-localRanges = 
cfs.keyspace.getReplicationStrategy().getAddressRanges(tmd.cloneAfterAllSettled()).get(FBUtilities.getBroadcastAddressAndPort());
+localRanges = 
cfs.keyspace.getReplicationStrategy().getAddressReplicas(tmd.cloneAfterAllSettled(),
 FBUtilities.getBroadcastAddressAndPort());
 }
 logger.debug("Got local ranges {} (ringVersion = {})", 
localRanges, ringVersion);
 }
@@ -106,9 +109,18 @@ public class DiskBoundaryManager
 if (localRanges == null || localRanges.isEmpty())
 return new DiskBoundaries(dirs, null, ringVersion, 
directoriesVersion);
 
-List> sortedLocalRanges = Range.sort(localRanges);
+// note that Range.sort unwraps any wraparound ranges, so we need to 
sort them here
+List> fullLocalRanges = Range.sort(localRanges.stream()
+   
.filter(Replica::isFull)
+   
.map(Replica::range)
+   
.collect(Collectors.toList()));
+List> transientLocalRanges = 
Range.sort(localRanges.stream()
+
.filter(Replica::isTransient)
+
.map(Replica::range)
+
.collect(Collectors.toList()));
+
+List positions = getDiskBoundaries(fullLocalRanges, 
transientLocalRanges, cfs.getPartitioner(), dirs);
 
-List positions = 
getDiskBoundaries(sortedLocalRanges, cfs.getPartitioner(), dirs);
 return new DiskBoundaries(dirs, positions, ringVersion, 
directoriesVersion);
 }
 
@@ -121,15 +133,26 @@ public class DiskBoundaryManager
  *
  * The final entry in the returned list will always be the partitioner 
maximum tokens upper key bound
  */
-private static List 
getDiskBoundaries(List> sortedLocalRanges, IPartitioner 
partitioner, Directories.DataDirectory[] dataDirectories)
+private static List 
getDiskBoundaries(List> fullRanges, List> 
transientRanges, IPartitioner partitioner, Directories.DataDirectory[] 
dataDirectories)
 {
 assert partitioner.splitter().isPresent();
+
 Splitter splitter = partitioner.splitter().get();
 boolean dontSplitRanges = DatabaseDescriptor.getNumTokens() > 1;
-List boundaries = 
splitter.splitOwnedRanges(dataDirectories.length, sortedLocalRanges, 
dontSplitRanges);
+
+List weightedRanges = new 
ArrayList<>(fullRanges.size() + transientRanges.size());
+for (Range r : fullRanges)
+weightedRanges.add(new Splitter.WeightedRange(1.0, r));
+
+for (Range r : transientRanges)
+weightedRanges.add(new Splitter.WeightedRange(0.1, r));
+
+
weightedRanges.sort(Comparator.comparing(Splitter.WeightedRange::left));
+
+List boundaries = 
splitter.splitOwnedRanges(dataDirectories.length,

[14/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java
--
diff --git a/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java 
b/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java
index db73b4f..8eb8603 100644
--- a/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java
+++ b/src/java/org/apache/cassandra/io/sstable/format/big/BigFormat.java
@@ -21,6 +21,9 @@ import java.util.Collection;
 import java.util.Set;
 import java.util.UUID;
 
+import com.google.common.base.Preconditions;
+
+import org.apache.cassandra.io.sstable.SSTable;
 import org.apache.cassandra.schema.TableMetadata;
 import org.apache.cassandra.schema.TableMetadataRef;
 import org.apache.cassandra.db.RowIndexEntry;
@@ -85,13 +88,15 @@ public class BigFormat implements SSTableFormat
   long keyCount,
   long repairedAt,
   UUID pendingRepair,
+  boolean isTransient,
   TableMetadataRef metadata,
   MetadataCollector metadataCollector,
   SerializationHeader header,
   Collection observers,
   LifecycleTransaction txn)
 {
-return new BigTableWriter(descriptor, keyCount, repairedAt, 
pendingRepair, metadata, metadataCollector, header, observers, txn);
+SSTable.validateRepairedMetadata(repairedAt, pendingRepair, 
isTransient);
+return new BigTableWriter(descriptor, keyCount, repairedAt, 
pendingRepair, isTransient, metadata, metadataCollector, header, observers, 
txn);
 }
 }
 
@@ -120,7 +125,7 @@ public class BigFormat implements SSTableFormat
 // mb (3.0.7, 3.7): commit log lower bound included
 // mc (3.0.8, 3.9): commit log intervals included
 
-// na (4.0.0): uncompressed chunks, pending repair session, 
checksummed sstable metadata file, new Bloomfilter format
+// na (4.0.0): uncompressed chunks, pending repair session, 
isTransient, checksummed sstable metadata file, new Bloomfilter format
 //
 // NOTE: when adding a new version, please add that to 
LegacySSTableTest, too.
 
@@ -131,6 +136,7 @@ public class BigFormat implements SSTableFormat
 public final boolean hasMaxCompressedLength;
 private final boolean hasPendingRepair;
 private final boolean hasMetadataChecksum;
+private final boolean hasIsTransient;
 /**
  * CASSANDRA-9067: 4.0 bloom filter representation changed (two longs 
just swapped)
  * have no 'static' bits caused by using the same upper bits for both 
bloom filter and token distribution.
@@ -148,6 +154,7 @@ public class BigFormat implements SSTableFormat
 hasCommitLogIntervals = version.compareTo("mc") >= 0;
 hasMaxCompressedLength = version.compareTo("na") >= 0;
 hasPendingRepair = version.compareTo("na") >= 0;
+hasIsTransient = version.compareTo("na") >= 0;
 hasMetadataChecksum = version.compareTo("na") >= 0;
 hasOldBfFormat = version.compareTo("na") < 0;
 }
@@ -176,6 +183,12 @@ public class BigFormat implements SSTableFormat
 }
 
 @Override
+public boolean hasIsTransient()
+{
+return hasIsTransient;
+}
+
+@Override
 public int correspondingMessagingVersion()
 {
 return correspondingMessagingVersion;

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java
--
diff --git 
a/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java 
b/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java
index b5488ed..7513e95 100644
--- a/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java
+++ b/src/java/org/apache/cassandra/io/sstable/format/big/BigTableWriter.java
@@ -68,13 +68,14 @@ public class BigTableWriter extends SSTableWriter
   long keyCount,
   long repairedAt,
   UUID pendingRepair,
+  boolean isTransient,
   TableMetadataRef metadata,
   MetadataCollector metadataCollector, 
   SerializationHeader header,
   Collection observers,
   LifecycleTransaction txn)
 {
-super(descriptor, keyCount, repairedAt, pendingRepair, metadata, 
metadataCollector, header, observers);
+super(descriptor, keyCount, repairedAt,

[12/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/SimpleStrategy.java
--
diff --git a/src/java/org/apache/cassandra/locator/SimpleStrategy.java 
b/src/java/org/apache/cassandra/locator/SimpleStrategy.java
index 545ad28..7a000b7 100644
--- a/src/java/org/apache/cassandra/locator/SimpleStrategy.java
+++ b/src/java/org/apache/cassandra/locator/SimpleStrategy.java
@@ -21,9 +21,9 @@ import java.util.ArrayList;
 import java.util.Collections;
 import java.util.Collection;
 import java.util.Iterator;
-import java.util.List;
 import java.util.Map;
 
+import org.apache.cassandra.dht.Range;
 import org.apache.cassandra.exceptions.ConfigurationException;
 import org.apache.cassandra.dht.Token;
 
@@ -36,34 +36,41 @@ import org.apache.cassandra.dht.Token;
  */
 public class SimpleStrategy extends AbstractReplicationStrategy
 {
+private final ReplicationFactor rf;
+
 public SimpleStrategy(String keyspaceName, TokenMetadata tokenMetadata, 
IEndpointSnitch snitch, Map configOptions)
 {
 super(keyspaceName, tokenMetadata, snitch, configOptions);
+this.rf = 
ReplicationFactor.fromString(this.configOptions.get("replication_factor"));
 }
 
-public List calculateNaturalEndpoints(Token token, 
TokenMetadata metadata)
+public EndpointsForRange calculateNaturalReplicas(Token token, 
TokenMetadata metadata)
 {
-int replicas = getReplicationFactor();
-ArrayList tokens = metadata.sortedTokens();
-List endpoints = new 
ArrayList(replicas);
+ArrayList ring = metadata.sortedTokens();
+if (ring.isEmpty())
+return EndpointsForRange.empty(new 
Range<>(metadata.partitioner.getMinimumToken(), 
metadata.partitioner.getMinimumToken()));
+
+Token replicaEnd = TokenMetadata.firstToken(ring, token);
+Token replicaStart = metadata.getPredecessor(replicaEnd);
+Range replicaRange = new Range<>(replicaStart, replicaEnd);
+Iterator iter = TokenMetadata.ringIterator(ring, token, false);
 
-if (tokens.isEmpty())
-return endpoints;
+EndpointsForRange.Builder replicas = 
EndpointsForRange.builder(replicaRange, rf.allReplicas);
 
 // Add the token at the index by default
-Iterator iter = TokenMetadata.ringIterator(tokens, token, 
false);
-while (endpoints.size() < replicas && iter.hasNext())
+while (replicas.size() < rf.allReplicas && iter.hasNext())
 {
-InetAddressAndPort ep = metadata.getEndpoint(iter.next());
-if (!endpoints.contains(ep))
-endpoints.add(ep);
+Token tk = iter.next();
+InetAddressAndPort ep = metadata.getEndpoint(tk);
+if (!replicas.containsEndpoint(ep))
+replicas.add(new Replica(ep, replicaRange, replicas.size() < 
rf.fullReplicas));
 }
-return endpoints;
+return replicas.build();
 }
 
-public int getReplicationFactor()
+public ReplicationFactor getReplicationFactor()
 {
-return Integer.parseInt(this.configOptions.get("replication_factor"));
+return rf;
 }
 
 public void validateOptions() throws ConfigurationException

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/locator/SystemReplicas.java
--
diff --git a/src/java/org/apache/cassandra/locator/SystemReplicas.java 
b/src/java/org/apache/cassandra/locator/SystemReplicas.java
new file mode 100644
index 000..13a9d74
--- /dev/null
+++ b/src/java/org/apache/cassandra/locator/SystemReplicas.java
@@ -0,0 +1,62 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.cassandra.locator;
+
+import java.util.ArrayList;
+import java.util.Collection;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+import org.apache.cassandra.config.DatabaseDescriptor;
+import org.apache.cassandra.dht.Range;
+import org.apache.cassandra.dht.Token;
+
+public class SystemReplicas
+{
+private static final Map

[07/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java
--
diff --git 
a/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java 
b/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java
index a43e3eb..4af4a92 100644
--- a/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java
+++ b/src/java/org/apache/cassandra/service/reads/repair/NoopReadRepair.java
@@ -18,7 +18,6 @@
 
 package org.apache.cassandra.service.reads.repair;
 
-import java.util.List;
 import java.util.Map;
 import java.util.function.Consumer;
 
@@ -27,24 +26,28 @@ import org.apache.cassandra.db.Mutation;
 import org.apache.cassandra.db.partitions.PartitionIterator;
 import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators;
 import org.apache.cassandra.exceptions.ReadTimeoutException;
-import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.ReplicaLayout;
+import org.apache.cassandra.locator.Endpoints;
+import org.apache.cassandra.locator.Replica;
 import org.apache.cassandra.service.reads.DigestResolver;
 
 /**
  * Bypasses the read repair path for short read protection and testing
  */
-public class NoopReadRepair implements ReadRepair
+public class NoopReadRepair, L extends ReplicaLayout> implements ReadRepair
 {
 public static final NoopReadRepair instance = new NoopReadRepair();
 
 private NoopReadRepair() {}
 
-public UnfilteredPartitionIterators.MergeListener 
getMergeListener(InetAddressAndPort[] endpoints)
+@Override
+public UnfilteredPartitionIterators.MergeListener getMergeListener(L 
replicas)
 {
 return UnfilteredPartitionIterators.MergeListener.NOOP;
 }
 
-public void startRepair(DigestResolver digestResolver, 
List allEndpoints, List 
contactedEndpoints, Consumer resultConsumer)
+@Override
+public void startRepair(DigestResolver digestResolver, 
Consumer resultConsumer)
 {
 resultConsumer.accept(digestResolver.getData());
 }
@@ -72,7 +75,7 @@ public class NoopReadRepair implements ReadRepair
 }
 
 @Override
-public void repairPartition(DecoratedKey key, Map mutations, InetAddressAndPort[] destinations)
+public void repairPartition(DecoratedKey partitionKey, Map mutations, L replicaLayout)
 {
 
 }

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java
--
diff --git 
a/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java
 
b/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java
index 6cf761a..4cae3ae 100644
--- 
a/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java
+++ 
b/src/java/org/apache/cassandra/service/reads/repair/PartitionIteratorMergeListener.java
@@ -28,18 +28,18 @@ import org.apache.cassandra.db.RegularAndStaticColumns;
 import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators;
 import org.apache.cassandra.db.rows.UnfilteredRowIterator;
 import org.apache.cassandra.db.rows.UnfilteredRowIterators;
-import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.ReplicaLayout;
 
 public class PartitionIteratorMergeListener implements 
UnfilteredPartitionIterators.MergeListener
 {
-private final InetAddressAndPort[] sources;
+private final ReplicaLayout replicaLayout;
 private final ReadCommand command;
 private final ConsistencyLevel consistency;
 private final ReadRepair readRepair;
 
-public PartitionIteratorMergeListener(InetAddressAndPort[] sources, 
ReadCommand command, ConsistencyLevel consistency, ReadRepair readRepair)
+public PartitionIteratorMergeListener(ReplicaLayout replicaLayout, 
ReadCommand command, ConsistencyLevel consistency, ReadRepair readRepair)
 {
-this.sources = sources;
+this.replicaLayout = replicaLayout;
 this.command = command;
 this.consistency = consistency;
 this.readRepair = readRepair;
@@ -47,10 +47,10 @@ public class PartitionIteratorMergeListener implements 
UnfilteredPartitionIterat
 
 public UnfilteredRowIterators.MergeListener 
getRowMergeListener(DecoratedKey partitionKey, List 
versions)
 {
-return new RowIteratorMergeListener(partitionKey, columns(versions), 
isReversed(versions), sources, command, consistency, readRepair);
+return new RowIteratorMergeListener(partitionKey, columns(versions), 
isReversed(versions), replicaLayout, command, consistency, readRepair);
 }
 
-private RegularAndStaticColumns columns(List 
versions)
+protected RegularAndStaticColumns columns(List 
versions)
 {
 Columns statics = Columns.NONE;

[06/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/CleanupTest.java
--
diff --git a/test/unit/org/apache/cassandra/db/CleanupTest.java 
b/test/unit/org/apache/cassandra/db/CleanupTest.java
index a096c78..46c0afd 100644
--- a/test/unit/org/apache/cassandra/db/CleanupTest.java
+++ b/test/unit/org/apache/cassandra/db/CleanupTest.java
@@ -107,7 +107,6 @@ public class CleanupTest
 SchemaLoader.compositeIndexCFMD(KEYSPACE2, 
CF_INDEXED2, true));
 }
 
-/*
 @Test
 public void testCleanup() throws ExecutionException, InterruptedException
 {
@@ -116,7 +115,6 @@ public class CleanupTest
 Keyspace keyspace = Keyspace.open(KEYSPACE1);
 ColumnFamilyStore cfs = keyspace.getColumnFamilyStore(CF_STANDARD1);
 
-UnfilteredPartitionIterator iter;
 
 // insert data and verify we get it back w/ range query
 fillCF(cfs, "val", LOOPS);
@@ -124,8 +122,7 @@ public class CleanupTest
 // record max timestamps of the sstables pre-cleanup
 List expectedMaxTimestamps = getMaxTimestampList(cfs);
 
-iter = Util.getRangeSlice(cfs);
-assertEquals(LOOPS, Iterators.size(iter));
+assertEquals(LOOPS, Util.getAll(Util.cmd(cfs).build()).size());
 
 // with one token in the ring, owned by the local node, cleanup should 
be a no-op
 CompactionManager.instance.performCleanup(cfs, 2);
@@ -134,10 +131,8 @@ public class CleanupTest
 assert expectedMaxTimestamps.equals(getMaxTimestampList(cfs));
 
 // check data is still there
-iter = Util.getRangeSlice(cfs);
-assertEquals(LOOPS, Iterators.size(iter));
+assertEquals(LOOPS, Util.getAll(Util.cmd(cfs).build()).size());
 }
-*/
 
 @Test
 public void testCleanupWithIndexes() throws IOException, 
ExecutionException, InterruptedException

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/CleanupTransientTest.java
--
diff --git a/test/unit/org/apache/cassandra/db/CleanupTransientTest.java 
b/test/unit/org/apache/cassandra/db/CleanupTransientTest.java
new file mode 100644
index 000..9789183
--- /dev/null
+++ b/test/unit/org/apache/cassandra/db/CleanupTransientTest.java
@@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.cassandra.db;
+
+import java.nio.ByteBuffer;
+import java.util.ArrayList;
+import java.util.LinkedList;
+import java.util.List;
+import java.util.UUID;
+
+import org.apache.cassandra.locator.RangesAtEndpoint;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import org.apache.cassandra.SchemaLoader;
+import org.apache.cassandra.Util;
+import org.apache.cassandra.config.DatabaseDescriptor;
+import org.apache.cassandra.db.compaction.CompactionManager;
+import org.apache.cassandra.db.partitions.FilteredPartition;
+import org.apache.cassandra.dht.IPartitioner;
+import org.apache.cassandra.dht.RandomPartitioner;
+import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.io.sstable.format.SSTableReader;
+import org.apache.cassandra.locator.AbstractNetworkTopologySnitch;
+import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.Replica;
+import org.apache.cassandra.locator.TokenMetadata;
+import org.apache.cassandra.schema.KeyspaceParams;
+import org.apache.cassandra.service.PendingRangeCalculatorService;
+import org.apache.cassandra.service.StorageService;
+import org.apache.cassandra.utils.ByteBufferUtil;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+public class CleanupTransientTest
+{
+private static final IPartitioner partitioner = RandomPartitioner.instance;
+private static IPartitioner oldPartitioner;
+
+public static final int LOOPS = 200;
+public static final String KEYSPACE1 = "CleanupTest1";
+public static final String CF_INDEXED1 = "Indexed1";
+public static final String CF_STANDARD1 = "Standard1";
+
+public static final String KEYSPACE2

[10/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/StorageProxy.java
--
diff --git a/src/java/org/apache/cassandra/service/StorageProxy.java 
b/src/java/org/apache/cassandra/service/StorageProxy.java
index ed0cafc..c23eb88 100644
--- a/src/java/org/apache/cassandra/service/StorageProxy.java
+++ b/src/java/org/apache/cassandra/service/StorageProxy.java
@@ -29,7 +29,6 @@ import javax.management.MBeanServer;
 import javax.management.ObjectName;
 
 import com.google.common.base.Preconditions;
-import com.google.common.base.Predicate;
 import com.google.common.cache.CacheLoader;
 import com.google.common.collect.*;
 import com.google.common.primitives.Ints;
@@ -133,18 +132,10 @@ public class StorageProxy implements StorageProxyMBean
 HintsService.instance.registerMBean();
 HintedHandOffManager.instance.registerMBean();
 
-standardWritePerformer = new WritePerformer()
+standardWritePerformer = (mutation, targets, responseHandler, 
localDataCenter) ->
 {
-public void apply(IMutation mutation,
-  Iterable targets,
-  AbstractWriteResponseHandler 
responseHandler,
-  String localDataCenter,
-  ConsistencyLevel consistency_level)
-throws OverloadedException
-{
-assert mutation instanceof Mutation;
-sendToHintedEndpoints((Mutation) mutation, targets, 
responseHandler, localDataCenter, Stage.MUTATION);
-}
+assert mutation instanceof Mutation;
+sendToHintedReplicas((Mutation) mutation, targets.selected(), 
responseHandler, localDataCenter, Stage.MUTATION);
 };
 
 /*
@@ -153,29 +144,19 @@ public class StorageProxy implements StorageProxyMBean
  * but on the latter case, the verb handler already run on the 
COUNTER_MUTATION stage, so we must not execute the
  * underlying on the stage otherwise we risk a deadlock. Hence two 
different performer.
  */
-counterWritePerformer = new WritePerformer()
+counterWritePerformer = (mutation, targets, responseHandler, 
localDataCenter) ->
 {
-public void apply(IMutation mutation,
-  Iterable targets,
-  AbstractWriteResponseHandler 
responseHandler,
-  String localDataCenter,
-  ConsistencyLevel consistencyLevel)
-{
-counterWriteTask(mutation, targets, responseHandler, 
localDataCenter).run();
-}
+EndpointsForToken selected = targets.selected().withoutSelf();
+Replicas.temporaryAssertFull(selected); // TODO CASSANDRA-14548
+counterWriteTask(mutation, selected, responseHandler, 
localDataCenter).run();
 };
 
-counterWriteOnCoordinatorPerformer = new WritePerformer()
+counterWriteOnCoordinatorPerformer = (mutation, targets, 
responseHandler, localDataCenter) ->
 {
-public void apply(IMutation mutation,
-  Iterable targets,
-  AbstractWriteResponseHandler 
responseHandler,
-  String localDataCenter,
-  ConsistencyLevel consistencyLevel)
-{
-StageManager.getStage(Stage.COUNTER_MUTATION)
-.execute(counterWriteTask(mutation, targets, 
responseHandler, localDataCenter));
-}
+EndpointsForToken selected = targets.selected().withoutSelf();
+Replicas.temporaryAssertFull(selected); // TODO CASSANDRA-14548
+StageManager.getStage(Stage.COUNTER_MUTATION)
+.execute(counterWriteTask(mutation, selected, 
responseHandler, localDataCenter));
 };
 
 for(ConsistencyLevel level : ConsistencyLevel.values())
@@ -251,11 +232,9 @@ public class StorageProxy implements StorageProxyMBean
 while (System.nanoTime() - queryStartNanoTime < timeout)
 {
 // for simplicity, we'll do a single liveness check at the 
start of each attempt
-PaxosParticipants p = getPaxosParticipants(metadata, key, 
consistencyForPaxos);
-List liveEndpoints = p.liveEndpoints;
-int requiredParticipants = p.participants;
+ReplicaLayout.ForPaxos replicaLayout = 
ReplicaLayout.forPaxos(Keyspace.open(keyspaceName), key, consistencyForPaxos);
 
-final PaxosBallotAndContention pair = 
beginAndRepairPaxos(queryStartNanoTime, key, metadata, liveEndpoints, 
requiredParticipants, consistencyForPaxos, consistencyForCommit, true, state);
+final PaxosBallotAndContention pair =

[02/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/StorageServiceTest.java
--
diff --git a/test/unit/org/apache/cassandra/service/StorageServiceTest.java 
b/test/unit/org/apache/cassandra/service/StorageServiceTest.java
new file mode 100644
index 000..9d5c324
--- /dev/null
+++ b/test/unit/org/apache/cassandra/service/StorageServiceTest.java
@@ -0,0 +1,148 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.cassandra.service;
+
+import org.apache.cassandra.locator.EndpointsByReplica;
+import org.apache.cassandra.locator.ReplicaCollection;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Test;
+
+import org.apache.cassandra.config.DatabaseDescriptor;
+import org.apache.cassandra.dht.RandomPartitioner;
+import org.apache.cassandra.dht.Range;
+import org.apache.cassandra.dht.Token;
+import org.apache.cassandra.locator.AbstractEndpointSnitch;
+import org.apache.cassandra.locator.AbstractReplicationStrategy;
+import org.apache.cassandra.locator.IEndpointSnitch;
+import org.apache.cassandra.locator.InetAddressAndPort;
+import org.apache.cassandra.locator.Replica;
+import org.apache.cassandra.locator.ReplicaMultimap;
+import org.apache.cassandra.locator.SimpleStrategy;
+import org.apache.cassandra.locator.TokenMetadata;
+
+import static org.junit.Assert.assertEquals;
+
+public class StorageServiceTest
+{
+static InetAddressAndPort aAddress;
+static InetAddressAndPort bAddress;
+static InetAddressAndPort cAddress;
+static InetAddressAndPort dAddress;
+static InetAddressAndPort eAddress;
+
+@BeforeClass
+public static void setUpClass() throws Exception
+{
+aAddress = InetAddressAndPort.getByName("127.0.0.1");
+bAddress = InetAddressAndPort.getByName("127.0.0.2");
+cAddress = InetAddressAndPort.getByName("127.0.0.3");
+dAddress = InetAddressAndPort.getByName("127.0.0.4");
+eAddress = InetAddressAndPort.getByName("127.0.0.5");
+}
+
+private static final Token threeToken = new 
RandomPartitioner.BigIntegerToken("3");
+private static final Token sixToken = new 
RandomPartitioner.BigIntegerToken("6");
+private static final Token nineToken = new 
RandomPartitioner.BigIntegerToken("9");
+private static final Token elevenToken = new 
RandomPartitioner.BigIntegerToken("11");
+private static final Token oneToken = new 
RandomPartitioner.BigIntegerToken("1");
+
+Range aRange = new Range<>(oneToken, threeToken);
+Range bRange = new Range<>(threeToken, sixToken);
+Range cRange = new Range<>(sixToken, nineToken);
+Range dRange = new Range<>(nineToken, elevenToken);
+Range eRange = new Range<>(elevenToken, oneToken);
+
+@Before
+public void setUp()
+{
+DatabaseDescriptor.daemonInitialization();
+DatabaseDescriptor.setTransientReplicationEnabledUnsafe(true);
+IEndpointSnitch snitch = new AbstractEndpointSnitch()
+{
+public int compareEndpoints(InetAddressAndPort target, Replica r1, 
Replica r2)
+{
+return 0;
+}
+
+public String getRack(InetAddressAndPort endpoint)
+{
+return "R1";
+}
+
+public String getDatacenter(InetAddressAndPort endpoint)
+{
+return "DC1";
+}
+};
+
+DatabaseDescriptor.setEndpointSnitch(snitch);
+}
+
+private AbstractReplicationStrategy simpleStrategy(TokenMetadata tmd)
+{
+return new SimpleStrategy("MoveTransientTest",
+  tmd,
+  DatabaseDescriptor.getEndpointSnitch(),
+  
com.google.common.collect.ImmutableMap.of("replication_factor", "3/1"));
+}
+
+public static >  void 
assertMultimapEqualsIgnoreOrder(ReplicaMultimap a, ReplicaMultimap 
b)
+{
+if (!a.keySet().equals(b.keySet()))
+assertEquals(a, b);
+for (K key : a.keySet())
+{
+C ac = a.get(key);
+C bc = b.get(key);
+

[18/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

Transient Replication and Cheap Quorums

Patch by Blake Eggleston, Benedict Elliott Smith, Marcus Eriksson, Alex Petrov, 
Ariel Weisberg; Reviewed by Blake Eggleston, Marcus Eriksson, Benedict Elliott 
Smith, Alex Petrov, Ariel Weisberg for CASSANDRA-14404

Co-authored-by: Blake Eggleston 
Co-authored-by: Benedict Elliott Smith 
Co-authored-by: Marcus Eriksson 
Co-authored-by: Alex Petrov 


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f7431b43
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f7431b43
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f7431b43

Branch: refs/heads/trunk
Commit: f7431b432875e334170ccdb19934d05545d2cebd
Parents: 5b645de
Author: Ariel Weisberg 
Authored: Thu Jul 5 18:10:40 2018 -0400
Committer: Ariel Weisberg 
Committed: Fri Aug 31 21:34:22 2018 -0400

--
 CHANGES.txt |   1 +
 NEWS.txt|   4 +
 conf/cassandra.yaml |   4 +
 doc/source/architecture/dynamo.rst  |  29 +
 doc/source/cql/ddl.rst  |  14 +-
 ...iver-internal-only-3.12.0.post0-5838e2fd.zip | Bin 0 -> 269418 bytes
 pylib/cqlshlib/cql3handling.py  |   1 +
 pylib/cqlshlib/cqlshhandling.py |   1 +
 pylib/cqlshlib/test/test_cqlsh_completion.py|   6 +-
 pylib/cqlshlib/test/test_cqlsh_output.py|   3 +-
 .../cassandra/batchlog/BatchlogManager.java |  45 +-
 .../org/apache/cassandra/config/Config.java |   2 +
 .../cassandra/config/DatabaseDescriptor.java|  35 +-
 .../apache/cassandra/cql3/QueryProcessor.java   |  13 +-
 .../cql3/statements/BatchStatement.java |   4 +-
 .../cql3/statements/BatchUpdatesCollector.java  |   2 +-
 .../cql3/statements/ModificationStatement.java  |   4 +-
 .../statements/SingleTableUpdatesCollector.java |   2 +-
 .../cql3/statements/UpdatesCollector.java   |   5 +-
 .../schema/AlterKeyspaceStatement.java  |  86 +-
 .../statements/schema/AlterTableStatement.java  |   7 +
 .../statements/schema/CreateIndexStatement.java |   5 +
 .../statements/schema/CreateTableStatement.java |   9 +
 .../statements/schema/CreateViewStatement.java  |   5 +
 .../cql3/statements/schema/TableAttributes.java |   3 +
 .../apache/cassandra/db/ColumnFamilyStore.java  |  24 +-
 .../apache/cassandra/db/ConsistencyLevel.java   | 211 +++--
 .../cassandra/db/DiskBoundaryManager.java   |  39 +-
 src/java/org/apache/cassandra/db/Memtable.java  |   1 +
 .../cassandra/db/MutationVerbHandler.java   |   5 +-
 .../cassandra/db/PartitionRangeReadCommand.java |  28 +-
 .../org/apache/cassandra/db/ReadCommand.java|  33 +-
 .../apache/cassandra/db/SSTableImporter.java|   6 +-
 .../db/SinglePartitionReadCommand.java  |  26 +-
 .../org/apache/cassandra/db/SystemKeyspace.java |  98 ++-
 .../cassandra/db/SystemKeyspaceMigrator40.java  |  45 +
 .../org/apache/cassandra/db/TableCQLHelper.java |   1 +
 .../compaction/AbstractCompactionStrategy.java  |   3 +-
 .../db/compaction/AbstractStrategyHolder.java   |   7 +-
 .../db/compaction/CompactionManager.java| 295 ---
 .../db/compaction/CompactionStrategyHolder.java |  34 +-
 .../compaction/CompactionStrategyManager.java   | 108 +--
 .../cassandra/db/compaction/CompactionTask.java |  26 +-
 .../db/compaction/PendingRepairHolder.java  |  42 +-
 .../db/compaction/PendingRepairManager.java |  45 +-
 .../cassandra/db/compaction/Scrubber.java   |   4 +-
 .../cassandra/db/compaction/Upgrader.java   |  10 +-
 .../cassandra/db/compaction/Verifier.java   |   3 +-
 .../writers/CompactionAwareWriter.java  |   2 +
 .../writers/DefaultCompactionWriter.java|   1 +
 .../writers/MajorLeveledCompactionWriter.java   |   1 +
 .../writers/MaxSSTableSizeWriter.java   |   1 +
 .../SplittingSizeTieredCompactionWriter.java|   1 +
 .../db/partitions/PartitionIterators.java   |  12 -
 .../repair/CassandraKeyspaceRepairManager.java  |  10 +-
 .../db/repair/PendingAntiCompaction.java|  22 +-
 .../db/streaming/CassandraOutgoingFile.java |  11 +-
 .../db/streaming/CassandraStreamManager.java|  36 +-
 .../db/streaming/CassandraStreamReader.java |   2 +-
 .../apache/cassandra/db/view/TableViews.java|   5 +
 .../apache/cassandra/db/view/ViewBuilder.java   |  19 +-
 .../apache/cassandra/db/view/ViewManager.java   |   2 +-
 .../org/apache/cassandra/db/view/ViewUtils.java |  64 +-
 src/java/org/apache/cassandra/dht/Range.java|  27 +-
 .../cassandra/dht/RangeFetchMapCalculator.java  |  58 +-
 .../org/apache/cassandra/dht/RangeStreamer.java | 571 
 src/java/org/apache/cassandra/dht/Splitter.java |  95 +-
 .../apache/cassandra/dht/StreamStateStore.java  |  25 +-
 .../ReplicationAwareTokenAllocator.java |   2 +-

[05/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java 
b/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java
index 447d504..374a760 100644
--- a/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java
+++ b/test/unit/org/apache/cassandra/db/repair/PendingAntiCompactionTest.java
@@ -18,6 +18,7 @@
 
 package org.apache.cassandra.db.repair;
 
+import java.net.UnknownHostException;
 import java.util.ArrayList;
 import java.util.Collection;
 import java.util.Collections;
@@ -42,6 +43,8 @@ import org.slf4j.LoggerFactory;
 import org.apache.cassandra.SchemaLoader;
 import org.apache.cassandra.config.DatabaseDescriptor;
 import org.apache.cassandra.db.compaction.CompactionManager;
+import org.apache.cassandra.locator.RangesAtEndpoint;
+import org.apache.cassandra.locator.Replica;
 import org.apache.cassandra.schema.TableId;
 import org.apache.cassandra.schema.TableMetadata;
 import org.apache.cassandra.schema.Schema;
@@ -64,6 +67,9 @@ public class PendingAntiCompactionTest
 {
 private static final Logger logger = 
LoggerFactory.getLogger(PendingAntiCompactionTest.class);
 private static final Collection> FULL_RANGE;
+private static final Collection> NO_RANGES = 
Collections.emptyList();
+private static InetAddressAndPort local;
+
 static
 {
 DatabaseDescriptor.daemonInitialization();
@@ -77,9 +83,10 @@ public class PendingAntiCompactionTest
 private ColumnFamilyStore cfs;
 
 @BeforeClass
-public static void setupClass()
+public static void setupClass() throws Throwable
 {
 SchemaLoader.prepareServer();
+local = InetAddressAndPort.getByName("127.0.0.1");
 }
 
 @Before
@@ -89,6 +96,7 @@ public class PendingAntiCompactionTest
 cfm = CreateTableStatement.parse(String.format("CREATE TABLE %s.%s (k 
INT PRIMARY KEY, v INT)", ks, tbl), ks).build();
 SchemaLoader.createKeyspace(ks, KeyspaceParams.simple(1), cfm);
 cfs = Schema.instance.getColumnFamilyStoreInstance(cfm.id);
+
 }
 
 private void makeSSTables(int num)
@@ -105,7 +113,7 @@ public class PendingAntiCompactionTest
 
 private static class InstrumentedAcquisitionCallback extends 
PendingAntiCompaction.AcquisitionCallback
 {
-public InstrumentedAcquisitionCallback(UUID parentRepairSession, 
Collection> ranges)
+public InstrumentedAcquisitionCallback(UUID parentRepairSession, 
RangesAtEndpoint ranges)
 {
 super(parentRepairSession, ranges);
 }
@@ -155,7 +163,7 @@ public class PendingAntiCompactionTest
 ExecutorService executor = Executors.newSingleThreadExecutor();
 try
 {
-pac = new PendingAntiCompaction(sessionID, tables, ranges, 
executor);
+pac = new PendingAntiCompaction(sessionID, tables, 
atEndpoint(ranges, NO_RANGES), executor);
 pac.run().get();
 }
 finally
@@ -217,7 +225,7 @@ public class PendingAntiCompactionTest
 Assert.assertTrue(repaired.intersects(FULL_RANGE));
 Assert.assertTrue(unrepaired.intersects(FULL_RANGE));
 
-
repaired.descriptor.getMetadataSerializer().mutateRepaired(repaired.descriptor, 
1, null);
+
repaired.descriptor.getMetadataSerializer().mutateRepairMetadata(repaired.descriptor,
 1, null, false);
 repaired.reloadSSTableMetadata();
 
 PendingAntiCompaction.AcquisitionCallable acquisitionCallable = new 
PendingAntiCompaction.AcquisitionCallable(cfs, FULL_RANGE, 
UUIDGen.getTimeUUID());
@@ -243,7 +251,7 @@ public class PendingAntiCompactionTest
 Assert.assertTrue(repaired.intersects(FULL_RANGE));
 Assert.assertTrue(unrepaired.intersects(FULL_RANGE));
 
-
repaired.descriptor.getMetadataSerializer().mutateRepaired(repaired.descriptor, 
0, UUIDGen.getTimeUUID());
+
repaired.descriptor.getMetadataSerializer().mutateRepairMetadata(repaired.descriptor,
 0, UUIDGen.getTimeUUID(), false);
 repaired.reloadSSTableMetadata();
 Assert.assertTrue(repaired.isPendingRepair());
 
@@ -284,7 +292,7 @@ public class PendingAntiCompactionTest
 PendingAntiCompaction.AcquireResult result = 
acquisitionCallable.call();
 Assert.assertNotNull(result);
 
-InstrumentedAcquisitionCallback cb = new 
InstrumentedAcquisitionCallback(UUIDGen.getTimeUUID(), FULL_RANGE);
+InstrumentedAcquisitionCallback cb = new 
InstrumentedAcquisitionCallback(UUIDGen.getTimeUUID(), atEndpoint(FULL_RANGE, 
NO_RANGES));
 Assert.assertTrue(cb.submittedCompactions.isEmpty());
 cb.apply(Lists.newArrayList(result));
 
@@ -308,7 +316,7 @@ public class PendingAntiCompactionTest
 Assert.assertNotNull(result);

[09/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/src/java/org/apache/cassandra/service/StorageService.java
--
diff --git a/src/java/org/apache/cassandra/service/StorageService.java 
b/src/java/org/apache/cassandra/service/StorageService.java
index 9467c9a..7f4ae14 100644
--- a/src/java/org/apache/cassandra/service/StorageService.java
+++ b/src/java/org/apache/cassandra/service/StorageService.java
@@ -29,6 +29,7 @@ import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicInteger;
 import java.util.regex.MatchResult;
 import java.util.regex.Pattern;
+import java.util.stream.Collectors;
 import java.util.stream.StreamSupport;
 
 import javax.annotation.Nullable;
@@ -41,9 +42,12 @@ import javax.management.openmbean.TabularDataSupport;
 import com.google.common.annotations.VisibleForTesting;
 import com.google.common.base.Preconditions;
 import com.google.common.base.Predicate;
+import com.google.common.base.Predicates;
 import com.google.common.collect.*;
 import com.google.common.util.concurrent.*;
 
+import org.apache.cassandra.dht.RangeStreamer.FetchReplica;
+import org.apache.cassandra.locator.ReplicaCollection.Mutable.Conflict;
 import org.apache.commons.lang3.StringUtils;
 
 import org.slf4j.Logger;
@@ -110,6 +114,8 @@ import 
org.apache.cassandra.utils.progress.ProgressEventType;
 import org.apache.cassandra.utils.progress.jmx.JMXBroadcastExecutor;
 import org.apache.cassandra.utils.progress.jmx.JMXProgressSupport;
 
+import static com.google.common.collect.Iterables.transform;
+import static com.google.common.collect.Iterables.tryFind;
 import static java.util.Arrays.asList;
 import static java.util.stream.Collectors.toList;
 import static org.apache.cassandra.index.SecondaryIndexManager.getIndexName;
@@ -164,9 +170,9 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 return isShutdown;
 }
 
-public Collection> getLocalRanges(String keyspaceName)
+public RangesAtEndpoint getLocalReplicas(String keyspaceName)
 {
-return getRangesForEndpoint(keyspaceName, 
FBUtilities.getBroadcastAddressAndPort());
+return getReplicasForEndpoint(keyspaceName, 
FBUtilities.getBroadcastAddressAndPort());
 }
 
 public List> getLocalAndPendingRanges(String ks)
@@ -174,9 +180,11 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 InetAddressAndPort broadcastAddress = 
FBUtilities.getBroadcastAddressAndPort();
 Keyspace keyspace = Keyspace.open(ks);
 List> ranges = new ArrayList<>();
-
ranges.addAll(keyspace.getReplicationStrategy().getAddressRanges().get(broadcastAddress));
-ranges.addAll(getTokenMetadata().getPendingRanges(ks, 
broadcastAddress));
-return Range.normalize(ranges);
+for (Replica r : 
keyspace.getReplicationStrategy().getAddressReplicas(broadcastAddress))
+ranges.add(r.range());
+for (Replica r : getTokenMetadata().getPendingRanges(ks, 
broadcastAddress))
+ranges.add(r.range());
+return ranges;
 }
 
 public Collection> getPrimaryRanges(String keyspace)
@@ -1225,11 +1233,11 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 if (keyspace == null)
 {
 for (String keyspaceName : 
Schema.instance.getNonLocalStrategyKeyspaces())
-streamer.addRanges(keyspaceName, 
getLocalRanges(keyspaceName));
+streamer.addRanges(keyspaceName, 
getLocalReplicas(keyspaceName));
 }
 else if (tokens == null)
 {
-streamer.addRanges(keyspace, getLocalRanges(keyspace));
+streamer.addRanges(keyspace, getLocalReplicas(keyspace));
 }
 else
 {
@@ -1251,14 +1259,16 @@ public class StorageService extends 
NotificationBroadcasterSupport implements IE
 }
 
 // Ensure all specified ranges are actually ranges owned by 
this host
-Collection> localRanges = 
getLocalRanges(keyspace);
+RangesAtEndpoint localReplicas = getLocalReplicas(keyspace);
+RangesAtEndpoint.Builder streamRanges = new 
RangesAtEndpoint.Builder(FBUtilities.getBroadcastAddressAndPort(), 
ranges.size());
 for (Range specifiedRange : ranges)
 {
 boolean foundParentRange = false;
-for (Range localRange : localRanges)
+for (Replica localReplica : localReplicas)
 {
-if (localRange.contains(specifiedRange))
+if (localReplica.contains(specifiedRange))
 {
+
streamRanges.add(localReplica.decorateSubrange(specifiedRange));

[01/18] cassandra git commit: Transient Replication and Cheap Quorums

2018-08-31 Thread aweisberg

Repository: cassandra
Updated Branches:
  refs/heads/trunk 5b645de13 -> f7431b432


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f7431b43/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java
--
diff --git 
a/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java 
b/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java
new file mode 100644
index 000..8119400
--- /dev/null
+++ 
b/test/unit/org/apache/cassandra/service/reads/DataResolverTransientTest.java
@@ -0,0 +1,226 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.cassandra.service.reads;
+
+import java.util.concurrent.TimeUnit;
+
+import com.google.common.primitives.Ints;
+
+import org.apache.cassandra.Util;
+import org.apache.cassandra.db.DecoratedKey;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+import org.apache.cassandra.db.Clustering;
+import org.apache.cassandra.db.ConsistencyLevel;
+import org.apache.cassandra.db.DeletionTime;
+import org.apache.cassandra.db.EmptyIterators;
+import org.apache.cassandra.db.RangeTombstone;
+import org.apache.cassandra.db.SimpleBuilders;
+import org.apache.cassandra.db.SinglePartitionReadCommand;
+import org.apache.cassandra.db.Slice;
+import org.apache.cassandra.db.partitions.PartitionUpdate;
+import org.apache.cassandra.db.rows.BTreeRow;
+import org.apache.cassandra.db.rows.Row;
+import org.apache.cassandra.locator.EndpointsForToken;
+import org.apache.cassandra.locator.ReplicaLayout;
+import org.apache.cassandra.schema.TableMetadata;
+import org.apache.cassandra.service.reads.repair.TestableReadRepair;
+import org.apache.cassandra.utils.ByteBufferUtil;
+
+import static org.apache.cassandra.db.ConsistencyLevel.QUORUM;
+import static org.apache.cassandra.locator.Replica.fullReplica;
+import static org.apache.cassandra.locator.Replica.transientReplica;
+import static org.apache.cassandra.locator.ReplicaUtils.full;
+import static org.apache.cassandra.locator.ReplicaUtils.trans;
+
+/**
+ * Tests DataResolvers handing of transient replicas
+ */
+public class DataResolverTransientTest extends AbstractReadResponseTest
+{
+private static DecoratedKey key;
+
+@Before
+public void setUp()
+{
+key = Util.dk("key1");
+}
+
+private static PartitionUpdate.Builder update(TableMetadata metadata, 
String key, Row... rows)
+{
+PartitionUpdate.Builder builder = new 
PartitionUpdate.Builder(metadata, dk(key), metadata.regularAndStaticColumns(), 
rows.length, false);
+for (Row row: rows)
+{
+builder.add(row);
+}
+return builder;
+}
+
+private static PartitionUpdate.Builder update(Row... rows)
+{
+return update(cfm, "key1", rows);
+}
+
+private static Row.SimpleBuilder rowBuilder(int clustering)
+{
+return new SimpleBuilders.RowBuilder(cfm, 
Integer.toString(clustering));
+}
+
+private static Row row(long timestamp, int clustering, int value)
+{
+return rowBuilder(clustering).timestamp(timestamp).add("c1", 
Integer.toString(value)).build();
+}
+
+private static DeletionTime deletion(long timeMillis)
+{
+TimeUnit MILLIS = TimeUnit.MILLISECONDS;
+return new DeletionTime(MILLIS.toMicros(timeMillis), 
Ints.checkedCast(MILLIS.toSeconds(timeMillis)));
+}
+
+/**
+ * Tests that the given update doesn't cause data resolver to attempt to 
repair a transient replica
+ */
+private void assertNoTransientRepairs(PartitionUpdate update)
+{
+SinglePartitionReadCommand command = 
SinglePartitionReadCommand.fullPartitionRead(update.metadata(), nowInSec, key);
+EndpointsForToken targetReplicas = 
EndpointsForToken.of(key.getToken(), full(EP1), full(EP2), trans(EP3));
+TestableReadRepair repair = new TestableReadRepair(command, QUORUM);
+DataResolver resolver = new DataResolver(command, plan(targetReplicas, 
ConsistencyLevel.QUORUM), repair, 0);
+
+Assert.assertFalse(resolver.isDataPresent());
+resolver.preprocess(response(command, EP1, iter(update),

[jira] [Comment Edited] (CASSANDRA-14145) Detecting data resurrection during read

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444
 ] 

Jordan West edited comment on CASSANDRA-14145 at 9/1/18 1:16 AM:
-

Thanks [~beobal]. The recent updates are a big improvement. The 
{{InputCollector}} changes make things much more readable. I feel like we've 
reduced the footprint and risk considerably which is making me more comfortable 
about merging this late in the game. The feature is well hidden behind a flag: 
if tracking is not enabled we add a couple bytes to the internode messaging 
protocol that are not used but otherwise the changes to the existing path are 
negligible; {{InputCollector}} probably being the biggest. I do have one 
comment below I think should be addressed before merge. The other one we can 
open an open an improvement JIRA for. So I will give my +1 assuming the comment 
below is addressed so that we don't miss the deadline due to tz differences. 
We'll definitely need to test this for correctness and performance further but 
that can happen after 9/1.
 * This change I think should be made before we merge: In {{InputCollector}} is 
{{repairedSSTables}} necessary? It seems like an extra upfront allocation and 
iteration that we could skip. It also widens the unconfirmed window slightly 
since the status could change between the call to the constructor and the call 
to {{addSSTableIterator}}. What about going back to the on-demand allocation of 
{{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}?
 * This can be addressed in a subsequent JIRA if you agree: In 
{{RepairedDataVerifier}}, we could more quickly report confirmed failures if we 
separated tracking of confirmed and unconfirmed digests. If the number of 
confirmed digests is > 1 we still have a confirmed issue regardless of the 
number of unconfirmed digests. We could separately increment unconfirmed in 
this case if the unconfirmed digests don't match any of the confirmed ones (if 
it does we would assume its consistent).

Minor Nits (up to you if you want to fix before merge):
 * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we 
shouldn’t try to characterize (“slight”) the performance impact besides to say 
it exists. Same goes for the identical comment in {{Config.java}}


was (Author: jrwest):
Thanks [~beobal]. The recent updates are a big improvement. The 
{{InputCollector}} changes make things much more readable. I feel like we've 
reduced the footprint and risk considerably which is making me more comfortable 
about merging this late in the game. The feature is well hidden behind a flag 
(if tracking is not enabled we add a couple bytes to the internode messaging 
protocol that are not used but otherwise the changes to the existing path are 
negligible; {{InputCollector}} probably being the biggest. I do have one 
comment below I think should be addressed before merge. The other one we can 
open an open an improvement JIRA for. So I will give my +1 assuming the comment 
below is addressed so that we don't miss the deadline due to tz differences. 
We'll definitely need to test this for correctness and performance further but 
that can happen after 9/1.
 * This change I think should be made before we merge: In {{InputCollector}} is 
{{repairedSSTables}} necessary? It seems like an extra upfront allocation and 
iteration that we could skip. It also widens the unconfirmed window slightly 
since the status could change between the call to the constructor and the call 
to {{addSSTableIterator}}. What about going back to the on-demand allocation of 
{{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}?
 * This can be addressed in a subsequent JIRA if you agree: In 
{{RepairedDataVerifier}}, we could more quickly report confirmed failures if we 
separated tracking of confirmed and unconfirmed digests. If the number of 
confirmed digests is > 1 we still have a confirmed issue regardless of the 
number of unconfirmed digests. We could separately increment unconfirmed in 
this case if the unconfirmed digests don't match any of the confirmed ones (if 
it does we would assume its consistent).

Minor Nits (up to you if you want to fix before merge):
 * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we 
shouldn’t try to characterize (“slight”) the performance impact besides to say 
it exists. Same goes for the identical comment in {{Config.java}}

>  Detecting data resurrection during read
> 
>
> Key: CASSANDRA-14145
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14145
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Sam Tunnicliffe
>Priority: Minor
> Fix For: 4.x
>
>
> We have

[jira] [Comment Edited] (CASSANDRA-14145) Detecting data resurrection during read

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444
 ] 

Jordan West edited comment on CASSANDRA-14145 at 9/1/18 1:16 AM:
-

Thanks [~beobal]. The recent updates are a big improvement. The 
{{InputCollector}} changes make things much more readable. I feel like we've 
reduced the footprint and risk considerably which is making me more comfortable 
about merging this late in the game. The feature is well hidden behind a flag: 
if tracking is not enabled we add a couple bytes to the internode messaging 
protocol that are not used but otherwise the changes to the existing path are 
negligible; {{InputCollector}} probably being the biggest. I do have one 
comment below I think should be addressed before merge. The other one we can 
open an improvement JIRA for. So I will give my +1 assuming the comment below 
is addressed so that we don't miss the deadline due to tz differences. We'll 
definitely need to test this for correctness and performance further but that 
can happen after 9/1.
 * This change I think should be made before we merge: In {{InputCollector}} is 
{{repairedSSTables}} necessary? It seems like an extra upfront allocation and 
iteration that we could skip. It also widens the unconfirmed window slightly 
since the status could change between the call to the constructor and the call 
to {{addSSTableIterator}}. What about going back to the on-demand allocation of 
{{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}?
 * This can be addressed in a subsequent JIRA if you agree: In 
{{RepairedDataVerifier}}, we could more quickly report confirmed failures if we 
separated tracking of confirmed and unconfirmed digests. If the number of 
confirmed digests is > 1 we still have a confirmed issue regardless of the 
number of unconfirmed digests. We could separately increment unconfirmed in 
this case if the unconfirmed digests don't match any of the confirmed ones (if 
it does we would assume its consistent).

Minor Nits (up to you if you want to fix before merge):
 * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we 
shouldn’t try to characterize (“slight”) the performance impact besides to say 
it exists. Same goes for the identical comment in {{Config.java}}


was (Author: jrwest):
Thanks [~beobal]. The recent updates are a big improvement. The 
{{InputCollector}} changes make things much more readable. I feel like we've 
reduced the footprint and risk considerably which is making me more comfortable 
about merging this late in the game. The feature is well hidden behind a flag: 
if tracking is not enabled we add a couple bytes to the internode messaging 
protocol that are not used but otherwise the changes to the existing path are 
negligible; {{InputCollector}} probably being the biggest. I do have one 
comment below I think should be addressed before merge. The other one we can 
open an open an improvement JIRA for. So I will give my +1 assuming the comment 
below is addressed so that we don't miss the deadline due to tz differences. 
We'll definitely need to test this for correctness and performance further but 
that can happen after 9/1.
 * This change I think should be made before we merge: In {{InputCollector}} is 
{{repairedSSTables}} necessary? It seems like an extra upfront allocation and 
iteration that we could skip. It also widens the unconfirmed window slightly 
since the status could change between the call to the constructor and the call 
to {{addSSTableIterator}}. What about going back to the on-demand allocation of 
{{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}?
 * This can be addressed in a subsequent JIRA if you agree: In 
{{RepairedDataVerifier}}, we could more quickly report confirmed failures if we 
separated tracking of confirmed and unconfirmed digests. If the number of 
confirmed digests is > 1 we still have a confirmed issue regardless of the 
number of unconfirmed digests. We could separately increment unconfirmed in 
this case if the unconfirmed digests don't match any of the confirmed ones (if 
it does we would assume its consistent).

Minor Nits (up to you if you want to fix before merge):
 * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we 
shouldn’t try to characterize (“slight”) the performance impact besides to say 
it exists. Same goes for the identical comment in {{Config.java}}

>  Detecting data resurrection during read
> 
>
> Key: CASSANDRA-14145
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14145
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Sam Tunnicliffe
>Priority: Minor
> Fix For: 4.x
>
>
> We have seen

[jira] [Commented] (CASSANDRA-14145) Detecting data resurrection during read

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599444#comment-16599444
 ] 

Jordan West commented on CASSANDRA-14145:
-

Thanks [~beobal]. The recent updates are a big improvement. The 
{{InputCollector}} changes make things much more readable. I feel like we've 
reduced the footprint and risk considerably which is making me more comfortable 
about merging this late in the game. The feature is well hidden behind a flag 
(if tracking is not enabled we add a couple bytes to the internode messaging 
protocol that are not used but otherwise the changes to the existing path are 
negligible; {{InputCollector}} probably being the biggest. I do have one 
comment below I think should be addressed before merge. The other one we can 
open an open an improvement JIRA for. So I will give my +1 assuming the comment 
below is addressed so that we don't miss the deadline due to tz differences. 
We'll definitely need to test this for correctness and performance further but 
that can happen after 9/1.
 * This change I think should be made before we merge: In {{InputCollector}} is 
{{repairedSSTables}} necessary? It seems like an extra upfront allocation and 
iteration that we could skip. It also widens the unconfirmed window slightly 
since the status could change between the call to the constructor and the call 
to {{addSSTableIterator}}. What about going back to the on-demand allocation of 
{{repairedIters}} and checking if its {{null}} in {{finalizeIterators}}?
 * This can be addressed in a subsequent JIRA if you agree: In 
{{RepairedDataVerifier}}, we could more quickly report confirmed failures if we 
separated tracking of confirmed and unconfirmed digests. If the number of 
confirmed digests is > 1 we still have a confirmed issue regardless of the 
number of unconfirmed digests. We could separately increment unconfirmed in 
this case if the unconfirmed digests don't match any of the confirmed ones (if 
it does we would assume its consistent).

Minor Nits (up to you if you want to fix before merge):
 * Re: comments in cassandra.yaml, until we’ve benchmarked the changes maybe we 
shouldn’t try to characterize (“slight”) the performance impact besides to say 
it exists. Same goes for the identical comment in {{Config.java}}

>  Detecting data resurrection during read
> 
>
> Key: CASSANDRA-14145
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14145
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Sam Tunnicliffe
>Priority: Minor
> Fix For: 4.x
>
>
> We have seen several bugs in which deleted data gets resurrected. We should 
> try to see if we can detect this on the read path and possibly fix it. Here 
> are a few examples which brought back data
> A replica lost an sstable on startup which caused one replica to lose the 
> tombstone and not the data. This tombstone was past gc grace which means this 
> could resurrect data. We can detect such invalid states by looking at other 
> replicas. 
> If we are running incremental repair, Cassandra will keep repaired and 
> non-repaired data separate. Every-time incremental repair will run, it will 
> move the data from non-repaired to repaired. Repaired data across all 
> replicas should be 100% consistent. 
> Here is an example of how we can detect and mitigate the issue in most cases. 
> Say we have 3 machines, A,B and C. All these machines will have data split 
> b/w repaired and non-repaired. 
> 1. Machine A due to some bug bring backs data D. This data D is in repaired 
> dataset. All other replicas will have data D and tombstone T 
> 2. Read for data D comes from application which involve replicas A and B. The 
> data being read involves data which is in repaired state.  A will respond 
> back to co-ordinator with data D and B will send nothing as tombstone is past 
> gc grace. This will cause digest mismatch. 
> 3. This patch will only kick in when there is a digest mismatch. Co-ordinator 
> will ask both replicas to send back all data like we do today but with this 
> patch, replicas will respond back what data it is returning is coming from 
> repaired vs non-repaired. If data coming from repaired does not match, we 
> know there is a something wrong!! At this time, co-ordinator cannot determine 
> if replica A has resurrected some data or replica B has lost some data. We 
> can still log error in the logs saying we hit an invalid state.
> 4. Besides the log, we can take this further and even correct the response to 
> the query. After logging an invalid state, we can ask replica A and B (and 
> also C if alive) to send back all data for this including gcable tombstones. 
> If any machine returns a tombstone which is after this data, we know we 
> cannot return

[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420
 ] 

Jordan West edited comment on CASSANDRA-13304 at 9/1/18 12:33 AM:
--

Other than the bug (and minor nits if you choose to address them) above I am +1 
on the these changes. We will need to make sure we test it more thoroughly once 
there is client support. Thanks for the updates [~beobal]!


was (Author: jrwest):
Other thank the bug (and minor nits if you choose to address them) above I am 
+1 on the these changes. We will need to make sure we test it more thoroughly 
once there is client support. Thanks for the updates [~beobal]!

> Add checksumming to the native protocol
> ---
>
> Key: CASSANDRA-13304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Michael Kjellman
>Assignee: Sam Tunnicliffe
>Priority: Blocker
>  Labels: client-impacting
> Fix For: 4.x
>
> Attachments: 13304_v1.diff, boxplot-read-throughput.png, 
> boxplot-write-throughput.png
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *  1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  | Compressed Length (e1)/
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e1) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (e2)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Compressed Bytes (e2)   +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e2) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (en)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Compressed Bytes (en)  +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (en) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
> {noformat}
> The first pass here adds checksums only to the actual contents of the frame 
> body itself (and doesn't actually checksum lengths and headers). While it 
> would be great to fully add checksuming across the entire protocol, the 
> proposed implementation will ensure we at least catch corrupted data and 
> likely protect ourselves pretty well anyways.
> I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor 
> implementation as it's been deprecated for a while -- is really slow and 
> crappy compared to LZ4 -- and we should do everything in our power to make 
> sure no one in the community is still using it. I left it in (for obvious 
> backwards compatibility aspects) old for clients that don't know about the

[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420
 ] 

Jordan West edited comment on CASSANDRA-13304 at 9/1/18 12:32 AM:
--

Other thank the bug (and minor nits if you choose to address them) above I am 
+1 on the these changes. We will need to make sure we test it more thoroughly 
once there is client support. Thanks for the updates [~beobal]!


was (Author: jrwest):
Other thank the bug above I am +1 on the these changes. We will need to make 
sure we test it more thoroughly once there is client support. Thanks for the 
updates [~beobal]!

> Add checksumming to the native protocol
> ---
>
> Key: CASSANDRA-13304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Michael Kjellman
>Assignee: Sam Tunnicliffe
>Priority: Blocker
>  Labels: client-impacting
> Fix For: 4.x
>
> Attachments: 13304_v1.diff, boxplot-read-throughput.png, 
> boxplot-write-throughput.png
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *  1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  | Compressed Length (e1)/
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e1) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (e2)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Compressed Bytes (e2)   +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e2) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (en)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Compressed Bytes (en)  +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (en) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
> {noformat}
> The first pass here adds checksums only to the actual contents of the frame 
> body itself (and doesn't actually checksum lengths and headers). While it 
> would be great to fully add checksuming across the entire protocol, the 
> proposed implementation will ensure we at least catch corrupted data and 
> likely protect ourselves pretty well anyways.
> I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor 
> implementation as it's been deprecated for a while -- is really slow and 
> crappy compared to LZ4 -- and we should do everything in our power to make 
> sure no one in the community is still using it. I left it in (for obvious 
> backwards compatibility aspects) old for clients that don't know about the 
> new protocol.
> The current protocol has a

[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599420#comment-16599420
 ] 

Jordan West commented on CASSANDRA-13304:
-

Other thank the bug above I am +1 on the these changes. We will need to make 
sure we test it more thoroughly once there is client support. Thanks for the 
updates [~beobal]!

> Add checksumming to the native protocol
> ---
>
> Key: CASSANDRA-13304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Michael Kjellman
>Assignee: Sam Tunnicliffe
>Priority: Blocker
>  Labels: client-impacting
> Fix For: 4.x
>
> Attachments: 13304_v1.diff, boxplot-read-throughput.png, 
> boxplot-write-throughput.png
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *  1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  | Compressed Length (e1)/
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e1) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (e2)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Compressed Bytes (e2)   +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e2) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (en)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Compressed Bytes (en)  +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (en) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
> {noformat}
> The first pass here adds checksums only to the actual contents of the frame 
> body itself (and doesn't actually checksum lengths and headers). While it 
> would be great to fully add checksuming across the entire protocol, the 
> proposed implementation will ensure we at least catch corrupted data and 
> likely protect ourselves pretty well anyways.
> I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor 
> implementation as it's been deprecated for a while -- is really slow and 
> crappy compared to LZ4 -- and we should do everything in our power to make 
> sure no one in the community is still using it. I left it in (for obvious 
> backwards compatibility aspects) old for clients that don't know about the 
> new protocol.
> The current protocol has a 256MB (max) frame body -- where the serialized 
> contents are simply written in to the frame body.
> If the client sends a compression option in the startup, we will install a 
> FrameCompressor inline. Unfortunately, we went with a decision to treat the 
> frame body separately from the

[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Blake Eggleston (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599415#comment-16599415
 ] 

Blake Eggleston commented on CASSANDRA-14408:
-

+1 from me as well

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-13262) Incorrect cqlsh results when selecting same columns multiple times

2018-08-31 Thread mck (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597316#comment-16597316
 ] 

mck edited comment on CASSANDRA-13262 at 8/31/18 11:56 PM:
---

New dtests running…

|| branch || testall || dtest ||
| 
[cassandra-2.2_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-2.2_13262]
 | 
[!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-2.2_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-2.2_13262]
   | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/625/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/625/]
 |
| 
[cassandra-3.0_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-3.0_13262]
 | 
[!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.0_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.0_13262]
   | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/626/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/626/]
 |
| 
[cassandra-3.11_13262|https://github.com/thelastpickle/cassandra/tree/mck/cassandra-3.11_13262]
   | 
[!https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.11_13262.svg?style=svg!|https://circleci.com/gh/thelastpickle/cassandra/tree/mck%2Fcassandra-3.11_13262]
 | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/627/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/627/]
 |


EDIT: rebased branches.


was (Author: michaelsembwever):
New dtests running…

|| branch || testall || dtest ||
| 
[cassandra-2.2_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-2.2_13262]
  | 
[!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-2.2_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-2.2_13262]
 | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/618/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/618/]
 |
| 
[cassandra-3.0_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-3.0_13262]
  | 
[!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.0_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.0_13262]
 | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/619/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/619/]
 |
| 
[cassandra-3.11_13262|https://github.com/michaelsembwever/cassandra/tree/mck/cassandra-3.11_13262]
| 
[!https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.11_13262.svg?style=svg!|https://circleci.com/gh/michaelsembwever/cassandra/tree/mck%2Fcassandra-3.11_13262]
   | 
[!https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/620/badge/icon!|https://builds.apache.org/view/A-D/view/Cassandra/job/Cassandra-devbranch-dtest/620/]
 |


> Incorrect cqlsh results when selecting same columns multiple times
> --
>
> Key: CASSANDRA-13262
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13262
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Stefan Podkowinski
>Assignee: Murukesh Mohanan
>Priority: Minor
>  Labels: lhf
> Fix For: 4.0
>
> Attachments: 
> 0001-Fix-incorrect-cqlsh-results-when-selecting-same-colu.patch, 
> CASSANDRA-13262-v2.2.txt, CASSANDRA-13262-v3.0.txt, CASSANDRA-13262-v3.11.txt
>
>
> Just stumbled over this on trunk:
> {quote}
> cqlsh:test1> select a, b, c from table1;
>  a | b| c
> ---+--+-
>  1 |b |   2
>  2 | null | 2.2
> (2 rows)
> cqlsh:test1> select a, a, b, c from table1;
>  a | a| b   | c
> ---+--+-+--
>  1 |b |   2 | null
>  2 | null | 2.2 | null
> (2 rows)
> cqlsh:test1> select a, a, a, b, c from table1;
>  a | a| a | b| c
> ---+--+---+--+--
>  1 |b |   2.0 | null | null
>  2 | null | 2.2004768 | null | null
> {quote}
> My guess is that his is on the Python side, but haven't really looked into it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Blake Eggleston (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston updated CASSANDRA-14408:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston, Marcus 
Eriksson  (was: Alex Petrov, Ariel Weisberg, Benedict, Marcus Eriksson)

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Blake Eggleston (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston updated CASSANDRA-14406:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston  (was: 
Alex Petrov, Ariel Weisberg, Benedict)

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Blake Eggleston (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599413#comment-16599413
 ] 

Blake Eggleston commented on CASSANDRA-14405:
-

+1 from me as well

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Blake Eggleston (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599414#comment-16599414
 ] 

Blake Eggleston commented on CASSANDRA-14406:
-

+1 from me as well

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Blake Eggleston (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Eggleston updated CASSANDRA-14405:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Blake Eggleston  (was: 
Alex Petrov, Ariel Weisberg, Benedict)

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14407) Transient Replication: Add support for correct reads when transient replication is in use

2018-08-31 Thread Benedict (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599406#comment-16599406
 ] 

Benedict commented on CASSANDRA-14407:
--

+1 also; I will be following up next week with a detailed comment and some 
follow up tickets, once I've had time to collate my notes.

> Transient Replication: Add support for correct reads when transient 
> replication is in use
> -
>
> Key: CASSANDRA-14407
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14407
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Digest reads should never be sent to transient replicas.
> Mismatches with results from transient replicas shouldn't trigger read repair.
> Read repair should never attempt to repair a transient replica.
> Reads should always include at least one full replica. They should also 
> prefer transient replicas where possible.
> Range scans must ensure the entire scanned range performs replica selection 
> that satisfies the requirement that every range scanned includes one full 
> replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Benedict (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599407#comment-16599407
 ] 

Benedict commented on CASSANDRA-14406:
--

+1 also; I will be following up next week with a detailed comment and some 
follow up tickets, once I've had time to collate my notes.

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Benedict (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599405#comment-16599405
 ] 

Benedict commented on CASSANDRA-14405:
--

+1 also; I will be following up next week with a detailed comment and some 
follow up tickets, once I've had time to collate my notes.

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Alex Petrov (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599393#comment-16599393
 ] 

Alex Petrov commented on CASSANDRA-14406:
-

The patch was reviewed, modified and incorporated into the final Transient 
Replication patch. +1 from my side of the review

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Alex Petrov (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599395#comment-16599395
 ] 

Alex Petrov commented on CASSANDRA-14408:
-

The patch was reviewed, modified and incorporated into the final Transient 
Replication patch. +1 from my side of the review

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14408) Transient Replication: Incremental & Validation repair handling of transient replicas

2018-08-31 Thread Alex Petrov (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14408:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict, Marcus Eriksson  (was: 
Ariel Weisberg, Benedict, Marcus Eriksson)

> Transient Replication: Incremental & Validation repair handling of transient 
> replicas
> -
>
> Key: CASSANDRA-14408
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14408
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Repair
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> At transient replicas anti-compaction shouldn't output any data for transient 
> ranges as the data will be dropped after repair.
> Transient replicas should also never have data streamed to them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14406) Transient Replication: Implement cheap quorum write optimizations

2018-08-31 Thread Alex Petrov (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14406:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict

> Transient Replication: Implement cheap quorum write optimizations
> -
>
> Key: CASSANDRA-14406
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14406
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Coordination
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
>
> Writes should never be sent to transient replicas unless necessary to satisfy 
> the requested consistency level. Such as RF not being sufficient for strong 
> consistency or not enough full replicas marked as alive.
> If a write doesn't receive sufficient responses in time additional replicas 
> should be sent the write similar to Rapid Read Protection.
> Hints should never be written for a transient replica.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Alex Petrov (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Petrov updated CASSANDRA-14405:

Reviewers: Alex Petrov, Ariel Weisberg, Benedict  (was: Ariel Weisberg, 
Benedict)

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14405) Transient Replication: Metadata refactor

2018-08-31 Thread Alex Petrov (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599389#comment-16599389
 ] 

Alex Petrov commented on CASSANDRA-14405:
-

The patch was reviewed, modified and incorporated into the final Transient 
Replication patch. +1 from my side of the review

> Transient Replication: Metadata refactor
> 
>
> Key: CASSANDRA-14405
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14405
> Project: Cassandra
>  Issue Type: Sub-task
>  Components: Core, Distributed Metadata, Documentation and Website
>Reporter: Ariel Weisberg
>Assignee: Blake Eggleston
>Priority: Major
> Fix For: 4.0
>
>
> Add support to CQL and NTS for configuring keyspaces to have transient 
> replicas.
> Add syntax allowing a keyspace using NTS to declare some replicas in each DC 
> as transient.
> Implement metadata internal to the DB so that it's possible to identify what 
> replicas are transient for a given token or range.
> Introduce Replica which is an InetAddressAndPort and a boolean indicating 
> whether the replica is transient. ReplicatedRange which is a wrapper around a 
> Range that indicates if the range is transient.
> Block altering of keyspaces to use transient replication if they already 
> contain MVs or 2i.
> Block the creation of MV or 2i in keyspaces using transient replication.
> Block the creation/alteration of keyspaces using transient replication if the 
> experimental flag is not set.
> Update web site, CQL spec, and any other documentation for the new syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-10699) Make schema alterations strongly consistent

2018-08-31 Thread C. Scott Andreas (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

C. Scott Andreas updated CASSANDRA-10699:
-
Fix Version/s: (was: 4.0)
   4.x

> Make schema alterations strongly consistent
> ---
>
> Key: CASSANDRA-10699
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10699
> Project: Cassandra
>  Issue Type: Sub-task
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
> Fix For: 4.x
>
>
> Schema changes do not necessarily commute. This has been the case before 
> CASSANDRA-5202, but now is particularly problematic.
> We should employ a strongly consistent protocol instead of relying on 
> marshalling {{Mutation}} objects with schema changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599314#comment-16599314
 ] 

Jordan West edited comment on CASSANDRA-13304 at 8/31/18 10:08 PM:
---

[~beobal] I'm still going over the changes but I wanted to post this bug now 
before I finish. The issue is that the checksum over the lengths is only 
calculated over the least-significant byte of the compressed and uncompressed 
lengths. This means that if we introduce corruption in the three most 
significant bytes we won't catch it, which leads to a host of different bugs 
(index out of bounds exception , lz4 decompression issues, etc). I pushed a 
patch 
[here|https://github.com/jrwest/cassandra/commit/e57a2508c26f05efb826a7f4342964fa6d6691bd]
 with a new test that catches the issue. I left in the fixed seed for now so 
its easy for you to run and see the failing example in a debugger (its the 
second example that fails). I've pasted a stack trace of the failure with that 
seed below. The example generated has a single byte input and introduces 
corruption into the 3rd byte in the stream (the most significant byte of the 
first length in the stream). This leads to a case where the checksums match but 
when we go to read the data we read past the total length of the buffer.

A couple other comments while I'm here. All minor:
 * [~djoshi3] and I were reviewing the bug above before I posted and we were 
thinking it would be nice to refactor 
{{ChecksummingTransformer#transformInbound/transformOutbound}}. They are a bit 
large/unwieldy right now. We can open a JIRA to address this later if you 
prefer.
 * Re: the {{roundTripZeroLength}} property. This is mostly covered by the 
property I already added although this makes it more likely to generate a few 
cases. If you want to keep it I would recommend setting {{withExamples}} and 
using something small like 10 or 20 examples (since the state space is small).
 * The {{System.out.println}} I added in {{roundTripSafetyProperty}} should be 
removed

Stack Trace:
{code:java}
java.lang.AssertionError: Property falsified after 2 example(s) 
Smallest found falsifying value(s) :- \{(c,3), 0, null, Adler32}

Cause was :-
 java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) 
exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54)
 at 
io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
 at 
io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388)
 at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106)
 at 
org.quicktheories.dsl.TheoryBuilder4.lambda$checkAssert$9(TheoryBuilder4.java:163)
 at org.quicktheories.dsl.TheoryBuilder4.lambda$check$8(TheoryBuilder4.java:151)
 at org.quicktheories.impl.Property.tryFalsification(Property.java:23)
 at org.quicktheories.impl.Core.shrink(Core.java:111)
 at org.quicktheories.impl.Core.run(Core.java:39)
 at org.quicktheories.impl.TheoryRunner.check(TheoryRunner.java:35)
 at org.quicktheories.dsl.TheoryBuilder4.check(TheoryBuilder4.java:150)
 at org.quicktheories.dsl.TheoryBuilder4.checkAssert(TheoryBuilder4.java:162)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.corruptionCausesFailure(ChecksummingTransformerTest.java:87)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at

[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Jordan West (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599314#comment-16599314
 ] 

Jordan West commented on CASSANDRA-13304:
-

[~beobal] I'm still going over the changes but I wanted to post this bug now 
before I finish. The issue is that the checksum over the lengths is only 
calculated over the least-significant byte of the compressed and uncompressed 
lengths. This means that if we introduce corruption in the three most 
significant bytes we won't catch it, which leads to a host of different bugs 
(index out of bounds exception , lz4 decompression issues, etc). I pushed a 
patch 
[here|https://github.com/jrwest/cassandra/commit/e57a2508c26f05efb826a7f4342964fa6d6691bd]
 with a new test that catches the issue. I left in the fixed seed for now so 
its easy for you to run and see the failing example in a debugger (its the 
second example that fails). I've pasted a stack trace of the failure with that 
seed below. The example generated has a single byte input and introduces 
corruption into the 3rd byte in the stream (the most significant byte of the 
first length in the stream). This leads to a case where the checksums match but 
when we go to read the data we read past the total length of the buffer.

A couple other comments while I'm here. All minor:
 * [~djoshi3] and I were reviewing the bug above before I posted and we were 
thinking it would be nice to refactor 
{{ChecksummingTransformer#transformInbound/transformOutbound}}. They are a bit 
large/unwieldy right now. We can open a JIRA to address this later if you 
prefer.
 * Re: the {{roundTripZeroLength}} property. This is mostly covered by the 
property I already added although this makes it more likely to generate a few 
cases. If you want to keep it I would recommend setting {{withExamples}} and 
using something small like 10 or 20 examples (since the state space is small).
 * The {{System.out.println}} I added in {{roundTripSafetyProperty}} should be 
removed

Stack Trace:
{code:java}
java.lang.AssertionError: Property falsified after 2 example(s) Smallest found 
falsifying value(s) :- \{(c,3), 0, null, Adler32}

Cause was :-
 java.lang.IndexOutOfBoundsException: readerIndex(10) + length(16711681) 
exceeds writerIndex(15): UnpooledHeapByteBuf(ridx: 10, widx: 15, cap: 54/54)
 at 
io.netty.buffer.AbstractByteBuf.checkReadableBytes0(AbstractByteBuf.java:1401)
 at 
io.netty.buffer.AbstractByteBuf.checkReadableBytes(AbstractByteBuf.java:1388)
 at io.netty.buffer.AbstractByteBuf.readBytes(AbstractByteBuf.java:870)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformer.transformInbound(ChecksummingTransformer.java:289)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.roundTripWithCorruption(ChecksummingTransformerTest.java:106)
 at 
org.quicktheories.dsl.TheoryBuilder4.lambda$checkAssert$9(TheoryBuilder4.java:163)
 at org.quicktheories.dsl.TheoryBuilder4.lambda$check$8(TheoryBuilder4.java:151)
 at org.quicktheories.impl.Property.tryFalsification(Property.java:23)
 at org.quicktheories.impl.Core.shrink(Core.java:111)
 at org.quicktheories.impl.Core.run(Core.java:39)
 at org.quicktheories.impl.TheoryRunner.check(TheoryRunner.java:35)
 at org.quicktheories.dsl.TheoryBuilder4.check(TheoryBuilder4.java:150)
 at org.quicktheories.dsl.TheoryBuilder4.checkAssert(TheoryBuilder4.java:162)
 at 
org.apache.cassandra.transport.frame.checksum.ChecksummingTransformerTest.corruptionCausesFailure(ChecksummingTransformerTest.java:87)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
 at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
 at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
 at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
 at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
 at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
 at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
 at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
 at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
 at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
 at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
 at

[jira] [Commented] (CASSANDRA-14618) Create fqltool replay command

2018-08-31 Thread Jason Brown (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599274#comment-16599274
 ] 

Jason Brown commented on CASSANDRA-14618:
-

+1, and please commit with CASSANDRA-14619 (as both patches are linked, code 
and review wise)

> Create fqltool replay command
> -
>
> Key: CASSANDRA-14618
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14618
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> Make it possible to replay the full query logs from CASSANDRA-13983 against 
> one or several clusters. The goal is to be able to compare different runs of 
> production traffic against different versions/configurations of Cassandra.
> * It should be possible to take logs from several machines and replay them in 
> "order" by the timestamps recorded
> * Record the results from each run to be able to compare different runs 
> (against different clusters/versions/etc)
> * If {{fqltool replay}} is run against 2 or more clusters, the results should 
> be compared as we go



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14618) Create fqltool replay command

2018-08-31 Thread Jason Brown (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-14618:

Status: Ready to Commit  (was: Patch Available)

> Create fqltool replay command
> -
>
> Key: CASSANDRA-14618
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14618
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> Make it possible to replay the full query logs from CASSANDRA-13983 against 
> one or several clusters. The goal is to be able to compare different runs of 
> production traffic against different versions/configurations of Cassandra.
> * It should be possible to take logs from several machines and replay them in 
> "order" by the timestamps recorded
> * Record the results from each run to be able to compare different runs 
> (against different clusters/versions/etc)
> * If {{fqltool replay}} is run against 2 or more clusters, the results should 
> be compared as we go



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14619) Create fqltool compare command

2018-08-31 Thread Jason Brown (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown updated CASSANDRA-14619:

Status: Ready to Commit  (was: Patch Available)

> Create fqltool compare command
> --
>
> Key: CASSANDRA-14619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14619
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> We need a {{fqltool compare}} command that can take the recorded runs from 
> CASSANDRA-14618 and compares them, it should output any differences and 
> potentially all queries against the mismatching partition up until the 
> mismatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command

2018-08-31 Thread Jason Brown (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599273#comment-16599273
 ] 

Jason Brown commented on CASSANDRA-14619:
-

- ColumnDefsReader.readMarshallable - you read an int32 value, but in 
ColumnDefsWriter.writeMarshallable, you wrote an int16. Is this correct? The 
unit tests pass, but I'm not sure if RecordStore is being fully exercised. The 
same thing happens in RowReader vs RowWriter

UPDATE: I stepped through the chronicle code and it looks like the library can 
optimize the value it writes out (it only gets written as a byte, basically, 
since your value is zero). So, while your API calls are incongruous, the 
library does a correct thing under the hood. I would still prefer you to switch 
to the reads to int16(), but that can be done on commit.

I also had a few trivial comments on the PR linked above. They are minor, so 
just address them on commit (if you choose).

Otherwise, +1 from me.

> Create fqltool compare command
> --
>
> Key: CASSANDRA-14619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14619
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> We need a {{fqltool compare}} command that can take the recorded runs from 
> CASSANDRA-14618 and compares them, it should output any differences and 
> potentially all queries against the mismatching partition up until the 
> mismatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Comment Edited] (CASSANDRA-14497) Add Role login cache

2018-08-31 Thread Sam Tunnicliffe (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599244#comment-16599244
 ] 

Sam Tunnicliffe edited comment on CASSANDRA-14497 at 8/31/18 8:24 PM:
--

{quote}{{logical}} role doesn't have password right? Can we use that?
{quote}
I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack 
of a password, then the answer is no because alternative {{IAuthenticator}} 
implementations may also not use passwords, but you still want users to be able 
to login.
{quote}Then if we disable {{authorizer}}, it should not do the login check 
right?
{quote}
No, it still has to do that because it's a required privilege for connecting. I 
guess I over-simplified when I said perms are only the concern of the 
{{IAuthorizer}}.
{quote}Maybe my questions are beyond the scope of this ticket. If we just want 
to add cache with minimized the impact. I think the patch looks good.
{quote}
I think there's definitely plenty of scope to improve the design of the auth 
subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll 
commit this patch in the meantime (after rebasing and CI) to reduce the impact 
of high login rates.

Thanks [~jay.zhuang]


was (Author: beobal):
bq. {{logical}} role doesn't have password right? Can we use that?

I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack 
of a password, then the answer is no because alternative {{IAuthenticator}} 
implementations may also not use passwords, but you still want users to be able 
to login.

 bq. Then if we disable {{authorizer}}, it should not do the login check right?

No, it still has to do that because it's a required privilege for connecting. I 
guess I over-simplified when I said perms are only the concern of the 
{{IAuthorizer}}.

 bq. Maybe my questions are beyond the scope of this ticket. If we just want to 
add cache with minimized the impact. I think the patch looks good.
I think there's definitely plenty of scope to improve the design of the auth 
subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll 
commit this patch in the meantime to reduce the impact of high login rates.

Thanks [~jay.zhuang]

> Add Role login cache
> 
>
> Key: CASSANDRA-14497
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14497
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Auth
>Reporter: Jay Zhuang
>Assignee: Sam Tunnicliffe
>Priority: Major
>  Labels: security
> Fix For: 4.0
>
>
> The 
> [{{ClientState.login()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ClientState.java#L313]
>  function is used for all auth message: 
> [{{AuthResponse.java:82}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/messages/AuthResponse.java#L82].
>  But the 
> [{{role.canLogin}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L521]
>  information is not cached. So it hits the database every time: 
> [{{CassandraRoleManager.java:407}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L407].
>  For a cluster with lots of new connections, it's causing performance issue. 
> The mitigation for us is to increase the {{system_auth}} replication factor 
> to match the number of nodes, so 
> [{{local_one}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L488]
>  would be very cheap. The P99 dropped immediately, but I don't think it is 
> not a good solution.
> I would purpose to add {{Role.canLogin}} to the RolesCache to improve the 
> auth performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14497) Add Role login cache

2018-08-31 Thread Sam Tunnicliffe (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599244#comment-16599244
 ] 

Sam Tunnicliffe commented on CASSANDRA-14497:
-

bq. {{logical}} role doesn't have password right? Can we use that?

I'm not sure I exactly follow. If you mean can we infer {{LOGIN}} from the lack 
of a password, then the answer is no because alternative {{IAuthenticator}} 
implementations may also not use passwords, but you still want users to be able 
to login.

 bq. Then if we disable {{authorizer}}, it should not do the login check right?

No, it still has to do that because it's a required privilege for connecting. I 
guess I over-simplified when I said perms are only the concern of the 
{{IAuthorizer}}.

 bq. Maybe my questions are beyond the scope of this ticket. If we just want to 
add cache with minimized the impact. I think the patch looks good.
I think there's definitely plenty of scope to improve the design of the auth 
subsystem, so let's open a 4.x JIRA to figure out exactly what we want. I'll 
commit this patch in the meantime to reduce the impact of high login rates.

Thanks [~jay.zhuang]

> Add Role login cache
> 
>
> Key: CASSANDRA-14497
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14497
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Auth
>Reporter: Jay Zhuang
>Assignee: Sam Tunnicliffe
>Priority: Major
>  Labels: security
> Fix For: 4.0
>
>
> The 
> [{{ClientState.login()}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/ClientState.java#L313]
>  function is used for all auth message: 
> [{{AuthResponse.java:82}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/messages/AuthResponse.java#L82].
>  But the 
> [{{role.canLogin}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L521]
>  information is not cached. So it hits the database every time: 
> [{{CassandraRoleManager.java:407}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L407].
>  For a cluster with lots of new connections, it's causing performance issue. 
> The mitigation for us is to increase the {{system_auth}} replication factor 
> to match the number of nodes, so 
> [{{local_one}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/auth/CassandraRoleManager.java#L488]
>  would be very cheap. The P99 dropped immediately, but I don't think it is 
> not a good solution.
> I would purpose to add {{Role.canLogin}} to the RolesCache to improve the 
> auth performance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Dinesh Joshi (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599238#comment-16599238
 ] 

Dinesh Joshi commented on CASSANDRA-13304:
--

{{ChecksummingTransformer::transformInbound}} - Very minor nit, I prefer a 
ternary operator here. Its more concise. You don't have to change it. I'm +1 on 
the patch.

> Add checksumming to the native protocol
> ---
>
> Key: CASSANDRA-13304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Michael Kjellman
>Assignee: Sam Tunnicliffe
>Priority: Blocker
>  Labels: client-impacting
> Fix For: 4.x
>
> Attachments: 13304_v1.diff, boxplot-read-throughput.png, 
> boxplot-write-throughput.png
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *  1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  | Compressed Length (e1)/
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e1) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (e2)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (e2) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Compressed Bytes (e2)   +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e2) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |Compressed Length (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |   Uncompressed Length (en)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |CRC32 Checksum of Lengths (en) |
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Compressed Bytes (en)  +//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (en) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
> {noformat}
> The first pass here adds checksums only to the actual contents of the frame 
> body itself (and doesn't actually checksum lengths and headers). While it 
> would be great to fully add checksuming across the entire protocol, the 
> proposed implementation will ensure we at least catch corrupted data and 
> likely protect ourselves pretty well anyways.
> I didn't go to the trouble of implementing a Snappy Checksum'ed Compressor 
> implementation as it's been deprecated for a while -- is really slow and 
> crappy compared to LZ4 -- and we should do everything in our power to make 
> sure no one in the community is still using it. I left it in (for obvious 
> backwards compatibility aspects) old for clients that don't know about the 
> new protocol.
> The current protocol has a 256MB (max) frame body -- where the serialized 
> contents are simply written in to the frame body.
> If the client sends a compression option in the startup, we will install a 
> FrameCompressor inline. Unfortunately, we went with a decision to treat the 
> frame body separately from the header

[jira] [Commented] (CASSANDRA-13304) Add checksumming to the native protocol

2018-08-31 Thread Sam Tunnicliffe (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-13304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599223#comment-16599223
 ] 

Sam Tunnicliffe commented on CASSANDRA-13304:
-

thanks both. [~jrwest], I've pulled your patch into my branch and add a couple 
more tests.

bq. Protocol comments
I agree that this could be useful, so I've implemented a first cut of it to 
validate the protocol changes. The test for whether compression is applied or 
not is super-brutal (i.e. it just does applies the compression and checks the 
compressed size) but that could be refined later. I haven't updated either the 
diagram in {{ChecksummingTransformer}}, or the protocol spec, but those can 
come later (soon) if we commit this.

bq. In ChecksummingTransformer::transformOutbound L185 - consider adding a 
debug log statement as this is an unexpected event.
I considered this, but decided in the end that it wouldn't really be 
actionable, so a bit redundant. It would more useful to add a metric for the 
number of times where we have to resize a buffer, but probably only in 
conjunction with some targetted testing which we can't easily do with the 
existing drivers. I'll open a follow up JIRA for that.  

bq. In ChecksummingTransformerTest - Could you add a zero length round trip 
test so we cover that corner case as well?
Added when pulling in Jordan's conversion of ChecksummingTransformerTest to a 
property-based test.

bq. cassandra.yaml entry for compression/checksum block size is missing
Added, thanks.

bq. There was some conversation above about using ProtocolException instead of 
IOException when the checksums don't match. It seemed like there was agreement 
on using ProtocolException but the code still uses IOException.
Good catch, thanks.

bq. Would be nice to move ChecksummingTransformer#readUnsignedShort to 
something like ByteBufUtil#readUnsignedShort. Similar to 
ByteBufferUtil#readShortLength.
I've moved it to CBUtil, which is the defacto equivalent of {{ByteBufferUtil}}

bq. I thought that Optionals were not adding much value 
bq. StartupMessage#getChecksumType/getCompressor(): I'm not sure there is much 
benefit to using optional here given how its used at the call sites.
argh, both good catches. The use of {{Optional}} was much more widespread 
earlier on & I thought I'd removed it all, but I missed this. 

bq. The comment about why the frame.compress package defines an 
ICompressor-Like interface was removed but is helpful since its not obvious at 
first. It should probably be expanded on a bit as well. 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/FrameCompressor.java#L37
I've added some javadoc to {{o.a.c.transport.frame.compress.Compressor}}

bq. The created by comment at the top of ChecksummingCompressorTest should be 
removed
Done


> Add checksumming to the native protocol
> ---
>
> Key: CASSANDRA-13304
> URL: https://issues.apache.org/jira/browse/CASSANDRA-13304
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Reporter: Michael Kjellman
>Assignee: Sam Tunnicliffe
>Priority: Blocker
>  Labels: client-impacting
> Fix For: 4.x
>
> Attachments: 13304_v1.diff, boxplot-read-throughput.png, 
> boxplot-write-throughput.png
>
>
> The native binary transport implementation doesn't include checksums. This 
> makes it highly susceptible to silently inserting corrupted data either due 
> to hardware issues causing bit flips on the sender/client side, C*/receiver 
> side, or network in between.
> Attaching an implementation that makes checksum'ing mandatory (assuming both 
> client and server know about a protocol version that supports checksums) -- 
> and also adds checksumming to clients that request compression.
> The serialized format looks something like this:
> {noformat}
>  *  1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
>  *  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  Number of Compressed Chunks  | Compressed Length (e1)/
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * /  Compressed Length cont. (e1) |Uncompressed Length (e1)   /
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Uncompressed Length cont. (e1)| CRC32 Checksum of Lengths (e1)|
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * | Checksum of Lengths cont. (e1)|Compressed Bytes (e1)+//
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |  CRC32 Checksum (e1) ||
>  * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
>  * |

[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command

2018-08-31 Thread Dinesh Joshi (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599225#comment-16599225
 ] 

Dinesh Joshi commented on CASSANDRA-14619:
--

[~krummas] Other than just one minor thing, I'm +1 on the PR. Please fix on 
commit - {{FQLQuery::toString}} is using {{"\n"}}, it would be better to use 
{{System.getProperty("line.separator")}}.

> Create fqltool compare command
> --
>
> Key: CASSANDRA-14619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14619
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> We need a {{fqltool compare}} command that can take the recorded runs from 
> CASSANDRA-14618 and compares them, it should output any differences and 
> potentially all queries against the mismatching partition up until the 
> mismatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command

2018-08-31 Thread Dinesh Joshi (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599169#comment-16599169
 ] 

Dinesh Joshi commented on CASSANDRA-14619:
--

I will take a look at it today.

> Create fqltool compare command
> --
>
> Key: CASSANDRA-14619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14619
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> We need a {{fqltool compare}} command that can take the recorded runs from 
> CASSANDRA-14618 and compares them, it should output any differences and 
> potentially all queries against the mismatching partition up until the 
> mismatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14619) Create fqltool compare command

2018-08-31 Thread Marcus Eriksson (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599153#comment-16599153
 ] 

Marcus Eriksson commented on CASSANDRA-14619:
-

bq. , every columnDefinition and row entry that is written out is prefixed with 
a 4-byte version number.
I would really like to keep it in each document to be able to parse them 
out-of-context, let me know if you have really strong feelings about it, but I 
think the size of this will be tiny compared to actual result sets etc. And I 
changed it to two bytes

I have pushed a branch rebased on almost latest trunk including the updated 
fields etc here: https://github.com/krummas/cassandra/commits/marcuse/fql_rebase


> Create fqltool compare command
> --
>
> Key: CASSANDRA-14619
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14619
> Project: Cassandra
>  Issue Type: New Feature
>Reporter: Marcus Eriksson
>Assignee: Marcus Eriksson
>Priority: Major
>  Labels: fqltool
> Fix For: 4.x
>
>
> We need a {{fqltool compare}} command that can take the recorded runs from 
> CASSANDRA-14618 and compares them, it should output any differences and 
> potentially all queries against the mismatching partition up until the 
> mismatch



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair

2018-08-31 Thread Ariel Weisberg (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599146#comment-16599146
 ] 

Ariel Weisberg commented on CASSANDRA-14404:


There are no transient nodes. All nodes are the same. If you have transient 
replication enabled each node will transiently replicate some ranges instead of 
fully replicating them.

Capacity requirements are reduced evenly across all nodes in the cluster.

Nodes are not temporarily transient replicas during expansion. They need to 
stream data like a full replica for the transient range before they can serve 
reads. There is a pending state similar to how there is a pending state for 
full replicas. Transient replicas also always receive writes when they are 
pending. There may be some room to relax how that is handled, but for now we 
opt to send pending transient ranges a bit more data and avoid reading from 
them when maybe we could.

This doesn't change how expansion works with vnodes. The same restrictions 
still apply. We won't officially support vnodes until we have done more testing 
and really thought through the corner cases. It's quite possible we will relax 
the restriction on creating transient keyspaces with vnodes in 4.0.x.

> Transient Replication & Cheap Quorums: Decouple storage requirements from 
> consensus group size using incremental repair
> ---
>
> Key: CASSANDRA-14404
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14404
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Core, CQL, Distributed Metadata, Hints, 
> Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, 
> Testing, Tools
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> Transient Replication is an implementation of [Witness 
> Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf]
>  that leverages incremental repair to make full replicas consistent with 
> transient replicas that don't store the entire data set. Witness replicas are 
> used in real world systems such as Megastore and Spanner to increase 
> availability inexpensively without having to commit to more full copies of 
> the database. Transient replicas implement functionality similar to 
> upgradable and temporary replicas from the paper.
> With transient replication the replication factor is increased beyond the 
> desired level of data redundancy by adding replicas that only store data when 
> sufficient full replicas are unavailable to store the data. These replicas 
> are called transient replicas. When incremental repair runs transient 
> replicas stream any data they have received to full replicas and once the 
> data is fully replicated it is dropped at the transient replicas.
> Cheap quorums are a further set of optimizations on the write path to avoid 
> writing to transient replicas unless sufficient full replicas are available 
> as well as optimizations on the read path to prefer reading from transient 
> replicas. When writing at quorum to a table configured to use transient 
> replication the quorum will always prefer available full replicas over 
> transient replicas so that transient replicas don't have to process writes. 
> Rapid write protection (similar to rapid read protection) reduces tail 
> latency when full replicas are slow/unavailable to respond by sending writes 
> to additional replicas if necessary.
> Transient replicas can generally service reads faster because they don't have 
> to do anything beyond bloom filter checks if they have no data. With vnodes 
> and larger size clusters they will not have a large quantity of data even in 
> failure cases where transient replicas start to serve a steady amount of 
> write traffic for some of their transiently replicated ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread Aleksey Yeschenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-14675:
--
Status: Patch Available  (was: In Progress)

> Log the actual (if server-generated) timestamp and nowInSeconds used by 
> queries in FQL
> --
>
> Key: CASSANDRA-14675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14675
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
>  Labels: fqltool
> Fix For: 4.0.x
>
>
> FQL doesn't currently log the actual timestamp - in microseconds - if it's 
> been server generated, nor the nowInSeconds value. It needs to, to allow for 
> - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic 
> playback tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread Aleksey Yeschenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-14675:
--
   Resolution: Fixed
Fix Version/s: (was: 4.0.x)
   4.0
   Status: Resolved  (was: Patch Available)

> Log the actual (if server-generated) timestamp and nowInSeconds used by 
> queries in FQL
> --
>
> Key: CASSANDRA-14675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14675
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
>  Labels: fqltool
> Fix For: 4.0
>
>
> FQL doesn't currently log the actual timestamp - in microseconds - if it's 
> been server generated, nor the nowInSeconds value. It needs to, to allow for 
> - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic 
> playback tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread Aleksey Yeschenko (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599139#comment-16599139
 ] 

Aleksey Yeschenko commented on CASSANDRA-14675:
---

Thanks, committed as 
[5b645de13f8bea775d5a979712b3bea910960255|https://github.com/apache/cassandra/commit/5b645de13f8bea775d5a979712b3bea910960255]
 to trunk.

FQL/AuditLog code needs more cleanup. I did as much as I could as part of this 
patch.

> Log the actual (if server-generated) timestamp and nowInSeconds used by 
> queries in FQL
> --
>
> Key: CASSANDRA-14675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14675
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
>  Labels: fqltool
> Fix For: 4.0
>
>
> FQL doesn't currently log the actual timestamp - in microseconds - if it's 
> been server generated, nor the nowInSeconds value. It needs to, to allow for 
> - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic 
> playback tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

cassandra git commit: Log the server-generated timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread aleksey

Repository: cassandra
Updated Branches:
  refs/heads/trunk 1e2f5244e -> 5b645de13


Log the server-generated timestamp and nowInSeconds used by queries in FQL

patch by Aleksey Yeschenko; reviewed by Marcus Eriksson for
CASSANDRA-14675


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/5b645de1
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/5b645de1
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/5b645de1

Branch: refs/heads/trunk
Commit: 5b645de13f8bea775d5a979712b3bea910960255
Parents: 1e2f524
Author: Aleksey Yeshchenko 
Authored: Fri Aug 31 14:47:02 2018 +0100
Committer: Aleksey Yeshchenko 
Committed: Fri Aug 31 19:42:23 2018 +0100

--
 CHANGES.txt |   1 +
 .../apache/cassandra/audit/AuditLogEntry.java   |   5 +
 .../apache/cassandra/audit/AuditLogManager.java |  11 +-
 .../apache/cassandra/audit/BinAuditLogger.java  |   6 +-
 .../cassandra/audit/BinLogAuditLogger.java  |  93 ---
 .../apache/cassandra/audit/FullQueryLogger.java | 255 ++-
 .../org/apache/cassandra/cql3/QueryOptions.java |   3 +-
 .../apache/cassandra/service/QueryState.java|  22 +-
 .../apache/cassandra/tools/fqltool/Dump.java| 148 +++
 .../transport/messages/BatchMessage.java|   2 +-
 .../cassandra/audit/FullQueryLoggerTest.java| 129 ++
 11 files changed, 418 insertions(+), 257 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index d2d9c86..9e76586 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0
+ * Log server-generated timestamp and nowInSeconds used by queries in FQL 
(CASSANDRA-14675)
  * Add diagnostic events for read repairs (CASSANDRA-14668)
  * Use consistent nowInSeconds and timestamps values within a request 
(CASSANDRA-14671)
  * Add sampler for query time and expose with nodetool (CASSANDRA-14436)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/src/java/org/apache/cassandra/audit/AuditLogEntry.java
--
diff --git a/src/java/org/apache/cassandra/audit/AuditLogEntry.java 
b/src/java/org/apache/cassandra/audit/AuditLogEntry.java
index 0b891d4..4d3b867 100644
--- a/src/java/org/apache/cassandra/audit/AuditLogEntry.java
+++ b/src/java/org/apache/cassandra/audit/AuditLogEntry.java
@@ -153,6 +153,11 @@ public class AuditLogEntry
 return options;
 }
 
+public QueryState getState()
+{
+return state;
+}
+
 public static class Builder
 {
 private static final InetAddressAndPort DEFAULT_SOURCE;

http://git-wip-us.apache.org/repos/asf/cassandra/blob/5b645de1/src/java/org/apache/cassandra/audit/AuditLogManager.java
--
diff --git a/src/java/org/apache/cassandra/audit/AuditLogManager.java 
b/src/java/org/apache/cassandra/audit/AuditLogManager.java
index ab9c2e9..25966f7 100644
--- a/src/java/org/apache/cassandra/audit/AuditLogManager.java
+++ b/src/java/org/apache/cassandra/audit/AuditLogManager.java
@@ -33,6 +33,7 @@ import org.apache.cassandra.config.DatabaseDescriptor;
 import org.apache.cassandra.cql3.CQLStatement;
 import org.apache.cassandra.cql3.QueryHandler;
 import org.apache.cassandra.cql3.QueryOptions;
+import org.apache.cassandra.cql3.statements.BatchStatement;
 import org.apache.cassandra.exceptions.AuthenticationException;
 import org.apache.cassandra.exceptions.ConfigurationException;
 import org.apache.cassandra.exceptions.UnauthorizedException;
@@ -187,7 +188,13 @@ public class AuditLogManager
 /**
  * Logs Batch queries to both FQL and standard audit logger.
  */
-public void logBatch(String batchTypeName, List queryOrIdList, 
List> values, List prepared, 
QueryOptions options, QueryState state, long queryStartTimeMillis)
+public void logBatch(BatchStatement.Type type,
+ List queryOrIdList,
+ List> values,
+ List prepared,
+ QueryOptions options,
+ QueryState state,
+ long queryStartTimeMillis)
 {
 if (isAuditingEnabled())
 {
@@ -205,7 +212,7 @@ public class AuditLogManager
 {
 queryStrings.add(prepStatment.rawCQLStatement);
 }
-fullQueryLogger.logBatch(batchTypeName, queryStrings, values, 
options, queryStartTimeMillis);
+fullQueryLogger.logBatch(type, queryStrings, values, options, 
state, queryStartTimeMillis);
 }
 }

[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599128#comment-16599128
 ] 

Alexander Dejanovski commented on CASSANDRA-14685:
--

[~jasobrown],

indeed, nodes 2 and 3 are still showing ongoing streams although node1 is down 
: 

 
{noformat}
$ ccm node2 nodetool netstats
Mode: NORMAL
Repair e28883b0-ad4b-11e8-82ca-5fbf27df5fb6
 /127.0.0.1
 Sending 2 files, 49304220 bytes total. Already sent 0 files, 5373952 bytes 
total
 
/Users/adejanovski/.ccm/inc-repair-issue/node2/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db
 5373952/34243878 bytes(15%) sent to idx:0/127.0.0.1
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 2 0
Small messages n/a 0 244612 0
Gossip messages n/a 23 531 0
$ ccm node3 nodetool netstats
Mode: NORMAL
Repair e269d820-ad4b-11e8-82ca-5fbf27df5fb6
 /127.0.0.1
 Sending 2 files, 49166315 bytes total. Already sent 1 files, 11748602 bytes 
total
 
/Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-11-big-Data.db
 8865018/8865018 bytes(100%) sent to idx:0/127.0.0.1
 
/Users/adejanovski/.ccm/inc-repair-issue/node3/data0/tlp_stress/sensor_data-67193da0ad4b11e88663cb45de9ab9e9/na-9-big-Data.db
 2883584/34198115 bytes(8%) sent to idx:0/127.0.0.1
Read Repair Statistics:
Attempted: 0
Mismatch (Blocking): 0
Mismatch (Background): 0
Pool Name Active Pending Completed Dropped
Large messages n/a 0 2 0
Small messages n/a 0 244611 0
Gossip messages n/a 0 820 0
{noformat}
 

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Assignee: Jason Brown
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> ccm node1 stop
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> ccm node1 start{noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
>

[jira] [Updated] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Dejanovski updated CASSANDRA-14685:
-
Description: 
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
--compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
ccm node1 stop
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
ccm node1 start{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens   OwnsHost ID  
 Rack
UN  127.0.0.1  228,64 KiB  256  ?   
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256  ?   
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256  ?   
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.

  was:
The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e

[jira] [Commented] (CASSANDRA-14404) Transient Replication & Cheap Quorums: Decouple storage requirements from consensus group size using incremental repair

2018-08-31 Thread Constance Eustace (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599094#comment-16599094
 ] 

Constance Eustace commented on CASSANDRA-14404:
---

So are (basically) these transient nodes basically serving as centralized 
hinted handoff caches rather than having the hinted handoffs cluttering up full 
replicas, especially nodes that have no concern for the token range involved? I 
understand that hinted handoffs aren't being replaced by this, but is that kind 
of the idea?

Are the transient nodes sitting around?

Will the transient nodes have cheaper/lower hardware requirements?

During cluster expansion, does the newly streaming node acquiring data function 
as a temporary transient node until it becomes a full replica? Likewise while 
shrinking, does a previously full replica function as a transient while it 
streams off data?

Can this help vnode expansion with multiple concurrent nodes? Admittedly I'm 
not familiar with how much work has gone into fixing cluster expansion with 
vnodes, it is my understanding that you typically expand only one node at a 
time or in multiples of the datacenter size

> Transient Replication & Cheap Quorums: Decouple storage requirements from 
> consensus group size using incremental repair
> ---
>
> Key: CASSANDRA-14404
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14404
> Project: Cassandra
>  Issue Type: New Feature
>  Components: Coordination, Core, CQL, Distributed Metadata, Hints, 
> Local Write-Read Paths, Materialized Views, Repair, Secondary Indexes, 
> Testing, Tools
>Reporter: Ariel Weisberg
>Assignee: Ariel Weisberg
>Priority: Major
> Fix For: 4.0
>
>
> Transient Replication is an implementation of [Witness 
> Replicas|http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.146.3429=rep1=pdf]
>  that leverages incremental repair to make full replicas consistent with 
> transient replicas that don't store the entire data set. Witness replicas are 
> used in real world systems such as Megastore and Spanner to increase 
> availability inexpensively without having to commit to more full copies of 
> the database. Transient replicas implement functionality similar to 
> upgradable and temporary replicas from the paper.
> With transient replication the replication factor is increased beyond the 
> desired level of data redundancy by adding replicas that only store data when 
> sufficient full replicas are unavailable to store the data. These replicas 
> are called transient replicas. When incremental repair runs transient 
> replicas stream any data they have received to full replicas and once the 
> data is fully replicated it is dropped at the transient replicas.
> Cheap quorums are a further set of optimizations on the write path to avoid 
> writing to transient replicas unless sufficient full replicas are available 
> as well as optimizations on the read path to prefer reading from transient 
> replicas. When writing at quorum to a table configured to use transient 
> replication the quorum will always prefer available full replicas over 
> transient replicas so that transient replicas don't have to process writes. 
> Rapid write protection (similar to rapid read protection) reduces tail 
> latency when full replicas are slow/unavailable to respond by sending writes 
> to additional replicas if necessary.
> Transient replicas can generally service reads faster because they don't have 
> to do anything beyond bloom filter checks if they have no data. With vnodes 
> and larger size clusters they will not have a large quantity of data even in 
> failure cases where transient replicas start to serve a steady amount of 
> write traffic for some of their transiently replicated ranges.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Assigned] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Jason Brown (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Brown reassigned CASSANDRA-14685:
---

Assignee: Jason Brown

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Assignee: Jason Brown
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> {noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   OwnsHost ID
>Rack
> UN  127.0.0.1  228,64 KiB  256  ?   
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
> UN  127.0.0.2  60,09 MiB  256  ?   
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
> UN  127.0.0.3  57,59 MiB  256  ?   
> a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
> {noformat}
> sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
> "pending repair" state :
> {noformat}
> ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
> grep repair
> SSTable: 
> /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
> Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
> {noformat}
> Restarting these nodes wouldn't help either.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Jason Brown (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599086#comment-16599086
 ] 

Jason Brown commented on CASSANDRA-14685:
-

Thanks for the report, [~adejanovski]. I'll be able to look into this next 
week, and I'm assigning the ticket to myself as a reminder. I'm not sure 
[~bdeggleston] can get to it before next week either.

I'm not sure if this is due to the stream sessions on nodes 2 and 3 not 
properly closing (and thus not informing the repair sessions they are part of), 
or if it's something getting lost in the repair session. Do nodes 2/3 show any 
streaming or repair activities (via nodetool cmds) after the repair coordinator 
dies? 

> Incremental repair 4.0 : SSTables remain locked forever if the coordinator 
> dies during streaming 
> -
>
> Key: CASSANDRA-14685
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
> Project: Cassandra
>  Issue Type: Bug
>  Components: Repair
>Reporter: Alexander Dejanovski
>Priority: Critical
>
> The changes in CASSANDRA-9143 modified the way incremental repair performs by 
> applying the following sequence of events : 
>  * Anticompaction is executed on all replicas for all SSTables overlapping 
> the repaired ranges
>  * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
> compacted anymore, nor part of another repair session
>  * Merkle trees are generated and compared
>  * Streaming takes place if needed
>  * Anticompaction is committed and "pending repair" table are marked as 
> repaired if it succeeded, or they are released if the repair session failed.
> If the repair coordinator dies during the streaming phase, *the SSTables on 
> the replicas will remain in "pending repair" state and will never be eligible 
> for repair or compaction*, even after all the nodes in the cluster are 
> restarted. 
> Steps to reproduce (I've used Jason's 13938 branch that fixes streaming 
> errors) : 
> {noformat}
> ccm create inc-repair-issue -v github:jasobrown/13938 -n 3
> # Allow jmx access and remove all rpc_ settings in yaml
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
> do
>   sed -i'' -e 
> 's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
>  $f
> done
> for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
> do
>   grep -v "rpc_" $f > ${f}.tmp
>   cat ${f}.tmp > $f
> done
> ccm start
> {noformat}
> I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
> few 10s of MBs of data (killed it after some time). Obviously 
> cassandra-stress works as well :
> {noformat}
> bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
> --replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
> --compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 
> 127.0.0.1
> {noformat}
> Flush and delete all SSTables in node1 :
> {noformat}
> ccm node1 nodetool flush
> rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
> {noformat}
> Then throttle streaming throughput to 1MB/s so we have time to take node1 
> down during the streaming phase and run repair:
> {noformat}
> ccm node1 nodetool setstreamthroughput 1
> ccm node2 nodetool setstreamthroughput 1
> ccm node3 nodetool setstreamthroughput 1
> ccm node1 nodetool repair tlp_stress
> {noformat}
> Once streaming starts, shut down node1 and start it again :
> {noformat}
> ccm node1 stop
> ccm node1 start
> {noformat}
> Run repair again :
> {noformat}
> ccm node1 nodetool repair tlp_stress
> {noformat}
> The command will return very quickly, showing that it skipped all sstables :
> {noformat}
> [2018-08-31 19:05:16,292] Repair completed successfully
> [2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds
> $ ccm node1 nodetool status
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  AddressLoad   Tokens   OwnsHost ID
>Rack
> UN  127.0.0.1  228,64 KiB  256  ?   
> 437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
> UN  127.0.0.2  60,09 MiB  256  ?   
> fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
> UN  127.0.0.3  57,59 MiB  256  ?   
> a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
> {noformat}
> sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
> "pending repair" state :
> {noformat}
> ~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
> grep repair
> SSTable: 
> /Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
> Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
> {noformat}
> Restarting

[jira] [Updated] (CASSANDRA-14668) Diag events for read repairs

2018-08-31 Thread Stefan Podkowinski (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Podkowinski updated CASSANDRA-14668:
---
   Resolution: Fixed
Fix Version/s: (was: 4.x)
   4.0
   Status: Resolved  (was: Patch Available)

Committed as 1e2f5244e5

> Diag events for read repairs
> 
>
> Key: CASSANDRA-14668
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14668
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.0
>
>
> Read repairs have been a highly discussed topic during the last months and 
> also saw some significant code changes. I'd like to be better prepared in 
> case we need to investigate any further RR issues in the future, by adding 
> diagnostic events that can be enabled for exposing informations such as:
>  * contacted endpoints
>  * digest responses by endpoint
>  * affected partition keys
>  * speculated reads / writes
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

cassandra git commit: Add diag events for read repairs

2018-08-31 Thread spod

Repository: cassandra
Updated Branches:
  refs/heads/trunk aed682513 -> 1e2f5244e


Add diag events for read repairs

patch by Stefan Podkowinski; reviewed by Mick Semb Wever for CASSANDRA-14668


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/1e2f5244
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/1e2f5244
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/1e2f5244

Branch: refs/heads/trunk
Commit: 1e2f5244e5e341f32d23872104fad3b55dbf0cb0
Parents: aed6825
Author: Stefan Podkowinski 
Authored: Mon Aug 27 13:45:27 2018 +0200
Committer: Stefan Podkowinski 
Committed: Fri Aug 31 19:40:57 2018 +0200

--
 CHANGES.txt |   1 +
 .../cassandra/service/reads/DigestResolver.java |  30 ++-
 .../reads/repair/AbstractReadRepair.java|   2 +
 .../reads/repair/BlockingPartitionRepair.java   |  18 ++
 .../reads/repair/PartitionRepairEvent.java  | 102 ++
 .../reads/repair/ReadRepairDiagnostics.java |  78 
 .../service/reads/repair/ReadRepairEvent.java   | 114 +++
 .../DiagEventsBlockingReadRepairTest.java   | 192 +++
 8 files changed, 536 insertions(+), 1 deletion(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index e40cf27..d2d9c86 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0
+ * Add diagnostic events for read repairs (CASSANDRA-14668)
  * Use consistent nowInSeconds and timestamps values within a request 
(CASSANDRA-14671)
  * Add sampler for query time and expose with nodetool (CASSANDRA-14436)
  * Clean up Message.Request implementations (CASSANDRA-14677)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/src/java/org/apache/cassandra/service/reads/DigestResolver.java
--
diff --git a/src/java/org/apache/cassandra/service/reads/DigestResolver.java 
b/src/java/org/apache/cassandra/service/reads/DigestResolver.java
index b2eb0c6..897892f 100644
--- a/src/java/org/apache/cassandra/service/reads/DigestResolver.java
+++ b/src/java/org/apache/cassandra/service/reads/DigestResolver.java
@@ -25,9 +25,10 @@ import com.google.common.base.Preconditions;
 import org.apache.cassandra.db.*;
 import org.apache.cassandra.db.partitions.PartitionIterator;
 import org.apache.cassandra.db.partitions.UnfilteredPartitionIterators;
+import org.apache.cassandra.locator.InetAddressAndPort;
 import org.apache.cassandra.net.MessageIn;
 import org.apache.cassandra.service.reads.repair.ReadRepair;
-import org.apache.cassandra.tracing.TraceState;
+import org.apache.cassandra.utils.ByteBufferUtil;
 
 public class DigestResolver extends ResponseResolver
 {
@@ -82,4 +83,31 @@ public class DigestResolver extends ResponseResolver
 {
 return dataResponse != null;
 }
+
+public DigestResolverDebugResult[] getDigestsByEndpoint()
+{
+DigestResolverDebugResult[] ret = new 
DigestResolverDebugResult[responses.size()];
+for (int i = 0; i < responses.size(); i++)
+{
+MessageIn message = responses.get(i);
+ReadResponse response = message.payload;
+String digestHex = 
ByteBufferUtil.bytesToHex(response.digest(command));
+ret[i] = new DigestResolverDebugResult(message.from, digestHex, 
message.payload.isDigestResponse());
+}
+return ret;
+}
+
+public static class DigestResolverDebugResult
+{
+public InetAddressAndPort from;
+public String digestHex;
+public boolean isDigestResponse;
+
+private DigestResolverDebugResult(InetAddressAndPort from, String 
digestHex, boolean isDigestResponse)
+{
+this.from = from;
+this.digestHex = digestHex;
+this.isDigestResponse = isDigestResponse;
+}
+}
 }

http://git-wip-us.apache.org/repos/asf/cassandra/blob/1e2f5244/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java
--
diff --git 
a/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java 
b/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java
index a1cf827..7e3f0ae 100644
--- a/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java
+++ b/src/java/org/apache/cassandra/service/reads/repair/AbstractReadRepair.java
@@ -122,6 +122,7 @@ public abstract class AbstractReadRepair implements 
ReadRepair
 Tracing.trace("Enqueuing full data read to {}", endpoint);
 sendReadCommand(endpoint, readCallback);
 }
+

[jira] [Updated] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request

2018-08-31 Thread Aleksey Yeschenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-14671:
--
   Resolution: Fixed
Fix Version/s: (was: 4.0.x)
   4.0
   Status: Resolved  (was: Patch Available)

> Use consistent nowInSeconds and timestamps values within a request
> --
>
> Key: CASSANDRA-14671
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14671
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Minor
>  Labels: fqltool
> Fix For: 4.0
>
>
> We don't currently use consistent values of {{nowInSeconds}} and 
> {{timestamp}} in the codebase, and sometimes generate several server-side 
> timestamps for each in the same request. {{QueryState}} should cache the 
> values it generated so that the same values are used for the duration of 
> write/read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request

2018-08-31 Thread Aleksey Yeschenko (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599063#comment-16599063
 ] 

Aleksey Yeschenko commented on CASSANDRA-14671:
---

Committed as 
[aed682513cc381b80705d1f971fddc394e8a62a5|https://github.com/apache/cassandra/commit/aed682513cc381b80705d1f971fddc394e8a62a5]
 to trunk, thanks.

> Use consistent nowInSeconds and timestamps values within a request
> --
>
> Key: CASSANDRA-14671
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14671
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Minor
>  Labels: fqltool
> Fix For: 4.0.x
>
>
> We don't currently use consistent values of {{nowInSeconds}} and 
> {{timestamp}} in the codebase, and sometimes generate several server-side 
> timestamps for each in the same request. {{QueryState}} should cache the 
> values it generated so that the same values are used for the duration of 
> write/read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

cassandra git commit: Use consistent nowInSeconds and timestamps values within a request

2018-08-31 Thread aleksey

Repository: cassandra
Updated Branches:
  refs/heads/trunk f31d1a05a -> aed682513


Use consistent nowInSeconds and timestamps values within a request

patch by Aleksey Yeschenko; reviewed by Chris Lohfink for
CASSANDRA-14671


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/aed68251
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/aed68251
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/aed68251

Branch: refs/heads/trunk
Commit: aed682513cc381b80705d1f971fddc394e8a62a5
Parents: f31d1a05
Author: Aleksey Yeshchenko 
Authored: Fri Aug 31 11:13:03 2018 +0100
Committer: Aleksey Yeshchenko 
Committed: Fri Aug 31 18:29:33 2018 +0100

--
 CHANGES.txt |   1 +
 .../cassandra/cql3/BatchQueryOptions.java   |   4 +-
 .../org/apache/cassandra/cql3/QueryOptions.java |   4 +-
 .../apache/cassandra/cql3/UpdateParameters.java |   4 +-
 .../cql3/statements/BatchStatement.java |  64 +++-
 .../cql3/statements/CQL3CasRequest.java |  43 +---
 .../cql3/statements/ModificationStatement.java  | 101 +--
 .../cql3/statements/SelectStatement.java|   7 +-
 .../cassandra/io/sstable/CQLSSTableWriter.java  |  10 +-
 .../apache/cassandra/service/QueryState.java|  54 +++---
 .../org/apache/cassandra/cql3/ListsTest.java|   4 +-
 .../cassandra/transport/SerDeserTest.java   |   7 +-
 .../io/sstable/StressCQLSSTableWriter.java  |   9 +-
 13 files changed, 206 insertions(+), 106 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index 475cd48..e40cf27 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0
+ * Use consistent nowInSeconds and timestamps values within a request 
(CASSANDRA-14671)
  * Add sampler for query time and expose with nodetool (CASSANDRA-14436)
  * Clean up Message.Request implementations (CASSANDRA-14677)
  * Disable old native protocol versions on demand (CASANDRA-14659)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java
--
diff --git a/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java 
b/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java
index ac0d148..ac8f179 100644
--- a/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java
+++ b/src/java/org/apache/cassandra/cql3/BatchQueryOptions.java
@@ -84,9 +84,9 @@ public abstract class BatchQueryOptions
 return wrapped.getTimestamp(state);
 }
 
-public int getNowInSeconds()
+public int getNowInSeconds(QueryState state)
 {
-return wrapped.getNowInSeconds();
+return wrapped.getNowInSeconds(state);
 }
 
 private static class WithoutPerStatementVariables extends BatchQueryOptions

http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/QueryOptions.java
--
diff --git a/src/java/org/apache/cassandra/cql3/QueryOptions.java 
b/src/java/org/apache/cassandra/cql3/QueryOptions.java
index e546304..f76d6b2 100644
--- a/src/java/org/apache/cassandra/cql3/QueryOptions.java
+++ b/src/java/org/apache/cassandra/cql3/QueryOptions.java
@@ -200,10 +200,10 @@ public abstract class QueryOptions
 return tstamp != Long.MIN_VALUE ? tstamp : state.getTimestamp();
 }
 
-public int getNowInSeconds()
+public int getNowInSeconds(QueryState state)
 {
 int nowInSeconds = getSpecificOptions().nowInSeconds;
-return Integer.MIN_VALUE == nowInSeconds ? FBUtilities.nowInSeconds() 
: nowInSeconds;
+return nowInSeconds != Integer.MIN_VALUE ? nowInSeconds : 
state.getNowInSeconds();
 }
 
 /** The keyspace that this query is bound to, or null if not relevant. */

http://git-wip-us.apache.org/repos/asf/cassandra/blob/aed68251/src/java/org/apache/cassandra/cql3/UpdateParameters.java
--
diff --git a/src/java/org/apache/cassandra/cql3/UpdateParameters.java 
b/src/java/org/apache/cassandra/cql3/UpdateParameters.java
index 500862e..740cd91 100644
--- a/src/java/org/apache/cassandra/cql3/UpdateParameters.java
+++ b/src/java/org/apache/cassandra/cql3/UpdateParameters.java
@@ -28,7 +28,6 @@ import org.apache.cassandra.db.filter.ColumnFilter;
 import org.apache.cassandra.db.partitions.Partition;
 import org.apache.cassandra.db.rows.*;
 import org.apache.cassandra.exceptions.InvalidRequestException;
-import org.apache.cassandra.utils.FBUtilities;
 
 /**
  * Groups the parameters of an update

[jira] [Commented] (CASSANDRA-14145) Detecting data resurrection during read

2018-08-31 Thread Sam Tunnicliffe (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599053#comment-16599053
 ] 

Sam Tunnicliffe commented on CASSANDRA-14145:
-

bq. Since we lose a good chunk of the benefits of the optimization, would it 
make more sense to disable it entirely 
I agree. We've been pushing and pulling this around for a while now, in an 
attempt to retain some of the optimisations of 
{{queryMemtableAndSSTablesInTimestampOrder}} when tracking is enabled, but 
nothing so far has really been satisfactory. +1 to just disabling it when 
tracking for now & revisiting later.

bq. I would turn inconclusive tracking off by default
Fair enough. As everything is off by default for now it probably makes for a 
better UI to selectively turn things on rather than flipping some switches up 
and some down. To that end I've reversed the semantics of that third option 
from "only report confirmed" to "also report unconfirmed".

bq. Missing new cassandra.yaml entries
Sorry, forgot to git add the yaml file before pushing the previous commits. 
Updated for default changes above & pushed.

bq. Minor nits
Addressed and bundled in a single commit, with a couple of caveats:
 I came to the conclusion that {{RepairedDataInfo}} really ought to be private 
to the {{ReadCommand}} using it, so I removed the iface and just made it a 
private class. The digest and isConclusive are accessed via {{ReadCommand}} 
itself which I think does a better job of hiding unnecessary information. I did 
leave {{NO_OP_REPAIRED_DATA_INFO}} in (slightly renamed) as I'd rather a safe 
nullobject there than worry about null checking every use of it in case some 
path is overlooked or untested. 

I also felt that the splitting of iterators according to the repaired status of 
the sstables (or memtable) that they come from was a bit ugly and invasive, so 
I've refactored that into another inner class of {{ReadCommand}}, which tidied 
up {{SPRC/PRRC}} quite a bit.

One last thing, [~jrwest] mentioned off-JIRA that there's an edge case where 
compaction is backed up and so an sstable may be marked pending, but the 
session to which it belongs has been purged. In that case, we'd mistakenly 
consider digests inconclusive, so I've added a check that the session actually 
exists, and if not we consider the sstable unrepaired. 


>  Detecting data resurrection during read
> 
>
> Key: CASSANDRA-14145
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14145
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: sankalp kohli
>Assignee: Sam Tunnicliffe
>Priority: Minor
> Fix For: 4.x
>
>
> We have seen several bugs in which deleted data gets resurrected. We should 
> try to see if we can detect this on the read path and possibly fix it. Here 
> are a few examples which brought back data
> A replica lost an sstable on startup which caused one replica to lose the 
> tombstone and not the data. This tombstone was past gc grace which means this 
> could resurrect data. We can detect such invalid states by looking at other 
> replicas. 
> If we are running incremental repair, Cassandra will keep repaired and 
> non-repaired data separate. Every-time incremental repair will run, it will 
> move the data from non-repaired to repaired. Repaired data across all 
> replicas should be 100% consistent. 
> Here is an example of how we can detect and mitigate the issue in most cases. 
> Say we have 3 machines, A,B and C. All these machines will have data split 
> b/w repaired and non-repaired. 
> 1. Machine A due to some bug bring backs data D. This data D is in repaired 
> dataset. All other replicas will have data D and tombstone T 
> 2. Read for data D comes from application which involve replicas A and B. The 
> data being read involves data which is in repaired state.  A will respond 
> back to co-ordinator with data D and B will send nothing as tombstone is past 
> gc grace. This will cause digest mismatch. 
> 3. This patch will only kick in when there is a digest mismatch. Co-ordinator 
> will ask both replicas to send back all data like we do today but with this 
> patch, replicas will respond back what data it is returning is coming from 
> repaired vs non-repaired. If data coming from repaired does not match, we 
> know there is a something wrong!! At this time, co-ordinator cannot determine 
> if replica A has resurrected some data or replica B has lost some data. We 
> can still log error in the logs saying we hit an invalid state.
> 4. Besides the log, we can take this further and even correct the response to 
> the query. After logging an invalid state, we can ask replica A and B (and 
> also C if alive) to send back all data for this including gcable tombstones. 
> If any machine returns a tombstone which is

[jira] [Commented] (CASSANDRA-14671) Use consistent nowInSeconds and timestamps values within a request

2018-08-31 Thread Chris Lohfink (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599052#comment-16599052
 ] 

Chris Lohfink commented on CASSANDRA-14671:
---

+1

> Use consistent nowInSeconds and timestamps values within a request
> --
>
> Key: CASSANDRA-14671
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14671
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Minor
>  Labels: fqltool
> Fix For: 4.0.x
>
>
> We don't currently use consistent values of {{nowInSeconds}} and 
> {{timestamp}} in the codebase, and sometimes generate several server-side 
> timestamps for each in the same request. {{QueryState}} should cache the 
> values it generated so that the same values are used for the duration of 
> write/read.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread Marcus Eriksson (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599051#comment-16599051
 ] 

Marcus Eriksson commented on CASSANDRA-14675:
-

and just realised that Dump.java needs update as well

> Log the actual (if server-generated) timestamp and nowInSeconds used by 
> queries in FQL
> --
>
> Key: CASSANDRA-14675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14675
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
>  Labels: fqltool
> Fix For: 4.0.x
>
>
> FQL doesn't currently log the actual timestamp - in microseconds - if it's 
> been server generated, nor the nowInSeconds value. It needs to, to allow for 
> - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic 
> playback tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14675) Log the actual (if server-generated) timestamp and nowInSeconds used by queries in FQL

2018-08-31 Thread Marcus Eriksson (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599042#comment-16599042
 ] 

Marcus Eriksson commented on CASSANDRA-14675:
-

oh and 
https://github.com/krummas/cassandra/commit/f22ea233de6886b7beae7ef57d50d4b5dd9bad4f
 - don't know how well chronicle works with repeated fields

> Log the actual (if server-generated) timestamp and nowInSeconds used by 
> queries in FQL
> --
>
> Key: CASSANDRA-14675
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14675
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Aleksey Yeschenko
>Assignee: Aleksey Yeschenko
>Priority: Major
>  Labels: fqltool
> Fix For: 4.0.x
>
>
> FQL doesn't currently log the actual timestamp - in microseconds - if it's 
> been server generated, nor the nowInSeconds value. It needs to, to allow for 
> - in conjunction with CASSANDRA-14664 and CASSANDRA-14671 - deterministic 
> playback tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Updated] (CASSANDRA-14436) Add sampler for query time and expose with nodetool

2018-08-31 Thread Aleksey Yeschenko (JIRA)



 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-14436:
--
   Resolution: Fixed
Fix Version/s: (was: 4.x)
   4.0
   Status: Resolved  (was: Patch Available)

> Add sampler for query time and expose with nodetool
> ---
>
> Key: CASSANDRA-14436
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14436
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Lohfink
>Assignee: Chris Lohfink
>Priority: Major
>  Labels: 4.0-feature-freeze-review-requested, 
> pull-request-available
> Fix For: 4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Create a new {{nodetool profileload}} that functions just like toppartitions 
> but with more data, returning the slowest local reads and writes on the host 
> during a given duration and highest frequency touched partitions (same as 
> {{nodetool toppartitions}}). Refactor included to extend use of the sampler 
> for uses outside of top frequency (max instead of total sample values).
> Future work to this is to include top cpu and allocations by query and 
> possibly tasks/cpu/allocations by stage during time window.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14436) Add sampler for query time and expose with nodetool

2018-08-31 Thread Aleksey Yeschenko (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599037#comment-16599037
 ] 

Aleksey Yeschenko commented on CASSANDRA-14436:
---

Committed as 
[f31d1a05a1f6f85f64c9b965009db814960c4eca|https://github.com/apache/cassandra/commit/f31d1a05a1f6f85f64c9b965009db814960c4eca]
 to trunk. Mostly just looked at potential negative effects on the read path 
and found none, but cleaned up {{ReadExecutionController}} a little in the 
process. I trust Chris and Dinesh to have collectively done a good job at 
implementation and review of the rest.

> Add sampler for query time and expose with nodetool
> ---
>
> Key: CASSANDRA-14436
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14436
> Project: Cassandra
>  Issue Type: Improvement
>Reporter: Chris Lohfink
>Assignee: Chris Lohfink
>Priority: Major
>  Labels: 4.0-feature-freeze-review-requested, 
> pull-request-available
> Fix For: 4.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Create a new {{nodetool profileload}} that functions just like toppartitions 
> but with more data, returning the slowest local reads and writes on the host 
> during a given duration and highest frequency touched partitions (same as 
> {{nodetool toppartitions}}). Refactor included to extend use of the sampler 
> for uses outside of top frequency (max instead of total sample values).
> Future work to this is to include top cpu and allocations by query and 
> possibly tasks/cpu/allocations by stage during time window.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANDRA-14685) Incremental repair 4.0 : SSTables remain locked forever if the coordinator dies during streaming

2018-08-31 Thread Alexander Dejanovski (JIRA)

Alexander Dejanovski created CASSANDRA-14685:


 Summary: Incremental repair 4.0 : SSTables remain locked forever 
if the coordinator dies during streaming 
 Key: CASSANDRA-14685
 URL: https://issues.apache.org/jira/browse/CASSANDRA-14685
 Project: Cassandra
  Issue Type: Bug
  Components: Repair
Reporter: Alexander Dejanovski


The changes in CASSANDRA-9143 modified the way incremental repair performs by 
applying the following sequence of events : 
 * Anticompaction is executed on all replicas for all SSTables overlapping the 
repaired ranges
 * Anticompacted SSTables are then marked as "Pending repair" and cannot be 
compacted anymore, nor part of another repair session
 * Merkle trees are generated and compared
 * Streaming takes place if needed
 * Anticompaction is committed and "pending repair" table are marked as 
repaired if it succeeded, or they are released if the repair session failed.

If the repair coordinator dies during the streaming phase, *the SSTables on the 
replicas will remain in "pending repair" state and will never be eligible for 
repair or compaction*, even after all the nodes in the cluster are restarted. 

Steps to reproduce (I've used Jason's 13938 branch that fixes streaming errors) 
: 
{noformat}
ccm create inc-repair-issue -v github:jasobrown/13938 -n 3

# Allow jmx access and remove all rpc_ settings in yaml
for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra-env.sh;
do
  sed -i'' -e 
's/com.sun.management.jmxremote.authenticate=true/com.sun.management.jmxremote.authenticate=false/g'
 $f
done

for f in ~/.ccm/inc-repair-issue/node*/conf/cassandra.yaml;
do
  grep -v "rpc_" $f > ${f}.tmp
  cat ${f}.tmp > $f
done

ccm start
{noformat}
I used [tlp-stress|https://github.com/thelastpickle/tlp-stress] to generate a 
few 10s of MBs of data (killed it after some time). Obviously cassandra-stress 
works as well :
{noformat}
bin/tlp-stress run BasicTimeSeries -i 1M -p 1M -t 2 --rate 5000  
--replication "{'class':'SimpleStrategy', 'replication_factor':2}"   
--compaction "{'class': 'SizeTieredCompactionStrategy'}"   --host 127.0.0.1
{noformat}
Flush and delete all SSTables in node1 :
{noformat}
ccm node1 nodetool flush
rm -f ~/.ccm/inc-repair-issue/node1/data0/tlp_stress/sensor*/*.*
{noformat}
Then throttle streaming throughput to 1MB/s so we have time to take node1 down 
during the streaming phase and run repair:
{noformat}
ccm node1 nodetool setstreamthroughput 1
ccm node2 nodetool setstreamthroughput 1
ccm node3 nodetool setstreamthroughput 1
ccm node1 nodetool repair tlp_stress
{noformat}
Once streaming starts, shut down node1 and start it again :
{noformat}
ccm node1 stop
ccm node1 start
{noformat}
Run repair again :
{noformat}
ccm node1 nodetool repair tlp_stress
{noformat}
The command will return very quickly, showing that it skipped all sstables :
{noformat}
[2018-08-31 19:05:16,292] Repair completed successfully
[2018-08-31 19:05:16,292] Repair command #1 finished in 2 seconds

$ ccm node1 nodetool status

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens   OwnsHost ID  
 Rack
UN  127.0.0.1  228,64 KiB  256  ?   
437dc9cd-b1a1-41a5-961e-cfc99763e29f  rack1
UN  127.0.0.2  60,09 MiB  256  ?   
fbcbbdbb-e32a-4716-8230-8ca59aa93e62  rack1
UN  127.0.0.3  57,59 MiB  256  ?   
a0b1bcc6-0fad-405a-b0bf-180a0ca31dd0  rack1
{noformat}
sstablemetadata will then show that nodes 2 and 3 have SSTables still in 
"pending repair" state :
{noformat}
~/.ccm/repository/gitCOLONtrunk/tools/bin/sstablemetadata na-4-big-Data.db | 
grep repair
SSTable: 
/Users/adejanovski/.ccm/inc-repair-4.0/node2/data0/tlp_stress/sensor_data-b7375660ad3111e8a0e59357ff9c9bda/na-4-big
Pending repair: 3844a400-ad33-11e8-b5a7-6b8dd8f31b62
{noformat}
Restarting these nodes wouldn't help either.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[2/2] cassandra git commit: Add sampler for query time and expose with nodetool

2018-08-31 Thread aleksey

Add sampler for query time and expose with nodetool

patch by Chris Lohfink; reviewed by Dinesh Joshi for CASSANDRA-14436


Project: http://git-wip-us.apache.org/repos/asf/cassandra/repo
Commit: http://git-wip-us.apache.org/repos/asf/cassandra/commit/f31d1a05
Tree: http://git-wip-us.apache.org/repos/asf/cassandra/tree/f31d1a05
Diff: http://git-wip-us.apache.org/repos/asf/cassandra/diff/f31d1a05

Branch: refs/heads/trunk
Commit: f31d1a05a1f6f85f64c9b965009db814960c4eca
Parents: a09dc3a
Author: Chris Lohfink 
Authored: Fri Aug 17 00:40:54 2018 -0500
Committer: Aleksey Yeshchenko 
Committed: Fri Aug 31 18:14:31 2018 +0100

--
 CHANGES.txt |   1 +
 .../apache/cassandra/db/ColumnFamilyStore.java  |  67 ++---
 .../cassandra/db/ColumnFamilyStoreMBean.java|   4 +-
 .../cassandra/db/ReadExecutionController.java   |  89 ---
 .../db/SinglePartitionReadCommand.java  |   4 +-
 .../cassandra/metrics/FrequencySampler.java | 103 
 .../apache/cassandra/metrics/MaxSampler.java|  75 ++
 .../org/apache/cassandra/metrics/Sampler.java   |  97 
 .../apache/cassandra/metrics/TableMetrics.java  |  93 +--
 .../apache/cassandra/net/MessagingService.java  |   2 +
 .../apache/cassandra/service/StorageProxy.java  |   4 +-
 .../cassandra/service/StorageService.java   |  30 ++-
 .../cassandra/service/StorageServiceMBean.java  |   3 +-
 .../org/apache/cassandra/tools/NodeProbe.java   |  17 +-
 .../org/apache/cassandra/tools/NodeTool.java|  54 ++--
 .../cassandra/tools/nodetool/ProfileLoad.java   | 192 ++
 .../cassandra/tools/nodetool/TopPartitions.java | 121 +
 .../org/apache/cassandra/utils/TopKSampler.java | 139 ---
 .../cassandra/metrics/MaxSamplerTest.java   |  93 +++
 .../apache/cassandra/metrics/SamplerTest.java   | 247 +++
 .../metrics/TopFrequencySamplerTest.java|  71 ++
 .../cassandra/tools/TopPartitionsTest.java  |  14 +-
 .../apache/cassandra/utils/TopKSamplerTest.java | 171 -
 23 files changed, 1120 insertions(+), 571 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/CHANGES.txt
--
diff --git a/CHANGES.txt b/CHANGES.txt
index 6489038..475cd48 100644
--- a/CHANGES.txt
+++ b/CHANGES.txt
@@ -1,4 +1,5 @@
 4.0
+ * Add sampler for query time and expose with nodetool (CASSANDRA-14436)
  * Clean up Message.Request implementations (CASSANDRA-14677)
  * Disable old native protocol versions on demand (CASANDRA-14659)
  * Allow specifying now-in-seconds in native protocol (CASSANDRA-14664)

http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
--
diff --git a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java 
b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
index 496fc10..5e38584 100644
--- a/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
+++ b/src/java/org/apache/cassandra/db/ColumnFamilyStore.java
@@ -42,7 +42,6 @@ import com.google.common.util.concurrent.*;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import com.clearspring.analytics.stream.Counter;
 import org.apache.cassandra.cache.*;
 import org.apache.cassandra.concurrent.*;
 import org.apache.cassandra.config.*;
@@ -65,11 +64,9 @@ import org.apache.cassandra.exceptions.StartupException;
 import org.apache.cassandra.index.SecondaryIndexManager;
 import org.apache.cassandra.index.internal.CassandraIndex;
 import org.apache.cassandra.index.transactions.UpdateTransaction;
-import org.apache.cassandra.io.FSError;
 import org.apache.cassandra.io.FSReadError;
 import org.apache.cassandra.io.FSWriteError;
 import org.apache.cassandra.io.sstable.Component;
-import org.apache.cassandra.io.sstable.CorruptSSTableException;
 import org.apache.cassandra.io.sstable.Descriptor;
 import org.apache.cassandra.io.sstable.KeyIterator;
 import org.apache.cassandra.io.sstable.SSTableMultiWriter;
@@ -77,8 +74,10 @@ import org.apache.cassandra.io.sstable.format.*;
 import org.apache.cassandra.io.sstable.metadata.MetadataCollector;
 import org.apache.cassandra.io.util.FileUtils;
 import org.apache.cassandra.io.util.RandomAccessReader;
+import org.apache.cassandra.metrics.Sampler;
+import org.apache.cassandra.metrics.Sampler.Sample;
+import org.apache.cassandra.metrics.Sampler.SamplerType;
 import org.apache.cassandra.metrics.TableMetrics;
-import org.apache.cassandra.metrics.TableMetrics.Sampler;
 import org.apache.cassandra.repair.TableRepairManager;
 import org.apache.cassandra.schema.*;
 import org.apache.cassandra.schema.CompactionParams.TombstoneOption;
@@ -87,7 +86,6 @@ import org.apache.cassandra.service.CacheService;
 import

[1/2] cassandra git commit: Add sampler for query time and expose with nodetool

2018-08-31 Thread aleksey

Repository: cassandra
Updated Branches:
  refs/heads/trunk a09dc3a53 -> f31d1a05a


http://git-wip-us.apache.org/repos/asf/cassandra/blob/f31d1a05/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java
--
diff --git a/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java 
b/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java
deleted file mode 100644
index f35d072..000
--- a/test/unit/org/apache/cassandra/utils/TopKSamplerTest.java
+++ /dev/null
@@ -1,171 +0,0 @@
-/*
- *
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements.  See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership.  The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License.  You may obtain a copy of the License at
- *
- *   http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing,
- * software distributed under the License is distributed on an
- * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
- * KIND, either express or implied.  See the License for the
- * specific language governing permissions and limitations
- * under the License.
- *
- */
-package org.apache.cassandra.utils;
-
-import java.util.List;
-import java.util.Map;
-import java.util.concurrent.CountDownLatch;
-import java.util.concurrent.TimeUnit;
-import java.util.concurrent.TimeoutException;
-import java.util.concurrent.atomic.AtomicBoolean;
-
-import com.google.common.collect.Maps;
-import com.google.common.util.concurrent.Uninterruptibles;
-import org.junit.Test;
-
-import com.clearspring.analytics.hash.MurmurHash;
-import com.clearspring.analytics.stream.Counter;
-import org.junit.Assert;
-import org.apache.cassandra.concurrent.NamedThreadFactory;
-import org.apache.cassandra.utils.TopKSampler.SamplerResult;
-
-public class TopKSamplerTest
-{
-
-@Test
-public void testSamplerSingleInsertionsEqualMulti() throws TimeoutException
-{
-TopKSampler sampler = new TopKSampler();
-sampler.beginSampling(10);
-insert(sampler);
-waitForEmpty(1000);
-SamplerResult single = sampler.finishSampling(10);
-
-TopKSampler sampler2 = new TopKSampler();
-sampler2.beginSampling(10);
-for(int i = 1; i <= 10; i++)
-{
-   String key = "item" + i;
-   sampler2.addSample(key, MurmurHash.hash64(key), i);
-}
-waitForEmpty(1000);
-Assert.assertEquals(countMap(single.topK), 
countMap(sampler2.finishSampling(10).topK));
-Assert.assertEquals(sampler2.hll.cardinality(), 10);
-Assert.assertEquals(sampler.hll.cardinality(), 
sampler2.hll.cardinality());
-}
-
-@Test
-public void testSamplerOutOfOrder() throws TimeoutException
-{
-TopKSampler sampler = new TopKSampler();
-sampler.beginSampling(10);
-insert(sampler);
-waitForEmpty(1000);
-SamplerResult single = sampler.finishSampling(10);
-single = sampler.finishSampling(10);
-}
-
-/**
- * checking for exceptions from SS/HLL which are not thread safe
- */
-@Test
-public void testMultithreadedAccess() throws Exception
-{
-final AtomicBoolean running = new AtomicBoolean(true);
-final CountDownLatch latch = new CountDownLatch(1);
-final TopKSampler sampler = new TopKSampler();
-
-NamedThreadFactory.createThread(new Runnable()
-{
-public void run()
-{
-try
-{
-while (running.get())
-{
-insert(sampler);
-}
-} finally
-{
-latch.countDown();
-}
-}
-
-}
-, "inserter").start();
-try
-{
-// start/stop in fast iterations
-for(int i = 0; i<100; i++)
-{
-sampler.beginSampling(i);
-sampler.finishSampling(i);
-}
-// start/stop with pause to let it build up past capacity
-for(int i = 0; i<3; i++)
-{
-sampler.beginSampling(i);
-Thread.sleep(250);
-sampler.finishSampling(i);
-}
-
-// with empty results
-running.set(false);
-latch.await(1, TimeUnit.SECONDS);
-waitForEmpty(1000);
-for(int i = 0; i<10; i++)
-{
-sampler.beginSampling(i);
-Thread.sleep(i);
-sampler.finishSampling(i);
-}
-} finally
-{
-running.set(false);
-}
-}
-
-private void insert(TopKSampler sampler)

[jira] [Commented] (CASSANDRA-14683) Pagestate is null after 2^31 rows

2018-08-31 Thread Abhishek (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599030#comment-16599030
 ] 

Abhishek commented on CASSANDRA-14683:
--

[~iamaleksey]

 

I'm willing to fix this but I'm new to the codebase. Can you point me to the 
class which needs changes?

> Pagestate is null after 2^31 rows
> -
>
> Key: CASSANDRA-14683
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14683
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Abhishek
>Priority: Major
>
> I am using the nodejs driver to take a dump of my table via 
> [pagination|http://datastax.github.io/nodejs-driver/features/paging/#manual-paging]
>  for a simple query.
> My query is \{{select * from mytable}}
> The table has close to 4 billion rows and cassandra stops returning results 
> exactly after 2147483647 rows. The pagestate is not returned after this.
> Cassandra version - 3.0.9
> Nodejs cassandra driver version - 3.5.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14655) Upgrade C* to use latest guava (26.0)

2018-08-31 Thread Andy Tolbert (JIRA)



[ 
https://issues.apache.org/jira/browse/CASSANDRA-14655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16599028#comment-16599028
 ] 

Andy Tolbert commented on CASSANDRA-14655:
--

Thanks for providing the stack traces.

Ahh, I suppose that the cause of the NPE could just be that no {{Cluster}} 
instance is provided in {{executeNet}} because it can't properly initialize 
because of the {{system_virtual_schema}} issue.

In terms of {{system_virtual_schema}} being missing, I think the code that 
loads/creates that keyspace is 
[here|https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/service/CassandraDaemon.java#L255-L256]
 during the setup of {{CassandraDaemon}}.  My guess is that if you add the 
relevant code in CQLTester.requireNetwork, like [is 
done|https://github.com/apache/cassandra/blob/207c80c1fd63dfbd8ca7e615ec8002ee8983c5d6/test/unit/org/apache/cassandra/cql3/CQLTester.java#L381]
 for the system keyspace via {{SystemKeyspace.finishStartup()}} that will fix 
this issue.

> Upgrade C* to use latest guava (26.0)
> -
>
> Key: CASSANDRA-14655
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14655
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Libraries
>Reporter: Sumanth Pasupuleti
>Assignee: Sumanth Pasupuleti
>Priority: Minor
> Fix For: 4.x
>
>
> C* currently uses guava 23.3. This JIRA is about changing C* to use latest 
> guava (26.0). Originated from a discussion in the mailing list.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

1 2 >

1 - 100 of 139 matches

Mail list logo