[jira] [Updated] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Test and Documentation Plan: 
[https://app.circleci.com/pipelines/github/jasonstack/cassandra/300/workflows/f41f6585-cd97-4791-abdc-a2935694948e]
  (was: 
[https://app.circleci.com/pipelines/github/jasonstack/cassandra/298/workflows/4e56c8b2-a998-4785-9daf-0ceee52a9a83])

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because \{{ComponentManifest}}
> with component sizes are 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188976#comment-17188976
 ] 

ZhaoYang commented on CASSANDRA-15861:
--

Pushed a unit test to verify "compressionMetadata" is used to calculate the 
transferred size for compressed sstable.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because \{{ComponentManifest}}
> with component sizes are sent before sending actual files. This isn't a 
> problem in legacy streaming as STATS file length didn't matter.
>  

[jira] [Updated] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-15861:
-
Test and Documentation Plan: 
[https://app.circleci.com/pipelines/github/jasonstack/cassandra/298/workflows/4e56c8b2-a998-4785-9daf-0ceee52a9a83]
  (was: https://circleci.com/workflow-run/610e8169-e60c-420b-a556-4120967db6cb)

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because \{{ComponentManifest}}
> with component sizes are sent before sending actual files. This isn't 

[jira] [Updated] (CASSANDRA-16095) Nodetool tablestats WriteLatency error

2020-09-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Altan Özlü updated CASSANDRA-16095:
---
Labels:   (was: 4)

> Nodetool tablestats WriteLatency error
> --
>
> Key: CASSANDRA-16095
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16095
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Altan Özlü
>Priority: Normal
>
> when i use
> {code:java}
> bin/nodetool tablestats {code}
> i get this error
> {code:java}
> error: 
> org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
> -- StackTrace --
> javax.management.InstanceNotFoundException: 
> org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
> {code}
> if i restart the node and check it works but if i write something then i get 
> the error again
>  * Cassandra 4.0-beta1
>  * Ubuntu 20



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16095) Nodetool tablestats WriteLatency error

2020-09-01 Thread Jira


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Altan Özlü updated CASSANDRA-16095:
---
Labels: 4  (was: )

> Nodetool tablestats WriteLatency error
> --
>
> Key: CASSANDRA-16095
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16095
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Altan Özlü
>Priority: Normal
>  Labels: 4
>
> when i use
> {code:java}
> bin/nodetool tablestats {code}
> i get this error
> {code:java}
> error: 
> org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
> -- StackTrace --
> javax.management.InstanceNotFoundException: 
> org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
> {code}
> if i restart the node and check it works but if i write something then i get 
> the error again
>  * Cassandra 4.0-beta1
>  * Ubuntu 20



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16095) Nodetool tablestats WriteLatency error

2020-09-01 Thread Jira
Altan Özlü created CASSANDRA-16095:
--

 Summary: Nodetool tablestats WriteLatency error
 Key: CASSANDRA-16095
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16095
 Project: Cassandra
  Issue Type: Bug
Reporter: Altan Özlü


when i use
{code:java}
bin/nodetool tablestats {code}
i get this error
{code:java}
error: 
org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
-- StackTrace --
javax.management.InstanceNotFoundException: 
org.apache.cassandra.metrics:type=Table,keyspace=,scope=feed,name=WriteLatency
{code}
if i restart the node and check it works but if i write something then i get 
the error again
 * Cassandra 4.0-beta1
 * Ubuntu 20



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other

2020-09-01 Thread Dinesh Joshi (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dinesh Joshi updated CASSANDRA-15909:
-
Reviewers: Caleb Rackliffe, David Capwell, Dinesh Joshi  (was: Caleb 
Rackliffe, David Capwell)

> Make Table/Keyspace Metric Names Consistent With Each Other
> ---
>
> Key: CASSANDRA-15909
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15909
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Metrics
>Reporter: Stephen Mallette
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-beta
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> As part of CASSANDRA-15821 it became apparent that certain metric names found 
> in keyspace and tables had different names but were in fact the same metric - 
> they are as follows:
> * Table.SyncTime == Keyspace.RepairSyncTime
> * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows
> * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime
> * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize
> * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize
> * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize
> * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize
> Also, client metrics are the only metrics to start with a lower case letter. 
> Change those to upper case to match all the other metrics.
> Unifying this naming would help make metrics more consistent as part of 
> CASSANDRA-15582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other

2020-09-01 Thread Dinesh Joshi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188837#comment-17188837
 ] 

Dinesh Joshi commented on CASSANDRA-15909:
--

I'll take a look this week.

> Make Table/Keyspace Metric Names Consistent With Each Other
> ---
>
> Key: CASSANDRA-15909
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15909
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Metrics
>Reporter: Stephen Mallette
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-beta
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> As part of CASSANDRA-15821 it became apparent that certain metric names found 
> in keyspace and tables had different names but were in fact the same metric - 
> they are as follows:
> * Table.SyncTime == Keyspace.RepairSyncTime
> * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows
> * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime
> * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize
> * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize
> * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize
> * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize
> Also, client metrics are the only metrics to start with a lower case letter. 
> Change those to upper case to match all the other metrics.
> Unifying this naming would help make metrics more consistent as part of 
> CASSANDRA-15582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-16057) Should update in-jvm dtest to expose stdout and stderr for nodetool

2020-09-01 Thread Yifan Cai (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yifan Cai reassigned CASSANDRA-16057:
-

Assignee: Yifan Cai

> Should update in-jvm dtest to expose stdout and stderr for nodetool
> ---
>
> Key: CASSANDRA-16057
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16057
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/dtest/java
>Reporter: David Capwell
>Assignee: Yifan Cai
>Priority: Normal
>
> Many nodetool commands output to stdout or stderr so running nodetool using 
> in-jvm dtest should expose that to tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other

2020-09-01 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188004#comment-17188004
 ] 

Caleb Rackliffe edited comment on CASSANDRA-15909 at 9/1/20, 8:25 PM:
--

Created CASSANDRA-16090 to track eventual removal of the (new and existing) 
clutter in {{TableMetrics}}, and CASSANDRA-16094 to address the preexisting 
flaky test in {{TestIncRepair}}.


was (Author: maedhroz):
Created CASSANDRA-16090 to track eventual removal of the (new and existing) 
clutter in {{TableMetrics}}.

> Make Table/Keyspace Metric Names Consistent With Each Other
> ---
>
> Key: CASSANDRA-15909
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15909
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Metrics
>Reporter: Stephen Mallette
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-beta
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> As part of CASSANDRA-15821 it became apparent that certain metric names found 
> in keyspace and tables had different names but were in fact the same metric - 
> they are as follows:
> * Table.SyncTime == Keyspace.RepairSyncTime
> * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows
> * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime
> * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize
> * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize
> * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize
> * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize
> Also, client metrics are the only metrics to start with a lower case letter. 
> Change those to upper case to match all the other metrics.
> Unifying this naming would help make metrics more consistent as part of 
> CASSANDRA-15582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16094) Flaky Test: TestIncRepair.test_repaired_tracking_with_mismatching_replicas

2020-09-01 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16094:

Labels: dtest incremental_repair repair  (was: )

> Flaky Test: TestIncRepair.test_repaired_tracking_with_mismatching_replicas
> --
>
> Key: CASSANDRA-16094
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16094
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Caleb Rackliffe
>Priority: Normal
>  Labels: dtest, incremental_repair, repair
> Fix For: 4.0-beta
>
>
> We have two recent failures for this test on trunk: 
> 1.) 
> https://app.circleci.com/pipelines/github/maedhroz/cassandra/102/workflows/37ed8dab-9da4-4730-a883-20b7a99d88b4/jobs/518/tests
>  (CASSANDRA-15909)
> 2.) 
> https://app.circleci.com/pipelines/github/jolynch/cassandra/6/workflows/41e080e0-d7ff-4256-899e-b4010c6ef5ab/jobs/716/tests
>  (CASSANDRA-15379)
> The test expects there to be mismatches and then read repair executed on a 
> following SELECT, but either those mismatches aren’t there, read repair isn’t 
> happening, or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16094) Flaky Test: TestIncRepair.test_repaired_tracking_with_mismatching_replicas

2020-09-01 Thread Caleb Rackliffe (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Caleb Rackliffe updated CASSANDRA-16094:

 Bug Category: Parent values: Correctness(12982)Level 1 values: Test 
Failure(12990)
   Complexity: Normal
Discovered By: Adhoc Test
Fix Version/s: 4.0-beta
 Severity: Normal
   Status: Open  (was: Triage Needed)

> Flaky Test: TestIncRepair.test_repaired_tracking_with_mismatching_replicas
> --
>
> Key: CASSANDRA-16094
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16094
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/dtest/python
>Reporter: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-beta
>
>
> We have two recent failures for this test on trunk: 
> 1.) 
> https://app.circleci.com/pipelines/github/maedhroz/cassandra/102/workflows/37ed8dab-9da4-4730-a883-20b7a99d88b4/jobs/518/tests
>  (CASSANDRA-15909)
> 2.) 
> https://app.circleci.com/pipelines/github/jolynch/cassandra/6/workflows/41e080e0-d7ff-4256-899e-b4010c6ef5ab/jobs/716/tests
>  (CASSANDRA-15379)
> The test expects there to be mismatches and then read repair executed on a 
> following SELECT, but either those mismatches aren’t there, read repair isn’t 
> happening, or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16094) Flaky Test: TestIncRepair.test_repaired_tracking_with_mismatching_replicas

2020-09-01 Thread Caleb Rackliffe (Jira)
Caleb Rackliffe created CASSANDRA-16094:
---

 Summary: Flaky Test: 
TestIncRepair.test_repaired_tracking_with_mismatching_replicas
 Key: CASSANDRA-16094
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16094
 Project: Cassandra
  Issue Type: Bug
  Components: Test/dtest/python
Reporter: Caleb Rackliffe


We have two recent failures for this test on trunk: 

1.) 
https://app.circleci.com/pipelines/github/maedhroz/cassandra/102/workflows/37ed8dab-9da4-4730-a883-20b7a99d88b4/jobs/518/tests
 (CASSANDRA-15909)
2.) 
https://app.circleci.com/pipelines/github/jolynch/cassandra/6/workflows/41e080e0-d7ff-4256-899e-b4010c6ef5ab/jobs/716/tests
 (CASSANDRA-15379)

The test expects there to be mismatches and then read repair executed on a 
following SELECT, but either those mismatches aren’t there, read repair isn’t 
happening, or both.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15909) Make Table/Keyspace Metric Names Consistent With Each Other

2020-09-01 Thread Caleb Rackliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188787#comment-17188787
 ] 

Caleb Rackliffe commented on CASSANDRA-15909:
-

UPDATE: Looking for a second committer +1 here...

> Make Table/Keyspace Metric Names Consistent With Each Other
> ---
>
> Key: CASSANDRA-15909
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15909
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Observability/Metrics
>Reporter: Stephen Mallette
>Assignee: Caleb Rackliffe
>Priority: Normal
> Fix For: 4.0-beta
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> As part of CASSANDRA-15821 it became apparent that certain metric names found 
> in keyspace and tables had different names but were in fact the same metric - 
> they are as follows:
> * Table.SyncTime == Keyspace.RepairSyncTime
> * Table.RepairedDataTrackingOverreadRows == Keyspace.RepairedOverreadRows
> * Table.RepairedDataTrackingOverreadTime == Keyspace.RepairedOverreadTime
> * Table.AllMemtablesHeapSize == Keyspace.AllMemtablesOnHeapDataSize
> * Table.AllMemtablesOffHeapSize == Keyspace.AllMemtablesOffHeapDataSize
> * Table.MemtableOnHeapSize == Keyspace.MemtableOnHeapDataSize
> * Table.MemtableOffHeapSize == Keyspace.MemtableOffHeapDataSize
> Also, client metrics are the only metrics to start with a lower case letter. 
> Change those to upper case to match all the other metrics.
> Unifying this naming would help make metrics more consistent as part of 
> CASSANDRA-15582



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16093) Cassandra website is building/including the wrong versioned nodetool docs

2020-09-01 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188738#comment-17188738
 ] 

Michael Semb Wever commented on CASSANDRA-16093:


`ant realclean` will remove the final in-tree generated docs at {{build/html}}
but it doesn't remove the nodetool generated docs at 
{{doc/source/tools/nodetool}}

The docker-compose.yaml file in cassandra-website, rebuilds the versions using 
the same cassandra directory just switching branches/tags.

> Cassandra website is building/including the wrong versioned nodetool docs
> -
>
> Key: CASSANDRA-16093
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
>
> For example
> https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html
> shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16093) Cassandra website is building/including the wrong versioned nodetool docs

2020-09-01 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-16093:
---
Summary: Cassandra website is building/including the wrong versioned 
nodetool docs  (was: Cassandra website is building the wrong versioned docs)

> Cassandra website is building/including the wrong versioned nodetool docs
> -
>
> Key: CASSANDRA-16093
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
>
> For example
> https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html
> shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15701) Does Cassandra 3.11.3/3.11.5 is affected by CVE-2019-10712 or not ?

2020-09-01 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188714#comment-17188714
 ] 

Brandon Williams commented on CASSANDRA-15701:
--

It's hard to say since no details of that vulnerability seem to be published. 
CASSANDRA-15867 removed this in 3.11.7 anyway, however.

> Does  Cassandra 3.11.3/3.11.5  is affected by CVE-2019-10712 or not ?
> -
>
> Key: CASSANDRA-15701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15701
> Project: Cassandra
>  Issue Type: Bug
>  Components: Dependencies
>Reporter: wht
>Priority: Normal
>
> Because  cassandra 3.11.3/3.11.5 rely on jackson-mapper-asl-1.9.13.jar which 
> has been reported a vulnerability CVE-2019-10172, 
> [https://nvd.nist.gov/vuln/detail/CVE-2019-10172], so I want to know if it 
> has an impact to cassandra. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16093) Cassandra website is building the wrong versioned docs

2020-09-01 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-16093:
---
Summary: Cassandra website is building the wrong versioned docs  (was: 
Cassandra docs are building the wrong versioned docs)

> Cassandra website is building the wrong versioned docs
> --
>
> Key: CASSANDRA-16093
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
>
> For example
> https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html
> shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16091) rpc server gets wrongly initialized with rpc_enabled:false

2020-09-01 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16091:
-
Fix Version/s: 3.11.9
Since Version: 3.11.8

> rpc server gets wrongly initialized with rpc_enabled:false
> --
>
> Key: CASSANDRA-16091
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16091
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Tom van der Woerdt
>Priority: Normal
> Fix For: 3.11.9
>
>
> After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
> is thrown:
> {code:java}
>  java.lang.RuntimeException: Client SSL is not supported for non-blocking 
> sockets (hsha). Please remove client ssl from the configuration.
>   at 
> org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
>   at 
> org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
>   at 
> org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
>   at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
>   at 
> org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
>   at 
> org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
> {code}
> No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
> both versions.
> I created this Jira issue because clearly something changed between 3.11.7 
> and 3.11.8. Maybe intentional, maybe not. Changing `rpc_server_type` (which 
> is not clearly documented to be about Thrift only) from `hsha` to `sync` does 
> resolve the issue, as expected, but this does seem like a regression, hence 
> the Jira issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-16091) rpc server gets wrongly initialized with rpc_enabled:false

2020-09-01 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-16091:
-
 Bug Category: Parent values: Availability(12983)Level 1 values: 
Unavailable(12994)
   Complexity: Normal
Discovered By: User Report
 Severity: Normal
   Status: Open  (was: Triage Needed)

> rpc server gets wrongly initialized with rpc_enabled:false
> --
>
> Key: CASSANDRA-16091
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16091
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Tom van der Woerdt
>Priority: Normal
>
> After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
> is thrown:
> {code:java}
>  java.lang.RuntimeException: Client SSL is not supported for non-blocking 
> sockets (hsha). Please remove client ssl from the configuration.
>   at 
> org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
>   at 
> org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
>   at 
> org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
>   at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
>   at 
> org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
>   at 
> org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
> {code}
> No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
> both versions.
> I created this Jira issue because clearly something changed between 3.11.7 
> and 3.11.8. Maybe intentional, maybe not. Changing `rpc_server_type` (which 
> is not clearly documented to be about Thrift only) from `hsha` to `sync` does 
> resolve the issue, as expected, but this does seem like a regression, hence 
> the Jira issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16072) Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS loops to atomic adds

2020-09-01 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188608#comment-17188608
 ] 

Benjamin Lerer commented on CASSANDRA-16072:


Thanks a lot. 
The patches look go to me. Feel free to commit if CI is green.


> Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS 
> loops to atomic adds
> --
>
> Key: CASSANDRA-16072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16072
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Hints, Local/Commit Log
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 3.11.x, 4.0-beta
>
>
> Follow up to CASSANDRA-15922
> Both CommitLogSegment and HintsBuffer use AtomicIntegers for the current 
> offset when allocating. Like in CASSANDRA\-15922 the loops on 
> {{.compareAndSet(..)}} can be replaced with atomic adds using the {{. 
> getAndAdd(..)}} method.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node. On the same cluster CASSANDRA\-15922 was 
> found, after CASSANDRA\-15922's fix was deployed, there was still problems 
> around commit log flushing and hints. No flamegraph was collected that 
> demonstrated the thread contention as clearly as was found in 
> CASSANDRA\-15922, but the performance fix proposed here hopefully is obvious 
> enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-09-01 Thread Aleksey Yeschenko (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-15158:
--
Status: Changes Suggested  (was: Review In Progress)

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-09-01 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188581#comment-17188581
 ] 

Aleksey Yeschenko edited comment on CASSANDRA-15158 at 9/1/20, 3:49 PM:


Pushed some minor tweaks 
[here|https://github.com/iamaleksey/cassandra/commits/15158-review]. Made some 
bits more idiomatic, and changed the way in-flight requests are being kept 
track of.

In general, this does the job and solves the problem in the description. It 
doesn't, however, fully deal with storms in large clusters caused by a sequence 
of updates in quick succession, but, it's not intended to, either.

EDIT: the amount of synchronisation here bothers me a tiny bit, as all of it 
will likely have to be eventually gotten rid of, when and if TPC happens, but I 
can live with it.


was (Author: iamaleksey):
Pushed some minor tweaks 
[here|https://github.com/iamaleksey/cassandra/commits/15158-review]. Made some 
bits more idiomatic, and changed the way in-flight requests are being kept 
track of.

In general, this does the job and solves the problem in the description. It 
doesn't, however, fully deal with storms in large clusters caused by a sequence 
of updates in quick succession, but, it's not intended to, either.

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15158) Wait for schema agreement rather than in flight schema requests when bootstrapping

2020-09-01 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188581#comment-17188581
 ] 

Aleksey Yeschenko commented on CASSANDRA-15158:
---

Pushed some minor tweaks 
[here|https://github.com/iamaleksey/cassandra/commits/15158-review]. Made some 
bits more idiomatic, and changed the way in-flight requests are being kept 
track of.

In general, this does the job and solves the problem in the description. It 
doesn't, however, fully deal with storms in large clusters caused by a sequence 
of updates in quick succession, but, it's not intended to, either.

> Wait for schema agreement rather than in flight schema requests when 
> bootstrapping
> --
>
> Key: CASSANDRA-15158
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15158
> Project: Cassandra
>  Issue Type: Bug
>  Components: Cluster/Gossip, Cluster/Schema
>Reporter: Vincent White
>Assignee: Blake Eggleston
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently when a node is bootstrapping we use a set of latches 
> (org.apache.cassandra.service.MigrationTask#inflightTasks) to keep track of 
> in-flight schema pull requests, and we don't proceed with 
> bootstrapping/stream until all the latches are released (or we timeout 
> waiting for each one). One issue with this is that if we have a large schema, 
> or the retrieval of the schema from the other nodes was unexpectedly slow 
> then we have no explicit check in place to ensure we have actually received a 
> schema before we proceed.
> While it's possible to increase "migration_task_wait_in_seconds" to force the 
> node to wait on each latche longer, there are cases where this doesn't help 
> because the callbacks for the schema pull requests have expired off the 
> messaging service's callback map 
> (org.apache.cassandra.net.MessagingService#callbacks) after 
> request_timeout_in_ms (default 10 seconds) before the other nodes were able 
> to respond to the new node.
> This patch checks for schema agreement between the bootstrapping node and the 
> rest of the live nodes before proceeding with bootstrapping. It also adds a 
> check to prevent the new node from flooding existing nodes with simultaneous 
> schema pull requests as can happen in large clusters.
> Removing the latch system should also prevent new nodes in large clusters 
> getting stuck for extended amounts of time as they wait 
> `migration_task_wait_in_seconds` on each of the latches left orphaned by the 
> timed out callbacks.
>  
> ||3.11||
> |[PoC|https://github.com/apache/cassandra/compare/cassandra-3.11...vincewhite:check_for_schema]|
> |[dtest|https://github.com/apache/cassandra-dtest/compare/master...vincewhite:wait_for_schema_agreement]|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188539#comment-17188539
 ] 

Benjamin Lerer commented on CASSANDRA-15861:


This part of the code was obviously broken. Having some tests to ensure that it 
does not happen again makes total sense.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because \{{ComponentManifest}}
> with component sizes are sent before sending actual files. This isn't a 
> problem in legacy streaming as STATS file length 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188529#comment-17188529
 ] 

Stefan Miklosovic commented on CASSANDRA-15861:
---

Thanks [~blerer], but anyway, a test as I mentioned would be awesome to see 
when the code / logic which is touching this has changed little bit and we 
would be sure it is just computed right. The fact that we found that issue as 
part of 15406 is telling ... It would also simplify a lot my patch because 
right now I am literally parsing the output of netstats and I track that these 
numbers do make sense and the total is eventually equal to that sum. I am not 
so strong in internals so I am not able to write that "low level" test on my 
own. But ultimately it is not about me having things easier, but imho we should 
really test that and this patch seems to be like a good candidate to do it.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.

[jira] [Updated] (CASSANDRA-16093) Cassandra docs are building the wrong versioned docs

2020-09-01 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever updated CASSANDRA-16093:
---
 Bug Category: Parent values: Correctness(12982)Level 1 values: API / 
Semantic Definition(13162)
   Complexity: Normal
Discovered By: User Report
 Severity: Low
   Status: Open  (was: Triage Needed)

> Cassandra docs are building the wrong versioned docs
> 
>
> Key: CASSANDRA-16093
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
>
> For example
> https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html
> shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Assigned] (CASSANDRA-16093) Cassandra docs are building the wrong versioned docs

2020-09-01 Thread Michael Semb Wever (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Semb Wever reassigned CASSANDRA-16093:
--

Assignee: Michael Semb Wever

> Cassandra docs are building the wrong versioned docs
> 
>
> Key: CASSANDRA-16093
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
> Project: Cassandra
>  Issue Type: Bug
>  Components: Documentation/Website
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
>
> For example
> https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html
> shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16093) Cassandra docs are building the wrong versioned docs

2020-09-01 Thread Michael Semb Wever (Jira)
Michael Semb Wever created CASSANDRA-16093:
--

 Summary: Cassandra docs are building the wrong versioned docs
 Key: CASSANDRA-16093
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16093
 Project: Cassandra
  Issue Type: Bug
  Components: Documentation/Website
Reporter: Michael Semb Wever


For example
https://cassandra.apache.org/doc/3.11/tools/nodetool/enablefullquerylog.html

shouldn't be under the 3.11 documentation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16092) Add Index Group Interface for Storage Attached Index

2020-09-01 Thread ZhaoYang (Jira)
ZhaoYang created CASSANDRA-16092:


 Summary: Add Index Group Interface for Storage Attached Index
 Key: CASSANDRA-16092
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16092
 Project: Cassandra
  Issue Type: New Feature
  Components: Feature/SASI
Reporter: ZhaoYang


[Index 
group|https://github.com/datastax/cassandra/blob/storage_attached_index/src/java/org/apache/cassandra/index/Index.java#L634]
 interface allows:
* indexes on the same table to receive centralized lifecycle events called 
secondary index groups. Sharing of data between multiple column indexes on the 
same table allows SAI disk usage to realise significant space savings over 
other index implementations.
* index-group to analyze user query and provide a query plan that leverages all 
available indexes within the group.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188453#comment-17188453
 ] 

Benjamin Lerer commented on CASSANDRA-15861:


[~stefan.miklosovic] CASSANDRA-15406 as been going on for so long that I 
perfectly understand your frustration and the fact that you are worried to go 
to square one. The good news is that I am reviewer on both patches and I will 
do my best to ensure that there are no problem between them.
[~jasonstack] found an interesting issue with the missing part in the size 
calculation. I missed that part by not digging deeper enough in the code 
history. That would have caused us to reintroduce CASSANDRA-10680. I need to 
have a new look at that part for CASSANDRA-15406

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded 

[jira] [Commented] (CASSANDRA-15958) org.apache.cassandra.net.ConnectionTest testMessagePurging

2020-09-01 Thread Aleksey Yeschenko (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188398#comment-17188398
 ] 

Aleksey Yeschenko commented on CASSANDRA-15958:
---

Ping [~sbtourist] / CASSANDRA-15700

> org.apache.cassandra.net.ConnectionTest testMessagePurging
> --
>
> Key: CASSANDRA-15958
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15958
> Project: Cassandra
>  Issue Type: Bug
>  Components: Test/unit
>Reporter: David Capwell
>Assignee: Adam Holmberg
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Build: 
> https://ci-cassandra.apache.org/job/Cassandra-trunk-test/196/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessagePurging/
> Build: 
> https://ci-cassandra.apache.org/job/Cassandra-trunk-test/194/testReport/junit/org.apache.cassandra.net/ConnectionTest/testMessagePurging/
> java.util.concurrent.TimeoutException
>   at org.apache.cassandra.net.AsyncPromise.get(AsyncPromise.java:258)
>   at org.apache.cassandra.net.FutureDelegate.get(FutureDelegate.java:143)
>   at 
> org.apache.cassandra.net.ConnectionTest.doTestManual(ConnectionTest.java:268)
>   at 
> org.apache.cassandra.net.ConnectionTest.testManual(ConnectionTest.java:236)
>   at 
> org.apache.cassandra.net.ConnectionTest.testMessagePurging(ConnectionTest.java:679)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16072) Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS loops to atomic adds

2020-09-01 Thread Michael Semb Wever (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188367#comment-17188367
 ] 

Michael Semb Wever commented on CASSANDRA-16072:


Thanks for the feedback [~blerer].

bq. If the slab size is around 1GB, the maximum hint size will be around 500MB.

I see now that the slab size can in fact be 2GB, and max mutation size 1GB. 
This makes the problem worse (even if very edge-case). {{HintsBuffer.position}} 
has been changed to {{AtomicLong}}

bq. Regarding CommitLogSegment, it will be good to have a comment explaining 
the negative value logic.

Done. Two comments added, explaining when the overflow is harmless, and when it 
isn't and hence the cast to long.


Patches
 - 
[3.11|https://github.com/apache/cassandra/compare/cassandra-3.11...thelastpickle:mck/cassandra-3.11_cas_improvements]
 with CI 
[run|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/301/pipeline]
 - 
[trunk|https://github.com/apache/cassandra/compare/trunk...thelastpickle:mck/trunk_cas_improvements]
 with CI 
[run|https://ci-cassandra.apache.org/blue/organizations/jenkins/Cassandra-devbranch/detail/Cassandra-devbranch/303/pipeline]




> Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS 
> loops to atomic adds
> --
>
> Key: CASSANDRA-16072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16072
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Hints, Local/Commit Log
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 3.11.x, 4.0-beta
>
>
> Follow up to CASSANDRA-15922
> Both CommitLogSegment and HintsBuffer use AtomicIntegers for the current 
> offset when allocating. Like in CASSANDRA\-15922 the loops on 
> {{.compareAndSet(..)}} can be replaced with atomic adds using the {{. 
> getAndAdd(..)}} method.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node. On the same cluster CASSANDRA\-15922 was 
> found, after CASSANDRA\-15922's fix was deployed, there was still problems 
> around commit log flushing and hints. No flamegraph was collected that 
> demonstrated the thread contention as clearly as was found in 
> CASSANDRA\-15922, but the performance fix proposed here hopefully is obvious 
> enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-website] branch master updated (5f47e16 -> 85519e8)

2020-09-01 Thread mck
This is an automated email from the ASF dual-hosted git repository.

mck pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/cassandra-website.git.


from 5f47e16  Add 3.11.8, 3.11.9 4.0-beta2, 4.0-beta3
 add 85519e8  add generated docs for 3.11.9 and 4.0-beta2

No new revisions were added by this update.

Summary of changes:
 src/doc/3.11   |2 +-
 src/doc/3.11.9/.buildinfo  |4 +
 .../_images/eclipse_debug0.png |  Bin
 .../_images/eclipse_debug1.png |  Bin
 .../_images/eclipse_debug2.png |  Bin
 .../_images/eclipse_debug3.png |  Bin
 .../_images/eclipse_debug4.png |  Bin
 .../_images/eclipse_debug5.png |  Bin
 .../_images/eclipse_debug6.png |  Bin
 .../_sources/architecture/dynamo.rst.txt   |0
 .../_sources/architecture/guarantees.rst.txt   |0
 .../_sources/architecture/index.rst.txt|0
 .../_sources/architecture/overview.rst.txt |0
 .../_sources/architecture/storage_engine.rst.txt   |0
 src/doc/{3.11.8 => 3.11.9}/_sources/bugs.rst.txt   |0
 .../configuration/cassandra_config_file.rst.txt| 1920 ++
 .../_sources/configuration/index.rst.txt   |0
 .../{3.11.8 => 3.11.9}/_sources/contactus.rst.txt  |0
 .../_sources/cql/appendices.rst.txt|0
 .../_sources/cql/changes.rst.txt   |0
 .../{3.11.8 => 3.11.9}/_sources/cql/ddl.rst.txt|0
 .../_sources/cql/definitions.rst.txt   |0
 .../{4.0-beta2 => 3.11.9}/_sources/cql/dml.rst.txt |0
 .../_sources/cql/functions.rst.txt |0
 .../{3.11.8 => 3.11.9}/_sources/cql/index.rst.txt  |0
 .../_sources/cql/indexes.rst.txt   |0
 .../_sources/cql/json.rst.txt  |0
 .../{3.11.8 => 3.11.9}/_sources/cql/mvs.rst.txt|0
 .../_sources/cql/security.rst.txt  |0
 .../_sources/cql/triggers.rst.txt  |0
 .../_sources/cql/types.rst.txt |0
 .../_sources/data_modeling/index.rst.txt   |0
 .../_sources/development/code_style.rst.txt|0
 .../_sources/development/how_to_commit.rst.txt |0
 .../_sources/development/how_to_review.rst.txt |0
 .../_sources/development/ide.rst.txt   |0
 .../_sources/development/index.rst.txt |0
 .../_sources/development/patches.rst.txt   |0
 .../_sources/development/testing.rst.txt   |0
 .../{3.11.8 => 3.11.9}/_sources/faq/index.rst.txt  |0
 .../_sources/getting_started/configuring.rst.txt   |0
 .../_sources/getting_started/drivers.rst.txt   |0
 .../_sources/getting_started/index.rst.txt |0
 .../_sources/getting_started/installing.rst.txt|0
 .../_sources/getting_started/querying.rst.txt  |0
 src/doc/{3.11.8 => 3.11.9}/_sources/index.rst.txt  |0
 .../_sources/operating/backups.rst.txt |0
 .../_sources/operating/bloom_filters.rst.txt   |0
 .../_sources/operating/bulk_loading.rst.txt|0
 .../_sources/operating/cdc.rst.txt |0
 .../_sources/operating/compaction.rst.txt  |0
 .../_sources/operating/compression.rst.txt |0
 .../_sources/operating/hardware.rst.txt|0
 .../_sources/operating/hints.rst.txt   |0
 .../_sources/operating/index.rst.txt   |0
 src/doc/3.11.9/_sources/operating/metrics.rst.txt  |  710 +++
 .../_sources/operating/read_repair.rst.txt |0
 .../_sources/operating/repair.rst.txt  |0
 .../_sources/operating/security.rst.txt|0
 .../_sources/operating/snitch.rst.txt  |0
 .../_sources/operating/topo_changes.rst.txt|0
 .../_sources/tools/cqlsh.rst.txt   |0
 .../_sources/tools/index.rst.txt   |0
 .../_sources/tools/nodetool.rst.txt|0
 .../_sources/tools/nodetool/assassinate.rst.txt|0
 .../_sources/tools/nodetool/bootstrap.rst.txt  |0
 .../_sources/tools/nodetool/cleanup.rst.txt|0
 .../_sources/tools/nodetool/clearsnapshot.rst.txt  |0
 .../_sources/tools/nodetool/clientstats.rst.txt|0
 .../_sources/tools/nodetool/compact.rst.txt|0
 .../tools/nodetool/compactionhistory.rst.txt   |0
 .../tools/nodetool/compactionstats.rst.txt |0
 .../_sources/tools/nodetool/decommission.rst.txt   |0
 .../tools/nodetool/describecluster.rst.txt |0
 .../_sources/tools/nodetool/describering.rst.txt   |0
 .../tools/nodetool/disableauditlog.rst.txt |0
 .../tools/nodetool/disableautocompaction.rst.txt   |0
 .../_sources/tools/nodetool/disablebackup.rst.txt  |

[jira] [Updated] (CASSANDRA-16091) rpc server gets wrongly initialized with rpc_enabled:false

2020-09-01 Thread Tom van der Woerdt (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-16091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom van der Woerdt updated CASSANDRA-16091:
---
Description: 
After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
is thrown:
{code:java}
 java.lang.RuntimeException: Client SSL is not supported for non-blocking 
sockets (hsha). Please remove client ssl from the configuration.
at 
org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
at 
org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
at 
org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
at 
org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
at 
org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
{code}
No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
both versions.

I created this Jira issue because clearly something changed between 3.11.7 and 
3.11.8. Maybe intentional, maybe not. Changing `rpc_server_type` (which is not 
clearly documented to be about Thrift only) from `hsha` to `sync` does resolve 
the issue, as expected, but this does seem like a regression, hence the Jira 
issue.

  was:
After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
is thrown:
{code:java}
 java.lang.RuntimeException: Client SSL is not supported for non-blocking 
sockets (hsha). Please remove client ssl from the configuration.
at 
org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
at 
org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
at 
org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
at 
org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
at 
org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
{code}
No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
both versions.


> rpc server gets wrongly initialized with rpc_enabled:false
> --
>
> Key: CASSANDRA-16091
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16091
> Project: Cassandra
>  Issue Type: Bug
>  Components: Local/Config
>Reporter: Tom van der Woerdt
>Priority: Normal
>
> After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
> is thrown:
> {code:java}
>  java.lang.RuntimeException: Client SSL is not supported for non-blocking 
> sockets (hsha). Please remove client ssl from the configuration.
>   at 
> org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
>   at 
> org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
>   at 
> org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
>   at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
>   at 
> org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
>   at 
> org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
>   at 
> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
>   at 
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
> {code}
> No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
> both versions.
> I created this Jira issue because clearly something changed between 3.11.7 
> and 3.11.8. Maybe intentional, maybe not. Changing `rpc_server_type` (which 
> is not clearly documented to be about Thrift only) from `hsha` to `sync` does 
> resolve the issue, as expected, but this does seem like a regression, hence 
> the Jira issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Created] (CASSANDRA-16091) rpc server gets wrongly initialized with rpc_enabled:false

2020-09-01 Thread Tom van der Woerdt (Jira)
Tom van der Woerdt created CASSANDRA-16091:
--

 Summary: rpc server gets wrongly initialized with rpc_enabled:false
 Key: CASSANDRA-16091
 URL: https://issues.apache.org/jira/browse/CASSANDRA-16091
 Project: Cassandra
  Issue Type: Bug
  Components: Local/Config
Reporter: Tom van der Woerdt


After upgrading to Cassandra 3.11.8, Cassandra no longer starts. An exception 
is thrown:
{code:java}
 java.lang.RuntimeException: Client SSL is not supported for non-blocking 
sockets (hsha). Please remove client ssl from the configuration.
at 
org.apache.cassandra.thrift.THsHaDisruptorServer$Factory.buildTServer(THsHaDisruptorServer.java:74)
at 
org.apache.cassandra.thrift.TServerCustomFactory.buildTServer(TServerCustomFactory.java:55)
at 
org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.(ThriftServer.java:128)
at org.apache.cassandra.thrift.ThriftServer.start(ThriftServer.java:55)
at 
org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:713)
at 
org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:538)
at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:643)
at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:768)
{code}
No configuration changed between 3.11.7 and 3.11.8. rpc_enabled is false in 
both versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[cassandra-builds] branch master updated: ninja-fix: in finish_release.sh fix quote character in echo'd instructions at end

2020-09-01 Thread mck
This is an automated email from the ASF dual-hosted git repository.

mck pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/cassandra-builds.git


The following commit(s) were added to refs/heads/master by this push:
 new 4b6caff  ninja-fix: in finish_release.sh fix quote character in echo'd 
instructions at end
4b6caff is described below

commit 4b6caffd63b635eaf330ba0232095f7863e946c9
Author: Mick Semb Wever 
AuthorDate: Tue Sep 1 12:35:02 2020 +0200

ninja-fix: in finish_release.sh fix quote character in echo'd instructions 
at end
---
 cassandra-release/finish_release.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cassandra-release/finish_release.sh 
b/cassandra-release/finish_release.sh
index ce4cf5c..45aea38 100755
--- a/cassandra-release/finish_release.sh
+++ b/cassandra-release/finish_release.sh
@@ -262,4 +262,4 @@ echo ' 9) tweet from @cassandra'
 echo ' 10) release version in JIRA'
 echo ' 11) remove old version (eg: `svn rm 
https://dist.apache.org/repos/dist/release/cassandra/`)'
 echo ' 12) increment build.xml base.version for the next release'
-echo ' 13) follow instructions in email you will receive from the \"Apache 
Reporter Service\" to update the project\'s list of releases in 
reporter.apache.org'
+echo ' 13) follow instructions in email you will receive from the \"Apache 
Reporter Service\" to update the project`s list of releases in 
reporter.apache.org'


-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15991) 15583 - Add UX tests to intree LHF tooling

2020-09-01 Thread Berenguer Blasi (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188324#comment-17188324
 ] 

Berenguer Blasi commented on CASSANDRA-15991:
-

Thanks a lot, really appreciated. I'll comment on the PR.

> 15583 - Add UX tests to intree LHF tooling
> --
>
> Key: CASSANDRA-15991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15991
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Test/unit
>Reporter: Berenguer Blasi
>Assignee: Berenguer Blasi
>Priority: Normal
> Fix For: 4.0-beta
>
>
> As per CASSANDRA-15583 many in tree tools lack proper UX tooling: mandatory 
> params are indeed mandatory, 'help' produces an actual help, return codes etc
> This ticket is an attempt to add it to those tools that classify as LHF. 
> Other tools such as nodetool, with many sub-commands, deserve a separate 
> ticket of their own



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16072) Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS loops to atomic adds

2020-09-01 Thread Benjamin Lerer (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188321#comment-17188321
 ] 

Benjamin Lerer commented on CASSANDRA-16072:


If we do some rough calculation on the limits of things for {{HintBuffer}}:
If the slab size is around {{1GB}}, the maximum hint size will be around 
{{500MB}}.
Once we exceed the slab capacity the {{position}} will be set to {{-2GB}}. It 
will only take 5 hints at a size close from the maximum hint size to be back on 
the positive side.
It seems to me that we are on a risky side. What do you think?

Regarding {{CommitLogSegment}}, it will be good to have a comment explaining 
the negative value logic. People might not get it straight away as it is an 
obvious way to handle overflow.
 
  

> Reduce thread contention in CommitLogSegment and HintsBuffer by rewriting CAS 
> loops to atomic adds
> --
>
> Key: CASSANDRA-16072
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16072
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Consistency/Hints, Local/Commit Log
>Reporter: Michael Semb Wever
>Assignee: Michael Semb Wever
>Priority: Normal
> Fix For: 3.11.x, 4.0-beta
>
>
> Follow up to CASSANDRA-15922
> Both CommitLogSegment and HintsBuffer use AtomicIntegers for the current 
> offset when allocating. Like in CASSANDRA\-15922 the loops on 
> {{.compareAndSet(..)}} can be replaced with atomic adds using the {{. 
> getAndAdd(..)}} method.
> In highly contended environments the CAS failures can be high, starving 
> writes in a running Cassandra node. On the same cluster CASSANDRA\-15922 was 
> found, after CASSANDRA\-15922's fix was deployed, there was still problems 
> around commit log flushing and hints. No flamegraph was collected that 
> demonstrated the thread contention as clearly as was found in 
> CASSANDRA\-15922, but the performance fix proposed here hopefully is obvious 
> enough.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15701) Does Cassandra 3.11.3/3.11.5 is affected by CVE-2019-10712 or not ?

2020-09-01 Thread Tigran Mardanyan (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188281#comment-17188281
 ] 

Tigran Mardanyan commented on CASSANDRA-15701:
--

Sorry guys, but my security scanners warn about this as well. It seems Apache 
Cassandra is not affected but need confirmation from you experts.

If this is not correct place to talk about this issue, please direct me to the 
correct platform where I can ask for impact analyze.

Any updated on this much appreciated.

> Does  Cassandra 3.11.3/3.11.5  is affected by CVE-2019-10712 or not ?
> -
>
> Key: CASSANDRA-15701
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15701
> Project: Cassandra
>  Issue Type: Bug
>  Components: Dependencies
>Reporter: wht
>Priority: Normal
>
> Because  cassandra 3.11.3/3.11.5 rely on jackson-mapper-asl-1.9.13.jar which 
> has been reported a vulnerability CVE-2019-10172, 
> [https://nvd.nist.gov/vuln/detail/CVE-2019-10172], so I want to know if it 
> has an impact to cassandra. Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188181#comment-17188181
 ] 

Stefan Miklosovic edited comment on CASSANDRA-15861 at 9/1/20, 6:48 AM:


I understand that this is the most probably just more important patch than my 
"percentage tracking" where I am just accidentally fixing some size-related bug 
and thats life ... But I would really like to see the test where we are 
checking this properly because netstats can again report some weird numbers and 
we are back in square one. The ideal test would look like - generate some 
sstables compressed / uncompressed and try to stream it in their entirety / 
without and check that the total sizes to be streamed are equal to sum of sizes 
of all individual tables. For all 4 combinations. I have checked your PR and I 
dont see this kind of test and I am worried that it will be not be matching 
again.


was (Author: stefan.miklosovic):
I understand that this is the most probably just more important patch then my 
"percentage tracking" where I am just accidentally fixing some size-related bug 
and thats life ... But I would really like to see the test where we are 
checking this properly because netstats can again report some weird numbers and 
we are back in square one. The ideal test would look like - generate some 
sstables compressed / uncompressed and try to stream it in their entirety / 
without and check that the total sizes to be streamed are equal to sum of sizes 
of all individual tables. For all 4 combinations. I have checked your PR and I 
dont see this kind of test and I am worried that it will be not be matching 
again.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188181#comment-17188181
 ] 

Stefan Miklosovic commented on CASSANDRA-15861:
---

I understand that this is the most probably just more important patch that my 
"percentage tracking" where I am just accidentally fixing some size-related bug 
and thats life ... But I would really like to see the test where we are 
checking this properly because netstats can again report some weird numbers and 
we are back in square one. The ideal test would look like - generate some 
sstables compressed / uncompressed and try to stream it in their entirety / 
without and check that the total sizes to be streamed are equal to sum of sizes 
of all individual tables. For all 4 combinations. I have checked your PR and I 
dont see this kind of test and I am worried that it will be not be matching 
again.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> 

[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188181#comment-17188181
 ] 

Stefan Miklosovic edited comment on CASSANDRA-15861 at 9/1/20, 6:48 AM:


I understand that this is the most probably just more important patch then my 
"percentage tracking" where I am just accidentally fixing some size-related bug 
and thats life ... But I would really like to see the test where we are 
checking this properly because netstats can again report some weird numbers and 
we are back in square one. The ideal test would look like - generate some 
sstables compressed / uncompressed and try to stream it in their entirety / 
without and check that the total sizes to be streamed are equal to sum of sizes 
of all individual tables. For all 4 combinations. I have checked your PR and I 
dont see this kind of test and I am worried that it will be not be matching 
again.


was (Author: stefan.miklosovic):
I understand that this is the most probably just more important patch that my 
"percentage tracking" where I am just accidentally fixing some size-related bug 
and thats life ... But I would really like to see the test where we are 
checking this properly because netstats can again report some weird numbers and 
we are back in square one. The ideal test would look like - generate some 
sstables compressed / uncompressed and try to stream it in their entirety / 
without and check that the total sizes to be streamed are equal to sum of sizes 
of all individual tables. For all 4 combinations. I have checked your PR and I 
dont see this kind of test and I am worried that it will be not be matching 
again.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what 

[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188178#comment-17188178
 ] 

ZhaoYang edited comment on CASSANDRA-15861 at 9/1/20, 6:46 AM:
---

bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

The change in {{CassandraStreamHeder}} around compressed size is to restore 
original behavior (reduce GC) before storege-engine refactoring.


was (Author: jasonstack):
bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

The change in {{CassandraStreamHeder}} around compressed size is to restore 
original behavior before storege-engine refactoring.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to 

[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188178#comment-17188178
 ] 

ZhaoYang edited comment on CASSANDRA-15861 at 9/1/20, 6:45 AM:
---

bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

The change in {{CassandraStreamHeder}} around compressed size is to restore 
original behavior before storege-engine refactoring.


was (Author: jasonstack):
bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

The change in {{ComponentMetadata}} around compressed size is to restore 
original behavior before storege-engine refactoring.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast 

[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188178#comment-17188178
 ] 

ZhaoYang edited comment on CASSANDRA-15861 at 9/1/20, 6:44 AM:
---

bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

The change in {{ComponentMetadata}} around compressed size is to restore 
original behavior before storege-engine refactoring.


was (Author: jasonstack):
bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188178#comment-17188178
 ] 

ZhaoYang commented on CASSANDRA-15861:
--

bq. Could you elaborate on what needs to be changed specifically in mine code 
so it will be fully ok again?

hmm.. some moved codes in {{CassandraOutgoingFiles}}. What are you worried 
about?

bq. Are you already sure that your changes are computing sizes in both 
compressed and uncompressed paths right?

this patch is for zero-copy-streaming to avoid partial written files, not about 
how size is calculated.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188172#comment-17188172
 ] 

Stefan Miklosovic commented on CASSANDRA-15861:
---

Could you elaborate on what needs to be changed specifically in mine code so it 
will be fully ok again? Are you already sure that your changes are computing 
sizes in both compressed and uncompressed paths right? (with and without entire 
sstable streaming)

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread ZhaoYang (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188166#comment-17188166
 ] 

ZhaoYang commented on CASSANDRA-15861:
--

[~stefan.miklosovic] I took a brief look at the patch in 15406, I think there 
are just minor superficial conflicts, no compatibility issue.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation failure when it tries to 
> mutate sstable level and "isTransient" attribute in 
> {{CassandraEntireSSTableStreamReader#read}}.
> {code}
> Currently, entire-sstable-streaming requires sstable components to be 
> immutable, because \{{ComponentManifest}}
> with component sizes are sent before sending actual files. This isn't a 
> problem in legacy streaming as STATS file 

[jira] [Comment Edited] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188162#comment-17188162
 ] 

Stefan Miklosovic edited comment on CASSANDRA-15861 at 9/1/20, 6:21 AM:


This clashes with https://issues.apache.org/jira/browse/CASSANDRA-15406 I am 
working on for a very long time and I havent been able to manage to merge it. 
There seems to be a clash with some functionality related to how size of files 
is computed because it is broken and it reports wrong numbers in netstats.

Could somebody verify that this patch is compatible with 15406 and it does not 
break things? It is pretty frustrating to continuously rewrite already fully 
prepared and tested code.

The problem with the original code is nicely summarized in this comment 
downwards and input from Benjamin Lerer 
https://issues.apache.org/jira/browse/CASSANDRA-15406?focusedCommentId=17181389=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17181389


was (Author: stefan.miklosovic):
This clashes with https://issues.apache.org/jira/browse/CASSANDRA-15406 I am 
working on for a very long time and I havent been able to manage to merge it. 
There seems to be a clash with some functionality related to how size of files 
is computed because it is broken and it reports wrong numbers in netstats.

Could somebody verify that this patch is compatible with 15406 and it does not 
break things? It is pretty frustrating to continuously rewrite already fully 
prepared and tested code.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. 

[jira] [Commented] (CASSANDRA-15861) Mutating sstable component may race with entire-sstable-streaming(ZCS) causing checksum validation failure

2020-09-01 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188162#comment-17188162
 ] 

Stefan Miklosovic commented on CASSANDRA-15861:
---

This clashes with https://issues.apache.org/jira/browse/CASSANDRA-15406 I am 
working on for a very long time and I havent been able to manage to merge it. 
There seems to be a clash with some functionality related to how size of files 
is computed because it is broken and it reports wrong numbers in netstats.

Could somebody verify that this patch is compatible with 15406 and it does not 
break things? It is pretty frustrating to continuously rewrite already fully 
prepared and tested code.

> Mutating sstable component may race with entire-sstable-streaming(ZCS) 
> causing checksum validation failure
> --
>
> Key: CASSANDRA-15861
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15861
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Consistency/Streaming, 
> Local/Compaction
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Normal
> Fix For: 4.0-beta
>
>
> Flaky dtest: [test_dead_sync_initiator - 
> repair_tests.repair_test.TestRepair|https://ci-cassandra.apache.org/view/all/job/Cassandra-devbranch-dtest/143/testReport/junit/dtest.repair_tests.repair_test/TestRepair/test_dead_sync_initiator/]
> {code:java|title=stacktrace}
> Unexpected error found in node logs (see stdout for full details). Errors: 
> [ERROR [Stream-Deserializer-127.0.0.1:7000-570871f3] 2020-06-03 04:05:19,081 
> CassandraEntireSSTableStreamReader.java:145 - [Stream 
> 6f1c3360-a54f-11ea-a808-2f23710fdc90] Error while reading sstable from stream 
> for table = keyspace1.standard1
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted: 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.maybeValidateChecksum(MetadataSerializer.java:219)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:198)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.deserialize(MetadataSerializer.java:129)
>   at 
> org.apache.cassandra.io.sstable.metadata.MetadataSerializer.mutate(MetadataSerializer.java:226)
>   at 
> org.apache.cassandra.db.streaming.CassandraEntireSSTableStreamReader.read(CassandraEntireSSTableStreamReader.java:140)
>   at 
> org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:78)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36)
>   at 
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:49)
>   at 
> org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:181)
>   at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Checksums do not match for 
> /home/cassandra/cassandra/cassandra-dtest/tmp/dtest-te4ty0r9/test/node3/data0/keyspace1/standard1-5f5ab140a54f11eaa8082f23710fdc90/na-2-big-Statistics.db
> {code}
>  
> In the above test, it executes "nodetool repair" on node1 and kills node2 
> during repair. At the end, node3 reports checksum validation failure on 
> sstable transferred from node1.
> {code:java|title=what happened}
> 1. When repair started on node1, it performs anti-compaction which modifies 
> sstable's repairAt to 0 and pending repair id to session-id.
> 2. Then node1 creates {{ComponentManifest}} which contains file lengths to be 
> transferred to node3.
> 3. Before node1 actually sends the files to node3, node2 is killed and node1 
> starts to broadcast repair-failure-message to all participants in 
> {{CoordinatorSession#fail}}
> 4. Node1 receives its own repair-failure-message and fails its local repair 
> sessions at {{LocalSessions#failSession}} which triggers async background 
> compaction.
> 5. Node1's background compaction will mutate sstable's repairAt to 0 and 
> pending repair id to null via  
> {{PendingRepairManager#getNextRepairFinishedTask}}, as there is no more 
> in-progress repair.
> 6. Node1 actually sends the sstable to node3 where the sstable's STATS 
> component size is different from the original size recorded in the manifest.
> 7. At the end, node3 reports checksum validation