[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016527#comment-17016527 ] Dinesh Joshi edited comment on CASSANDRA-13938 at 1/16/20 3:56 AM: --- Hi [~aleksey], Overall the code looks good. Two minor nits only. Feel free to make changes on commit. - {{CompressedInputStream}} - could you pull the resizing multiplier (1.5) out as a constant? I think its used in multiple locations. - {{CompressedInputStream::chunkBytesRead}} can be package private. - {{RebufferingInputStream}} - Line 106, the word 'length' has a typo in comment. +1 was (Author: djoshi3): Hi [~aleksey], Overall the code looks good. Two minor nits only. Feel free to make changes on commit. - {{CompressedInputStream}} - could you pull the resizing multiplier (1.5) out as a constant? I think its used in multiple locations. - {{CompressedInputStream::chunkBytesRead}} can be package private. - {{RebufferingInputStream}} - Line 106, the word 'length' has a typo in comment. > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairi
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17016527#comment-17016527 ] Dinesh Joshi edited comment on CASSANDRA-13938 at 1/16/20 3:56 AM: --- Hi [~aleksey], Overall the code looks good. Minor nits only. Feel free to make changes on commit. - {{CompressedInputStream}} - could you pull the resizing multiplier (1.5) out as a constant? I think its used in multiple locations. - {{CompressedInputStream::chunkBytesRead}} can be package private. - {{RebufferingInputStream}} - Line 106, the word 'length' has a typo in comment. +1 was (Author: djoshi3): Hi [~aleksey], Overall the code looks good. Two minor nits only. Feel free to make changes on commit. - {{CompressedInputStream}} - could you pull the resizing multiplier (1.5) out as a constant? I think its used in multiple locations. - {{CompressedInputStream::chunkBytesRead}} can be package private. - {{RebufferingInputStream}} - Line 106, the word 'length' has a typo in comment. +1 > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Repair >Reporter: Nate McCall >Assignee: Aleksey Yeschenko >Priority: Urgent > Fix For: 4.0-alpha > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > compression={'sstable_compression': 'LZ4Compressor'}; > columnspec: > - name: key > population: uniform(1..5000) # 50 million records available > - name: ts > cluster: gaussian(1..50) # Up to 50 inserts per record > - name: val > population: gaussian(128..1024) # varrying size of value data > insert: > partitions: fixed(1) # only one insert per batch for individual partitions > select: fixed(1)/1 # each insert comes in one at a time > batchtype: UNLOGGED > queries: > single: > cql: select * from test_data where key = ? and ts = ? limit 1; > series: > cql: select key,ts,val from test_data where key = ? limit 10; > {noformat} > The commands to build and run: > {noformat} > ccm create 4_0_test -v git:trunk -n 3 -s > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=15s -rate threads=4 > # flush the memtable just to get everything on disk > ccm node1 nodetool flush > ccm node2 nodetool flush > ccm node3 nodetool flush > # disable hints for nodes 2 and 3 > ccm node2 nodetool disablehandoff > ccm node3 nodetool disablehandoff > # stop node1 > ccm node1 stop > ccm stress user profile=./histo-test-schema.yml > ops\(insert=20,single=1,series=1\) duration=45s -rate threads=4 > # wait 10 seconds > ccm node1 start > # Note that we are local to ccm's nodetool install 'cause repair preview is > not reported yet > node1/bin/nodetool repair --preview > node1/bin/nodetool repair standard_long test_data > {noformat} > The error outputs from the last repair command follow. First, this is stdout > from node1: > {noformat} > $ node1/bin/nodetool repair standard_long test_data > objc[47876]: Class JavaLaunchHelper is implemented in both > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/bin/java > (0x10274d4c0) and > /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre/lib/libinstrument.dylib > (0x1047b64e0). One of the two will be used. Which one is undefined. > [2017-10-05 14:31:52,425] Starting repair command #4 > (7e1a9150-a98e-11e7-ad86-cbd2801b8de2), repairi
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16658981#comment-16658981 ] Alex Lourie edited comment on CASSANDRA-13938 at 10/22/18 1:39 PM: --- [~jasobrown] I've been testing both trunk and your branch in a simple repair scenario and it's still failing. The scenario I'm working with is: 1. Start a cluster 2. Load the cluster for 10 minutes 3. Stop one node and load it for an additional 30 minutes 4. Clear the hints 5. Start the stopped node and let it resync with others for a couple of minutes. 6. Start the repairs on the previously stopped node. Repairs crash on the other nodes (on 2 nodes in my 3-node test cluster) with the following error: {code} Oct 22 13:10:54 ip-10-0-13-111 cassandra[5927]: INFO [AntiEntropyStage:1] 2018-10-22 13:10:54,716 Validator.java:417 - [repair #9c38dd00-d5fb-11e8-ac32-316a9d8f8d32] Sending completed merkle tree to 35.162.15.68:7000 for alex.test2 Oct 22 13:16:02 ip-10-0-13-111 cassandra[5927]: INFO [AntiEntropyStage:1] 2018-10-22 13:16:02,594 StreamingRepairTask.java:72 - [streaming task #9c38dd00-d5fb-11e8-ac32-316a9d8f8d32] Performing streaming repair of 7382 ranges with 35.162.15.68:7000 Oct 22 13:16:02 ip-10-0-13-111 cassandra[5927]: INFO [AntiEntropyStage:1] 2018-10-22 13:16:02,981 StreamResultFuture.java:89 - [Stream #9fe63820-d5fc-11e8-8a2b-3555ba61a619] Executing streaming plan for Repair Oct 22 13:16:02 ip-10-0-13-111 cassandra[5927]: INFO [AntiEntropyStage:1] 2018-10-22 13:16:02,981 StreamSession.java:287 - [Stream #9fe63820-d5fc-11e8-8a2b-3555ba61a619] Starting streaming to 35.162.15.68:7000 Oct 22 13:16:02 ip-10-0-13-111 cassandra[5927]: INFO [AntiEntropyStage:1] 2018-10-22 13:16:02,987 StreamCoordinator.java:259 - [Stream #9fe63820-d5fc-11e8-8a2b-3555ba61a619, ID#0] Beginning stream session with 35.162.15.68:7000 Oct 22 13:16:03 ip-10-0-13-111 cassandra[5927]: INFO [Stream-Deserializer-35.162.15.68:7000-0b32ed63] 2018-10-22 13:16:03,783 StreamResultFuture.java:178 - [Stream #9fe63820-d5fc-11e8-8a2b-3555ba61a619 ID#0] Prepare completed. Receiving 6 files(215.878MiB), sending 17 files(720.317MiB) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: WARN [Stream-Deserializer-35.162.15.68:60292-be7cb6ee] 2018-10-22 13:16:04,355 CassandraCompressedStreamReader.java:110 - [Stream 9fe63820-d5fc-11e8-8a2b-3555ba61a619] Error while reading partition DecoratedKey(-9088115514873584734, 646572706865616435393632373436) from stream on ks='alex' and table='test2'. Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: ERROR [Stream-Deserializer-35.162.15.68:60292-be7cb6ee] 2018-10-22 13:16:04,362 StreamingInboundHandler.java:213 - [Stream channel: be7cb6ee] stream operation from 35.162.15.68:60292 failed Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: java.lang.AssertionError: stream can only read forward. Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.db.streaming.CompressedInputStream.position(CompressedInputStream.java:108) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.db.streaming.CassandraCompressedStreamReader.read(CassandraCompressedStreamReader.java:93) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.db.streaming.CassandraIncomingFile.read(CassandraIncomingFile.java:74) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:49) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.streaming.messages.IncomingStreamMessage$1.deserialize(IncomingStreamMessage.java:36) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:55) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at org.apache.cassandra.streaming.async.StreamingInboundHandler$StreamDeserializingTask.run(StreamingInboundHandler.java:177) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) Oct 22 13:16:04 ip-10-0-13-111 cassandra[5927]: at java.lang.Thread.run(Thread.java:748) {code} The data is created as follows: {code:sql} CREATE KEYSPACE IF NOT EXISTS alex with replication = { 'class': 'NetworkTopologyStrategy', 'alourie': 3 }; CREATE TABLE IF NOT EXISTS alex.test2 ( part text,
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542961#comment-16542961 ] Dimitar Dimitrov edited comment on CASSANDRA-13938 at 9/11/18 5:52 AM: --- {quote}The problem is that when {{CompressedInputStream#position()}} is called, the new position might be in the middle of a buffer. We need to remember that offset, and subtract that value when updating {{current}} in {{#reBuffer(boolean)}}. The resaon why is that those offset bytes get double counted on the first call to {{#reBuffer()}} after {{#position()}} as we add the {{buffer.position()}} to {{current}}. {{current}} already accounts for those offset bytes when {{#position()}} was called. {quote} [~jasobrown], isn't that equivalent (although a bit more complex) to just setting {{current}} to the last reached/read position in the stream when rebuffering? (i.e. {{current = streamOffset + buffer.position()}}). I might be missing something, but the role of {{currentBufferOffset}} seems to be solely to "align" {{current}} and {{streamOffset}} the first time after a new section is started. Then {{current += buffer.position() - currentBufferOffset}} expands to {{current = -current- + buffer.position() + streamOffset - -current- }} which is the same as {{current = streamOffset + buffer.position()}}. After that first time, {{current}} naturally follows {{streamOffset}} without the need of any adjustment, but it seems more natural to express this as {{streamOffset + buffer.position()}} instead of the new expression or the old {{current + buffer.position()}}. To me, it's also a bit more intuitive and easier to understand (hopefully it's also right in addition to intuitive :)). The equivalence above would hold true if {{current}} and {{streamOffset}} don't change their value in the meantime, but I think this is ensured by the well-ordered sequential fashion in which the decompressing and the offset bookkeeping functionality of {{CompressedInputStream}} happen in the thread running the corresponding {{StreamDeserializingTask}}. * The aforementioned well-ordered sequential fashion seems to be POSITION followed by 0-N times REBUFFER + DECOMPRESS, where the first REBUFFER might not update {{current}} with the above calculation in case {{current}} is already too far ahead (i.e. the new section is not starting within the current buffer). was (Author: dimitarndimitrov): {quote}The problem is that when {{CompressedInputStream#position()}} is called, the new position might be in the middle of a buffer. We need to remember that offset, and subtract that value when updating {{current}} in {{#reBuffer(boolean)}}. The resaon why is that those offset bytes get double counted on the first call to {{#reBuffer()}} after {{#position()}} as we add the {{buffer.position()}} to {{current}}. {{current}} already accounts for those offset bytes when {{#position()}} was called. {quote} [~jasobrown], isn't that equivalent (although a bit more complex) to just setting {{current}} to the last reached/read position in the stream when rebuffering? (i.e. {{current = streamOffset + buffer.position()}}). I might be missing something, but the role of {{currentBufferOffset}} seems to be solely to "align" {{current}} and {{streamOffset}} the first time after a new section is started. Then {{current += buffer.position() - currentBufferOffse expands to }}{{current = -current- + buffer.position() + streamOffset - -current- }}which is the same as {{current = streamOffset + buffer.position()}}. After that first time, {{current}} naturally follows {{streamOffset}} without the need of any adjustment, but it seems more natural to express this as {{streamOffset + buffer.position()}} instead of the new expression or the old {{current + buffer.position()}}. To me, it's also a bit more intuitive and easier to understand (hopefully it's also right in addition to intuitive :)). The equivalence above would hold true if {{current}} and {{streamOffset}} don't change their value in the meantime, but I think this is ensured by the well-ordered sequential fashion in which the decompressing and the offset bookkeeping functionality of {{CompressedInputStream}} happen in the thread running the corresponding {{StreamDeserializingTask}}. * The aforementioned well-ordered sequential fashion seems to be POSITION followed by 0-N times REBUFFER + DECOMPRESS, where the first REBUFFER might not update {{current}} with the above calculation in case {{current}} is already too far ahead (i.e. the new section is not starting within the current buffer). > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.or
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610105#comment-16610105 ] Jason Brown edited comment on CASSANDRA-13938 at 9/11/18 5:01 AM: -- [~dimitarndimitrov], Thanks for your comments, and apologies for the late response. While your proposed simplification indeed clarifies the logic, unfortunately it doesn't resolve the bug (my dtest still fails - this is due to the need to reset a 'some' value, like the currentBufferOffset, after rebufferring). However, your observation about simplifying this patch (in particular eliminate {{currentBufferOffset}} made me reconsider the needs of this class. Basically, we just need to correctly track the streamOffset for the current buffer, and that's all. When I ported this clas from 3.11, I over-complicated the offsets and counters into the first version of this class (committed with CASSANDRA-12229), and then confused it again (while resolving the error) with the first patch. In short: as long as I correctly calculate streamOffset, that should satisfy the needs for the class. Thus, I eliminated both {{current}} and {{currentBufferOffset}}, and the result is clearer and correct. I've pushed a cleaned up branch (which has been rebased to trunk). Please note that, as with the first patch, the majority of this patch is refactoring to clean up the class in general. I've also updated my dtest patch as my version required a stress profile (based on [~zznate]'s original) to be committed, as well. (Note: my dtest branch also includes [~pauloricardomg]'s patch, but, as before, I'm unable to get that to fail on trunk.) was (Author: jasobrown): [~dimitarndimitrov], Thanks for your comments, and apologies for the late response. While your proposed simplification indeed clarifies the logic, unfortunately it doesn't resolve the bug (my dtest still fails - this is due to the need to reset a 'some' value, like the currentBufferOffset, after rebufferring). However, your observation about simplifying this patch (in particular eliminate {{currentBufferOffset}} made me reconsider the needs of this class. Basically, we just need to correctly track the streamOffset for the current buffer, and that's all. When I ported this clas from 3.11, I over-complicated the offsets and counters into the first version of this class (committed with CASSANDRA-12229), and then confused it again (while resolving the error) with the first patch. In short: as long as I correctly calculate streamOffset, that should satisfy the needs for the class. Thus, I eliminated both {{current}} and {{currentBufferOffset}}, and the result is clearer and correct. I've pushed a cleaned up branch (which has been rebased to trunk). Please note that, as with the first patch, the majority of this patch is refactoring to clean up the class in general. I've also updated my dtest patch as my version required a stress profile (based on [~zznate]'s original) to be committed, as well. (Note: my dtest branch also includes [~pauloricardomg]'s patch, but, as before, I'm unable to get that to fail on trunk.) > Default repair is broken, crashes other nodes participating in repair (in > trunk) > > > Key: CASSANDRA-13938 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13938 > Project: Cassandra > Issue Type: Bug > Components: Repair >Reporter: Nate McCall >Assignee: Jason Brown >Priority: Critical > Fix For: 4.x > > Attachments: 13938.yaml, test.sh > > > Running through a simple scenario to test some of the new repair features, I > was not able to make a repair command work. Further, the exception seemed to > trigger a nasty failure state that basically shuts down the netty connections > for messaging *and* CQL on the nodes transferring back data to the node being > repaired. The following steps reproduce this issue consistently. > Cassandra stress profile (probably not necessary, but this one provides a > really simple schema and consistent data shape): > {noformat} > keyspace: standard_long > keyspace_definition: | > CREATE KEYSPACE standard_long WITH replication = {'class':'SimpleStrategy', > 'replication_factor':3}; > table: test_data > table_definition: | > CREATE TABLE test_data ( > key text, > ts bigint, > val text, > PRIMARY KEY (key, ts) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (ts DESC) AND > bloom_filter_fp_chance=0.01 AND > caching={'keys':'ALL', 'rows_per_partition':'NONE'} AND > comment='' AND > dclocal_read_repair_chance=0.00 AND > gc_grace_seconds=864000 AND > read_repair_chance=0.00 AND > compaction={'class': 'SizeTieredCompactionStrategy'} AND > com
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499721#comment-16499721 ] Lerh Chuan Low edited comment on CASSANDRA-13938 at 6/4/18 4:08 AM: Here's another stacktrace that may help - I've also been getting these while testing trunk in EC2. The steps I use are the same: - Disable hintedhandoff - Take out 1 node - Run stress for 10 mins, then run repair It will error out and the nodes also end up in a bizarre situation with gossip that I will have to stop the entire cluster and then start them up one at a time (in a rolling restart they still won't be able to sort themselves out). Sometimes it errors with {{stream can only read forward}} (as above and in the JIRA), but here's another stacktrace that has also showed up several times in some of the failed nodes: {code:java} May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: ERROR [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,445 StreamingInboundHandler.java:210 - [Stream channel: 28daf76d] stream operation from 35.155.140.194:39371 failed May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1711542017 May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.util.ByteBufferUtils.checkRange(ByteBufferUtils.java:20) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.util.ByteBufferUtils.checkRange(ByteBufferUtils.java:14) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:48) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.compress.LZ4Compressor.uncompress(LZ4Compressor.java:162) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.decompress(CompressedInputStream.java:163) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.reBuffer(CompressedInputStream.java:144) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.reBuffer(CompressedInputStream.java:119) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:144) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:108) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readShort(RebufferingInputStream.java:164) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readUnsignedShort(RebufferingInputStream.java:170) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.TrackedDataInputPlus.readUnsignedShort(TrackedDataInputPlus.java:139) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:367) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:377) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CassandraStreamReader$StreamDeserializer.newPartition(CassandraStreamReader.java:199) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CassandraStreamReader.writePartition(CassandraStreamReader.java:172){code} I get the feeling they may be related but I'm not sure...I can open a different Jira for this if you like, but otherwise hope it may point out more clues as to what is going on :/ was (Author: lerh low): Here's another stacktrace that may help - I've also been getting these while testing trunk in EC2. The steps I use are the same: - Disable hintedhandoff - Take out 1 node - Run stress for 10 mins, then run repair It will error out and the nodes also end up in a bizarre situation with gossip that I will have to stop the entire cluster and then start them up one at a time (in a rolling restart they still won't be able to sort themselves out). Sometimes it errors with {{stream can only read forward}} (as above and in the JIRA), but here's another stacktrace that has also showed up several times in some of the failed nodes: {code:java} May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: WARN [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,440 CompressedCassandraStreamReader.java:110 - [Stream e12b9b10-6476-11e8-936f-35a28469245e] Error while reading partition null from stream on May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: ERROR [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,445 StreamingInboundHandler.java:210 - [Stream channel: 28daf76d] stream o
[jira] [Comment Edited] (CASSANDRA-13938) Default repair is broken, crashes other nodes participating in repair (in trunk)
[ https://issues.apache.org/jira/browse/CASSANDRA-13938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16499721#comment-16499721 ] Lerh Chuan Low edited comment on CASSANDRA-13938 at 6/4/18 4:07 AM: Here's another stacktrace that may help - I've also been getting these while testing trunk in EC2. The steps I use are the same: - Disable hintedhandoff - Take out 1 node - Run stress for 10 mins, then run repair It will error out and the nodes also end up in a bizarre situation with gossip that I will have to stop the entire cluster and then start them up one at a time (in a rolling restart they still won't be able to sort themselves out). Sometimes it errors with {{stream can only read forward}} (as above and in the JIRA), but here's another stacktrace that has also showed up several times in some of the failed nodes: {code:java} May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: WARN [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,440 CompressedCassandraStreamReader.java:110 - [Stream e12b9b10-6476-11e8-936f-35a28469245e] Error while reading partition null from stream on May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: ERROR [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,445 StreamingInboundHandler.java:210 - [Stream channel: 28daf76d] stream operation from 35.155.140.194:39371 failed May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 1711542017 May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.util.ByteBufferUtils.checkRange(ByteBufferUtils.java:20) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.util.ByteBufferUtils.checkRange(ByteBufferUtils.java:14) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at net.jpountz.lz4.LZ4JNIFastDecompressor.decompress(LZ4JNIFastDecompressor.java:48) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.compress.LZ4Compressor.uncompress(LZ4Compressor.java:162) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.decompress(CompressedInputStream.java:163) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.reBuffer(CompressedInputStream.java:144) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CompressedInputStream.reBuffer(CompressedInputStream.java:119) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readByte(RebufferingInputStream.java:144) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readPrimitiveSlowly(RebufferingInputStream.java:108) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readShort(RebufferingInputStream.java:164) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.RebufferingInputStream.readUnsignedShort(RebufferingInputStream.java:170) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.io.util.TrackedDataInputPlus.readUnsignedShort(TrackedDataInputPlus.java:139) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.utils.ByteBufferUtil.readShortLength(ByteBufferUtil.java:367) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:377) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CassandraStreamReader$StreamDeserializer.newPartition(CassandraStreamReader.java:199) May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: at org.apache.cassandra.db.streaming.CassandraStreamReader.writePartition(CassandraStreamReader.java:172){code} I get the feeling they may be related but I'm not sure...I can open a different Jira for this if you like, but otherwise hope it may point out more clues as to what is going on :/ was (Author: lerh low): Here's another stacktrace that may help - I've also been getting these while testing trunk in EC2. The steps I use are the same: - Disable hintedhandoff - Take out 1 node - Run stress for 10 mins, then run repair It will error out and the nodes also end up in a bizarre situation with gossip that I will have to stop the entire cluster and then start them up one at a time (in a rolling restart they still won't be able to sort themselves out). Sometimes it errors with {{stream can only read forward}}, but here's another stacktrace that has also showed up several times in some of the failed nodes: {code:java} May 31 02:07:24 ip-10-0-18-230 cassandra[6034]: WARN [Stream-Deserializer-35.155.140.194:39371-28daf76d] 2018-05-31 02:07:24,440 CompressedCassandraStreamReader.java:110 - [Stream e12b9b10-6476-11e8-936f-35a284692