[jira] [Commented] (CASSANDRA-15082) SASI SPARSE mode 5 limit
[ https://issues.apache.org/jira/browse/CASSANDRA-15082?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977210#comment-16977210 ] Alex Petrov commented on CASSANDRA-15082: - This seems to be related to, or possibly a duplicate of [CASSANDRA-13478], depending on which exactly part of issue is considered more important here. I agree that a general purpose data base should have no such limitation. Possibly, we could perform similar optimisation in a way that wouldn't force user to pick an arbitrary number that sets upper limit on cardinality of the data: fall back to non-sparse mode, or create overflow pages would be two potential options. > SASI SPARSE mode 5 limit > > > Key: CASSANDRA-15082 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15082 > Project: Cassandra > Issue Type: Improvement > Components: Feature/SASI >Reporter: Edward Capriolo >Priority: Normal > > I do not know what the "improvement" should be here, but I ran into this: > [https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/index/sasi/disk/OnDiskIndexBuilder.java#L585] > Term '55.3' belongs to more than 5 keys in sparse mode, which is not allowed. > The only reference I can find to the limit is here: > [http://www.doanduyhai.com/blog/?p=2058] > Why is it 5? Could it be a variable? Could it be an option when creating the > table? Why or why not? > This seems awkward. A user can insert more then 5 rows into a table, and it > "works". IE you can write and you can query that table getting more than 5 > results, but the index will not flush to disk. It throws an IOException. > Maybe I am misunderstanding, but this seems impossible to support, if users > inserts the same value 5 times, the entire index will not flush to disk? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15077) Dropping column via thrift renders cf unreadable via CQL, leads to missing data
[ https://issues.apache.org/jira/browse/CASSANDRA-15077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15077: Bug Category: Parent values: Correctness(12982)Level 1 values: Unrecoverable Corruption / Loss(13161) Complexity: Normal Component/s: Legacy/Distributed Metadata Discovered By: User Report Status: Open (was: Triage Needed) > Dropping column via thrift renders cf unreadable via CQL, leads to missing > data > --- > > Key: CASSANDRA-15077 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15077 > Project: Cassandra > Issue Type: Bug > Components: Legacy/Distributed Metadata >Reporter: Muir Manders >Priority: Normal > > Hello > We have a lot of thrift/compact storage column families in production. We > upgraded to 3.11.4 last week. This week we ran a (thrift) schema change to > drop a column from a column family. Our CQL clients immediately starting > getting a read error ("ReadFailure: Error from server: code=1300 ...") trying > to read the column family. Thrift clients were still able to read the column > family. > We determined restarting the nodes "fixed" CQL reads, so we did that, but > soon discovered that we were missing data because cassandra was skipping > sstables it didn't like on startup. That exception looked like this: > {noformat} > INFO [main] 2019-04-04 20:06:35,676 ColumnFamilyStore.java:430 - > Initializing test.test > ERROR [SSTableBatchOpen:1] 2019-04-04 20:06:35,689 CassandraDaemon.java:228 - > Exception in thread Thread[SSTableBatchOpen:1,5,main] > java.lang.RuntimeException: Unknown column foo during deserialization > at > org.apache.cassandra.db.SerializationHeader$Component.toHeader(SerializationHeader.java:326) > ~[apache-cassandra-3.11.4.jar:3.11.4] > at > org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:522) > ~[apache-cassandra-3.11.4.jar:3.11.4] > at > org.apache.cassandra.io.sstable.format.SSTableReader.open(SSTableReader.java:385) > ~[apache-cassandra-3.11.4.jar:3.11.4] > at > org.apache.cassandra.io.sstable.format.SSTableReader$3.run(SSTableReader.java:570) > ~[apache-cassandra-3.11.4.jar:3.11.4] > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_121] > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_121] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > ~[na:1.8.0_121] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > [na:1.8.0_121] > at > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81) > [apache-cassandra-3.11.4.jar:3.11.4] > at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_121] > {noformat} > > Below is a list of steps to reproduce the issue. Note that in production our > column families were all created via thrift, but I thought it was simpler to > create them using CQL for the reproduction script. > {code} > ccm create test -v 3.11.4 -n 1 > ccm updateconf 'start_rpc: true' > ccm start > sleep 10 > ccm node1 cqlsh < CREATE KEYSPACE test WITH REPLICATION = {'class': 'SimpleStrategy', > 'replication_factor': 1}; > CREATE COLUMNFAMILY test.test ( > id text, > foo text, > bar text, > PRIMARY KEY (id) > ) WITH COMPACT STORAGE; > INSERT INTO test.test (id, foo, bar) values ('1', 'hi', 'there'); > SCHEMA > pip install pycassa > python < import pycassa > sys = pycassa.system_manager.SystemManager('127.0.0.1:9160') > cf = sys.get_keyspace_column_families('test')['test'] > sys.alter_column_family('test', 'test', column_metadata=filter(lambda c: > c.name != 'foo', cf.column_metadata)) > DROP_COLUMN > # this produces the "ReadFailure: Error from server: code=1300" error > ccm node1 cqlsh < select * from test.test; > QUERY > ccm node1 stop > ccm node1 start > sleep 10 > # this returns 0 rows (i.e. demonstrates missing data) > ccm node1 cqlsh < select * from test.test; > QUERY > {code} > We added the columns back via thrift and restarted cassandra to restore the > missing data. Later we realized a secondary index on the affected column > family had become out of sync with the data. We assume that was somehow a > side effect of running for a period with data missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15433) Pending ranges are not recalculated on keyspace creation
[ https://issues.apache.org/jira/browse/CASSANDRA-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15433: Since Version: 3.0.0 > Pending ranges are not recalculated on keyspace creation > > > Key: CASSANDRA-15433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15433 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership >Reporter: Josh Snyder >Priority: Normal > > When a node begins bootstrapping, Cassandra recalculates pending tokens for > each keyspace that exists when the state change is observed (in > StorageService:handleState*). When new keyspaces are created, we do not > recalculate pending ranges (around Schema:merge). As a result, writes for new > keyspaces are not received by nodes in BOOT or BOOT_REPLACE modes. When > bootstrapping finishes, the node which just bootstrapped will not have data > for the newly created keyspace. > Consider a ring with bootstrapped nodes A, B, and C. Node D is pending, and > when it finishes bootstrapping, C will cede ownership of some ranges to D. A > quorum write is acknowledged by C and A. B missed the write, and the > coordinator didn't send it to D at all. When D finishes bootstrapping, the > quorum B+D will not contain the mutation. > Steps to reproduce: > # Join a node in BOOT mode > # Create a keyspace > # Send writes to that keyspace > # On the joining node, observe that {{nodetool cfstats}} records zero writes > to the new keyspace > I have observed this directly in Cassandra 3.0, and based on my reading the > code, I believe it affects up through trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15433) Pending ranges are not recalculated on keyspace creation
[ https://issues.apache.org/jira/browse/CASSANDRA-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15433: Impacts: (was: None) > Pending ranges are not recalculated on keyspace creation > > > Key: CASSANDRA-15433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15433 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership >Reporter: Josh Snyder >Priority: Normal > > When a node begins bootstrapping, Cassandra recalculates pending tokens for > each keyspace that exists when the state change is observed (in > StorageService:handleState*). When new keyspaces are created, we do not > recalculate pending ranges (around Schema:merge). As a result, writes for new > keyspaces are not received by nodes in BOOT or BOOT_REPLACE modes. When > bootstrapping finishes, the node which just bootstrapped will not have data > for the newly created keyspace. > Consider a ring with bootstrapped nodes A, B, and C. Node D is pending, and > when it finishes bootstrapping, C will cede ownership of some ranges to D. A > quorum write is acknowledged by C and A. B missed the write, and the > coordinator didn't send it to D at all. When D finishes bootstrapping, the > quorum B+D will not contain the mutation. > Steps to reproduce: > # Join a node in BOOT mode > # Create a keyspace > # Send writes to that keyspace > # On the joining node, observe that {{nodetool cfstats}} records zero writes > to the new keyspace > I have observed this directly in Cassandra 3.0, and based on my reading the > code, I believe it affects up through trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15433) Pending ranges are not recalculated on keyspace creation
[ https://issues.apache.org/jira/browse/CASSANDRA-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15433: Bug Category: Parent values: Correctness(12982)Level 1 values: Recoverable Corruption / Loss(12986) Complexity: Normal Discovered By: User Report Severity: Normal Status: Open (was: Triage Needed) > Pending ranges are not recalculated on keyspace creation > > > Key: CASSANDRA-15433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15433 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership >Reporter: Josh Snyder >Priority: Normal > > When a node begins bootstrapping, Cassandra recalculates pending tokens for > each keyspace that exists when the state change is observed (in > StorageService:handleState*). When new keyspaces are created, we do not > recalculate pending ranges (around Schema:merge). As a result, writes for new > keyspaces are not received by nodes in BOOT or BOOT_REPLACE modes. When > bootstrapping finishes, the node which just bootstrapped will not have data > for the newly created keyspace. > Consider a ring with bootstrapped nodes A, B, and C. Node D is pending, and > when it finishes bootstrapping, C will cede ownership of some ranges to D. A > quorum write is acknowledged by C and A. B missed the write, and the > coordinator didn't send it to D at all. When D finishes bootstrapping, the > quorum B+D will not contain the mutation. > Steps to reproduce: > # Join a node in BOOT mode > # Create a keyspace > # Send writes to that keyspace > # On the joining node, observe that {{nodetool cfstats}} records zero writes > to the new keyspace > I have observed this directly in Cassandra 3.0, and based on my reading the > code, I believe it affects up through trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15433) Pending ranges are not recalculated on keyspace creation
[ https://issues.apache.org/jira/browse/CASSANDRA-15433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15433: Component/s: Cluster/Membership > Pending ranges are not recalculated on keyspace creation > > > Key: CASSANDRA-15433 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15433 > Project: Cassandra > Issue Type: Bug > Components: Cluster/Membership >Reporter: Josh Snyder >Priority: Normal > > When a node begins bootstrapping, Cassandra recalculates pending tokens for > each keyspace that exists when the state change is observed (in > StorageService:handleState*). When new keyspaces are created, we do not > recalculate pending ranges (around Schema:merge). As a result, writes for new > keyspaces are not received by nodes in BOOT or BOOT_REPLACE modes. When > bootstrapping finishes, the node which just bootstrapped will not have data > for the newly created keyspace. > Consider a ring with bootstrapped nodes A, B, and C. Node D is pending, and > when it finishes bootstrapping, C will cede ownership of some ranges to D. A > quorum write is acknowledged by C and A. B missed the write, and the > coordinator didn't send it to D at all. When D finishes bootstrapping, the > quorum B+D will not contain the mutation. > Steps to reproduce: > # Join a node in BOOT mode > # Create a keyspace > # Send writes to that keyspace > # On the joining node, observe that {{nodetool cfstats}} records zero writes > to the new keyspace > I have observed this directly in Cassandra 3.0, and based on my reading the > code, I believe it affects up through trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15052) Dtests: Add acceptable warnings to offline tool tests in order to pass them
[ https://issues.apache.org/jira/browse/CASSANDRA-15052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Petrov updated CASSANDRA-15052: Change Category: Quality Assurance Complexity: Normal Status: Open (was: Triage Needed) > Dtests: Add acceptable warnings to offline tool tests in order to pass them > --- > > Key: CASSANDRA-15052 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15052 > Project: Cassandra > Issue Type: Improvement > Components: Test/dtest >Reporter: Stefan Miklosovic >Assignee: Stefan Miklosovic >Priority: Normal > Labels: pull-request-available > Attachments: SPICE-15052.txt > > Time Spent: 50m > Remaining Estimate: 0h > > I run all dtest suite and test > offline_tools_test.py::TestOfflineTools::test_sstablelevelreset has failed > because of additional warning logs which were not added into acceptable ones. > After adding them, test passed fine. I believe added warning messages have > nothing to do with test itself, it was reproduced on c5.9xlarge as well as no > "regular" notebook. > > https://github.com/apache/cassandra-dtest/pull/47 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15433) Pending ranges are not recalculated on keyspace creation
Josh Snyder created CASSANDRA-15433: --- Summary: Pending ranges are not recalculated on keyspace creation Key: CASSANDRA-15433 URL: https://issues.apache.org/jira/browse/CASSANDRA-15433 Project: Cassandra Issue Type: Bug Reporter: Josh Snyder When a node begins bootstrapping, Cassandra recalculates pending tokens for each keyspace that exists when the state change is observed (in StorageService:handleState*). When new keyspaces are created, we do not recalculate pending ranges (around Schema:merge). As a result, writes for new keyspaces are not received by nodes in BOOT or BOOT_REPLACE modes. When bootstrapping finishes, the node which just bootstrapped will not have data for the newly created keyspace. Consider a ring with bootstrapped nodes A, B, and C. Node D is pending, and when it finishes bootstrapping, C will cede ownership of some ranges to D. A quorum write is acknowledged by C and A. B missed the write, and the coordinator didn't send it to D at all. When D finishes bootstrapping, the quorum B+D will not contain the mutation. Steps to reproduce: # Join a node in BOOT mode # Create a keyspace # Send writes to that keyspace # On the joining node, observe that {{nodetool cfstats}} records zero writes to the new keyspace I have observed this directly in Cassandra 3.0, and based on my reading the code, I believe it affects up through trunk. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13019) Improve clearsnapshot to delete the snapshot files slowly
[ https://issues.apache.org/jira/browse/CASSANDRA-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Jirsa updated CASSANDRA-13019: --- Reviewers: Aleksey Yeschenko, Chris Lohfink, maxwellguo, Jeff Jirsa (was: Aleksey Yeschenko, Chris Lohfink, Jeff Jirsa, maxwellguo) Aleksey Yeschenko, Chris Lohfink, maxwellguo, Jeff Jirsa Status: Review In Progress (was: Patch Available) > Improve clearsnapshot to delete the snapshot files slowly > -- > > Key: CASSANDRA-13019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13019 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Core >Reporter: Dikang Gu >Assignee: Jeff Jirsa >Priority: Normal > Labels: pull-request-available > Fix For: 4.x > > Time Spent: 2h 10m > Remaining Estimate: 0h > > In our environment, we are creating snapshots for backup, after we finish the > backup, we are running {{clearsnapshot}} to delete the snapshot files. At > that time we may have thousands of files to delete, and it's causing sudden > disk usage spike. As a result, we are experiencing a spike of drop messages > from Cassandra. > I think we should implement something like {{slowrm}} to delete the snapshot > files slowly, avoid the sudden disk usage spike. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-13019) Improve clearsnapshot to delete the snapshot files slowly
[ https://issues.apache.org/jira/browse/CASSANDRA-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Jirsa updated CASSANDRA-13019: --- Status: Ready to Commit (was: Review In Progress) > Improve clearsnapshot to delete the snapshot files slowly > -- > > Key: CASSANDRA-13019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13019 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Core >Reporter: Dikang Gu >Assignee: Jeff Jirsa >Priority: Normal > Labels: pull-request-available > Fix For: 4.x > > Time Spent: 2h 10m > Remaining Estimate: 0h > > In our environment, we are creating snapshots for backup, after we finish the > backup, we are running {{clearsnapshot}} to delete the snapshot files. At > that time we may have thousands of files to delete, and it's causing sudden > disk usage spike. As a result, we are experiencing a spike of drop messages > from Cassandra. > I think we should implement something like {{slowrm}} to delete the snapshot > files slowly, avoid the sudden disk usage spike. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13019) Improve clearsnapshot to delete the snapshot files slowly
[ https://issues.apache.org/jira/browse/CASSANDRA-13019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976959#comment-16976959 ] Jeff Jirsa commented on CASSANDRA-13019: Patch is approved by 3 people in GH PR (Aleksey, Chris, Maxwell) > Improve clearsnapshot to delete the snapshot files slowly > -- > > Key: CASSANDRA-13019 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13019 > Project: Cassandra > Issue Type: Improvement > Components: Legacy/Core >Reporter: Dikang Gu >Assignee: Jeff Jirsa >Priority: Normal > Labels: pull-request-available > Fix For: 4.x > > Time Spent: 2h 10m > Remaining Estimate: 0h > > In our environment, we are creating snapshots for backup, after we finish the > backup, we are running {{clearsnapshot}} to delete the snapshot files. At > that time we may have thousands of files to delete, and it's causing sudden > disk usage spike. As a result, we are experiencing a spike of drop messages > from Cassandra. > I think we should implement something like {{slowrm}} to delete the snapshot > files slowly, avoid the sudden disk usage spike. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-13990) Remove OldNetworkTopologyStrategy
[ https://issues.apache.org/jira/browse/CASSANDRA-13990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976941#comment-16976941 ] Anthony Grasso commented on CASSANDRA-13990: Started reviewing the patch. > Remove OldNetworkTopologyStrategy > - > > Key: CASSANDRA-13990 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13990 > Project: Cassandra > Issue Type: Improvement > Components: Local/Config >Reporter: Jeremy Hanna >Assignee: Anthony Grasso >Priority: Low > Labels: lhf > Attachments: 13990-trunk.txt > > > RackAwareStrategy was renamed OldNetworkTopologyStrategy back in 0.7 > (CASSANDRA-1392) and it's still around. Is there any reason to keep this > relatively dead code in the codebase at this point? I'm not aware of its use > and it sometimes confuses users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-11370) Display sstable count per level according to repair status on nodetool tablestats
[ https://issues.apache.org/jira/browse/CASSANDRA-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova reassigned CASSANDRA-11370: --- Assignee: (was: Ekaterina Dimitrova) > Display sstable count per level according to repair status on nodetool > tablestats > - > > Key: CASSANDRA-11370 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11370 > Project: Cassandra > Issue Type: Improvement > Components: Tool/nodetool >Reporter: Paulo Motta >Priority: Low > Labels: lhf > > After CASSANDRA-8004 we still display sstables in each level on nodetool > tablestats as if we had a single compaction strategy, while we have one > strategy for repaired and another for unrepaired data. > We should split display into repaired and unrepaired set, so this: > SSTables in each level: [2, 20/10, 15, 0, 0, 0, 0, 0, 0] > Would become: > SSTables in each level (repaired): [1, 10, 0, 0, 0, 0, 0, 0, 0] > SSTables in each level (unrepaired): [1, 10, 15, 0, 0, 0, 0, 0, 0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-11370) Display sstable count per level according to repair status on nodetool tablestats
[ https://issues.apache.org/jira/browse/CASSANDRA-11370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ekaterina Dimitrova reassigned CASSANDRA-11370: --- Assignee: Ekaterina Dimitrova > Display sstable count per level according to repair status on nodetool > tablestats > - > > Key: CASSANDRA-11370 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11370 > Project: Cassandra > Issue Type: Improvement > Components: Tool/nodetool >Reporter: Paulo Motta >Assignee: Ekaterina Dimitrova >Priority: Low > Labels: lhf > > After CASSANDRA-8004 we still display sstables in each level on nodetool > tablestats as if we had a single compaction strategy, while we have one > strategy for repaired and another for unrepaired data. > We should split display into repaired and unrepaired set, so this: > SSTables in each level: [2, 20/10, 15, 0, 0, 0, 0, 0, 0] > Would become: > SSTables in each level (repaired): [1, 10, 0, 0, 0, 0, 0, 0, 0] > SSTables in each level (unrepaired): [1, 10, 15, 0, 0, 0, 0, 0, 0] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Comment Edited] (CASSANDRA-15318) sendMessagesToNonlocalDC() should shuffle targets
[ https://issues.apache.org/jira/browse/CASSANDRA-15318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976809#comment-16976809 ] Dinesh Joshi edited comment on CASSANDRA-15318 at 11/18/19 7:46 PM: +1 but before merging lets ensure that the test failures are unrelated. was (Author: djoshi3): +1 but before merging lets insure that the test failures are unrelated. > sendMessagesToNonlocalDC() should shuffle targets > - > > Key: CASSANDRA-15318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15318 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Internode >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > > To better spread load and reduce the impact of a node failure before > detection (or other issues like issues host replacement), when forwarding > messages to other data centers the forwarding non-local dc nodes should be > selected at random rather than always selecting the first node in the list of > endpoints for a token. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15318) sendMessagesToNonlocalDC() should shuffle targets
[ https://issues.apache.org/jira/browse/CASSANDRA-15318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dinesh Joshi updated CASSANDRA-15318: - Reviewers: Dinesh Joshi, Dinesh Joshi (was: Dinesh Joshi) Dinesh Joshi, Dinesh Joshi (was: Dinesh Joshi) Status: Review In Progress (was: Patch Available) > sendMessagesToNonlocalDC() should shuffle targets > - > > Key: CASSANDRA-15318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15318 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Internode >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > > To better spread load and reduce the impact of a node failure before > detection (or other issues like issues host replacement), when forwarding > messages to other data centers the forwarding non-local dc nodes should be > selected at random rather than always selecting the first node in the list of > endpoints for a token. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15318) sendMessagesToNonlocalDC() should shuffle targets
[ https://issues.apache.org/jira/browse/CASSANDRA-15318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976809#comment-16976809 ] Dinesh Joshi commented on CASSANDRA-15318: -- +1 but before merging lets insure that the test failures are unrelated. > sendMessagesToNonlocalDC() should shuffle targets > - > > Key: CASSANDRA-15318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15318 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Internode >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > > To better spread load and reduce the impact of a node failure before > detection (or other issues like issues host replacement), when forwarding > messages to other data centers the forwarding non-local dc nodes should be > selected at random rather than always selecting the first node in the list of > endpoints for a token. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15429) Support NodeTool for in-jvm dtest
[ https://issues.apache.org/jira/browse/CASSANDRA-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976807#comment-16976807 ] Yifan Cai commented on CASSANDRA-15429: --- [~drohrer] and I co-worked on it. The changes in the PRs # added {{NodeProbeFactory}} field in NodeTool. The field can be set to the mock version, {{InternalNodeProbeFactory}} when running dtest. # added {{InternalNodeProbe}} that extends {{NodeProbe}}. It supports a subset of the nodetool functionality. The unsupported operations are basically 'printing info onto terminal' or similar display ops, which dtest has little interest in. # The changes to the production code, i.e. under the {{tools}} package, is made minimal. All existing functionality of nodetool should work as is. ||PR|| |[trunk|https://github.com/apache/cassandra/pull/385]| |[cassandra-3.11|https://github.com/apache/cassandra/pull/386]| |[cassandra-3.0|https://github.com/apache/cassandra/pull/387]| |[cassandra-2.2|https://github.com/apache/cassandra/pull/388]| > Support NodeTool for in-jvm dtest > - > > Key: CASSANDRA-15429 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15429 > Project: Cassandra > Issue Type: New Feature > Components: Test/dtest >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In-JVM dtest framework does not support nodetool as of now. This > functionality is wanted in some tests, e.g. constructing an end-to-end test > scenario that uses nodetool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15413) Missing results on reading large frozen text map
[ https://issues.apache.org/jira/browse/CASSANDRA-15413?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976805#comment-16976805 ] Tyler Codispoti commented on CASSANDRA-15413: - As a temporary workaround, we made a change to compareNextTo() in AbstractCompoundCellNameType to force using the BytesType comparator for this column. We did more changes to ensure we don't mess with any other columns, but essentially, the change boils down to: {code:java} ByteBuffer previous = null; for (int i = 0; i < composite.size(); i++) { if (!hasComponent(i)) return nextEOC == Composite.EOC.END ? 1 : -1; AbstractType comparator = type.subtype(i); ByteBuffer value1 = nextComponents[i]; ByteBuffer value2 = composite.get(i); // For frozen map, do not compare each key/value. Compare the whole serilized binary // as what it did when writing to sstables. if (comparator instanceof MapType) { comparator = BytesType.instance; } int cmp = comparator.compareCollectionMembers(value1, value2, previous); if (cmp != 0) return cmp; previous = value1; } {code} This looks to resolve the issue. It seems what is happening is that, when reading Map, it will compare when reading back using lexicographic order, but it is stored in binary order. When reading data back for pages after the 1st, it'll compare the last value of the previous page to the current page to see if any records can be skipped. Since the sort comparator is of the wrong type, you can easily end up in a state where it skips records in correctly. > Missing results on reading large frozen text map > > > Key: CASSANDRA-15413 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15413 > Project: Cassandra > Issue Type: Bug > Components: Local/SSTable >Reporter: Tyler Codispoti >Assignee: Alex Petrov >Priority: Normal > > Cassandra version: 2.2.15 > I have been running into a case where, when fetching the results from a table > with a frozen>, if the number of results is greater than the > fetch size (default 5000), we can end up with missing data. > Side note: The table schema comes from using KairosDB, but we've isolated > this issue to Cassandra itself. But it looks like this can cause problems for > users of KairosDB as well. > Repro case. Tested against fresh install of Cassandra 2.2.15. > 1. Create table (csqlsh) > {code:sql} > CREATE KEYSPACE test > WITH REPLICATION = { >'class' : 'SimpleStrategy', >'replication_factor' : 1 > }; > CREATE TABLE test.test ( > name text, > tags frozen>, > PRIMARY KEY (name, tags) > ) WITH CLUSTERING ORDER BY (tags ASC); > {code} > 2. Insert data (python3) > {code:python} > import time > from cassandra.cluster import Cluster > cluster = Cluster(['127.0.0.1']) > session = cluster.connect('test') > for i in range(0, 2): > session.execute( > """ > INSERT INTO test (name, tags) > VALUES (%s, %s) > """, > ("test_name", {'id':str(i)}) > ) > {code} > > 3. Flush > > {code:java} > nodetools flush{code} > > > 4. Fetch data (python3) > {code:python} > import time > from cassandra.cluster import Cluster > cluster = Cluster(['127.0.0.1'], control_connection_timeout=5000) > session = cluster.connect('test') > session.default_fetch_size = 5000 > session.default_timeout = 120 > count = 0 > rows = session.execute("select tags from test where name='test_name'") > for row in rows: > count += 1 > print(count) > {code} > Result: 10111 (expected 2) > > Changing the page size changes the result count. Some quick samples: > > ||default_fetch_size||count|| > |5000|10111| > |1000|1830| > |999|1840| > |998|1850| > |2|2| > |10|2| > > > In short, I cannot guarantee I'll get all the results back unless the page > size > number of rows. > This seems to get worse with multiple SSTables (eg nodetool flush between > some of the insert batches). When using replication, the issue can get > disgustingly bad - potentially giving a different result on each query. > Interesting, if we pad the values on the tag map ("id" in this repro case) so > that the insertion is in lexicographical order, there is no issue. I believe > the issue also does not repro if I do not call "nodetools flush" before > querying. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail:
[jira] [Updated] (CASSANDRA-15429) Support NodeTool for in-jvm dtest
[ https://issues.apache.org/jira/browse/CASSANDRA-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated CASSANDRA-15429: --- Labels: pull-request-available (was: ) > Support NodeTool for in-jvm dtest > - > > Key: CASSANDRA-15429 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15429 > Project: Cassandra > Issue Type: New Feature > Components: Test/dtest >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Labels: pull-request-available > > In-JVM dtest framework does not support nodetool as of now. This > functionality is wanted in some tests, e.g. constructing an end-to-end test > scenario that uses nodetool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15318) sendMessagesToNonlocalDC() should shuffle targets
[ https://issues.apache.org/jira/browse/CASSANDRA-15318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976802#comment-16976802 ] Jon Meredith commented on CASSANDRA-15318: -- Rebased and rerunning to double-check some (thought to be) unrelated unit test failures. [CircleCI|https://circleci.com/workflow-run/c6247670-e965-4260-9632-5bd3deb9ad06] > sendMessagesToNonlocalDC() should shuffle targets > - > > Key: CASSANDRA-15318 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15318 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Internode >Reporter: Jon Meredith >Assignee: Jon Meredith >Priority: Normal > > To better spread load and reduce the impact of a node failure before > detection (or other issues like issues host replacement), when forwarding > messages to other data centers the forwarding non-local dc nodes should be > selected at random rather than always selecting the first node in the list of > endpoints for a token. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15429) Support NodeTool for in-jvm dtest
[ https://issues.apache.org/jira/browse/CASSANDRA-15429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifan Cai updated CASSANDRA-15429: -- Authors: Doug Rohrer, Yifan Cai (was: Yifan Cai) > Support NodeTool for in-jvm dtest > - > > Key: CASSANDRA-15429 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15429 > Project: Cassandra > Issue Type: New Feature > Components: Test/dtest >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > > In-JVM dtest framework does not support nodetool as of now. This > functionality is wanted in some tests, e.g. constructing an end-to-end test > scenario that uses nodetool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Assigned] (CASSANDRA-2848) Make the Client API support passing down timeouts
[ https://issues.apache.org/jira/browse/CASSANDRA-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifan Cai reassigned CASSANDRA-2848: Assignee: Yifan Cai (was: Dinesh Joshi) > Make the Client API support passing down timeouts > - > > Key: CASSANDRA-2848 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2848 > Project: Cassandra > Issue Type: Improvement >Reporter: Chris Goffinet >Assignee: Yifan Cai >Priority: Low > Fix For: 3.11.x > > Attachments: 2848-trunk-v2.txt, 2848-trunk.txt > > > Having a max server RPC timeout is good for worst case, but many applications > that have middleware in front of Cassandra, might have higher timeout > requirements. In a fail fast environment, if my application starting at say > the front-end, only has 20ms to process a request, and it must connect to X > services down the stack, by the time it hits Cassandra, we might only have > 10ms. I propose we provide the ability to specify the timeout on each call we > do optionally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976776#comment-16976776 ] Yifan Cai commented on CASSANDRA-15410: --- Updated the PR with the {{sizeOfAsciiString()}} method. Since the input of the method is a US-ASCII string, the method simply returns {{2 (size) + str.length}}. Therefore, 2 orders of magnitude faster than {{sizeOfString()}} that iterates through the string. {code:java} [java] Benchmark Mode Cnt ScoreError Units [java] StringsEncodeBench.sizeOfAsciiStringavgt6 1.999 ± 0.153 ns/op [java] StringsEncodeBench.sizeOfString avgt6 283.413 ± 24.614 ns/op {code} > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yifan Cai updated CASSANDRA-15410: -- Reviewers: Aleksey Yeschenko, Dinesh Joshi (was: Aleksey Yeschenko) > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Updated] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table
[ https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Lohfink updated CASSANDRA-14888: -- Reviewers: Chris Lohfink, Dinesh Joshi (was: Dinesh Joshi) > Several mbeans are not unregistered when dropping a keyspace and table > -- > > Key: CASSANDRA-14888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14888 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Alex Deparvu >Priority: Urgent > Labels: patch-available > Fix For: 4.0, 4.0-rc > > Attachments: CASSANDRA-14888.patch > > > CasCommit, CasPrepare, CasPropose, ReadRepairRequests, > ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, > PartitionsValidated, RepairPrepareTime, RepairSyncTime, > RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, > WriteFailedIdealCL > Basically for 3 years people haven't known what they are doing because the > entire thing is kind of obscure. Fix it and also add a dtest that detects if > any mbeans are left behind after dropping a table and keyspace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-14888) Several mbeans are not unregistered when dropping a keyspace and table
[ https://issues.apache.org/jira/browse/CASSANDRA-14888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976643#comment-16976643 ] Chris Lohfink commented on CASSANDRA-14888: --- You should to add a unit test to cover MVs as well as they have some conditionally registered metrics. There are utility methods to create the metrics and automatically deregister them on cleanup. All metrics with issues just skipped that and created the metrics manually. Really this does fix the issue, but by doing more manual cleanup. While this does fix the problem, I think we should change these metrics to register appropriately (which also may provide keyspace metrics) or clean up that mechanism up a bit to be easier (maybe using annotations, reflection or something?). We should try to enforce the registering and automatic cleanup or make it easier and more obvious instead of setting a further precedence to do it manually. In the past the "wall of removeMetric" calls in {{release}} was never kept in sync and while the junit should work, this actually isn't the first time a unit test like this has been made (3rd to my knowledge) and there are some "flakey" scenarios with dropping tables, and the test actually doesn't capture everything (although we can improve that). This is the 4th (that I remember at least) time this issue came up, I think we should look at this in a bigger picture sense. That said if you are not interested or don't have bandwidth we could just patch this up a little here as it has definite value and have a follow up ticket to try to prevent the issue in future. > Several mbeans are not unregistered when dropping a keyspace and table > -- > > Key: CASSANDRA-14888 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14888 > Project: Cassandra > Issue Type: Bug > Components: Observability/Metrics >Reporter: Ariel Weisberg >Assignee: Alex Deparvu >Priority: Urgent > Labels: patch-available > Fix For: 4.0, 4.0-rc > > Attachments: CASSANDRA-14888.patch > > > CasCommit, CasPrepare, CasPropose, ReadRepairRequests, > ShortReadProtectionRequests, AntiCompactionTime, BytesValidated, > PartitionsValidated, RepairPrepareTime, RepairSyncTime, > RepairedDataInconsistencies, ViewLockAcquireTime, ViewReadTime, > WriteFailedIdealCL > Basically for 3 years people haven't known what they are doing because the > entire thing is kind of obscure. Fix it and also add a dtest that detects if > any mbeans are left behind after dropping a table and keyspace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976583#comment-16976583 ] Aleksey Yeschenko commented on CASSANDRA-15410: --- While you are at it, maybe update {{encodedSize()}} implementation as well to use the faster {{sizeOfAsciiString()}} - if not for performance then for symmetry? > Avoid over-allocation of bytes for UTF8 string serialization > - > > Key: CASSANDRA-15410 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15410 > Project: Cassandra > Issue Type: Improvement > Components: Messaging/Client >Reporter: Yifan Cai >Assignee: Yifan Cai >Priority: Normal > Fix For: 4.0 > > > In the current message encoding implementation, it first calculates the > `encodeSize` and allocates the bytebuffer with that size. > However, during encoding, it assumes the worst case of writing UTF8 string to > allocate bytes, i.e. assuming each letter takes 3 bytes. > The over-estimation further leads to resizing the underlying array and data > copy. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Created] (CASSANDRA-15432) The "read defragmentation" optimization does not work
Sylvain Lebresne created CASSANDRA-15432: Summary: The "read defragmentation" optimization does not work Key: CASSANDRA-15432 URL: https://issues.apache.org/jira/browse/CASSANDRA-15432 Project: Cassandra Issue Type: Bug Reporter: Sylvain Lebresne The so-called "read defragmentation" that has been added way back with CASSANDRA-2503 actually does not work, and never has. That is, the defragmentation writes do happen, but they only additional load on the nodes without helping anything, and are thus a clear negative. The "read defragmentation" (which only impact so-called "names queries") kicks in when a read hits "too many" sstables (> 4 by default), and when it does, it writes down the result of that read. The assumption being that the next read for that data would only read the newly written data, which if not still in memtable would at least be in a single sstable, thus speeding that next read. Unfortunately, this is not how this work. When we defrag and write the result of our original read, we do so with the timestamp of the data read (as we should, changing the timestamp would be plain wrong). And as a result, following reads will read that data first, but will have no way to tell that no more sstables should be read. Technically, the [{{reduceFilter}}|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/SinglePartitionReadCommand.java#L830] call will not return {{null}} because the {{currentMaxTs}} will be higher than at least some of the data in the result, and this until we've read from as many sstables than in the original read. I see no easy way to fix this. It might be possible to make it work with additional per-sstable metadata, but nothing sufficiently simple and cheap to be worth it comes to mind. And I thus suggest simply removing that code. For the record, I'll note that there is actually a 2nd problem with that code: currently, we "defrag" a read even if we didn't got data for everything that the query requests. This also is "wrong" even if we ignore the first issue: a following read that would read the defragmented data would also have no way to know to not read more sstables to try to get the missing parts. This problem would be fixeable, but is obviously overshadowed by the previous one anyway. Anyway, as mentioned, I suggest to just remove the "optimization" (which again, never optimized anything) altogether, and happy to provide the simple patch. The only question might be in which versions? This impact all versions, but this isn't a correction bug either, "just" a performance one. So do we want 4.0 only or is there appetite for earlier? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org