[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392146#comment-14392146 ] Stefania commented on CASSANDRA-8180: - Thanks for your input. With these two optimizations in place, I think it's starting to be in a decent shape and I would be happy for a first round of review. However, we can delay this until CASSANDRA-8099 has been completed since people are busy with it and this change in based on the same branch. Also, once CASSANDRA-8915 is available, we will have to integrate it, so more work and another review will be required then. Finally, although I did run all unit tests and the cql dtests, and verified that the only failures also apply to the 8099 branch, it could use more testing once 8099 is stable and 8915 is integrated. This is what the trace for the basic example above looks like now: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; [...] activity | timestamp | source| source_elapsed -++---+ Execute CQL3 query | 2015-04-02 11:22:00.544000 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; [SharedPool-Worker-1] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |211 Preparing statement [SharedPool-Worker-1] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |430 Executing single-partition query on test [SharedPool-Worker-3] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |849 Acquiring sstable references [SharedPool-Worker-3] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |915 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-3] | 2015-04-02 11:22:00.546000 | 127.0.0.1 | 1075 Merging data from memtables and sstables [SharedPool-Worker-3] | 2015-04-02 11:22:00.546000 | 127.0.0.1 | 1133 Key cache hit for sstable 3 [SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 | 1390 Accessed 1 sstables [SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 | 1901 Read 1 live and 0 tombstoned cells [SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 | 1986 Request complete | 2015-04-02 11:22:00.546035 | 127.0.0.1 | 2035 {code} Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388229#comment-14388229 ] Stefania commented on CASSANDRA-8180: - You are correct a Clusterable was sufficient. There is one disk access I am not sure if we can remove, when we read the partition level deletion: {code} status.mostRecentPartitionTombstone = Math.max(status.mostRecentPartitionTombstone, iter.partitionLevelDeletion().markedForDeleteAt()); {code} We need to read the partition level deletion in the initial for loop of {{SinglePartitionReadCommand}} with the sstables ordered by max timestamp in order to skip older sstables: {code} if (sstable.getMaxTimestamp() status.mostRecentPartitionTombstone) break; {code} I don't see how we could skip older sstables if they are not ordered by max timestamp and are instead picked lazily by the merge iterator when they become eligible according to their lower bound. As a consequence I don't know how to postpone reading the partition level deletion. So far I have some code that still calls {{iter.partitionLevelDeletion()}} in that initial for loop, but other iterator methods should not be called until the table is picked by the merge iterator. Can we do better? Code is here: https://github.com/stef1927/cassandra/tree/8180-8099 Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388301#comment-14388301 ] Sylvain Lebresne commented on CASSANDRA-8180: - bq. There is one disk access I am not sure if we can remove, when we read the partition level deletion It's a good point. A simple optimization however could be to collect whether a sstable has any partition level deletion in the first place. I suspect that it's not that uncommon to not use partition level deletion at all and so we could avoid that seeks at least in that case. Now I'll also note that when we do have to fetch the partition deletion, if the partition is indexed, we get it from the index file. Which means we at least avoid seeking into the data file, but also mean that with the index entry comes the partition row-index, and that's actually more precise than the global sstable min/max clustering values. In other words, an optimization we could do is that when we do have the index entry when the sstable iterator lower bound is queried, we should return the min clustering from the row index rather than the one from the sstable min clustering values. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383642#comment-14383642 ] Stefania commented on CASSANDRA-8180: - Using {{MergeIterator}} is a great idea, I still have some details to iron out but it's already looking much better. I have one question : the iterator is over atoms but the sstable min and max column names are lists of ByteBuffer, which I can compare with atoms using the ClusteringComparator but it would be nice to convert the lower bound to an atom, so we can have only one generic type (the {{In}} type) in the MergeIterator specialization, which must feed atoms upstream. Is there a way to do this or do I just have to settle for having two different (comparable) types in MergeIterator? Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383652#comment-14383652 ] Sylvain Lebresne commented on CASSANDRA-8180: - bq. it would be nice to convert the lower bound to an atom I suspect you don't need to have an Atom, only a Clusterable, in which case you can convert the lower bound to a Clustering with something like {{new SimpleClustering(sstable.minClusteringValues.toArray(new ByteBuffer\[metadata.clusteringColumns().size()\]))}}. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 | 60 Preparing statement | 23:52:31,888 | 127.0.0.1 |277
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381674#comment-14381674 ] Sylvain Lebresne commented on CASSANDRA-8180: - bq. l I found it much easier to understand Glad that it's the case. bq. I think it might make sense if I implement this change directly on a branch based on {{8099_engine_refactor}} I wouldn't be the one to blame you for that. bq. I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in {{SinglePartitionSliceCommand}} and the second time to return the data You actually don't have to care about the limit (in SinglePartitionSliceCommand at least). The way to do this would be to return an iterator that first query and return the results of the first sstable and once it has returned all results, it transparently query the 2nd sstable and start returning those results, etc... That being said, I do suspect doing this at the merging level (in MergeIterator) would be better. The idea would be to special the merge iterator to take specific iterators that expose some {{lowerBound()}} method. That method would be allowed to return a value that is not returned by the iterator but is lower than anything it will return. The merge iterator would use those lower bound as initial {{Candidate}} for the iterators but know that when it consumes those canditates it should just discard them (and get the actual next value of the iterator). Basically, we'd add a way for the iterator to say don't bother using me until you've at least reached value X. The sstable iterators would typically implement that {{lowerBound}} method by returning the sstable min column name. Provided we make sure the sstable iterators don't do any work unless their {{hasNext/next}} methods are called, we wouldn't actually use a sstable until we've reached it's min column name. Doing it that way would 2 advantages over doing it at the collation level: # this is more general as it would work even if the sstables min/max column name intersects (it's harder/uglier to do the same at the collation level imo). # this would work for range queries too. We may want to build that on top of CASSANDRA-8915 however. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381624#comment-14381624 ] Stefania commented on CASSANDRA-8180: - [~slebresne], [~thobbs], [~iamaleksey] : I think it might make sense if I implement this change directly on a branch based on {{8099_engine_refactor}}? First of all I found it *much easier* to understand and secondly I don't particularly want to rebase or merge later on once 8099 is merged into trunk. Any concerns? I've been looking at the code on 8099 today, and I cannot find a way to implement this unless we iterate twice, the first time to count until the limit has been reached in {{SinglePartitionSliceCommand}} and the second time to return the data. Or have I missed something? If not, I think we need to store the data in memory via an {{ArrayBackedPartition}}, is this correct? Here is a very inefficient and ugly way to do this, may I have some pointers on to improve on it? https://github.com/stef1927/cassandra/commits/8180-8099 Specifically in {{querySSTablesByClustering()}} at line 254 of {{SinglePartitionSliceCommand.java}}. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Stefania Priority: Minor Fix For: 3.0 I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182504#comment-14182504 ] Sylvain Lebresne commented on CASSANDRA-8180: - I'm also really curious as to how you expect this optimization to work, because I don't see how this could work. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Tyler Hobbs Priority: Minor I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 | 60 Preparing statement | 23:52:31,888 | 127.0.0.1 |277 Executing single-partition query on test | 23:52:31,889 | 127.0.0.1 |961 Acquiring sstable references | 23:52:31,889 | 127.0.0.1 |971
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182513#comment-14182513 ] Sylvain Lebresne commented on CASSANDRA-8180: - Oh, unless you mean to order the sstables by mean min column names and to query them one at a time (like we do for names filter) in the specific case where their column names interval don't intersect. In which case, yes, that could be indeed useful with the compaction strategy from CASSANDRA-6602. Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Tyler Hobbs Priority: Minor I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 | 60 Preparing statement | 23:52:31,888 | 127.0.0.1 |277 Executing single-partition query
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182733#comment-14182733 ] DOAN DuyHai commented on CASSANDRA-8180: The idea is: If there is no restriction on clustering columns but only on LIMIT: 1) Order SSTables by min/max column depending on the first clustering order 2) Hit the first SSTable and start sequential read until reaching LIMIT 3) If LIMIT is large enough, switch to another SSTable and so on Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Tyler Hobbs Priority: Minor I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key cache hit for sstable 3 | 23:48:46,500 | 127.0.0.1 | 1160 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1173 Key cache hit for sstable 2 | 23:48:46,500 | 127.0.0.1 | 1889 Seeking to partition beginning in data file | 23:48:46,500 | 127.0.0.1 | 1901 Key cache hit for sstable 1 | 23:48:46,501 | 127.0.0.1 | 2373 Seeking to partition beginning in data file | 23:48:46,501 | 127.0.0.1 | 2384 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 23:48:46,501 | 127.0.0.1 | 2768 Merging data from memtables and 3 sstables | 23:48:46,501 | 127.0.0.1 | 2784 Read 2 live and 0 tombstoned cells | 23:48:46,501 | 127.0.0.1 | 2976 Request complete | 23:48:46,501 | 127.0.0.1 | 3551 {code} We can clearly see that C* hits 3 SSTables on disk instead of just one, although it has the min/max column meta data to decide which SSTable contains the most recent data. Funny enough, if we add a clause on the clustering column to the select, this time C* optimizes the read path: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:52:31,888 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col 25 LIMIT 1; | 23:52:31,888 | 127.0.0.1 | 60 Preparing statement | 23:52:31,888 | 127.0.0.1 |277 Executing single-partition query on test |
[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used
[ https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182800#comment-14182800 ] DOAN DuyHai commented on CASSANDRA-8180: Another test showing that it could be optimized further: {code} cqlsh:test SELECT * FROM test WHERE id=1 AND col40 LIMIT 1; id | col | val +-+- 1 | 30 | 30 (1 rows) Tracing session: 2725c710-5b86-11e4-aeed-814585a29e7b activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 16:00:46,850 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 AND col40 LIMIT 1; | 16:00:46,850 | 127.0.0.1 | 77 Preparing statement | 16:00:46,850 | 127.0.0.1 |244 Executing single-partition query on test | 16:00:46,851 | 127.0.0.1 | 1485 Acquiring sstable references | 16:00:46,851 | 127.0.0.1 | 1500 Merging memtable tombstones | 16:00:46,851 | 127.0.0.1 | 1547 Key cache hit for sstable 3 | 16:00:46,852 | 127.0.0.1 | 1641 Seeking to partition indexed section in data file | 16:00:46,852 | 127.0.0.1 | 1651 Key cache hit for sstable 2 | 16:00:46,854 | 127.0.0.1 | 4054 Seeking to partition indexed section in data file | 16:00:46,854 | 127.0.0.1 | 4068 Key cache hit for sstable 1 | 16:00:46,855 | 127.0.0.1 | 5232 Seeking to partition indexed section in data file | 16:00:46,855 | 127.0.0.1 | 5249 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 16:00:46,855 | 127.0.0.1 | 5499 Merging data from memtables and 3 sstables | 16:00:46,855 | 127.0.0.1 | 5515 Read 2 live and 0 tombstoned cells | 16:00:46,855 | 127.0.0.1 | 5598 Request complete | 16:00:46,855 | 127.0.0.1 | 5997 {code} Now reversing the inequality on the clustering column, C* does scan 3 SSTables instead of just one (since LIMIT = 1) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used Key: CASSANDRA-8180 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180 Project: Cassandra Issue Type: Improvement Components: Core Environment: Cassandra 2.0.10 Reporter: DOAN DuyHai Assignee: Tyler Hobbs Priority: Minor I was working on an example of sensor data table (timeseries) and face a use case where C* does not optimize read on disk. {code} cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) WITH CLUSTERING ORDER BY (col DESC); cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20'); ... nodetool flush test test ... cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30'); ... nodetool flush test test {code} After that, I activate request tracing: {code} cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1; activity | timestamp| source| source_elapsed ---+--+---+ execute_cql3_query | 23:48:46,498 | 127.0.0.1 | 0 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 23:48:46,498 | 127.0.0.1 | 74 Preparing statement | 23:48:46,499 | 127.0.0.1 |253 Executing single-partition query on test | 23:48:46,499 | 127.0.0.1 |930 Acquiring sstable references | 23:48:46,499 | 127.0.0.1 |943 Merging memtable tombstones | 23:48:46,499 | 127.0.0.1 | 1032 Key