[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-04-01 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14392146#comment-14392146
 ] 

Stefania commented on CASSANDRA-8180:
-

Thanks for your input. With these two optimizations in place, I think it's 
starting to be in a decent shape and I would be happy for a first round of 
review.

However, we can delay this until CASSANDRA-8099 has been completed since people 
are busy with it and this change in based on the same branch.

Also, once CASSANDRA-8915 is available, we will have to integrate it, so more 
work and another review will be required then.

Finally, although I did run all unit tests and the cql dtests, and verified 
that the only failures also apply to the 8099 branch, it could use more testing 
once 8099 is stable and 8915 is integrated.

This is what the trace for the basic example above looks like now:

{code}
cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;

[...]

 activity   
 | timestamp  | source| source_elapsed
-++---+
  
Execute CQL3 query | 2015-04-02 11:22:00.544000 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 LIMIT 1; 
[SharedPool-Worker-1] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |211
   Preparing statement 
[SharedPool-Worker-1] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |430
  Executing single-partition query on test 
[SharedPool-Worker-3] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |849
  Acquiring sstable references 
[SharedPool-Worker-3] | 2015-04-02 11:22:00.545000 | 127.0.0.1 |915
 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones 
[SharedPool-Worker-3] | 2015-04-02 11:22:00.546000 | 127.0.0.1 |   1075
  Merging data from memtables and sstables 
[SharedPool-Worker-3] | 2015-04-02 11:22:00.546000 | 127.0.0.1 |   1133
   Key cache hit for sstable 3 
[SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 |   1390
   Accessed 1 sstables 
[SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 |   1901
Read 1 live and 0 tombstoned cells 
[SharedPool-Worker-1] | 2015-04-02 11:22:00.546000 | 127.0.0.1 |   1986

Request complete | 2015-04-02 11:22:00.546035 | 127.0.0.1 |   2035
{code}

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-31 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388229#comment-14388229
 ] 

Stefania commented on CASSANDRA-8180:
-

You are correct a Clusterable was sufficient.

There is one disk access I am not sure if we can remove, when we read the 
partition level deletion:

{code}
status.mostRecentPartitionTombstone = 
Math.max(status.mostRecentPartitionTombstone, 
iter.partitionLevelDeletion().markedForDeleteAt());
{code}

We need to read the partition level deletion in the initial for loop of 
{{SinglePartitionReadCommand}} with the sstables ordered by max timestamp in 
order to skip older sstables:

{code}
if (sstable.getMaxTimestamp()  status.mostRecentPartitionTombstone)
break;
{code}

I don't see how we could skip older sstables if they are not ordered by max 
timestamp and are instead picked lazily by the merge iterator when they become 
eligible according to their lower bound. As a consequence I don't know how to 
postpone reading the partition level deletion.

So far I have some code that still calls {{iter.partitionLevelDeletion()}} in 
that initial for loop, but other iterator methods should not be called until 
the table is picked by the merge iterator.

Can we do better?

Code is here: https://github.com/stef1927/cassandra/tree/8180-8099

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-31 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388301#comment-14388301
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

bq. There is one disk access I am not sure if we can remove, when we read the 
partition level deletion

It's a good point. A simple optimization however could be to collect whether a 
sstable has any partition level deletion in the first place. I suspect that 
it's not that uncommon to not use partition level deletion at all and so we 
could avoid that seeks at least in that case.

Now I'll also note that when we do have to fetch the partition deletion, if the 
partition is indexed, we get it from the index file. Which means we at least 
avoid seeking into the data file, but also mean that with the index entry comes 
the partition row-index, and that's actually more precise than the global 
sstable min/max clustering values. In other words, an optimization we could do 
is that when we do have the index entry when the sstable iterator lower bound 
is queried, we should return the min clustering from the row index rather than 
the one from the sstable min clustering values.

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-27 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383642#comment-14383642
 ] 

Stefania commented on CASSANDRA-8180:
-

Using {{MergeIterator}} is a great idea, I still have some details to iron out 
but it's already looking much better.

I have one question : the iterator is over atoms but the sstable min and max 
column names are lists of ByteBuffer, which I can compare with atoms using the 
ClusteringComparator but it would be nice to convert the lower bound to an 
atom, so we can  have only one generic type (the {{In}} type) in the 
MergeIterator specialization, which must feed atoms upstream. Is there a way to 
do this or do I just have to settle for having two different (comparable) types 
in MergeIterator?


 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-27 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14383652#comment-14383652
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

bq. it would be nice to convert the lower bound to an atom

I suspect you don't need to have an Atom, only a Clusterable, in which case 
you can convert the lower bound to a Clustering with something like {{new 
SimpleClustering(sstable.minClusteringValues.toArray(new 
ByteBuffer\[metadata.clusteringColumns().size()\]))}}.

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-26 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381674#comment-14381674
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

bq. l I found it much easier to understand

Glad that it's the case.

bq. I think it might make sense if I implement this change directly on a branch 
based on {{8099_engine_refactor}}

I wouldn't be the one to blame you for that.

bq. I cannot find a way to implement this unless we iterate twice, the first 
time to count until the limit has been reached in 
{{SinglePartitionSliceCommand}} and the second time to return the data

You actually don't have to care about the limit (in SinglePartitionSliceCommand 
at least). The way to do this would be to return an iterator that first query 
and return the results of the first sstable and once it has returned all 
results, it transparently query the 2nd sstable and start returning those 
results, etc...

That being said, I do suspect doing this at the merging level (in 
MergeIterator) would be better. The idea would be to special the merge iterator 
to take specific iterators that expose some {{lowerBound()}} method. That 
method would be allowed to return a value that is not returned by the iterator 
but is lower than anything it will return. The merge iterator would use those 
lower bound as initial {{Candidate}} for the iterators but know that when it 
consumes those canditates it should just discard them (and get the actual next 
value of the iterator). Basically, we'd add a way for the iterator to say 
don't bother using me until you've at least reached value X.  The sstable 
iterators would typically implement that {{lowerBound}} method by returning the 
sstable min column name. Provided we make sure the sstable iterators don't do 
any work unless their {{hasNext/next}} methods are called, we wouldn't actually 
use a sstable until we've reached it's min column name.

Doing it that way would 2 advantages over doing it at the collation level:
# this is more general as it would work even if the sstables min/max column 
name intersects (it's harder/uglier to do the same at the collation level imo).
# this would work for range queries too.

We may want to build that on top of CASSANDRA-8915 however.


 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2015-03-26 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381624#comment-14381624
 ] 

Stefania commented on CASSANDRA-8180:
-

[~slebresne], [~thobbs], [~iamaleksey] : 

I think it might make sense if I implement this change directly on a branch 
based on {{8099_engine_refactor}}? First of all I found it *much easier* to 
understand and secondly I don't particularly want to rebase or merge later on 
once 8099 is merged into trunk. Any concerns?

I've been looking at the code on 8099 today, and I cannot find a way to 
implement this unless we iterate twice, the first time to count until the limit 
has been reached in {{SinglePartitionSliceCommand}} and the second time to 
return the data. Or have I missed something? If not, I think we need to store 
the data in memory via an {{ArrayBackedPartition}}, is this correct?

Here is a very inefficient and ugly way to do this, may I have some pointers on 
to improve on it?

https://github.com/stef1927/cassandra/commits/8180-8099

Specifically in {{querySSTablesByClustering()}} at line 254 of 
{{SinglePartitionSliceCommand.java}}.

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Stefania
Priority: Minor
 Fix For: 3.0


 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2014-10-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182504#comment-14182504
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

I'm also really curious as to how you expect this optimization to work, because 
I don't see how this could work.

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Tyler Hobbs
Priority: Minor

 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 23:52:31,889 | 127.0.0.1 |961
   Acquiring sstable references | 
 23:52:31,889 | 127.0.0.1 |971
  

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2014-10-24 Thread Sylvain Lebresne (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182513#comment-14182513
 ] 

Sylvain Lebresne commented on CASSANDRA-8180:
-

Oh, unless you mean to order the sstables by mean min column names and to query 
them one at a time (like we do for names filter) in the specific case where 
their column names interval don't intersect. In which case, yes, that could be 
indeed useful with the compaction strategy from CASSANDRA-6602.

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Tyler Hobbs
Priority: Minor

 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2014-10-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182733#comment-14182733
 ] 

DOAN DuyHai commented on CASSANDRA-8180:


The idea is:

If there is no restriction on clustering columns but only on LIMIT:

1) Order SSTables by min/max column depending on the first clustering order
2) Hit the first SSTable and start sequential read until reaching LIMIT
3) If LIMIT is large enough, switch to another SSTable and so on



 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Tyler Hobbs
Priority: Minor

 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key cache hit for sstable 3 | 
 23:48:46,500 | 127.0.0.1 |   1160
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1173
Key cache hit for sstable 2 | 
 23:48:46,500 | 127.0.0.1 |   1889
Seeking to partition beginning in data file | 
 23:48:46,500 | 127.0.0.1 |   1901
Key cache hit for sstable 1 | 
 23:48:46,501 | 127.0.0.1 |   2373
Seeking to partition beginning in data file | 
 23:48:46,501 | 127.0.0.1 |   2384
  Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
 23:48:46,501 | 127.0.0.1 |   2768
 Merging data from memtables and 3 sstables | 
 23:48:46,501 | 127.0.0.1 |   2784
 Read 2 live and 0 tombstoned cells | 
 23:48:46,501 | 127.0.0.1 |   2976
   Request complete | 
 23:48:46,501 | 127.0.0.1 |   3551
 {code}
 We can clearly see that C* hits 3 SSTables on disk instead of just one, 
 although it has the min/max column meta data to decide which SSTable contains 
 the most recent data.
 Funny enough, if we add a clause on the clustering column to the select, this 
 time C* optimizes the read path:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:52:31,888 | 127.0.0.1 |  0
Parsing SELECT * FROM test WHERE id=1 AND col  25 LIMIT 1; | 
 23:52:31,888 | 127.0.0.1 | 60
Preparing statement | 
 23:52:31,888 | 127.0.0.1 |277
   Executing single-partition query on test | 
 

[jira] [Commented] (CASSANDRA-8180) Optimize disk seek using min/max colunm name meta data when the LIMIT clause is used

2014-10-24 Thread DOAN DuyHai (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182800#comment-14182800
 ] 

DOAN DuyHai commented on CASSANDRA-8180:


Another test showing that it could be optimized further:

{code}
cqlsh:test SELECT * FROM test WHERE id=1 AND col40 LIMIT 1;

 id | col | val
+-+-
  1 |  30 |  30

(1 rows)


Tracing session: 2725c710-5b86-11e4-aeed-814585a29e7b

 activity  | 
timestamp| source| source_elapsed
---+--+---+
execute_cql3_query | 
16:00:46,850 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 AND col40 LIMIT 1; | 
16:00:46,850 | 127.0.0.1 | 77
   Preparing statement | 
16:00:46,850 | 127.0.0.1 |244
  Executing single-partition query on test | 
16:00:46,851 | 127.0.0.1 |   1485
  Acquiring sstable references | 
16:00:46,851 | 127.0.0.1 |   1500
   Merging memtable tombstones | 
16:00:46,851 | 127.0.0.1 |   1547
   Key cache hit for sstable 3 | 
16:00:46,852 | 127.0.0.1 |   1641
 Seeking to partition indexed section in data file | 
16:00:46,852 | 127.0.0.1 |   1651
   Key cache hit for sstable 2 | 
16:00:46,854 | 127.0.0.1 |   4054
 Seeking to partition indexed section in data file | 
16:00:46,854 | 127.0.0.1 |   4068
   Key cache hit for sstable 1 | 
16:00:46,855 | 127.0.0.1 |   5232
 Seeking to partition indexed section in data file | 
16:00:46,855 | 127.0.0.1 |   5249
 Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones | 
16:00:46,855 | 127.0.0.1 |   5499
Merging data from memtables and 3 sstables | 
16:00:46,855 | 127.0.0.1 |   5515
Read 2 live and 0 tombstoned cells | 
16:00:46,855 | 127.0.0.1 |   5598
  Request complete | 
16:00:46,855 | 127.0.0.1 |   5997
{code}

 Now reversing the inequality on the clustering column, C* does scan 3 SSTables 
instead of just one (since LIMIT = 1)

 Optimize disk seek using min/max colunm name meta data when the LIMIT clause 
 is used
 

 Key: CASSANDRA-8180
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8180
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
 Environment: Cassandra 2.0.10
Reporter: DOAN DuyHai
Assignee: Tyler Hobbs
Priority: Minor

 I was working on an example of sensor data table (timeseries) and face a use 
 case where C* does not optimize read on disk.
 {code}
 cqlsh:test CREATE TABLE test(id int, col int, val text, PRIMARY KEY(id,col)) 
 WITH CLUSTERING ORDER BY (col DESC);
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 10, '10');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 20, '20');
 ...
 nodetool flush test test
 ...
 cqlsh:test INSERT INTO test(id, col , val ) VALUES ( 1, 30, '30');
 ...
 nodetool flush test test
 {code}
 After that, I activate request tracing:
 {code}
 cqlsh:test SELECT * FROM test WHERE id=1 LIMIT 1;
  activity  | 
 timestamp| source| source_elapsed
 ---+--+---+
 execute_cql3_query | 
 23:48:46,498 | 127.0.0.1 |  0
 Parsing SELECT * FROM test WHERE id=1 LIMIT 1; | 
 23:48:46,498 | 127.0.0.1 | 74
Preparing statement | 
 23:48:46,499 | 127.0.0.1 |253
   Executing single-partition query on test | 
 23:48:46,499 | 127.0.0.1 |930
   Acquiring sstable references | 
 23:48:46,499 | 127.0.0.1 |943
Merging memtable tombstones | 
 23:48:46,499 | 127.0.0.1 |   1032
Key