[jira] [Comment Edited] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-09-01 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446542#comment-13446542
 ] 

David Alves edited comment on CASSANDRA-1337 at 9/1/12 5:09 PM:


Clean rehash that addresses Sylvain's (very helpful) comments, including an 
implementation for the CQL3 case. It estimates concurrency factor the following 
ways:

Estimate Rows:
- Primary Indexes - uses cfs's estimated keys divided by RF
- 2ndary indexes - uses the mean col count of the most selective index to 
estimate the total num keys

Estimate Cols (CQL3):
- IdentityFilter - uses the estimated keys + mean col count to estimate total 
cols
- NamesFilter - assumes cols with names are present and uses estimated keys to 
calculate to estimate total cols
- Other filters - as Sylvain mentioned because we have no idea on the 
selectivity of the col filter we cannot estimate how many cols will be returned 
per node so we revert to concurrecy factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO 
(previous implementation was forcefully adapting from the initial sequential 
execution which made it difficult to read)

Notes:
- cql_test.py dtest is failing in the same place as trunk ,need to look into it 
to make sure Sylvain's dtest passes
- not sure whether to wait on read repair results for all handlers or just for 
the ones we actually use

  was (Author: dr-alves):
Clean rehash that addresses Sylvain's (very helpful) comments, including an 
implementation for the CQL3 case. It estimates concurrency factor the following 
ways:

Estimate Rows:
- Primary Indexes - uses cfs's estimated keys divided by RF
- 2ndary indexes - uses the mean col count of the most selective index to 
estimate the total num keys

Estimate Cols (CQL3):
- IdentityFilter - uses the estimated keys + mean col count to estimate total 
cols
- NamesFilter - assumes cols with names are present and uses estimated keys to 
calculate to estimate total cols
- Other filters - as ylvain mentioned because we have no idea on the 
selectivity of the col filter we cannot estimate how many cols will be returned 
per node so we revert to concurrecy factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO 
(previous implementation was forcefully adapting from the initial sequential 
execution which made it difficult to read)

Notes:
- cql_test.py dtest is failing in the same place as trunk ,need to look into it 
to make sure Sylvain's dtest passes
- not sure whether to wait on read repair results for all handlers or just for 
the ones we actually use
  
 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.1

 Attachments: 1137-bugfix.patch, 1337.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-09-01 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13446542#comment-13446542
 ] 

David Alves edited comment on CASSANDRA-1337 at 9/1/12 5:09 PM:


Clean rehash that addresses Sylvain's (very helpful) comments, including an 
implementation for the CQL3 case. It estimates concurrency factor the following 
ways:

Estimate Rows:
- Primary Indexes - uses cfs's estimated keys divided by RF
- 2ndary indexes - uses the mean col count of the most selective index to 
estimate the total num keys

Estimate Cols (CQL3):
- IdentityFilter - uses the estimated keys + mean col count to estimate total 
cols
- NamesFilter - assumes cols with names are present and uses estimated keys to 
calculate to estimate total cols
- Other filters - as ylvain mentioned because we have no idea on the 
selectivity of the col filter we cannot estimate how many cols will be returned 
per node so we revert to concurrecy factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO 
(previous implementation was forcefully adapting from the initial sequential 
execution which made it difficult to read)

Notes:
- cql_test.py dtest is failing in the same place as trunk ,need to look into it 
to make sure Sylvain's dtest passes
- not sure whether to wait on read repair results for all handlers or just for 
the ones we actually use

  was (Author: dr-alves):
Clean rehash that addressed Sylvain's (very helpful comments) including 
implementing for the CQL3 case. It estimated concurrency factor the following 
ways:

- Primary Indexes + Thrift - divides cfs by RF
- 2ndary indexes + Thrift - uses the mean col count of the most selective index 
to estimate the number of keys
- CQL3 + IdentityFilter - uses the estimated keys + mean col count to estimate 
cols per node
- CQL3 + Names filter - assumes cols with names are present and uses estimated 
keys to calculate cols per node
- CQL3 - Other filters - as sylvain mentioned because we have no idea on the 
selectivity of the col filter we cannot estimate how many cols will be returned 
per node so we revert to concurrecy factor = 1.

Reimplemented parallel the parallel execution part to make it a lot cleaner IMO 
(previous implementation was adapting sequential execution which made it 
difficult to read)

cql_test.py dtest is failing in the same place as trunk ,need to look into it 
to make sure Sylvain's dtest passes
  
 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
Priority: Minor
 Fix For: 1.2.1

 Attachments: 1137-bugfix.patch, 1337.patch, 
 ASF.LICENSE.NOT.GRANTED--0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt,
  CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-07-27 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424071#comment-13424071
 ] 

David Alves edited comment on CASSANDRA-1337 at 7/27/12 7:07 PM:
-

patch that addresses the bugs raised by sylvain. (StoragProxyTest and 
cql_test.py both pass) namely:
- local path counts as one less handler
- enough check moved out of the remote branch
- estimatedKeysPerRange take into account replication factor
- columns.maxIsColumns sets concurrency to 1

still working on the dtest that proves (or disproves that this works) but both 
StorageProxyTest and the regression test created by Sylvain pass

I'd like to move the rest of the issues raised by sylvain to another ticket.

  was (Author: dr-alves):
patch that addresses the bugs raised by sylvain. (StoragProxyTest and 
cql_test.py both pass) namely:
- local path counts as one less handler
- enough check moven out of the remote branch
- estimatedKeysPerRange take into account replication factor
- columns.maxIsColumns sets concurrency to 1

still working on the dtest that proves (or disproves that this works) 

I'd like to move the rest of the issues raised by sylvain to another ticket.
  
 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 1137-bugfix.patch, CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Comment Edited] (CASSANDRA-1337) parallelize fetching rows for low-cardinality indexes

2012-07-27 Thread David Alves (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424071#comment-13424071
 ] 

David Alves edited comment on CASSANDRA-1337 at 7/27/12 7:09 PM:
-

patch that addresses the bugs raised by sylvain. (StoragProxyTest and 
cql_test.py both pass) namely:
- local path counts as one less handler
- enough check moved out of the remote branch
- estimatedKeysPerRange take into account replication factor
- columns.maxIsColumns sets concurrency to 1

still working on the dtest that proves (or disproves that this works)

I'd like to move the rest of the issues raised by sylvain to another ticket.

  was (Author: dr-alves):
patch that addresses the bugs raised by sylvain. (StoragProxyTest and 
cql_test.py both pass) namely:
- local path counts as one less handler
- enough check moved out of the remote branch
- estimatedKeysPerRange take into account replication factor
- columns.maxIsColumns sets concurrency to 1

still working on the dtest that proves (or disproves that this works) but both 
StorageProxyTest and the regression test created by Sylvain pass

I'd like to move the rest of the issues raised by sylvain to another ticket.
  
 parallelize fetching rows for low-cardinality indexes
 -

 Key: CASSANDRA-1337
 URL: https://issues.apache.org/jira/browse/CASSANDRA-1337
 Project: Cassandra
  Issue Type: Improvement
Reporter: Jonathan Ellis
Assignee: David Alves
 Fix For: 1.2

 Attachments: 
 0001-CASSANDRA-1337-scan-concurrently-depending-on-num-rows.txt, 
 1137-bugfix.patch, CASSANDRA-1337.patch

   Original Estimate: 8h
  Remaining Estimate: 8h

 currently, we read the indexed rows from the first node (in partitioner 
 order); if that does not have enough matching rows, we read the rows from the 
 next, and so forth.
 we should use the statistics fom CASSANDRA-1155 to query multiple nodes in 
 parallel, such that we have a high chance of getting enough rows w/o having 
 to do another round of queries (but, if our estimate is incorrect, we do need 
 to loop and do more rounds until we have enough data or we have fetched from 
 each node).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira