Ok,

So I understand now. You choose the index with the smallest number of matches 
per key on the average. Unfortunately this doesn't work out so well for me. I 
am doing a query in the "edges" columnfamily of a graph database, which should 
return edges with source and target labels equal to given values.

I have about 30M edges, and the target labels have on the average more matching 
rows. Unfortunately in the given case there are 2 matches on target label, and 
about 100K on the source label, and I have 5000 similar queries to perform for 
the overall task.

What I think you should be doing is the following: open iterators on the 
matching keys for each of the indexes; the inside loop would pick an iterator 
at random, and pull a match from it. This would assure that the expected number 
of entries examined is a small multiple (# of other indexes) of the index with 
the most "precision". 

Then (if you want) you can optimize using overall statistics to adjust the 
initial probabilities if you want. But as you process the query you should mix 
these initial probabilities with probabilities proportional to the actual 
fraction of overall matches generated by a given index. (I guess you can 
control the speed of mixing using the standard deviations on the initial key 
counts if you want).

I know you have a new type of index in the works... but it doesn't look like 
"trunk" has any modifications for "scan", and presumably the strategy I just 
mentioned is pretty general (not depending on histograms, etc). Does it sound 
like a good idea?

-- Shaun

On Feb 6, 2011, at 12:15 AM, Jonathan Ellis wrote:

> ColumnFamilyStore.scan
> 
> On Sat, Feb 5, 2011 at 10:32 PM, Shaun Cutts <sh...@cuttshome.net> wrote:
>> Thanks for the response!
>> 
>> So.. I *may* have a bug to report (at least I can generate radically 
>> different response times based on expression order with a multiply indexed 
>> columnfamily), but first I'll have to upgrade to a stable version (currently 
>> I have 7.0rc2 installed).
>> 
>> I was also wondering where the code that does this is... is it in
>> 
>> java.org.apache.cassandra.db.columniterator.IndexedSliceReader?
>> 
>> 
>> Thanks,
>> 
>> -- Shaun
>> 
>> On Feb 5, 2011, at 2:39 PM, Jonathan Ellis wrote:
>> 
>>> On Sat, Feb 5, 2011 at 8:48 AM, Shaun Cutts <sh...@cuttshome.net> wrote:
>>>> Hello,
>>>> I'm wondering if cassandra is sensitive to the order of index expressions 
>>>> in
>>>> (pycassa call) get_indexed_slices?
>>> 
>>> No.
>>> 
>>>> If I have several column indexes available, will it attempt to optimize the
>>>> order?
>>> 
>>> Yes.
>>> 
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com

Reply via email to