Thanks, comments inline:

On Mon, 23 Jan 2012 20:59:34 +1300
aaron morton <aa...@thelastpickle.com> wrote:
> It depends a bit on the data and the query patterns. 
> 
> * How many versions do you have ? 
We may have 10k versions in some cases, with up to a million names
total in any given version but more often <10K. To manage this we are
currently using two CFs, one for storing compacted complete lists and
one for storing deltas on the compacted list. Based on usage, we will
create a new compacted list and start writing deltas against that. We
should be able to limit the number of deltas in a single row to below
100; I'd like to be able to keep it lower but I'm not sure we can
maintain that under all load scenarios. The compacted lists are
straightforward, but there are many ways to structure the deltas and
they all have trade offs. A CF with composite columns that supported
two dimensional slicing would be perfect.

> * How many names in each version ?
We plan on limiting to a total of 1 million names, and around 10,000 per
version (by limiting the batch size), but many deltas will have <10
names.

> * When querying do you know the versions numbers you want to query
> from ? How many are there normally?
Currently we don't know the version numbers in advance - they are
timestamps, and we are querying for versions less than or equal to the
desired timestamp. We have talked about using vector clock versions and
maintaining an index mapping time to version numbers, in which case we
would know the exact versions after the index lookup, at the expense of
another RTT on every operation.

> * How frequent are the updates and the reads ?
We expect reads to be more frequent than writes. Unfortunately we don't
have solid numbers on what to expect, but I would guess 20x. Update
operations will involve several reads to determine where to write.


> I would lean towards using two standard CF's, one to list all the
> version numbers (in a single row probably) and one to hold the names
> in a particular version. 
> 
> To do your query slice the first CF and then run multi gets to the
> second. 
> 
> Thats probably not the best solution, if you can add some more info
> it may get better.
I'm actually leaning back toward BOP, as I run into more issues
and complexity with the RP models. I'd really like to implement both
and compare them, but at this point I need to focus on one to get
things working, so I'm trying to make a best initial guess.


> 
> On 21/01/2012, at 6:20 AM, Bryce Allen wrote:
> 
> > I'm storing very large versioned lists of names, and I'd like to
> > query a range of names within a given range of versions, which is a
> > two dimensional slice, in a single query. This is easy to do using
> > ByteOrderedPartitioner, but seems to require multiple (non parallel)
> > queries and extra CFs when using RandomPartitioner.
> > 
> > I see two approaches when using RP:
> > 
> > 1) Data is stored in a super column family, with one dimension being
> > the super column names and the other the sub column names. Since
> > slicing on sub columns requires a list of super column names, a
> > second standard CF is needed to get a range of names before doing a
> > query on the main super CF. With CASSANDRA-2710, the same is
> > possible using a standard CF with composite types instead of a
> > super CF.
> > 
> > 2) If one of the dimensions is small, a two dimensional slice isn't
> > required. The data can be stored in a standard CF with linear
> > ordering on a composite type (large_dimension, small_dimension).
> > Data is queried based on the large dimension, and the client throws
> > out the extra data in the other dimension.
> > 
> > Neither of the above solutions are ideal. Does anyone else have a
> > use case where two dimensional slicing is useful? Given the
> > disadvantages of BOP, is it practical to make the composite column
> > query model richer to support this sort of use case?
> > 
> > Thanks,
> > Bryce
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to