"Those sequences are not fixed. All sequences with the same seq_id tend to grow at the same rate. If it's one partition per seq_id, the size will most likely exceed the threshold quickly"
--> Then use bucketing to avoid too wide partitions "Also new seq_types can be added and old seq_types can be deleted. This means I often need to ALTER TABLE to add and drop columns. I am not sure if this is a good practice from operation point of view." --> I don't understand why altering table is necessary to add seq_types. If "seq_types" is defined as your clustering column, you can have many of them using the same table structure ... On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote: > On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> wrote: > >> It depends on the size of your data, but if your data is reasonably >> small, there should be no trouble including thousands of records on the >> same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) >> ought to work fine. >> >> If the data size per partition exceeds some threshold that represents the >> right tradeoff of increasing repair cost, gc pressure, threatening >> unbalanced loads, and other issues that come with wide partitions, then you >> can subpartition via some means in a manner consistent with your work load, >> with something like PRIMARY KEY ((seq_id, subpartition), seq_type). >> >> For example, if seq_type can be processed for a given seq_id in any >> order, and you need to be able to locate specific records for a known >> seq_id/seq_type pair, you can compute subpartition is computed >> deterministically. Or if you only ever need to read *all* values for a >> given seq_id, and the processing order is not important, just randomly >> generate a value for subpartition at write time, as long as you can know >> all possible values for subpartition. >> >> If the values for the seq_types for a given seq_id must always be >> processed in order based on seq_type, then your subpartition calculation >> would need to reflect that and place adjacent seq_types in the same >> partition. As a contrived example, say seq_type was an incrementing >> integer, your subpartition could be seq_type / 100. >> >> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: >> >>> I have a data model question. I am trying to figure out how to model the >>> data to achieve the best data locality for analytic purpose. Our >>> application processes sequences. Each sequence has a unique key in the >>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited >>> number of seq_types. The typical read is to load a subset of sequences with >>> the same seq_id. Naturally I would like to have all the sequences with the >>> same seq_id to co-locate on the same node(s). >>> >>> >>> However I can't simply create one partition per seq_id and use seq_id as >>> my partition key. That's because: >>> >>> >>> 1. there could be thousands or even more seq_types for each seq_id. It's >>> not feasible to include all the seq_types into one table. >>> >>> 2. each seq_id might have different sets of seq_types. >>> >>> 3. each application only needs to access a subset of seq_types for a >>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I >>> prefer only touching the data that's needed. >>> >>> >>> As per above, I think I should use one partition per >>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One >>> possible approach is to override IPartitioner so that I just use part of >>> the field (say 64 bytes) to get the token (for location) while still using >>> the whole field as partition key (for look up). But before heading that >>> direction, I would like to see if there are better options out there. Maybe >>> any new or upcoming features in C* 3.0? >>> >>> >>> Thanks. >>> >> > Thanks, Eric. > > Those sequences are not fixed. All sequences with the same seq_id tend to > grow at the same rate. If it's one partition per seq_id, the size will most > likely exceed the threshold quickly. Also new seq_types can be added and > old seq_types can be deleted. This means I often need to ALTER TABLE to add > and drop columns. I am not sure if this is a good practice from operation > point of view. > > I thought about your subpartition idea. If there are only a few > applications and each one of them uses a subset of seq_types, I can easily > create one table per application since I can compute the subpartition > deterministically as you said. But in my case data scientists need to > easily write new applications using any combination of seq_types of a > seq_id. So I want the data model to be flexible enough to support > applications using any different set of seq_types without creating new > tables, duplicate all the data etc. > > -Kai > > >