Re: How to model data to achieve specific data locality

DuyHai Doan Sun, 07 Dec 2014 01:57:07 -0800

"Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly"


--> Then use bucketing to avoid too wide partitions

"Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns. I am not sure if
this is a good practice from operation point of view."

 --> I don't understand why altering table is necessary to add seq_types.
If "seq_types" is defined as your clustering column, you can have many of
them using the same table structure ...





On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote:

> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> wrote:
>
>> It depends on the size of your data, but if your data is reasonably
>> small, there should be no trouble including thousands of records on the
>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>> ought to work fine.
>>
>> If the data size per partition exceeds some threshold that represents the
>> right tradeoff of increasing repair cost, gc pressure, threatening
>> unbalanced loads, and other issues that come with wide partitions, then you
>> can subpartition via some means in a manner consistent with your work load,
>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>
>> For example, if seq_type can be processed for a given seq_id in any
>> order, and you need to be able to locate specific records for a known
>> seq_id/seq_type pair, you can compute subpartition is computed
>> deterministically.  Or if you only ever need to read *all* values for a
>> given seq_id, and the processing order is not important, just randomly
>> generate a value for subpartition at write time, as long as you can know
>> all possible values for subpartition.
>>
>> If the values for the seq_types for a given seq_id must always be
>> processed in order based on seq_type, then your subpartition calculation
>> would need to reflect that and place adjacent seq_types in the same
>> partition.  As a contrived example, say seq_type was an incrementing
>> integer, your subpartition could be seq_type / 100.
>>
>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>>
>>> I have a data model question. I am trying to figure out how to model the
>>> data to achieve the best data locality for analytic purpose. Our
>>> application processes sequences. Each sequence has a unique key in the
>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>> number of seq_types. The typical read is to load a subset of sequences with
>>> the same seq_id. Naturally I would like to have all the sequences with the
>>> same seq_id to co-locate on the same node(s).
>>>
>>>
>>> However I can't simply create one partition per seq_id and use seq_id as
>>> my partition key. That's because:
>>>
>>>
>>> 1. there could be thousands or even more seq_types for each seq_id. It's
>>> not feasible to include all the seq_types into one table.
>>>
>>> 2. each seq_id might have different sets of seq_types.
>>>
>>> 3. each application only needs to access a subset of seq_types for a
>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>>> prefer only touching the data that's needed.
>>>
>>>
>>> As per above, I think I should use one partition per
>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One
>>> possible approach is to override IPartitioner so that I just use part of
>>> the field (say 64 bytes) to get the token (for location) while still using
>>> the whole field as partition key (for look up). But before heading that
>>> direction, I would like to see if there are better options out there. Maybe
>>> any new or upcoming features in C* 3.0?
>>>
>>>
>>> Thanks.
>>>
>>
> Thanks, Eric.
>
> Those sequences are not fixed. All sequences with the same seq_id tend to
> grow at the same rate. If it's one partition per seq_id, the size will most
> likely exceed the threshold quickly. Also new seq_types can be added and
> old seq_types can be deleted. This means I often need to ALTER TABLE to add
> and drop columns. I am not sure if this is a good practice from operation
> point of view.
>
> I thought about your subpartition idea. If there are only a few
> applications and each one of them uses a subset of seq_types, I can easily
> create one table per application since I can compute the subpartition
> deterministically as you said. But in my case data scientists need to
> easily write new applications using any combination of seq_types of a
> seq_id. So I want the data model to be flexible enough to support
> applications using any different set of seq_types without creating new
> tables, duplicate all the data etc.
>
> -Kai
>
>
>

Re: How to model data to achieve specific data locality

Reply via email to