Re: How to model data to achieve specific data locality

Jonathan Haddad Sun, 07 Dec 2014 14:12:22 -0800

I think he mentioned 100MB as the max size - planning for 1mb might make
your data model difficult to work.


On Sun Dec 07 2014 at 12:07:47 PM Kai Wang <dep...@gmail.com> wrote:

> Thanks for the help. I wasn't clear how clustering column works. Coming
> from Thrift experience, it took me a while to understand how clustering
> column impacts partition storage on disk. Now I believe using seq_type as
> the first clustering column solves my problem. As of partition size, I will
> start with some bucket assumption. If the partition size exceeds the
> threshold I may need to re-bucket using smaller bucket size.
>
> On another thread Eric mentions the optimal partition size should be at
> 100 kb ~ 1 MB. I will use that as the start point to design my bucket
> strategy.
>
>
> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com>
> wrote:
>
>>   It would be helpful to look at some specific examples of sequences,
>> showing how they grow. I suspect that the term “sequence” is being
>> overloaded in some subtly misleading way here.
>>
>> Besides, we’ve already answered the headline question – data locality is
>> achieved by having a common partition key. So, we need some clarity as to
>> what question we are really focusing on
>>
>> And, of course, we should be asking the “Cassandra Data Modeling 101”
>> question of what do your queries want to look like, how exactly do you want
>> to access your data. Only after we have a handle on how you need to read
>> your data can we decide how it should be stored.
>>
>> My immediate question to get things back on track: When you say “The
>> typical read is to load a subset of sequences with the same seq_id”,
>> what type of “subset” are you talking about? Again, a few explicit and
>> concise example queries (in some concise, easy to read pseudo language or
>> even plain English, but not belabored with full CQL syntax.) would be very
>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>> command, so what are we really talking about?
>>
>> Also, I presume we are talking CQL, but some of the references seem more
>> Thrift/slice oriented.
>>
>> -- Jack Krupansky
>>
>>  *From:* Eric Stevens <migh...@gmail.com>
>> *Sent:* Sunday, December 7, 2014 10:12 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: How to model data to achieve specific data locality
>>
>> > Also new seq_types can be added and old seq_types can be deleted. This
>> means I often need to ALTER TABLE to add and drop columns.
>>
>> Kai, unless I'm misunderstanding something, I don't see why you need to
>> alter the table to add a new seq type.  From a data model perspective,
>> these are just new values in a row.
>>
>> If you do have columns which are specific to particular seq_types, data
>> modeling does become a little more challenging.  In that case you may get
>> some advantage from using collections (especially map) to store data which
>> applies to only a few seq types.  Or defining a schema which includes the
>> set of all possible columns (that's when you're getting into ALTERs when a
>> new column comes or goes).
>>
>> > All sequences with the same seq_id tend to grow at the same rate.
>>
>> Note that it is an anti pattern in Cassandra to append to the same row
>> indefinitely.  I think you understand this because of your original
>> question.  But please note that a sub partitioning strategy which reuses
>> subpartitions will result in degraded read performance after a while.
>> You'll need to rotate sub partitions by something that doesn't repeat in
>> order to keep the data for a given partition key grouped into just a few
>> sstables.  A typical pattern there is to use some kind of time bucket
>> (hour, day, week, etc., depending on your write volume).
>>
>> I do note that your original question was about preserving data locality
>> - and having a consistent locality for a given seq_id - for best offline
>> analytics.  If you wanted to work for this, you can certainly also include
>> a blob value in your partitioning key, whose value is calculated to force a
>> ring collision with this record's sibling data.  With Cassandra's default
>> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
>> designed to be cryptographically strong (it doesn't work to make it
>> difficult to force a collision), but it's meant to have good distribution
>> (it may still be computationally expensive to force a collision - I'm not
>> that familiar with its internal workings).  In this case,
>> ByteOrderedPartitioner would be a lot easier to force a ring collision on,
>> but then you need to work on a good ring balancing strategy to distribute
>> your data evenly over the ring.
>>
>> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com>
>> wrote:
>>
>>> "Those sequences are not fixed. All sequences with the same seq_id tend
>>> to grow at the same rate. If it's one partition per seq_id, the size will
>>> most likely exceed the threshold quickly"
>>>
>>>  --> Then use bucketing to avoid too wide partitions
>>>
>>> "Also new seq_types can be added and old seq_types can be deleted. This
>>> means I often need to ALTER TABLE to add and drop columns. I am not sure if
>>> this is a good practice from operation point of view."
>>>
>>>  --> I don't understand why altering table is necessary to add
>>> seq_types. If "seq_types" is defined as your clustering column, you can
>>> have many of them using the same table structure ...
>>>
>>>
>>>
>>>
>>>
>>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote:
>>>
>>>>   On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com>
>>>> wrote:
>>>>
>>>>> It depends on the size of your data, but if your data is reasonably
>>>>> small, there should be no trouble including thousands of records on the
>>>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), 
>>>>> seq_type)
>>>>> ought to work fine.
>>>>>
>>>>> If the data size per partition exceeds some threshold that represents
>>>>> the right tradeoff of increasing repair cost, gc pressure, threatening
>>>>> unbalanced loads, and other issues that come with wide partitions, then 
>>>>> you
>>>>> can subpartition via some means in a manner consistent with your work 
>>>>> load,
>>>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>>>>
>>>>> For example, if seq_type can be processed for a given seq_id in any
>>>>> order, and you need to be able to locate specific records for a known
>>>>> seq_id/seq_type pair, you can compute subpartition is computed
>>>>> deterministically.  Or if you only ever need to read *all* values for
>>>>> a given seq_id, and the processing order is not important, just randomly
>>>>> generate a value for subpartition at write time, as long as you can know
>>>>> all possible values for subpartition.
>>>>>
>>>>> If the values for the seq_types for a given seq_id must always be
>>>>> processed in order based on seq_type, then your subpartition calculation
>>>>> would need to reflect that and place adjacent seq_types in the same
>>>>> partition.  As a contrived example, say seq_type was an incrementing
>>>>> integer, your subpartition could be seq_type / 100.
>>>>>
>>>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote:
>>>>>
>>>>>>  I have a data model question. I am trying to figure out how to
>>>>>> model the data to achieve the best data locality for analytic purpose. 
>>>>>> Our
>>>>>> application processes sequences. Each sequence has a unique key in the
>>>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>>>>> number of seq_types. The typical read is to load a subset of sequences 
>>>>>> with
>>>>>> the same seq_id. Naturally I would like to have all the sequences with 
>>>>>> the
>>>>>> same seq_id to co-locate on the same node(s).
>>>>>>
>>>>>>
>>>>>>
>>>>>> However I can't simply create one partition per seq_id and use seq_id
>>>>>> as my partition key. That's because:
>>>>>>
>>>>>>
>>>>>>
>>>>>> 1. there could be thousands or even more seq_types for each seq_id.
>>>>>> It's not feasible to include all the seq_types into one table.
>>>>>>
>>>>>> 2. each seq_id might have different sets of seq_types.
>>>>>>
>>>>>> 3. each application only needs to access a subset of seq_types for a
>>>>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. 
>>>>>> I
>>>>>> prefer only touching the data that's needed.
>>>>>>
>>>>>>
>>>>>>
>>>>>> As per above, I think I should use one partition per
>>>>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? 
>>>>>> One
>>>>>> possible approach is to override IPartitioner so that I just use part of
>>>>>> the field (say 64 bytes) to get the token (for location) while still 
>>>>>> using
>>>>>> the whole field as partition key (for look up). But before heading that
>>>>>> direction, I would like to see if there are better options out there. 
>>>>>> Maybe
>>>>>> any new or upcoming features in C* 3.0?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>
>>>> Thanks, Eric.
>>>>
>>>> Those sequences are not fixed. All sequences with the same seq_id tend
>>>> to grow at the same rate. If it's one partition per seq_id, the size will
>>>> most likely exceed the threshold quickly. Also new seq_types can be added
>>>> and old seq_types can be deleted. This means I often need to ALTER TABLE to
>>>> add and drop columns. I am not sure if this is a good practice from
>>>> operation point of view.
>>>>
>>>> I thought about your subpartition idea. If there are only a few
>>>> applications and each one of them uses a subset of seq_types, I can easily
>>>> create one table per application since I can compute the subpartition
>>>> deterministically as you said. But in my case data scientists need to
>>>> easily write new applications using any combination of seq_types of a
>>>> seq_id. So I want the data model to be flexible enough to support
>>>> applications using any different set of seq_types without creating new
>>>> tables, duplicate all the data etc.
>>>>
>>>> -Kai
>>>>
>>>>
>>>>
>>>
>>
>

Re: How to model data to achieve specific data locality

Reply via email to