I think he mentioned 100MB as the max size - planning for 1mb might make your data model difficult to work.
On Sun Dec 07 2014 at 12:07:47 PM Kai Wang <dep...@gmail.com> wrote: > Thanks for the help. I wasn't clear how clustering column works. Coming > from Thrift experience, it took me a while to understand how clustering > column impacts partition storage on disk. Now I believe using seq_type as > the first clustering column solves my problem. As of partition size, I will > start with some bucket assumption. If the partition size exceeds the > threshold I may need to re-bucket using smaller bucket size. > > On another thread Eric mentions the optimal partition size should be at > 100 kb ~ 1 MB. I will use that as the start point to design my bucket > strategy. > > > On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky <j...@basetechnology.com> > wrote: > >> It would be helpful to look at some specific examples of sequences, >> showing how they grow. I suspect that the term “sequence” is being >> overloaded in some subtly misleading way here. >> >> Besides, we’ve already answered the headline question – data locality is >> achieved by having a common partition key. So, we need some clarity as to >> what question we are really focusing on >> >> And, of course, we should be asking the “Cassandra Data Modeling 101” >> question of what do your queries want to look like, how exactly do you want >> to access your data. Only after we have a handle on how you need to read >> your data can we decide how it should be stored. >> >> My immediate question to get things back on track: When you say “The >> typical read is to load a subset of sequences with the same seq_id”, >> what type of “subset” are you talking about? Again, a few explicit and >> concise example queries (in some concise, easy to read pseudo language or >> even plain English, but not belabored with full CQL syntax.) would be very >> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” >> command, so what are we really talking about? >> >> Also, I presume we are talking CQL, but some of the references seem more >> Thrift/slice oriented. >> >> -- Jack Krupansky >> >> *From:* Eric Stevens <migh...@gmail.com> >> *Sent:* Sunday, December 7, 2014 10:12 AM >> *To:* user@cassandra.apache.org >> *Subject:* Re: How to model data to achieve specific data locality >> >> > Also new seq_types can be added and old seq_types can be deleted. This >> means I often need to ALTER TABLE to add and drop columns. >> >> Kai, unless I'm misunderstanding something, I don't see why you need to >> alter the table to add a new seq type. From a data model perspective, >> these are just new values in a row. >> >> If you do have columns which are specific to particular seq_types, data >> modeling does become a little more challenging. In that case you may get >> some advantage from using collections (especially map) to store data which >> applies to only a few seq types. Or defining a schema which includes the >> set of all possible columns (that's when you're getting into ALTERs when a >> new column comes or goes). >> >> > All sequences with the same seq_id tend to grow at the same rate. >> >> Note that it is an anti pattern in Cassandra to append to the same row >> indefinitely. I think you understand this because of your original >> question. But please note that a sub partitioning strategy which reuses >> subpartitions will result in degraded read performance after a while. >> You'll need to rotate sub partitions by something that doesn't repeat in >> order to keep the data for a given partition key grouped into just a few >> sstables. A typical pattern there is to use some kind of time bucket >> (hour, day, week, etc., depending on your write volume). >> >> I do note that your original question was about preserving data locality >> - and having a consistent locality for a given seq_id - for best offline >> analytics. If you wanted to work for this, you can certainly also include >> a blob value in your partitioning key, whose value is calculated to force a >> ring collision with this record's sibling data. With Cassandra's default >> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't >> designed to be cryptographically strong (it doesn't work to make it >> difficult to force a collision), but it's meant to have good distribution >> (it may still be computationally expensive to force a collision - I'm not >> that familiar with its internal workings). In this case, >> ByteOrderedPartitioner would be a lot easier to force a ring collision on, >> but then you need to work on a good ring balancing strategy to distribute >> your data evenly over the ring. >> >> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan <doanduy...@gmail.com> >> wrote: >> >>> "Those sequences are not fixed. All sequences with the same seq_id tend >>> to grow at the same rate. If it's one partition per seq_id, the size will >>> most likely exceed the threshold quickly" >>> >>> --> Then use bucketing to avoid too wide partitions >>> >>> "Also new seq_types can be added and old seq_types can be deleted. This >>> means I often need to ALTER TABLE to add and drop columns. I am not sure if >>> this is a good practice from operation point of view." >>> >>> --> I don't understand why altering table is necessary to add >>> seq_types. If "seq_types" is defined as your clustering column, you can >>> have many of them using the same table structure ... >>> >>> >>> >>> >>> >>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang <dep...@gmail.com> wrote: >>> >>>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens <migh...@gmail.com> >>>> wrote: >>>> >>>>> It depends on the size of your data, but if your data is reasonably >>>>> small, there should be no trouble including thousands of records on the >>>>> same partition key. So a data model using PRIMARY KEY ((seq_id), >>>>> seq_type) >>>>> ought to work fine. >>>>> >>>>> If the data size per partition exceeds some threshold that represents >>>>> the right tradeoff of increasing repair cost, gc pressure, threatening >>>>> unbalanced loads, and other issues that come with wide partitions, then >>>>> you >>>>> can subpartition via some means in a manner consistent with your work >>>>> load, >>>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type). >>>>> >>>>> For example, if seq_type can be processed for a given seq_id in any >>>>> order, and you need to be able to locate specific records for a known >>>>> seq_id/seq_type pair, you can compute subpartition is computed >>>>> deterministically. Or if you only ever need to read *all* values for >>>>> a given seq_id, and the processing order is not important, just randomly >>>>> generate a value for subpartition at write time, as long as you can know >>>>> all possible values for subpartition. >>>>> >>>>> If the values for the seq_types for a given seq_id must always be >>>>> processed in order based on seq_type, then your subpartition calculation >>>>> would need to reflect that and place adjacent seq_types in the same >>>>> partition. As a contrived example, say seq_type was an incrementing >>>>> integer, your subpartition could be seq_type / 100. >>>>> >>>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang <dep...@gmail.com> wrote: >>>>> >>>>>> I have a data model question. I am trying to figure out how to >>>>>> model the data to achieve the best data locality for analytic purpose. >>>>>> Our >>>>>> application processes sequences. Each sequence has a unique key in the >>>>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited >>>>>> number of seq_types. The typical read is to load a subset of sequences >>>>>> with >>>>>> the same seq_id. Naturally I would like to have all the sequences with >>>>>> the >>>>>> same seq_id to co-locate on the same node(s). >>>>>> >>>>>> >>>>>> >>>>>> However I can't simply create one partition per seq_id and use seq_id >>>>>> as my partition key. That's because: >>>>>> >>>>>> >>>>>> >>>>>> 1. there could be thousands or even more seq_types for each seq_id. >>>>>> It's not feasible to include all the seq_types into one table. >>>>>> >>>>>> 2. each seq_id might have different sets of seq_types. >>>>>> >>>>>> 3. each application only needs to access a subset of seq_types for a >>>>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. >>>>>> I >>>>>> prefer only touching the data that's needed. >>>>>> >>>>>> >>>>>> >>>>>> As per above, I think I should use one partition per >>>>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? >>>>>> One >>>>>> possible approach is to override IPartitioner so that I just use part of >>>>>> the field (say 64 bytes) to get the token (for location) while still >>>>>> using >>>>>> the whole field as partition key (for look up). But before heading that >>>>>> direction, I would like to see if there are better options out there. >>>>>> Maybe >>>>>> any new or upcoming features in C* 3.0? >>>>>> >>>>>> >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>> Thanks, Eric. >>>> >>>> Those sequences are not fixed. All sequences with the same seq_id tend >>>> to grow at the same rate. If it's one partition per seq_id, the size will >>>> most likely exceed the threshold quickly. Also new seq_types can be added >>>> and old seq_types can be deleted. This means I often need to ALTER TABLE to >>>> add and drop columns. I am not sure if this is a good practice from >>>> operation point of view. >>>> >>>> I thought about your subpartition idea. If there are only a few >>>> applications and each one of them uses a subset of seq_types, I can easily >>>> create one table per application since I can compute the subpartition >>>> deterministically as you said. But in my case data scientists need to >>>> easily write new applications using any combination of seq_types of a >>>> seq_id. So I want the data model to be flexible enough to support >>>> applications using any different set of seq_types without creating new >>>> tables, duplicate all the data etc. >>>> >>>> -Kai >>>> >>>> >>>> >>> >> >