Re: How to model data to achieve specific data locality

2014-12-09 Thread Kai Wang
> concise example queries (in some concise, easy to read pseudo language or
>>>> even plain English, but not belabored with full CQL syntax.) would be very
>>>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>>>> command, so what are we really talking about?
>>>>
>>>> Also, I presume we are talking CQL, but some of the references seem
>>>> more Thrift/slice oriented.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>>  *From:* Eric Stevens 
>>>> *Sent:* Sunday, December 7, 2014 10:12 AM
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: How to model data to achieve specific data locality
>>>>
>>>> > Also new seq_types can be added and old seq_types can be deleted.
>>>> This means I often need to ALTER TABLE to add and drop columns.
>>>>
>>>> Kai, unless I'm misunderstanding something, I don't see why you need to
>>>> alter the table to add a new seq type.  From a data model perspective,
>>>> these are just new values in a row.
>>>>
>>>> If you do have columns which are specific to particular seq_types, data
>>>> modeling does become a little more challenging.  In that case you may get
>>>> some advantage from using collections (especially map) to store data which
>>>> applies to only a few seq types.  Or defining a schema which includes the
>>>> set of all possible columns (that's when you're getting into ALTERs when a
>>>> new column comes or goes).
>>>>
>>>> > All sequences with the same seq_id tend to grow at the same rate.
>>>>
>>>> Note that it is an anti pattern in Cassandra to append to the same row
>>>> indefinitely.  I think you understand this because of your original
>>>> question.  But please note that a sub partitioning strategy which reuses
>>>> subpartitions will result in degraded read performance after a while.
>>>> You'll need to rotate sub partitions by something that doesn't repeat in
>>>> order to keep the data for a given partition key grouped into just a few
>>>> sstables.  A typical pattern there is to use some kind of time bucket
>>>> (hour, day, week, etc., depending on your write volume).
>>>>
>>>> I do note that your original question was about preserving data
>>>> locality - and having a consistent locality for a given seq_id - for best
>>>> offline analytics.  If you wanted to work for this, you can certainly also
>>>> include a blob value in your partitioning key, whose value is calculated to
>>>> force a ring collision with this record's sibling data.  With Cassandra's
>>>> default partitioner of murmur3, that's probably pretty challenging -
>>>> murmur3 isn't designed to be cryptographically strong (it doesn't work to
>>>> make it difficult to force a collision), but it's meant to have good
>>>> distribution (it may still be computationally expensive to force a
>>>> collision - I'm not that familiar with its internal workings).  In this
>>>> case, ByteOrderedPartitioner would be a lot easier to force a ring
>>>> collision on, but then you need to work on a good ring balancing strategy
>>>> to distribute your data evenly over the ring.
>>>>
>>>> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan 
>>>> wrote:
>>>>
>>>>> "Those sequences are not fixed. All sequences with the same seq_id
>>>>> tend to grow at the same rate. If it's one partition per seq_id, the size
>>>>> will most likely exceed the threshold quickly"
>>>>>
>>>>>  --> Then use bucketing to avoid too wide partitions
>>>>>
>>>>> "Also new seq_types can be added and old seq_types can be deleted.
>>>>> This means I often need to ALTER TABLE to add and drop columns. I am not
>>>>> sure if this is a good practice from operation point of view."
>>>>>
>>>>>  --> I don't understand why altering table is necessary to add
>>>>> seq_types. If "seq_types" is defined as your clustering column, you can
>>>>> have many of them using the same table structure ...
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:
>>>>>
>&

Re: How to model data to achieve specific data locality

2014-12-08 Thread Eric Stevens
The upper bound for the data size of a single column is 2GB, and the upper
bound for the number of columns in a row (partition) is 2 billion.  So if
you wanted to create the largest possible row, you probably can't afford
enough disks to hold it.
http://wiki.apache.org/cassandra/CassandraLimitations

Practically speaking you start running into troubles *way* before you reach
those thresholds though.  Large columns and large numbers of columns create
GC pressure in your cluster, and since all data for a given row reside on
the same primary and replicas, this tends to lead to hot spotting.  Repair
happens for entire rows, so large rows increase the cost of repairs,
including GC pressure during the repair.  And rows of this size are often
arrived at by appending to the same row repeatedly, which will cause the
data for that row to be scattered across a large number of SSTables which
will hurt read performance. Also depending on your interface, you'll find
you start hitting limits that you have to increase, each with their own
implications (eg, maximum thrift message sizes and so forth).  The right
maximum practical size for a row definitely depends on your read and write
patterns, as well as your hardware and network.  More memory, SSD's, larger
SSTables, and faster networks will all raise the ceiling for where large
rows start to become painful.

@Kai, if you're familiar with the Thrift paradigm, the partition key
equates to a Thrift row key, and the clustering key equates to the first
part of a composite column name.  CQL PRIMARY KEY ((a,b), c, d) equates to
Thrift where row key is ['a:b'] and all columns begin with ['c:d:'].
Recommended reading: http://www.datastax.com/dev/blog/thrift-to-cql3

Whatever your partition key, if you need to sub-partition to maintain
reasonable row sizes, then the only way to preserve data locality for
related records is probably to switch to byte ordered partitioner, and
compute blob or long column as part of your partition key that is meant to
cause the PK to to map to the same token.  Just be aware that byte ordered
partitioner comes with a number of caveats, and you'll become responsible
for maintaining good data load distributions in your cluster. But the
benefits from being able to tune locality may be worth it.


On Sun Dec 07 2014 at 3:12:11 PM Jonathan Haddad  wrote:

> I think he mentioned 100MB as the max size - planning for 1mb might make
> your data model difficult to work.
>
> On Sun Dec 07 2014 at 12:07:47 PM Kai Wang  wrote:
>
>> Thanks for the help. I wasn't clear how clustering column works. Coming
>> from Thrift experience, it took me a while to understand how clustering
>> column impacts partition storage on disk. Now I believe using seq_type as
>> the first clustering column solves my problem. As of partition size, I will
>> start with some bucket assumption. If the partition size exceeds the
>> threshold I may need to re-bucket using smaller bucket size.
>>
>> On another thread Eric mentions the optimal partition size should be at
>> 100 kb ~ 1 MB. I will use that as the start point to design my bucket
>> strategy.
>>
>>
>> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky 
>> wrote:
>>
>>>   It would be helpful to look at some specific examples of sequences,
>>> showing how they grow. I suspect that the term “sequence” is being
>>> overloaded in some subtly misleading way here.
>>>
>>> Besides, we’ve already answered the headline question – data locality is
>>> achieved by having a common partition key. So, we need some clarity as to
>>> what question we are really focusing on
>>>
>>> And, of course, we should be asking the “Cassandra Data Modeling 101”
>>> question of what do your queries want to look like, how exactly do you want
>>> to access your data. Only after we have a handle on how you need to read
>>> your data can we decide how it should be stored.
>>>
>>> My immediate question to get things back on track: When you say “The
>>> typical read is to load a subset of sequences with the same seq_id”,
>>> what type of “subset” are you talking about? Again, a few explicit and
>>> concise example queries (in some concise, easy to read pseudo language or
>>> even plain English, but not belabored with full CQL syntax.) would be very
>>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>>> command, so what are we really talking about?
>>>
>>> Also, I presume we are talking CQL, but some of the references seem more
>>> Thrift/slice oriented.
>>>
>>> -- Jack Krupansky
>>>
>>>  *From:* Eric Stevens 
>>> *Sent:* Sunday, December 7,

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jonathan Haddad
I think he mentioned 100MB as the max size - planning for 1mb might make
your data model difficult to work.

On Sun Dec 07 2014 at 12:07:47 PM Kai Wang  wrote:

> Thanks for the help. I wasn't clear how clustering column works. Coming
> from Thrift experience, it took me a while to understand how clustering
> column impacts partition storage on disk. Now I believe using seq_type as
> the first clustering column solves my problem. As of partition size, I will
> start with some bucket assumption. If the partition size exceeds the
> threshold I may need to re-bucket using smaller bucket size.
>
> On another thread Eric mentions the optimal partition size should be at
> 100 kb ~ 1 MB. I will use that as the start point to design my bucket
> strategy.
>
>
> On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky 
> wrote:
>
>>   It would be helpful to look at some specific examples of sequences,
>> showing how they grow. I suspect that the term “sequence” is being
>> overloaded in some subtly misleading way here.
>>
>> Besides, we’ve already answered the headline question – data locality is
>> achieved by having a common partition key. So, we need some clarity as to
>> what question we are really focusing on
>>
>> And, of course, we should be asking the “Cassandra Data Modeling 101”
>> question of what do your queries want to look like, how exactly do you want
>> to access your data. Only after we have a handle on how you need to read
>> your data can we decide how it should be stored.
>>
>> My immediate question to get things back on track: When you say “The
>> typical read is to load a subset of sequences with the same seq_id”,
>> what type of “subset” are you talking about? Again, a few explicit and
>> concise example queries (in some concise, easy to read pseudo language or
>> even plain English, but not belabored with full CQL syntax.) would be very
>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
>> command, so what are we really talking about?
>>
>> Also, I presume we are talking CQL, but some of the references seem more
>> Thrift/slice oriented.
>>
>> -- Jack Krupansky
>>
>>  *From:* Eric Stevens 
>> *Sent:* Sunday, December 7, 2014 10:12 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: How to model data to achieve specific data locality
>>
>> > Also new seq_types can be added and old seq_types can be deleted. This
>> means I often need to ALTER TABLE to add and drop columns.
>>
>> Kai, unless I'm misunderstanding something, I don't see why you need to
>> alter the table to add a new seq type.  From a data model perspective,
>> these are just new values in a row.
>>
>> If you do have columns which are specific to particular seq_types, data
>> modeling does become a little more challenging.  In that case you may get
>> some advantage from using collections (especially map) to store data which
>> applies to only a few seq types.  Or defining a schema which includes the
>> set of all possible columns (that's when you're getting into ALTERs when a
>> new column comes or goes).
>>
>> > All sequences with the same seq_id tend to grow at the same rate.
>>
>> Note that it is an anti pattern in Cassandra to append to the same row
>> indefinitely.  I think you understand this because of your original
>> question.  But please note that a sub partitioning strategy which reuses
>> subpartitions will result in degraded read performance after a while.
>> You'll need to rotate sub partitions by something that doesn't repeat in
>> order to keep the data for a given partition key grouped into just a few
>> sstables.  A typical pattern there is to use some kind of time bucket
>> (hour, day, week, etc., depending on your write volume).
>>
>> I do note that your original question was about preserving data locality
>> - and having a consistent locality for a given seq_id - for best offline
>> analytics.  If you wanted to work for this, you can certainly also include
>> a blob value in your partitioning key, whose value is calculated to force a
>> ring collision with this record's sibling data.  With Cassandra's default
>> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
>> designed to be cryptographically strong (it doesn't work to make it
>> difficult to force a collision), but it's meant to have good distribution
>> (it may still be computationally expensive to force a collision - I'm not
>> that familiar with its internal workings).  In this case,
>> ByteOrd

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
As a general rule, partitions can certainly be much larger than 1 MB, even up 
to 100 MB. 5 MB to 10 MB might be a good target size.

Originally you stated that the number of seq_types could be “unlimited”... is 
that really true? Is there no practical upper limit you can establish, like 
10,000 or 10 million or...? Sure, buckets are a very real option, but if the 
number of seq_types was only 10,000 to 50,000, then bucketing might be 
unnecessary complexity and access overhead.

-- Jack Krupansky

From: Kai Wang 
Sent: Sunday, December 7, 2014 3:06 PM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

Thanks for the help. I wasn't clear how clustering column works. Coming from 
Thrift experience, it took me a while to understand how clustering column 
impacts partition storage on disk. Now I believe using seq_type as the first 
clustering column solves my problem. As of partition size, I will start with 
some bucket assumption. If the partition size exceeds the threshold I may need 
to re-bucket using smaller bucket size.


On another thread Eric mentions the optimal partition size should be at 100 kb 
~ 1 MB. I will use that as the start point to design my bucket strategy.



On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky  wrote:

  It would be helpful to look at some specific examples of sequences, showing 
how they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

  Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

  And, of course, we should be asking the “Cassandra Data Modeling 101” 
question of what do your queries want to look like, how exactly do you want to 
access your data. Only after we have a handle on how you need to read your data 
can we decide how it should be stored.

  My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

  Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

  -- Jack Krupansky

  From: Eric Stevens 
  Sent: Sunday, December 7, 2014 10:12 AM
  To: user@cassandra.apache.org 
  Subject: Re: How to model data to achieve specific data locality

  > Also new seq_types can be added and old seq_types can be deleted. This 
means I often need to ALTER TABLE to add and drop columns. 

  Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

  If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

  > All sequences with the same seq_id tend to grow at the same rate.


  Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


  I do note that your original question was about preserving data locality - 
and having a consistent locality for a given seq_id - for best offline 
analytics.  If you wanted to work for this, you can certainly also include a 
blob value in your partitioning key, whose value is calculated to force a ring 
collision with this record's sibling data.  With Cassandra's default 
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't 
designed to be cryptographically strong (it doesn't work to make it difficult 
to force a collision), but it's meant to have good distribution (it may still 
be computationally expensive to force a collision - I'm not that familiar with 
its internal workings).  In this case, ByteOrderedPartitioner would be a lot 
easier to force a ring collision on, but then you need to wor

Re: How to model data to achieve specific data locality

2014-12-07 Thread Kai Wang
Thanks for the help. I wasn't clear how clustering column works. Coming
from Thrift experience, it took me a while to understand how clustering
column impacts partition storage on disk. Now I believe using seq_type as
the first clustering column solves my problem. As of partition size, I will
start with some bucket assumption. If the partition size exceeds the
threshold I may need to re-bucket using smaller bucket size.

On another thread Eric mentions the optimal partition size should be at 100
kb ~ 1 MB. I will use that as the start point to design my bucket strategy.


On Sun, Dec 7, 2014 at 10:32 AM, Jack Krupansky 
wrote:

>   It would be helpful to look at some specific examples of sequences,
> showing how they grow. I suspect that the term “sequence” is being
> overloaded in some subtly misleading way here.
>
> Besides, we’ve already answered the headline question – data locality is
> achieved by having a common partition key. So, we need some clarity as to
> what question we are really focusing on
>
> And, of course, we should be asking the “Cassandra Data Modeling 101”
> question of what do your queries want to look like, how exactly do you want
> to access your data. Only after we have a handle on how you need to read
> your data can we decide how it should be stored.
>
> My immediate question to get things back on track: When you say “The
> typical read is to load a subset of sequences with the same seq_id”, what
> type of “subset” are you talking about? Again, a few explicit and concise
> example queries (in some concise, easy to read pseudo language or even
> plain English, but not belabored with full CQL syntax.) would be very
> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset”
> command, so what are we really talking about?
>
> Also, I presume we are talking CQL, but some of the references seem more
> Thrift/slice oriented.
>
> -- Jack Krupansky
>
>  *From:* Eric Stevens 
> *Sent:* Sunday, December 7, 2014 10:12 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: How to model data to achieve specific data locality
>
> > Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns.
>
> Kai, unless I'm misunderstanding something, I don't see why you need to
> alter the table to add a new seq type.  From a data model perspective,
> these are just new values in a row.
>
> If you do have columns which are specific to particular seq_types, data
> modeling does become a little more challenging.  In that case you may get
> some advantage from using collections (especially map) to store data which
> applies to only a few seq types.  Or defining a schema which includes the
> set of all possible columns (that's when you're getting into ALTERs when a
> new column comes or goes).
>
> > All sequences with the same seq_id tend to grow at the same rate.
>
> Note that it is an anti pattern in Cassandra to append to the same row
> indefinitely.  I think you understand this because of your original
> question.  But please note that a sub partitioning strategy which reuses
> subpartitions will result in degraded read performance after a while.
> You'll need to rotate sub partitions by something that doesn't repeat in
> order to keep the data for a given partition key grouped into just a few
> sstables.  A typical pattern there is to use some kind of time bucket
> (hour, day, week, etc., depending on your write volume).
>
> I do note that your original question was about preserving data locality -
> and having a consistent locality for a given seq_id - for best offline
> analytics.  If you wanted to work for this, you can certainly also include
> a blob value in your partitioning key, whose value is calculated to force a
> ring collision with this record's sibling data.  With Cassandra's default
> partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
> designed to be cryptographically strong (it doesn't work to make it
> difficult to force a collision), but it's meant to have good distribution
> (it may still be computationally expensive to force a collision - I'm not
> that familiar with its internal workings).  In this case,
> ByteOrderedPartitioner would be a lot easier to force a ring collision on,
> but then you need to work on a good ring balancing strategy to distribute
> your data evenly over the ring.
>
> On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:
>
>> "Those sequences are not fixed. All sequences with the same seq_id tend
>> to grow at the same rate. If it's one partition per seq_id, the size will
>> most likely exceed the threshold quickly"
>>
>>

Re: How to model data to achieve specific data locality

2014-12-07 Thread Jack Krupansky
It would be helpful to look at some specific examples of sequences, showing how 
they grow. I suspect that the term “sequence” is being overloaded in some 
subtly misleading way here.

Besides, we’ve already answered the headline question – data locality is 
achieved by having a common partition key. So, we need some clarity as to what 
question we are really focusing on

And, of course, we should be asking the “Cassandra Data Modeling 101” question 
of what do your queries want to look like, how exactly do you want to access 
your data. Only after we have a handle on how you need to read your data can we 
decide how it should be stored.

My immediate question to get things back on track: When you say “The typical 
read is to load a subset of sequences with the same seq_id”, what type of 
“subset” are you talking about? Again, a few explicit and concise example 
queries (in some concise, easy to read pseudo language or even plain English, 
but not belabored with full CQL syntax.) would be very helpful. I mean, 
Cassandra has no “subset” concept, nor a “load subset” command, so what are we 
really talking about?

Also, I presume we are talking CQL, but some of the references seem more 
Thrift/slice oriented.

-- Jack Krupansky

From: Eric Stevens 
Sent: Sunday, December 7, 2014 10:12 AM
To: user@cassandra.apache.org 
Subject: Re: How to model data to achieve specific data locality

> Also new seq_types can be added and old seq_types can be deleted. This means 
> I often need to ALTER TABLE to add and drop columns. 

Kai, unless I'm misunderstanding something, I don't see why you need to alter 
the table to add a new seq type.  From a data model perspective, these are just 
new values in a row.  

If you do have columns which are specific to particular seq_types, data 
modeling does become a little more challenging.  In that case you may get some 
advantage from using collections (especially map) to store data which applies 
to only a few seq types.  Or defining a schema which includes the set of all 
possible columns (that's when you're getting into ALTERs when a new column 
comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.


Note that it is an anti pattern in Cassandra to append to the same row 
indefinitely.  I think you understand this because of your original question.  
But please note that a sub partitioning strategy which reuses subpartitions 
will result in degraded read performance after a while.  You'll need to rotate 
sub partitions by something that doesn't repeat in order to keep the data for a 
given partition key grouped into just a few sstables.  A typical pattern there 
is to use some kind of time bucket (hour, day, week, etc., depending on your 
write volume).


I do note that your original question was about preserving data locality - and 
having a consistent locality for a given seq_id - for best offline analytics.  
If you wanted to work for this, you can certainly also include a blob value in 
your partitioning key, whose value is calculated to force a ring collision with 
this record's sibling data.  With Cassandra's default partitioner of murmur3, 
that's probably pretty challenging - murmur3 isn't designed to be 
cryptographically strong (it doesn't work to make it difficult to force a 
collision), but it's meant to have good distribution (it may still be 
computationally expensive to force a collision - I'm not that familiar with its 
internal workings).  In this case, ByteOrderedPartitioner would be a lot easier 
to force a ring collision on, but then you need to work on a good ring 
balancing strategy to distribute your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:

  "Those sequences are not fixed. All sequences with the same seq_id tend to 
grow at the same rate. If it's one partition per seq_id, the size will most 
likely exceed the threshold quickly" 


  --> Then use bucketing to avoid too wide partitions


  "Also new seq_types can be added and old seq_types can be deleted. This means 
I often need to ALTER TABLE to add and drop columns. I am not sure if this is a 
good practice from operation point of view."


  --> I don't understand why altering table is necessary to add seq_types. If 
"seq_types" is defined as your clustering column, you can have many of them 
using the same table structure ...









  On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:

On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:

  It depends on the size of your data, but if your data is reasonably 
small, there should be no trouble including thousands of records on the same 
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type) ought to 
work fine.  


  If the data size per partition exceeds some threshold that represents the 
right tradeoff of increasing repair cost, gc 

Re: How to model data to achieve specific data locality

2014-12-07 Thread Eric Stevens
> Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns.

Kai, unless I'm misunderstanding something, I don't see why you need to
alter the table to add a new seq type.  From a data model perspective,
these are just new values in a row.

If you do have columns which are specific to particular seq_types, data
modeling does become a little more challenging.  In that case you may get
some advantage from using collections (especially map) to store data which
applies to only a few seq types.  Or defining a schema which includes the
set of all possible columns (that's when you're getting into ALTERs when a
new column comes or goes).

> All sequences with the same seq_id tend to grow at the same rate.

Note that it is an anti pattern in Cassandra to append to the same row
indefinitely.  I think you understand this because of your original
question.  But please note that a sub partitioning strategy which reuses
subpartitions will result in degraded read performance after a while.
You'll need to rotate sub partitions by something that doesn't repeat in
order to keep the data for a given partition key grouped into just a few
sstables.  A typical pattern there is to use some kind of time bucket
(hour, day, week, etc., depending on your write volume).

I do note that your original question was about preserving data locality -
and having a consistent locality for a given seq_id - for best offline
analytics.  If you wanted to work for this, you can certainly also include
a blob value in your partitioning key, whose value is calculated to force a
ring collision with this record's sibling data.  With Cassandra's default
partitioner of murmur3, that's probably pretty challenging - murmur3 isn't
designed to be cryptographically strong (it doesn't work to make it
difficult to force a collision), but it's meant to have good distribution
(it may still be computationally expensive to force a collision - I'm not
that familiar with its internal workings).  In this case,
ByteOrderedPartitioner would be a lot easier to force a ring collision on,
but then you need to work on a good ring balancing strategy to distribute
your data evenly over the ring.

On Sun Dec 07 2014 at 2:56:26 AM DuyHai Doan  wrote:

> "Those sequences are not fixed. All sequences with the same seq_id tend
> to grow at the same rate. If it's one partition per seq_id, the size will
> most likely exceed the threshold quickly"
>
> --> Then use bucketing to avoid too wide partitions
>
> "Also new seq_types can be added and old seq_types can be deleted. This
> means I often need to ALTER TABLE to add and drop columns. I am not sure if
> this is a good practice from operation point of view."
>
>  --> I don't understand why altering table is necessary to add seq_types.
> If "seq_types" is defined as your clustering column, you can have many of
> them using the same table structure ...
>
>
>
>
>
> On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:
>
>> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:
>>
>>> It depends on the size of your data, but if your data is reasonably
>>> small, there should be no trouble including thousands of records on the
>>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>>> ought to work fine.
>>>
>>> If the data size per partition exceeds some threshold that represents
>>> the right tradeoff of increasing repair cost, gc pressure, threatening
>>> unbalanced loads, and other issues that come with wide partitions, then you
>>> can subpartition via some means in a manner consistent with your work load,
>>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>>
>>> For example, if seq_type can be processed for a given seq_id in any
>>> order, and you need to be able to locate specific records for a known
>>> seq_id/seq_type pair, you can compute subpartition is computed
>>> deterministically.  Or if you only ever need to read *all* values for a
>>> given seq_id, and the processing order is not important, just randomly
>>> generate a value for subpartition at write time, as long as you can know
>>> all possible values for subpartition.
>>>
>>> If the values for the seq_types for a given seq_id must always be
>>> processed in order based on seq_type, then your subpartition calculation
>>> would need to reflect that and place adjacent seq_types in the same
>>> partition.  As a contrived example, say seq_type was an incrementing
>>> integer, your subpartition could be seq_type / 100.
>>>
>>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>>>
 I have a data model question. I am trying to figure out how to model
 the data to achieve the best data locality for analytic purpose. Our
 application processes sequences. Each sequence has a unique key in the
 format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
 number of seq_types. The typical read is to load a subset of sequences with
 

Re: How to model data to achieve specific data locality

2014-12-07 Thread DuyHai Doan
"Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly"

--> Then use bucketing to avoid too wide partitions

"Also new seq_types can be added and old seq_types can be deleted. This
means I often need to ALTER TABLE to add and drop columns. I am not sure if
this is a good practice from operation point of view."

 --> I don't understand why altering table is necessary to add seq_types.
If "seq_types" is defined as your clustering column, you can have many of
them using the same table structure ...





On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang  wrote:

> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:
>
>> It depends on the size of your data, but if your data is reasonably
>> small, there should be no trouble including thousands of records on the
>> same partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
>> ought to work fine.
>>
>> If the data size per partition exceeds some threshold that represents the
>> right tradeoff of increasing repair cost, gc pressure, threatening
>> unbalanced loads, and other issues that come with wide partitions, then you
>> can subpartition via some means in a manner consistent with your work load,
>> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>>
>> For example, if seq_type can be processed for a given seq_id in any
>> order, and you need to be able to locate specific records for a known
>> seq_id/seq_type pair, you can compute subpartition is computed
>> deterministically.  Or if you only ever need to read *all* values for a
>> given seq_id, and the processing order is not important, just randomly
>> generate a value for subpartition at write time, as long as you can know
>> all possible values for subpartition.
>>
>> If the values for the seq_types for a given seq_id must always be
>> processed in order based on seq_type, then your subpartition calculation
>> would need to reflect that and place adjacent seq_types in the same
>> partition.  As a contrived example, say seq_type was an incrementing
>> integer, your subpartition could be seq_type / 100.
>>
>> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>>
>>> I have a data model question. I am trying to figure out how to model the
>>> data to achieve the best data locality for analytic purpose. Our
>>> application processes sequences. Each sequence has a unique key in the
>>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>>> number of seq_types. The typical read is to load a subset of sequences with
>>> the same seq_id. Naturally I would like to have all the sequences with the
>>> same seq_id to co-locate on the same node(s).
>>>
>>>
>>> However I can't simply create one partition per seq_id and use seq_id as
>>> my partition key. That's because:
>>>
>>>
>>> 1. there could be thousands or even more seq_types for each seq_id. It's
>>> not feasible to include all the seq_types into one table.
>>>
>>> 2. each seq_id might have different sets of seq_types.
>>>
>>> 3. each application only needs to access a subset of seq_types for a
>>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>>> prefer only touching the data that's needed.
>>>
>>>
>>> As per above, I think I should use one partition per
>>> [seq_id]_[seq_type]. But how can I archive the data locality on seq_id? One
>>> possible approach is to override IPartitioner so that I just use part of
>>> the field (say 64 bytes) to get the token (for location) while still using
>>> the whole field as partition key (for look up). But before heading that
>>> direction, I would like to see if there are better options out there. Maybe
>>> any new or upcoming features in C* 3.0?
>>>
>>>
>>> Thanks.
>>>
>>
> Thanks, Eric.
>
> Those sequences are not fixed. All sequences with the same seq_id tend to
> grow at the same rate. If it's one partition per seq_id, the size will most
> likely exceed the threshold quickly. Also new seq_types can be added and
> old seq_types can be deleted. This means I often need to ALTER TABLE to add
> and drop columns. I am not sure if this is a good practice from operation
> point of view.
>
> I thought about your subpartition idea. If there are only a few
> applications and each one of them uses a subset of seq_types, I can easily
> create one table per application since I can compute the subpartition
> deterministically as you said. But in my case data scientists need to
> easily write new applications using any combination of seq_types of a
> seq_id. So I want the data model to be flexible enough to support
> applications using any different set of seq_types without creating new
> tables, duplicate all the data etc.
>
> -Kai
>
>
>


Re: How to model data to achieve specific data locality

2014-12-06 Thread Kai Wang
On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens  wrote:

> It depends on the size of your data, but if your data is reasonably small,
> there should be no trouble including thousands of records on the same
> partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
> ought to work fine.
>
> If the data size per partition exceeds some threshold that represents the
> right tradeoff of increasing repair cost, gc pressure, threatening
> unbalanced loads, and other issues that come with wide partitions, then you
> can subpartition via some means in a manner consistent with your work load,
> with something like PRIMARY KEY ((seq_id, subpartition), seq_type).
>
> For example, if seq_type can be processed for a given seq_id in any order,
> and you need to be able to locate specific records for a known
> seq_id/seq_type pair, you can compute subpartition is computed
> deterministically.  Or if you only ever need to read *all* values for a
> given seq_id, and the processing order is not important, just randomly
> generate a value for subpartition at write time, as long as you can know
> all possible values for subpartition.
>
> If the values for the seq_types for a given seq_id must always be
> processed in order based on seq_type, then your subpartition calculation
> would need to reflect that and place adjacent seq_types in the same
> partition.  As a contrived example, say seq_type was an incrementing
> integer, your subpartition could be seq_type / 100.
>
> On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:
>
>> I have a data model question. I am trying to figure out how to model the
>> data to achieve the best data locality for analytic purpose. Our
>> application processes sequences. Each sequence has a unique key in the
>> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
>> number of seq_types. The typical read is to load a subset of sequences with
>> the same seq_id. Naturally I would like to have all the sequences with the
>> same seq_id to co-locate on the same node(s).
>>
>>
>> However I can't simply create one partition per seq_id and use seq_id as
>> my partition key. That's because:
>>
>>
>> 1. there could be thousands or even more seq_types for each seq_id. It's
>> not feasible to include all the seq_types into one table.
>>
>> 2. each seq_id might have different sets of seq_types.
>>
>> 3. each application only needs to access a subset of seq_types for a
>> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
>> prefer only touching the data that's needed.
>>
>>
>> As per above, I think I should use one partition per [seq_id]_[seq_type].
>> But how can I archive the data locality on seq_id? One possible approach is
>> to override IPartitioner so that I just use part of the field (say 64
>> bytes) to get the token (for location) while still using the whole field as
>> partition key (for look up). But before heading that direction, I would
>> like to see if there are better options out there. Maybe any new or
>> upcoming features in C* 3.0?
>>
>>
>> Thanks.
>>
>
Thanks, Eric.

Those sequences are not fixed. All sequences with the same seq_id tend to
grow at the same rate. If it's one partition per seq_id, the size will most
likely exceed the threshold quickly. Also new seq_types can be added and
old seq_types can be deleted. This means I often need to ALTER TABLE to add
and drop columns. I am not sure if this is a good practice from operation
point of view.

I thought about your subpartition idea. If there are only a few
applications and each one of them uses a subset of seq_types, I can easily
create one table per application since I can compute the subpartition
deterministically as you said. But in my case data scientists need to
easily write new applications using any combination of seq_types of a
seq_id. So I want the data model to be flexible enough to support
applications using any different set of seq_types without creating new
tables, duplicate all the data etc.

-Kai


Re: How to model data to achieve specific data locality

2014-12-06 Thread Eric Stevens
It depends on the size of your data, but if your data is reasonably small,
there should be no trouble including thousands of records on the same
partition key.  So a data model using PRIMARY KEY ((seq_id), seq_type)
ought to work fine.

If the data size per partition exceeds some threshold that represents the
right tradeoff of increasing repair cost, gc pressure, threatening
unbalanced loads, and other issues that come with wide partitions, then you
can subpartition via some means in a manner consistent with your work load,
with something like PRIMARY KEY ((seq_id, subpartition), seq_type).

For example, if seq_type can be processed for a given seq_id in any order,
and you need to be able to locate specific records for a known
seq_id/seq_type pair, you can compute subpartition is computed
deterministically.  Or if you only ever need to read *all* values for a
given seq_id, and the processing order is not important, just randomly
generate a value for subpartition at write time, as long as you can know
all possible values for subpartition.

If the values for the seq_types for a given seq_id must always be processed
in order based on seq_type, then your subpartition calculation would need
to reflect that and place adjacent seq_types in the same partition.  As a
contrived example, say seq_type was an incrementing integer, your
subpartition could be seq_type / 100.

On Fri Dec 05 2014 at 7:34:38 PM Kai Wang  wrote:

> I have a data model question. I am trying to figure out how to model the
> data to achieve the best data locality for analytic purpose. Our
> application processes sequences. Each sequence has a unique key in the
> format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
> number of seq_types. The typical read is to load a subset of sequences with
> the same seq_id. Naturally I would like to have all the sequences with the
> same seq_id to co-locate on the same node(s).
>
>
> However I can't simply create one partition per seq_id and use seq_id as
> my partition key. That's because:
>
>
> 1. there could be thousands or even more seq_types for each seq_id. It's
> not feasible to include all the seq_types into one table.
>
> 2. each seq_id might have different sets of seq_types.
>
> 3. each application only needs to access a subset of seq_types for a
> seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
> prefer only touching the data that's needed.
>
>
> As per above, I think I should use one partition per [seq_id]_[seq_type].
> But how can I archive the data locality on seq_id? One possible approach is
> to override IPartitioner so that I just use part of the field (say 64
> bytes) to get the token (for location) while still using the whole field as
> partition key (for look up). But before heading that direction, I would
> like to see if there are better options out there. Maybe any new or
> upcoming features in C* 3.0?
>
>
> Thanks.
>


How to model data to achieve specific data locality

2014-12-05 Thread Kai Wang
I have a data model question. I am trying to figure out how to model the
data to achieve the best data locality for analytic purpose. Our
application processes sequences. Each sequence has a unique key in the
format of [seq_id]_[seq_type]. For any given seq_id, there are unlimited
number of seq_types. The typical read is to load a subset of sequences with
the same seq_id. Naturally I would like to have all the sequences with the
same seq_id to co-locate on the same node(s).


However I can't simply create one partition per seq_id and use seq_id as my
partition key. That's because:


1. there could be thousands or even more seq_types for each seq_id. It's
not feasible to include all the seq_types into one table.

2. each seq_id might have different sets of seq_types.

3. each application only needs to access a subset of seq_types for a
seq_id. Based on CASSANDRA-5762, select partial row loads the whole row. I
prefer only touching the data that's needed.


As per above, I think I should use one partition per [seq_id]_[seq_type].
But how can I archive the data locality on seq_id? One possible approach is
to override IPartitioner so that I just use part of the field (say 64
bytes) to get the token (for location) while still using the whole field as
partition key (for look up). But before heading that direction, I would
like to see if there are better options out there. Maybe any new or
upcoming features in C* 3.0?


Thanks.