Cassandra Data Model with Narrow partition

2015-10-30 Thread chandrasekar.krc
Hi,

Could you please suggest if Narrow partition is a  good choice for the below 
use case.


1)  Write heavy event log table with 50m inserts per day with a peak load 
of 20K transaction per sec. There aren't any updates/deletes to records 
inserted. Records are inserted with a TTL of 60 days (retention period)

2)  The table has a single primary key which is a sequence number (27 
digits) generated by source application

3)  There are only two access patterns used - one by using the sequence 
number & the other using sequence number + event date (range scans also 
possible)

4)  My target data model in Cassandra is partitioned with sequence number 
as the primary key + event date as clustering columns to enable range scans on 
date.

5)  The Table has close to 120+ columns and the average row size comes 
close to 32K bytes

6)  Reads are very very less and account to <5% while inserts can be close 
to 95%.

7)  From a functional standpoint, I do not see any other columns that can 
be part of primary key to keep the partition reasonable (<100MB)

Questions:

1)  Is Narrow partition an ideal choice for the above use case.

2)  Is artificial bucketing an alternate choice to make the partition 
reasonable

3)  We are using varint as the data type for sequence number which is 27 
digits long. Is DECIMAL data type ?

4)  Any suggestions on performance impacts during compaction ?

Regards, Chandra Sekar KR

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Cassandra Data Model with Narrow partition

2015-10-30 Thread Carlos Alonso
Hi Chandra,

Narrow partition is probably your best choice, but you need to bucket data
somehow, otherwise your partitions will soon become unmanageable and you'll
have problems reading them, both because the partitions will become very
big and also because of the tombstones that your expired records will
generate.

In general having a partition that can grow indefinitely is a bad idea, so
I'd advice you to use time based artificial bucketing to limit the maximum
size of your partitions to be as close as possible to the recommendations.

Also 120+ columns sounds like quite many, is there a way you can separate
in different cfs or maybe use collections? I'd advice to do some
benchmarking here: http://mrcalonso.com/benchmarking-cassandra-models/.
This post is a bit outdated as nowadays you can use cassandra-stress with
your own models, but the idea is the same.

About compactions I'd use DTCS or LCS, but given that you will have a big
amount of tombstones due to TTLs I'd never go with STCS.

Hope it helps!

Carlos Alonso | Software Engineer | @calonso 

On 30 October 2015 at 10:55,  wrote:

> Hi,
>
>
>
> Could you please suggest if Narrow partition is a  good choice for the
> below use case.
>
>
>
> 1)  Write heavy event log table with 50m inserts per day with a peak
> load of 20K transaction per sec. There aren’t any updates/deletes to
> records inserted. Records are inserted with a TTL of 60 days (retention
> period)
>
> 2)  The table has a single primary key which is a sequence number (27
> digits) generated by source application
>
> 3)  There are only two access patterns used – one by using the
> sequence number & the other using sequence number + event date (range scans
> also possible)
>
> 4)  My target data model in Cassandra is partitioned with sequence
> number as the primary key + event date as clustering columns to enable
> range scans on date.
>
> 5)  The Table has close to 120+ columns and the average row size
> comes close to 32K bytes
>
> 6)  Reads are very very less and account to <5% while inserts can be
> close to 95%.
>
> 7)  From a functional standpoint, I do not see any other columns that
> can be part of primary key to keep the partition reasonable (<100MB)
>
>
>
> Questions:
>
> 1)  Is Narrow partition an ideal choice for the above use case.
>
> 2)  Is artificial bucketing an alternate choice to make the partition
> reasonable
>
> 3)  We are using varint as the data type for sequence number which is
> 27 digits long. Is DECIMAL data type ?
>
> 4)  Any suggestions on performance impacts during compaction ?
>
>
>
> Regards, Chandra Sekar KR
>
>
> The information contained in this electronic message and any attachments
> to this message are intended for the exclusive use of the addressee(s) and
> may contain proprietary, confidential or privileged information. If you are
> not the intended recipient, you should not disseminate, distribute or copy
> this e-mail. Please notify the sender immediately and destroy all copies of
> this message and any attachments. WARNING: Computer viruses can be
> transmitted via email. The recipient should check this email and any
> attachments for the presence of viruses. The company accepts no liability
> for any damage caused by any virus transmitted by this email.
> www.wipro.com
>


Re: Cassandra Data Model with Narrow partition

2015-10-30 Thread Kai Wang
agree with Carlos, you should bucket your key, for example, into (pk, day,
hour). Otherwise your partition is going to be large enough to cause
problems.

On Fri, Oct 30, 2015 at 8:04 AM, Carlos Alonso  wrote:

> Hi Chandra,
>
> Narrow partition is probably your best choice, but you need to bucket data
> somehow, otherwise your partitions will soon become unmanageable and you'll
> have problems reading them, both because the partitions will become very
> big and also because of the tombstones that your expired records will
> generate.
>
> In general having a partition that can grow indefinitely is a bad idea, so
> I'd advice you to use time based artificial bucketing to limit the maximum
> size of your partitions to be as close as possible to the recommendations.
>
> Also 120+ columns sounds like quite many, is there a way you can separate
> in different cfs or maybe use collections? I'd advice to do some
> benchmarking here: http://mrcalonso.com/benchmarking-cassandra-models/.
> This post is a bit outdated as nowadays you can use cassandra-stress with
> your own models, but the idea is the same.
>
> About compactions I'd use DTCS or LCS, but given that you will have a big
> amount of tombstones due to TTLs I'd never go with STCS.
>
> Hope it helps!
>
> Carlos Alonso | Software Engineer | @calonso 
>
> On 30 October 2015 at 10:55,  wrote:
>
>> Hi,
>>
>>
>>
>> Could you please suggest if Narrow partition is a  good choice for the
>> below use case.
>>
>>
>>
>> 1)  Write heavy event log table with 50m inserts per day with a peak
>> load of 20K transaction per sec. There aren’t any updates/deletes to
>> records inserted. Records are inserted with a TTL of 60 days (retention
>> period)
>>
>> 2)  The table has a single primary key which is a sequence number
>> (27 digits) generated by source application
>>
>> 3)  There are only two access patterns used – one by using the
>> sequence number & the other using sequence number + event date (range scans
>> also possible)
>>
>> 4)  My target data model in Cassandra is partitioned with sequence
>> number as the primary key + event date as clustering columns to enable
>> range scans on date.
>>
>> 5)  The Table has close to 120+ columns and the average row size
>> comes close to 32K bytes
>>
>> 6)  Reads are very very less and account to <5% while inserts can be
>> close to 95%.
>>
>> 7)  From a functional standpoint, I do not see any other columns
>> that can be part of primary key to keep the partition reasonable (<100MB)
>>
>>
>>
>> Questions:
>>
>> 1)  Is Narrow partition an ideal choice for the above use case.
>>
>> 2)  Is artificial bucketing an alternate choice to make the
>> partition reasonable
>>
>> 3)  We are using varint as the data type for sequence number which
>> is 27 digits long. Is DECIMAL data type ?
>>
>> 4)  Any suggestions on performance impacts during compaction ?
>>
>>
>>
>> Regards, Chandra Sekar KR
>>
>>
>> The information contained in this electronic message and any attachments
>> to this message are intended for the exclusive use of the addressee(s) and
>> may contain proprietary, confidential or privileged information. If you are
>> not the intended recipient, you should not disseminate, distribute or copy
>> this e-mail. Please notify the sender immediately and destroy all copies of
>> this message and any attachments. WARNING: Computer viruses can be
>> transmitted via email. The recipient should check this email and any
>> attachments for the presence of viruses. The company accepts no liability
>> for any damage caused by any virus transmitted by this email.
>> www.wipro.com
>>
>
>


Re: Cassandra Data Model with Narrow partition

2015-10-30 Thread Jeff Jirsa
I’m going to disagree with Carlos in two points.

You do have a lot of columns, so many that it’s likely to impact performance. 
Rather than using collections, serializing those into a single JSON field is 
far more performant. Since you write each record exactly once, this should be 
easily managed.
Your use case is the canonical use case for DTCS – all data has a TTL, you 
write in order, you write each row one time. However, I’m going to suggest to 
you that DTCS is currently a bit buggy in certain operations 
(adding/removing/replacing nodes, in particular, can be very painful with 
larger installations using DTCS). You’re small enough it may not matter, but if 
you expect to grow above a handful of nodes, you may consider waiting until 
DTCS is more stable. 
- Jeff

From:  Carlos Alonso
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, October 30, 2015 at 5:04 AM
To:  "user@cassandra.apache.org"
Subject:  Re: Cassandra Data Model with Narrow partition

Hi Chandra, 

Narrow partition is probably your best choice, but you need to bucket data 
somehow, otherwise your partitions will soon become unmanageable and you'll 
have problems reading them, both because the partitions will become very big 
and also because of the tombstones that your expired records will generate.

In general having a partition that can grow indefinitely is a bad idea, so I'd 
advice you to use time based artificial bucketing to limit the maximum size of 
your partitions to be as close as possible to the recommendations.

Also 120+ columns sounds like quite many, is there a way you can separate in 
different cfs or maybe use collections? I'd advice to do some benchmarking 
here: http://mrcalonso.com/benchmarking-cassandra-models/. This post is a bit 
outdated as nowadays you can use cassandra-stress with your own models, but the 
idea is the same.

About compactions I'd use DTCS or LCS, but given that you will have a big 
amount of tombstones due to TTLs I'd never go with STCS.

Hope it helps!

Carlos Alonso | Software Engineer | @calonso

On 30 October 2015 at 10:55, <chandrasekar@wipro.com> wrote:
Hi,

 

Could you please suggest if Narrow partition is a  good choice for the below 
use case.

 

1)  Write heavy event log table with 50m inserts per day with a peak load 
of 20K transaction per sec. There aren’t any updates/deletes to records 
inserted. Records are inserted with a TTL of 60 days (retention period)

2)  The table has a single primary key which is a sequence number (27 
digits) generated by source application

3)  There are only two access patterns used – one by using the sequence 
number & the other using sequence number + event date (range scans also 
possible)

4)  My target data model in Cassandra is partitioned with sequence number 
as the primary key + event date as clustering columns to enable range scans on 
date.

5)  The Table has close to 120+ columns and the average row size comes 
close to 32K bytes

6)  Reads are very very less and account to <5% while inserts can be close 
to 95%.

7)  From a functional standpoint, I do not see any other columns that can 
be part of primary key to keep the partition reasonable (<100MB)

 

Questions:

1)  Is Narrow partition an ideal choice for the above use case.

2)  Is artificial bucketing an alternate choice to make the partition 
reasonable

3)  We are using varint as the data type for sequence number which is 27 
digits long. Is DECIMAL data type ?

4)  Any suggestions on performance impacts during compaction ?

 

Regards, Chandra Sekar KR

 
The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com 




smime.p7s
Description: S/MIME cryptographic signature