Re: Time Series schema performance

Affan Syed Tue, 29 May 2018 22:59:37 -0700

Haris,

Like all things in Cassandra, you will need to create a down-sample
normalized table. ie either run a cron over the raw table, or if using some
streaming solution like Flink/Storm/Spark, to extract aggregate values and
put them into your downsample data.


HTH

- Affan

On Tue, May 29, 2018 at 10:24 PM, Haris Altaf <m.haris...@gmail.com> wrote:

> Hi All,
> I have a related question. How do you down-sample your timeseries data?
>
>
> regards,
> Haris
>
> On Tue, 29 May 2018 at 22:11 Jonathan Haddad <j...@jonhaddad.com> wrote:
>
>> I wrote a post on this topic a while ago, might be worth reading over:
>> http://thelastpickle.com/blog/2017/08/02/time-series-data-
>> modeling-massive-scale.html
>> On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> > There’s a third option which is doing bucketing by time instead of by
>> hash, which tends to perform quite well if you’re using TWCS as it makes
>> it
>> quite likely that a read can be served by a single sstable
>>
>> > --
>> > Jeff Jirsa
>>
>>
>> > On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote:
>>
>> > Folks,
>> > I have two alternatives for the time series schema i have, and wanted to
>> weigh of on one of the schema .
>>
>> > The query is given id, & timestamp, read the metrics associated with the
>> id
>>
>> > The records are inserted every 5 mins, and the number of id's = 2
>> million,
>> > so at every 5mins  it will be 2 million records that will be written.
>>
>> > Bucket Range  : 0 - 5K.
>>
>> > Schema 1 )
>>
>> > create table (
>> > id timeuuid,
>> > bucketid Int,
>> > date date,
>> > timestamp timestamp,
>> > metricName1   BigInt,
>> > metricName2 BigInt.
>> > ...
>> > .....
>> > metricName300 BigInt,
>>
>> > Primary Key (( day, bucketid ) ,  id, timestamp)
>> > )
>>
>> > BucketId is just a murmur3 hash of the id  which acts as a splitter to
>> group id's in a partition
>>
>>
>> > Pros : -
>>
>> > Efficient write performance, since data is written to minimal partitions
>>
>> > Cons : -
>>
>> > While the first schema works best when queried programmatically, but is
>> a
>> bit inflexible If it has to be integrated with 3rd party BI tools like
>> tableau, bucket-id cannot be generated from tableau as it's not part of
>> the
>> view etc..
>>
>>
>> > Schema 2 )
>> > Same as above, without bucketid &  date.
>>
>> > Primary Key (id, timestamp )
>>
>> > Pros : -
>>
>> > BI tools don't need to generate bucket id lookups,
>>
>> > Cons :-
>> > Too many partitions are written every 5 mins,  say 2 million records
>> written in distinct 2 million partitions.
>>
>>
>>
>> > I believe writing this data to commit log is same in case of Schema 1 &
>> Schema 2 ) , but the actual performance bottleneck could be compaction,
>> since the data from memtable is transformed to ssTables often based on the
>> memory settings, and
>> > the header for every SSTable would maintain partitionIndex with
>>   byteoffsets,
>>
>> >   wanted to guage how bad can the performance of Schema-2 go with
>> respect
>> to Write/Compaction having to do many diskseeks.
>>
>> > compacting many tables but with too many partitionIndex entries because
>> of the high number of parititions ,  can this be a bottleneck ?..
>>
>> > Any indept performance explanation of Schema-2 would be very much
>> helpful
>>
>>
>> > Thanks,
>>
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> twitter: rustyrazorblade
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
> regards,
> Haris
>

Re: Time Series schema performance

Reply via email to