Re: Time Series schema performance

Jeff Jirsa Tue, 29 May 2018 08:02:52 -0700

There’s a third option which is doing bucketing by time instead of by hash, 
which tends to perform quite well if you’re using TWCS as it makes it quite 
likely that a read can be served by a single sstable


-- 
Jeff Jirsa


> On May 29, 2018, at 6:49 AM, sujeet jog <sujeet....@gmail.com> wrote:
> 
> Folks, 
> I have two alternatives for the time series schema i have, and wanted to 
> weigh of on one of the schema . 
> 
> The query is given id, & timestamp, read the metrics associated with the id
> 
> The records are inserted every 5 mins, and the number of id's = 2 million, 
> so at every 5mins  it will be 2 million records that will be written.  
> 
> Bucket Range  : 0 - 5K.
> 
> Schema 1 ) 
> 
> create table (
> id timeuuid,
> bucketid Int, 
> date date,
> timestamp timestamp, 
> metricName1   BigInt,
> metricName2 BigInt. 
> ...
> .....
> metricName300 BigInt,
> 
> Primary Key (( day, bucketid ) ,  id, timestamp)
> ) 
> 
> BucketId is just a murmur3 hash of the id  which acts as a splitter to group 
> id's in a partition
> 
> 
> Pros : - 
> 
> Efficient write performance, since data is written to minimal partitions
> 
> Cons : -
> 
> While the first schema works best when queried programmatically, but is a bit 
> inflexible If it has to be integrated with 3rd party BI tools like tableau, 
> bucket-id cannot be generated from tableau as it's not part of the view etc..
> 
> 
> Schema 2 ) 
> Same as above, without bucketid &  date. 
> 
> Primary Key (id, timestamp ) 
> 
> Pros : -
> 
> BI tools don't need to generate bucket id lookups, 
> 
> Cons :-
> Too many partitions are written every 5 mins,  say 2 million records written 
> in distinct 2 million partitions.
> 
> 
> 
> I believe writing this data to commit log is same in case of Schema 1 & 
> Schema 2 ) , but the actual performance bottleneck could be compaction, since 
> the data from memtable is transformed to ssTables often based on the memory 
> settings, and 
> the header for every SSTable would maintain partitionIndex with  byteoffsets, 
> 
>  wanted to guage how bad can the performance of Schema-2 go with respect to 
> Write/Compaction having to do many diskseeks.
> 
> compacting many tables but with too many partitionIndex entries because of 
> the high number of parititions ,  can this be a bottleneck ?..
> 
> Any indept performance explanation of Schema-2 would be very much helpful
> 
> 
> Thanks, 
> 
>

Re: Time Series schema performance

Reply via email to