Re: Timeseries analysis using Cassandra and partition by date period

Serega Sheypak Mon, 06 Apr 2015 02:59:03 -0700

Thanks, is it a kind of opentsdb?

2015-04-05 18:28 GMT+02:00 Kevin Burton <bur...@spinn3r.com>:


> > Hi, I switched from HBase to Cassandra and try to find problem solution
> for timeseries analysis on top Cassandra.
>
> Depending on what you’re looking for, you might want to check out KairosDB.
>
> 0.95 beta2 just shipped yesterday as well so you have good timing.
>
> https://github.com/kairosdb/kairosdb
>
> On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.shey...@gmail.com>
> wrote:
>
>> Okay, so bucketing by day/week/month is a capacity planning stuff and
>> actual questions I want to ask.
>> As as a conclusion:
>> I have a table events
>>
>> CREATE TABLE user_plans (
>>   id timeuuid,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>>
>> PRIMARY KEY (user_id, ends)
>> );
>> which fits tactic queries:
>> select smth from user_plans where user_id='xxx' and end_ts > now()
>>
>> Then I create second table user_plans_daily (or weekly, monthy)
>>
>> with DDL:
>> CREATE TABLE user_plans_daily/weekly/monthly (
>>   ymd int,
>>   user_id timeuuid,
>>   event_ts timestamp,
>>   event_type int,
>>   some_other_attr text
>> )
>> PRIMARY KEY ((ymd, user_id), event_ts )
>> WITH CLUSTERING ORDER BY (event_ts DESC);
>>
>> And this table is good for answering strategic questions:
>> select * from
>> user_plans_daily/weekly/monthly
>> where ymd in (....)
>> And I should avoid long condition inside IN clause, that is why you
>> suggest me to create bigger bucket, correct?
>>
>>
>> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>
>>> It sounds like your time bucket should be a month, but it depends on the
>>> amount of data per user per day and your main query range. Within the
>>> partition you can then query for a range of days.
>>>
>>> Yes, all of the rows within a partition are stored on one physical node
>>> as well as the replica nodes.
>>>
>>> -- Jack Krupansky
>>>
>>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.shey...@gmail.com
>>> > wrote:
>>>
>>>> >non-equal relation on a partition key is not supported
>>>> Ok, can I generate select query:
>>>> select some_attributes
>>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or
>>>> 20150331
>>>>
>>>> > The partition key determines which node can satisfy the query
>>>> So you mean that all rows with the same *(ymd, user_id)* would be on
>>>> one physical node?
>>>>
>>>>
>>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>>
>>>>> Unfortunately, a non-equal relation on a partition key is not
>>>>> supported. You would need to bucket by some larger unit, like a month, and
>>>>> then use the date/time as a clustering column for the row key. Then you
>>>>> could query within the partition. The partition key determines which node
>>>>> can satisfy the query. Designing your partition key judiciously is the key
>>>>> (haha!) to performant Cassandra applications.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak <
>>>>> serega.shey...@gmail.com> wrote:
>>>>>
>>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events
>>>>>> per day.
>>>>>> So we have:
>>>>>> 10^8 records per day
>>>>>> 10^8*30 records per month.
>>>>>> Our timewindow analysis could be from 1 to 6 months.
>>>>>>
>>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts
>>>>>> of event.
>>>>>>
>>>>>> So you suggest this approach:
>>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) *
>>>>>> *WITH CLUSTERING ORDER BY (**event_ts*
>>>>>> * DESC);*
>>>>>>
>>>>>> where ymd=20150102 (the Second of January)?
>>>>>>
>>>>>> *What happens to writes:*
>>>>>> SSTable with past days (ymd < current_day) stay untouched and don't
>>>>>> take part in Compaction process since there are o changes to them?
>>>>>>
>>>>>> What happens to read:
>>>>>> I issue query:
>>>>>> select some_attributes
>>>>>> from events where ymd >= 20150101 and ymd < 20150301
>>>>>> Does Cassandra skip SSTables which don't have ymd in specified range
>>>>>> and give me a kind of partition elimination, like in traditional DBs?
>>>>>>
>>>>>>
>>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>:
>>>>>>
>>>>>>> It depends on the actual number of events per user, but simply
>>>>>>> bucketing the partition key can give you the same effect - clustering 
>>>>>>> rows
>>>>>>> by time range. A composite partition key could be comprised of the user
>>>>>>> name and the date.
>>>>>>>
>>>>>>> It also depends on the data rate - is it many events per day or just
>>>>>>> a few events per week, or over what time period. You need to be careful 
>>>>>>> -
>>>>>>> you don't want your Cassandra partitions to be too big (millions of 
>>>>>>> rows)
>>>>>>> or too small (just a few or even one row per partition.)
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak <
>>>>>>> serega.shey...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem
>>>>>>>> solution for timeseries analysis on top Cassandra.
>>>>>>>> I have a entity named "Event".
>>>>>>>> "Event" has attributes:
>>>>>>>> user_id - a guy who triggered event
>>>>>>>> event_ts - when even happened
>>>>>>>> event_type - type of event
>>>>>>>> some_other_attr - some other attrs we don't care about right now.
>>>>>>>>
>>>>>>>> The DDL for entity event looks this way:
>>>>>>>>
>>>>>>>> CREATE TABLE user_plans (
>>>>>>>>
>>>>>>>>   id timeuuid,
>>>>>>>>   user_id timeuuid,
>>>>>>>>   event_ts timestamp,
>>>>>>>>   event_type int,
>>>>>>>>   some_other_attr text
>>>>>>>>
>>>>>>>> PRIMARY KEY (user_id, ends)
>>>>>>>> );
>>>>>>>>
>>>>>>>> Table is "infinite", It would grow continuously during application
>>>>>>>> lifetime.
>>>>>>>> I want to ask question:
>>>>>>>> Cassandra, give me all event where event_ts >= xxx
>>>>>>>> and event_ts <=yyy.
>>>>>>>>
>>>>>>>> Right now it would lead to full table scan.
>>>>>>>>
>>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase
>>>>>>>> has Column Family abstraction.
>>>>>>>> Column family should be declared in advance.
>>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*").
>>>>>>>> So I can easily add partitioning for my HBase table:
>>>>>>>> alter table hbase_events add column familiy '2015_01'
>>>>>>>> and store all 2015 January data to Column familiy named '2015_01'.
>>>>>>>>
>>>>>>>> When I want to get January data, I would directly access column
>>>>>>>> family named '2015_01' and I won't massage all data in table, just this
>>>>>>>> piece.
>>>>>>>>
>>>>>>>> What is approach in C* in this case?
>>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02,
>>>>>>>> e.t.c. but it looks rather ugly from my current understanding how it 
>>>>>>>> works.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>

Re: Timeseries analysis using Cassandra and partition by date period

Reply via email to