Thanks, is it a kind of opentsdb? 2015-04-05 18:28 GMT+02:00 Kevin Burton <bur...@spinn3r.com>:
> > Hi, I switched from HBase to Cassandra and try to find problem solution > for timeseries analysis on top Cassandra. > > Depending on what you’re looking for, you might want to check out KairosDB. > > 0.95 beta2 just shipped yesterday as well so you have good timing. > > https://github.com/kairosdb/kairosdb > > On Sat, Apr 4, 2015 at 11:29 AM, Serega Sheypak <serega.shey...@gmail.com> > wrote: > >> Okay, so bucketing by day/week/month is a capacity planning stuff and >> actual questions I want to ask. >> As as a conclusion: >> I have a table events >> >> CREATE TABLE user_plans ( >> id timeuuid, >> user_id timeuuid, >> event_ts timestamp, >> event_type int, >> some_other_attr text >> >> PRIMARY KEY (user_id, ends) >> ); >> which fits tactic queries: >> select smth from user_plans where user_id='xxx' and end_ts > now() >> >> Then I create second table user_plans_daily (or weekly, monthy) >> >> with DDL: >> CREATE TABLE user_plans_daily/weekly/monthly ( >> ymd int, >> user_id timeuuid, >> event_ts timestamp, >> event_type int, >> some_other_attr text >> ) >> PRIMARY KEY ((ymd, user_id), event_ts ) >> WITH CLUSTERING ORDER BY (event_ts DESC); >> >> And this table is good for answering strategic questions: >> select * from >> user_plans_daily/weekly/monthly >> where ymd in (....) >> And I should avoid long condition inside IN clause, that is why you >> suggest me to create bigger bucket, correct? >> >> >> 2015-04-04 20:00 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >> >>> It sounds like your time bucket should be a month, but it depends on the >>> amount of data per user per day and your main query range. Within the >>> partition you can then query for a range of days. >>> >>> Yes, all of the rows within a partition are stored on one physical node >>> as well as the replica nodes. >>> >>> -- Jack Krupansky >>> >>> On Sat, Apr 4, 2015 at 1:38 PM, Serega Sheypak <serega.shey...@gmail.com >>> > wrote: >>> >>>> >non-equal relation on a partition key is not supported >>>> Ok, can I generate select query: >>>> select some_attributes >>>> from events where ymd = 20150101 or ymd = 20150102 or 20150103 ... or >>>> 20150331 >>>> >>>> > The partition key determines which node can satisfy the query >>>> So you mean that all rows with the same *(ymd, user_id)* would be on >>>> one physical node? >>>> >>>> >>>> 2015-04-04 16:38 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >>>> >>>>> Unfortunately, a non-equal relation on a partition key is not >>>>> supported. You would need to bucket by some larger unit, like a month, and >>>>> then use the date/time as a clustering column for the row key. Then you >>>>> could query within the partition. The partition key determines which node >>>>> can satisfy the query. Designing your partition key judiciously is the key >>>>> (haha!) to performant Cassandra applications. >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> On Sat, Apr 4, 2015 at 9:33 AM, Serega Sheypak < >>>>> serega.shey...@gmail.com> wrote: >>>>> >>>>>> Hi, we plan to have 10^8 users and each user could generate 10 events >>>>>> per day. >>>>>> So we have: >>>>>> 10^8 records per day >>>>>> 10^8*30 records per month. >>>>>> Our timewindow analysis could be from 1 to 6 months. >>>>>> >>>>>> Right now PK is PRIMARY KEY (user_id, ends) where endts is exact ts >>>>>> of event. >>>>>> >>>>>> So you suggest this approach: >>>>>> *PRIMARY KEY ((ymd, user_id), event_ts ) * >>>>>> *WITH CLUSTERING ORDER BY (**event_ts* >>>>>> * DESC);* >>>>>> >>>>>> where ymd=20150102 (the Second of January)? >>>>>> >>>>>> *What happens to writes:* >>>>>> SSTable with past days (ymd < current_day) stay untouched and don't >>>>>> take part in Compaction process since there are o changes to them? >>>>>> >>>>>> What happens to read: >>>>>> I issue query: >>>>>> select some_attributes >>>>>> from events where ymd >= 20150101 and ymd < 20150301 >>>>>> Does Cassandra skip SSTables which don't have ymd in specified range >>>>>> and give me a kind of partition elimination, like in traditional DBs? >>>>>> >>>>>> >>>>>> 2015-04-04 14:41 GMT+02:00 Jack Krupansky <jack.krupan...@gmail.com>: >>>>>> >>>>>>> It depends on the actual number of events per user, but simply >>>>>>> bucketing the partition key can give you the same effect - clustering >>>>>>> rows >>>>>>> by time range. A composite partition key could be comprised of the user >>>>>>> name and the date. >>>>>>> >>>>>>> It also depends on the data rate - is it many events per day or just >>>>>>> a few events per week, or over what time period. You need to be careful >>>>>>> - >>>>>>> you don't want your Cassandra partitions to be too big (millions of >>>>>>> rows) >>>>>>> or too small (just a few or even one row per partition.) >>>>>>> >>>>>>> -- Jack Krupansky >>>>>>> >>>>>>> On Sat, Apr 4, 2015 at 7:03 AM, Serega Sheypak < >>>>>>> serega.shey...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi, I switched from HBase to Cassandra and try to find problem >>>>>>>> solution for timeseries analysis on top Cassandra. >>>>>>>> I have a entity named "Event". >>>>>>>> "Event" has attributes: >>>>>>>> user_id - a guy who triggered event >>>>>>>> event_ts - when even happened >>>>>>>> event_type - type of event >>>>>>>> some_other_attr - some other attrs we don't care about right now. >>>>>>>> >>>>>>>> The DDL for entity event looks this way: >>>>>>>> >>>>>>>> CREATE TABLE user_plans ( >>>>>>>> >>>>>>>> id timeuuid, >>>>>>>> user_id timeuuid, >>>>>>>> event_ts timestamp, >>>>>>>> event_type int, >>>>>>>> some_other_attr text >>>>>>>> >>>>>>>> PRIMARY KEY (user_id, ends) >>>>>>>> ); >>>>>>>> >>>>>>>> Table is "infinite", It would grow continuously during application >>>>>>>> lifetime. >>>>>>>> I want to ask question: >>>>>>>> Cassandra, give me all event where event_ts >= xxx >>>>>>>> and event_ts <=yyy. >>>>>>>> >>>>>>>> Right now it would lead to full table scan. >>>>>>>> >>>>>>>> There is a trick in HBase. HBase has table abstraction and HBase >>>>>>>> has Column Family abstraction. >>>>>>>> Column family should be declared in advance. >>>>>>>> Column family - physically is a pack of HFiles ("SSTables in C*"). >>>>>>>> So I can easily add partitioning for my HBase table: >>>>>>>> alter table hbase_events add column familiy '2015_01' >>>>>>>> and store all 2015 January data to Column familiy named '2015_01'. >>>>>>>> >>>>>>>> When I want to get January data, I would directly access column >>>>>>>> family named '2015_01' and I won't massage all data in table, just this >>>>>>>> piece. >>>>>>>> >>>>>>>> What is approach in C* in this case? >>>>>>>> I have an idea create several tables: event_2015_01, event_2015_02, >>>>>>>> e.t.c. but it looks rather ugly from my current understanding how it >>>>>>>> works. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > > -- > > Founder/CEO Spinn3r.com > Location: *San Francisco, CA* > blog: http://burtonator.wordpress.com > … or check out my Google+ profile > <https://plus.google.com/102718274791889610666/posts> > <http://spinn3r.com> > >