> Is the requirement to pre-aggregate by time window? No, I am thinking to create a column say, "minute". It's basically the minute field of the timestamp column(even round to 5-min bucket depending on the needs). So it's a computed column being filled in on data ingestion. My goal is that this field would help with data filtering at read/query time, say select certain projection at minute 10-15, to speed up the read queries.
Thanks for the info., I will follow them. On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <d...@cloudera.com> wrote: > Hey Sand, > > Sorry for the delayed response. I'm not quite following your use case. > Is the requirement to pre-aggregate by time window? I don't think Kudu can > help you directly with that (nothing built in), but you could always create > a separate table to store the pre-aggregated values. As far as applying > functions to do row splits, that is an interesting idea, but I think once > Kudu has support for range bounds (the non-covering range partition design > doc linked above), you can simply create the bounds where the function > would have put them. For example, if you want a partition for every five > minutes, you can create the bounds accordingly. > > Earlier this week I gave a talk on timeseries in Kudu, I've included some > slides that may be interesting to you. Additionally, you may want to check > out https://github.com/danburkert/kudu-ts, it's a very young (not > feature complete) metrics layer on top of Kudu, it may give you some ideas. > > - Dan > > On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.st...@gmail.com> wrote: > >> Thanks for sharing, Dan. The diagrams explained clearly how the current >> system works. >> >> As for things in my mind. Take the schema of <host,metric,time,...>, say, >> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at >> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to >> introduce a special 5-min bar column, use that column to do range partition >> to spread data across the tablet servers so that I could leverage parallel >> filtering. >> >> The cost of this extra column (INT8) is not ideal but not too bad either >> (storage cost wise, compression should do wonders). So I am thinking >> whether it would be better to take "functions" as row split instead of only >> constants. Of course if business requires to drop down to 1-min bar, the >> data has to be re-sharded again. So a more cost effective way of doing this >> on a production cluster would be good. >> >> >> >> >> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <d...@cloudera.com> wrote: >> >>> Hi Sand, >>> >>> I've been working on some diagrams to help explain some of the more >>> advanced partitioning types, it's attached. Still pretty rough at this >>> point, but the goal is to clean it up and move it into the Kudu >>> documentation proper. I'm interested to hear what kind of time series you >>> are interested in Kudu for. I'm tasked with improving Kudu for time >>> series, you can follow progress here >>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any >>> additional ideas I'd love to hear them. You may also be interested in a >>> small project that a JD and I have been working on in the past week to >>> build an OpenTSDB style store on top of Kudu, you can find it here >>> <https://github.com/danburkert/kudu-ts>. Still quite feature limited >>> at this point. >>> >>> - Dan >>> >>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.st...@gmail.com> >>> wrote: >>> >>>> Thanks. Will read. >>>> >>>> Given that I am researching time series data, row locality is crucial >>>> :-) >>>> >>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcry...@apache.org >>>> > wrote: >>>> >>>>> We do have non-covering range partitions coming in the next few >>>>> months, here's the design (in review): >>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md >>>>> >>>>> The "Background & Motivation" section should give you a good idea of >>>>> why I'm mentioning this. >>>>> >>>>> Meanwhile, if you don't need row locality, using hash partitioning >>>>> could be good enough. >>>>> >>>>> J-D >>>>> >>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.st...@gmail.com> >>>>> wrote: >>>>> >>>>>> Makes sense. >>>>>> >>>>>> Yeah it would be cool if users could specify/control the split rows >>>>>> after the table is created. Now, I have to "think ahead" to pre-create >>>>>> the >>>>>> range buckets. >>>>>> >>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans < >>>>>> jdcry...@apache.org> wrote: >>>>>> >>>>>>> You will only get 1 tablet and no data distribution, which is bad. >>>>>>> >>>>>>> That's also how HBase works, but it will split regions as you insert >>>>>>> data and eventually you'll get some data distribution even if it doesn't >>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu. >>>>>>> >>>>>>> J-D >>>>>>> >>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.st...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> One more questions, how does the range partition work if I don't >>>>>>>> specify the split rows? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.st...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks, Misty. The "advanced" impala example helped. >>>>>>>>> >>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's >>>>>>>>> unclear how the range partition column names associated with the >>>>>>>>> partial >>>>>>>>> rows params in the addSplitRow API. >>>>>>>>> >>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones < >>>>>>>>> mstanleyjo...@cloudera.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Sand, >>>>>>>>>> >>>>>>>>>> Please have a look at >>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables >>>>>>>>>> and see if it is helpful to you. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Misty >>>>>>>>>> >>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone < >>>>>>>>>> sand.m.st...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know >>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am >>>>>>>>>>> researching how to partition (hash+range) some time series test >>>>>>>>>>> data. >>>>>>>>>>> >>>>>>>>>>> Is there an example? or notes somewhere I could read upon. >>>>>>>>>>> >>>>>>>>>>> Thanks much. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >