Re: Partition and Split rows

Sand Stone Thu, 12 May 2016 11:07:49 -0700

> Is the requirement to pre-aggregate by time window?
No, I am thinking to create a column say, "minute". It's basically the
minute field of the timestamp column(even round to 5-min bucket depending
on the needs). So it's a computed column being filled in on data ingestion.
My goal is that this field would help with data filtering at read/query
time, say select certain projection at minute 10-15, to speed up the read
queries.


Thanks for the info., I will follow them.

On Thu, May 12, 2016 at 10:50 AM, Dan Burkert <d...@cloudera.com> wrote:

> Hey Sand,
>
> Sorry for the delayed response.  I'm not quite following your use case.
> Is the requirement to pre-aggregate by time window? I don't think Kudu can
> help you directly with that (nothing built in), but you could always create
> a separate table to store the pre-aggregated values.  As far as applying
> functions to do row splits, that is an interesting idea, but I think once
> Kudu has support for range bounds (the non-covering range partition design
> doc linked above), you can simply create the bounds where the function
> would have put them.  For example, if you want a partition for every five
> minutes, you can create the bounds accordingly.
>
> Earlier this week I gave a talk on timeseries in Kudu, I've included some
> slides that may be interesting to you.  Additionally, you may want to check
> out https://github.com/danburkert/kudu-ts, it's a very young  (not
> feature complete) metrics layer on top of Kudu, it may give you some ideas.
>
> - Dan
>
> On Sat, May 7, 2016 at 1:28 PM, Sand Stone <sand.m.st...@gmail.com> wrote:
>
>> Thanks for sharing, Dan. The diagrams explained clearly how the current
>> system works.
>>
>> As for things in my mind. Take the schema of <host,metric,time,...>, say,
>> I am interested in data for the past 5 mins, 10 mins, etc. Or, aggregate at
>> 5 mins interval for the past 3 days, 7 days, ... Looks like I need to
>> introduce a special 5-min bar column, use that column to do range partition
>> to spread data across the tablet servers so that I could leverage parallel
>> filtering.
>>
>> The cost of this extra column (INT8) is not ideal but not too bad either
>> (storage cost wise, compression should do wonders). So I am thinking
>> whether it would be better to take "functions" as row split instead of only
>> constants. Of course if business requires to drop down to 1-min bar, the
>> data has to be re-sharded again. So a more cost effective way of doing this
>> on a production cluster would be good.
>>
>>
>>
>>
>> On Sat, May 7, 2016 at 8:50 AM, Dan Burkert <d...@cloudera.com> wrote:
>>
>>> Hi Sand,
>>>
>>> I've been working on some diagrams to help explain some of the more
>>> advanced partitioning types, it's attached.   Still pretty rough at this
>>> point, but the goal is to clean it up and move it into the Kudu
>>> documentation proper.  I'm interested to hear what kind of time series you
>>> are interested in Kudu for.  I'm tasked with improving Kudu for time
>>> series, you can follow progress here
>>> <https://issues.apache.org/jira/browse/KUDU-1306>. If you have any
>>> additional ideas I'd love to hear them.  You may also be interested in a
>>> small project that a JD and I have been working on in the past week to
>>> build an OpenTSDB style store on top of Kudu, you can find it here
>>> <https://github.com/danburkert/kudu-ts>.  Still quite feature limited
>>> at this point.
>>>
>>> - Dan
>>>
>>> On Fri, May 6, 2016 at 4:51 PM, Sand Stone <sand.m.st...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. Will read.
>>>>
>>>> Given that I am researching time series data, row locality is crucial
>>>> :-)
>>>>
>>>> On Fri, May 6, 2016 at 3:57 PM, Jean-Daniel Cryans <jdcry...@apache.org
>>>> > wrote:
>>>>
>>>>> We do have non-covering range partitions coming in the next few
>>>>> months, here's the design (in review):
>>>>> http://gerrit.cloudera.org:8080/#/c/2772/9/docs/design-docs/non-covering-range-partitions.md
>>>>>
>>>>> The "Background & Motivation" section should give you a good idea of
>>>>> why I'm mentioning this.
>>>>>
>>>>> Meanwhile, if you don't need row locality, using hash partitioning
>>>>> could be good enough.
>>>>>
>>>>> J-D
>>>>>
>>>>> On Fri, May 6, 2016 at 3:53 PM, Sand Stone <sand.m.st...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Makes sense.
>>>>>>
>>>>>> Yeah it would be cool if users could specify/control the split rows
>>>>>> after the table is created. Now, I have to "think ahead" to pre-create 
>>>>>> the
>>>>>> range buckets.
>>>>>>
>>>>>> On Fri, May 6, 2016 at 3:49 PM, Jean-Daniel Cryans <
>>>>>> jdcry...@apache.org> wrote:
>>>>>>
>>>>>>> You will only get 1 tablet and no data distribution, which is bad.
>>>>>>>
>>>>>>> That's also how HBase works, but it will split regions as you insert
>>>>>>> data and eventually you'll get some data distribution even if it doesn't
>>>>>>> start in an ideal situation. Tablet splitting will come later for Kudu.
>>>>>>>
>>>>>>> J-D
>>>>>>>
>>>>>>> On Fri, May 6, 2016 at 3:42 PM, Sand Stone <sand.m.st...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> One more questions, how does the range partition work if I don't
>>>>>>>> specify the split rows?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> On Fri, May 6, 2016 at 3:37 PM, Sand Stone <sand.m.st...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks, Misty. The "advanced" impala example helped.
>>>>>>>>>
>>>>>>>>> I was just reading the Java API,CreateTableOptions.java, it's
>>>>>>>>> unclear how the range partition column names associated with the 
>>>>>>>>> partial
>>>>>>>>> rows params in the addSplitRow API.
>>>>>>>>>
>>>>>>>>> On Fri, May 6, 2016 at 3:08 PM, Misty Stanley-Jones <
>>>>>>>>> mstanleyjo...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Sand,
>>>>>>>>>>
>>>>>>>>>> Please have a look at
>>>>>>>>>> http://getkudu.io/docs/kudu_impala_integration.html#partitioning_tables
>>>>>>>>>> and see if it is helpful to you.
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Misty
>>>>>>>>>>
>>>>>>>>>> On Fri, May 6, 2016 at 2:00 PM, Sand Stone <
>>>>>>>>>> sand.m.st...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, I am new to Kudu. I wonder how the split rows work. I know
>>>>>>>>>>> from some docs, this is currently for pre-creation the table. I am
>>>>>>>>>>> researching how to partition (hash+range) some time series test 
>>>>>>>>>>> data.
>>>>>>>>>>>
>>>>>>>>>>> Is there an example? or notes somewhere I could read upon.
>>>>>>>>>>>
>>>>>>>>>>> Thanks much.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Partition and Split rows

Reply via email to