Re: Optimizing ORC Sorting - Replace two level Partitions with one?

Nitin Pawar Sat, 10 Aug 2013 11:33:59 -0700

John,

[[ this is from my experience and I can be wrong ]]


the number of partitions you can have is totally dependent on two factors
1) how much memory (RAM) does your hive client posses
2) How many users will use the same hive machines at any given point of time

I have easily crossed 10k+ partitions in my exp
I have worked with data parittioned by date  and then there were
different criteria for sub partitions. (In total of around 24*365*n
partitions. I had a 64GB ram machine for myself and still managed to hit
OOM exceptions). I did not bother into debugging, rather spent time on
optimising the query performance without failure.

Best aspect of bucketing according to my understanding is , its evaluated
unders a partition (if you don't partition then entire table is under one
partition) Hive does take care of bucketing on hashing inside a partition
and if your unique values on bucketted column are more than the number of
buckets you decide, it also takes care of that as well.


based on the approach above,
I had built a shell script based worked engine which used to limit the data
to a query and then limit on failures.


if you want me to explain more on how we focused on data more than 200
billion+ records, we can talk on different thread  :)






On Sat, Aug 10, 2013 at 11:05 PM, John Omernik <[email protected]> wrote:

> Are there any effective limits on the number of partitions? Partitions is
> the answer that we choose because it makes logical sense. I.e. I have Days,
> on a given day I have a number of sources. Sometimes I want to query by day
> and search all sources, other times, I want to focus on specific sources.
>  With Bucketing, will it prune on the column like partitions do
> automatically? (Remember, this is specific to ORC files that I am working
> with here).
>
>
> On Sat, Aug 10, 2013 at 12:19 PM, Edward Capriolo 
> <[email protected]>wrote:
>
>> Bucketing does deal with that if you bucket on column you always get
>> bucket number of files. Because your hashing the value into a bucket.
>>
>> A query scanning many partitions and files is needlessly slow from MR
>> overhead.
>>
>>
>> On Sat, Aug 10, 2013 at 12:58 PM, John Omernik <[email protected]> wrote:
>>
>>> One issue with the bucketing is that the number of sources on any given
>>> day is dynamic. On some days it's 4, others it's 14 and it's also
>>> constantly changing.  I am hoping to use some of the features of the ORC
>>> files to almost make virtual partitions, but apparently I am going to run
>>> into issues either way.
>>>
>>> On another note, is there a limit to hive and partitions? I am hovering
>>> around 10k partitions on one table right now. It's still working, but some
>>> metadata operations can take a long time. The Sub-Partitions are going to
>>> hurt me here going forward I am guessing, so it may be worth flattening out
>>> to only days, even at the expense of read queries... thoughts?
>>>
>>>
>>>
>>> On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar 
>>> <[email protected]>wrote:
>>>
>>>> Agree with Edward,
>>>>
>>>> whole purpose of bucketing for me is to prune the data in where clause.
>>>> Else it totally defeats the purpose of splitting data into finite number of
>>>> identifiable distributions to improve the performance.
>>>>
>>>> But is my understanding correct that it  does help in reducing the
>>>> number of sub partitions we create at the bottom of table can be limited if
>>>> we identify the pattern does not exceed a finite number of values on that
>>>> partitions? (even if it cross this limit bucketting does take care of it
>>>> upto some volume)
>>>>
>>>>
>>>> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo <
>>>> [email protected]> wrote:
>>>>
>>>>> So there is one thing to be really carefully about bucketing. Say you
>>>>> bucket a table into 10 buckets, select with where does not actually prune
>>>>> the input buckets so many queries scan all the buckets.
>>>>>
>>>>>
>>>>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> will bucketing help? if you know finite # partiotions ?
>>>>>>
>>>>>>
>>>>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <[email protected]>wrote:
>>>>>>
>>>>>>> I have a table that currently uses RC files and has two levels of
>>>>>>> partitions.  day and source.  The table is first partitioned by day, 
>>>>>>> then
>>>>>>> within each day there are 6-15 source partitions.  This makes for a lot 
>>>>>>> of
>>>>>>> crazy partitions and was wondering if there'd be a way to optimize this
>>>>>>> with ORC files and some sorting.
>>>>>>>
>>>>>>> Specifically, would there be a way in a new table to make source a
>>>>>>> field (removing the partition)and somehow, as I am inserting into this 
>>>>>>> new
>>>>>>> setup sort by source in such a way that will help separate the
>>>>>>> files/indexes in a way that gives me almost the same performance as ORC
>>>>>>> with the two level partitions?  Just trying to optimize here and curious
>>>>>>> what people think.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nitin Pawar
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nitin Pawar
>>>>
>>>
>>>
>>
>


-- 
Nitin Pawar

Re: Optimizing ORC Sorting - Replace two level Partitions with one?

Reply via email to