Bucketing does deal with that if you bucket on column you always get bucket number of files. Because your hashing the value into a bucket.
A query scanning many partitions and files is needlessly slow from MR overhead. On Sat, Aug 10, 2013 at 12:58 PM, John Omernik <[email protected]> wrote: > One issue with the bucketing is that the number of sources on any given > day is dynamic. On some days it's 4, others it's 14 and it's also > constantly changing. I am hoping to use some of the features of the ORC > files to almost make virtual partitions, but apparently I am going to run > into issues either way. > > On another note, is there a limit to hive and partitions? I am hovering > around 10k partitions on one table right now. It's still working, but some > metadata operations can take a long time. The Sub-Partitions are going to > hurt me here going forward I am guessing, so it may be worth flattening out > to only days, even at the expense of read queries... thoughts? > > > > On Sat, Aug 10, 2013 at 11:46 AM, Nitin Pawar <[email protected]>wrote: > >> Agree with Edward, >> >> whole purpose of bucketing for me is to prune the data in where clause. >> Else it totally defeats the purpose of splitting data into finite number of >> identifiable distributions to improve the performance. >> >> But is my understanding correct that it does help in reducing the number >> of sub partitions we create at the bottom of table can be limited if we >> identify the pattern does not exceed a finite number of values on that >> partitions? (even if it cross this limit bucketting does take care of it >> upto some volume) >> >> >> On Sat, Aug 10, 2013 at 10:09 PM, Edward Capriolo >> <[email protected]>wrote: >> >>> So there is one thing to be really carefully about bucketing. Say you >>> bucket a table into 10 buckets, select with where does not actually prune >>> the input buckets so many queries scan all the buckets. >>> >>> >>> On Sat, Aug 10, 2013 at 12:34 PM, Nitin Pawar >>> <[email protected]>wrote: >>> >>>> will bucketing help? if you know finite # partiotions ? >>>> >>>> >>>> On Sat, Aug 10, 2013 at 9:26 PM, John Omernik <[email protected]> wrote: >>>> >>>>> I have a table that currently uses RC files and has two levels of >>>>> partitions. day and source. The table is first partitioned by day, then >>>>> within each day there are 6-15 source partitions. This makes for a lot of >>>>> crazy partitions and was wondering if there'd be a way to optimize this >>>>> with ORC files and some sorting. >>>>> >>>>> Specifically, would there be a way in a new table to make source a >>>>> field (removing the partition)and somehow, as I am inserting into this new >>>>> setup sort by source in such a way that will help separate the >>>>> files/indexes in a way that gives me almost the same performance as ORC >>>>> with the two level partitions? Just trying to optimize here and curious >>>>> what people think. >>>>> >>>>> John >>>>> >>>> >>>> >>>> >>>> -- >>>> Nitin Pawar >>>> >>> >>> >> >> >> -- >> Nitin Pawar >> > >
