Thanks Gopal for the feedback.

Actually, we don't have that many partitions - there are lot of gaps both
in days and time events as well. But, I would like to understand when you
say " time spent might partly be query planning with million partitions"? I
presume, this is in producing the physical plan? -- does it spend time in
allocating group of partition directories to each map task? Can you
elaborate? or point me to any material to better understand this..

Thanks
VJ

On Thu, Dec 15, 2016 at 1:03 PM, Gopal Vijayaraghavan <[email protected]>
wrote:

> > The partition is by year/month/day/hour/minute. I have two directories -
> over two years, and the total number of records is 50Million.
>
> That's a million partitions with 50 rows in each of them?
>
> > I am seeing it takes more than 1hr to complete. Any thoughts, on what
> could be the issue or approach that can be taken to improve the performance?
>
> Looks like you have over-partitioned your data massively - the 1 hour
> might be partly query planning with million partitions and the rest might
> be file-count related overheads.
>
> At least in case of ORC, I recommend that the partitions contain at least
> 1 Gb of data & that if you really need to query down to finer levels, to
> use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.
>
> http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4
>
> Cheers,
> Gopal
>
>
>


-- 
*VJ Anand*
*Founder *
*Sankia*
[email protected]
925-640-1340
www.sankia.com

*Confidentiality Notice*: This e-mail message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and privileged information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply e-mail and destroy all copies
of the original message

Reply via email to