> The partition is by year/month/day/hour/minute. I have two directories - over 
> two years, and the total number of records is 50Million.  

That's a million partitions with 50 rows in each of them?

> I am seeing it takes more than 1hr to complete. Any thoughts, on what could 
> be the issue or approach that can be taken to improve the performance?

Looks like you have over-partitioned your data massively - the 1 hour might be 
partly query planning with million partitions and the rest might be file-count 
related overheads.

At least in case of ORC, I recommend that the partitions contain at least 1 Gb 
of data & that if you really need to query down to finer levels, to use bloom 
filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering.

http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4

Cheers,
Gopal


Reply via email to