> The partition is by year/month/day/hour/minute. I have two directories - over > two years, and the total number of records is 50Million.
That's a million partitions with 50 rows in each of them? > I am seeing it takes more than 1hr to complete. Any thoughts, on what could > be the issue or approach that can be taken to improve the performance? Looks like you have over-partitioned your data massively - the 1 hour might be partly query planning with million partitions and the rest might be file-count related overheads. At least in case of ORC, I recommend that the partitions contain at least 1 Gb of data & that if you really need to query down to finer levels, to use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering. http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4 Cheers, Gopal
