Thanks Gopal for the feedback. Actually, we don't have that many partitions - there are lot of gaps both in days and time events as well. But, I would like to understand when you say " time spent might partly be query planning with million partitions"? I presume, this is in producing the physical plan? -- does it spend time in allocating group of partition directories to each map task? Can you elaborate? or point me to any material to better understand this..
Thanks VJ On Thu, Dec 15, 2016 at 1:03 PM, Gopal Vijayaraghavan <[email protected]> wrote: > > The partition is by year/month/day/hour/minute. I have two directories - > over two years, and the total number of records is 50Million. > > That's a million partitions with 50 rows in each of them? > > > I am seeing it takes more than 1hr to complete. Any thoughts, on what > could be the issue or approach that can be taken to improve the performance? > > Looks like you have over-partitioned your data massively - the 1 hour > might be partly query planning with million partitions and the rest might > be file-count related overheads. > > At least in case of ORC, I recommend that the partitions contain at least > 1 Gb of data & that if you really need to query down to finer levels, to > use bloom filters (PARQUET-41 is not fixed yet, so YMMV) + sorted ordering. > > http://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/4 > > Cheers, > Gopal > > > -- *VJ Anand* *Founder * *Sankia* [email protected] 925-640-1340 www.sankia.com *Confidentiality Notice*: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message
