I have a partitioned external Hive table, stored in parquet files. The
partition is by year/month/day/hour/minute. I have two directories - over
two years, and the total number of records is 50Million.  My cluster
configuration is 5 Nodes, with 8 cores and 64GB of RAM - total of 40 cores
and 300GB. I am running Hive using Tez as the engine. I have per container
setting as 4GB and VCore to 1. Additionally, I set the TEZ min input split
to 36MB and also max input Split to the same value 36GB.

When I submit a query Select count(*) from table. I see it allocates 43 Map
Tasks and 1 Reducer task.

I am seeing it takes more than 1hr to complete. Any thoughts, on what could
be the issue or approach that can be taken to improve the performance?

Thanks
VJ

Reply via email to