I have a partitioned external Hive table, stored in parquet files. The partition is by year/month/day/hour/minute. I have two directories - over two years, and the total number of records is 50Million. My cluster configuration is 5 Nodes, with 8 cores and 64GB of RAM - total of 40 cores and 300GB. I am running Hive using Tez as the engine. I have per container setting as 4GB and VCore to 1. Additionally, I set the TEZ min input split to 36MB and also max input Split to the same value 36GB.
When I submit a query Select count(*) from table. I see it allocates 43 Map Tasks and 1 Reducer task. I am seeing it takes more than 1hr to complete. Any thoughts, on what could be the issue or approach that can be taken to improve the performance? Thanks VJ
