Hi, I am testing the Hive 0.6 on parts of my data set. It's only a couple GB of log files that I am reading through a custom SerDe. The table is partitionned. I am using Hadoop local mode for testing.
When I run simple Group By queries (4 MR jobs), I am getting logs such as - map : 100% - reduce : 0% - map : 85% - reduce : 0% - map : 86% - reduce : 0% all the while only using one core on an 8 core server. Kind of a waste... I have activated the parallel option but it still won't parallelize. I have set the number of reduce jobs to be 8. My expectations is that since my data set is partitionned (=> different files), at least some of the map-reduce phases could be run on parallel on those files. Is my understanding wrong ? Is there a specific way to write the queries ? Thanks Philippe