I understand that Hive and Hadoop are meant to run many jobs at once. As a result, most tuning parameters are meant to increase the throughput of a Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce to run a single Hive script on a daily basis. For that reason, our top priority is to make the script run faster. So far, it's been a pretty frustrating experience. I am curious if there are workarounds for the things that are not easy to tune:
1) In particular, Hadoop lets you configure mapred.tasktracker.map/reduce.tasks.maximum individually but there is no way to limit the total of the two. Hive mappers seem to always finish before the reducers and I wish I could run 1 more reducer when no mappers are running at the same time. That doesn't seem to be possible. 2) Similarly, there is only one parameter to control memory allocation: mapred.child.java.opts. So if my box is configured for 4 mappers and 2 reducers, I have to set that parameter to less than 1/6 of total memory available. The only problem is that once the mappers are done, 4/6th or two thirds of all memory is essentially not being used. Is there something I can do about that? 3) Another odd thing is not being able to run a single wave of reducers easily. As I understand that's the optimal scenario in most cases. To make this work, I have to know the total number of reducer slots in the cluster and then define mapred.reduce.tasks accordingly. EMR seems to have a solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem to work. Any suggestions would be greatly appreciated! Thank you, igor