optimizing Hive/Hadoop for latency

Igor Tatarinov Wed, 09 Mar 2011 11:51:15 -0800

I understand that Hive and Hadoop are meant to run many jobs at once. As a
result, most tuning parameters are meant to increase the throughput of a
Hadoop cluster rather than latency. In our case, we use Elastic Map Reduce
to run a single Hive script on a daily basis. For that reason, our top
priority is to make the script run faster. So far, it's been a pretty
frustrating experience. I am curious if there are workarounds for the things
that are not easy to tune:


1) In particular, Hadoop lets you
configure mapred.tasktracker.map/reduce.tasks.maximum individually but there
is no way to limit the total of the two. Hive mappers seem to always finish
before the reducers and I wish I could run 1 more reducer when no mappers
are running at the same time. That doesn't seem to be possible.

2) Similarly, there is only one parameter to control memory
allocation: mapred.child.java.opts. So if my box is configured for 4 mappers
and 2 reducers, I have to set that parameter to less than 1/6 of total
memory available. The only problem is that once the mappers are done, 4/6th
or two thirds of all memory is essentially not being used. Is there
something I can do about that?

3) Another odd thing is not being able to run a single wave of reducers
easily. As I understand that's the optimal scenario in most cases. To make
this work, I have to know the total number of reducer slots in the cluster
and then define mapred.reduce.tasks accordingly. EMR seems to have a
solution for this problem (mapred.reduce.tasksperslot) but it doesn't seem
to work.

 Any suggestions would be greatly appreciated!

Thank you,
igor

optimizing Hive/Hadoop for latency

Reply via email to