Hi guys,
I am trying to understand how Tez actually works inside. I opened the
tez wordcount example and I see at some point there are classes
referring back to hadoop mapreduce classes. For this reason and since I
see a Tokenizer node requires to finish before a Summation vertex can
start, I can't understand what's the difference with a normal mapreduce
job, where map(s) has to finish before reduce can start. There must be
something since the performance are clearly better than the hadoop's
wordcount example (even if I'd need dedicated machines to be sure about
this, I am running VMs on my quite old laptop).
Also, I see tez has tez.runtime.io.sort.mb set at 32MB, so I set
mapreduce.task.io.sort.mb to 32 and make a comparison: I see the "old"
wordcount generates 21 spill files (something like
attempt_1411492856064_0001_m_000000_0_spill_XX.out) as big as 712KB
each, while with the tez wordcount I get 7 such files, but the size is
37MB each. Considering I am working on a 120MB input file (a little less
than one HDFS block), tez has to write way more than mapreduce on the
temporary dir. I thought these files are the intermediate map results,
but I can't see how they could be that larger then the original input
and than the original wordcount spill files.
I didn't enable any compression in the config file, working with Hadoop
2.5.0, Tez 0.5.0, master node + 2 slave. Just one slave is used for the
wordcount example (I always get 1 map/tokenizer and 1 reuduce/summation
running on the same node).
Thanks in advance
Fabio
- Wordcount vs Tez wordcount doubts Fabio
-