Hi guys,
I am trying to understand how Tez actually works inside. I opened the tez wordcount example and I see at some point there are classes referring back to hadoop mapreduce classes. For this reason and since I see a Tokenizer node requires to finish before a Summation vertex can start, I can't understand what's the difference with a normal mapreduce job, where map(s) has to finish before reduce can start. There must be something since the performance are clearly better than the hadoop's wordcount example (even if I'd need dedicated machines to be sure about this, I am running VMs on my quite old laptop).

Also, I see tez has tez.runtime.io.sort.mb set at 32MB, so I set mapreduce.task.io.sort.mb to 32 and make a comparison: I see the "old" wordcount generates 21 spill files (something like attempt_1411492856064_0001_m_000000_0_spill_XX.out) as big as 712KB each, while with the tez wordcount I get 7 such files, but the size is 37MB each. Considering I am working on a 120MB input file (a little less than one HDFS block), tez has to write way more than mapreduce on the temporary dir. I thought these files are the intermediate map results, but I can't see how they could be that larger then the original input and than the original wordcount spill files. I didn't enable any compression in the config file, working with Hadoop 2.5.0, Tez 0.5.0, master node + 2 slave. Just one slave is used for the wordcount example (I always get 1 map/tokenizer and 1 reuduce/summation running on the same node).

Thanks in advance

Fabio

Reply via email to