Wordcount vs Tez wordcount doubts

Fabio Wed, 24 Sep 2014 01:08:07 -0700

Hi guys,

I am trying to understand how Tez actually works inside. I opened thetez wordcount example and I see at some point there are classesreferring back to hadoop mapreduce classes. For this reason and since Isee a Tokenizer node requires to finish before a Summation vertex canstart, I can't understand what's the difference with a normal mapreducejob, where map(s) has to finish before reduce can start. There must besomething since the performance are clearly better than the hadoop'swordcount example (even if I'd need dedicated machines to be sure aboutthis, I am running VMs on my quite old laptop).

Also, I see tez has tez.runtime.io.sort.mb set at 32MB, so I setmapreduce.task.io.sort.mb to 32 and make a comparison: I see the "old"wordcount generates 21 spill files (something likeattempt_1411492856064_0001_m_000000_0_spill_XX.out) as big as 712KBeach, while with the tez wordcount I get 7 such files, but the size is37MB each. Considering I am working on a 120MB input file (a little lessthan one HDFS block), tez has to write way more than mapreduce on thetemporary dir. I thought these files are the intermediate map results,but I can't see how they could be that larger then the original inputand than the original wordcount spill files.I didn't enable any compression in the config file, working with Hadoop2.5.0, Tez 0.5.0, master node + 2 slave. Just one slave is used for thewordcount example (I always get 1 map/tokenizer and 1 reuduce/summationrunning on the same node).


Thanks in advance

Fabio

Wordcount vs Tez wordcount doubts

Reply via email to