$cat abook.txt |base64 –w 0 >onelinetext.b64 $hadoop fs –put onelinetext.b64 /input/onelinetext.b64 $hadoop jar hadoop-streaming.jar -input /input/onelinetext.b64 -output /output -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat –mapper wc Num task: 1, and output has one line: Line 1: 1 2 202699 which makes sense because one line per mapper is intended.
$bzip2 onelinetext.b64 $ hadoop fs –put onelinetext.b64.bz2 /input/onelinetext.b64.bz2 $hadoop jar hadoop-streaming.jar -Dmapred.input.compress=true -Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec -input /input/onelinetext.b64.bz2 -output /output -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat –mapper wc I am expecting the same results as above, ‘coz decompressing should occur before processing one-line text (i.e. wc), however, I am getting: Num task: 397, and output has 397 lines: Line1-396: 0 0 0 Line 397: 1 2 202699 Any idea why so many mapred.map.tasks <>1 ? splitting? I purposely choose gzip because I believe it is NOT split-able. I got similar results when using bzip2 and lzop codec. Thanks for your answer in advance. -- Dr. Qiming He [email protected] 301-525-6612 (Phone) 815-327-2122 (Fax)
