I have known why. When an executable is specified for mappers, each mapper task will launch the executable as a separate process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines and feed the lines to the stdin of the process. In the meantime, the mapper collects the line oriented outputs from the stdout of the process and converts each line into a key/value pair, which is collected as the output of the mapper. By default, the *prefix of a line up to the first tab character* is the *key* and the the rest of the line (excluding the tab character) will be the *value*. However, this can be customized
When an executable is specified for reducers, each reducer task will launch the executable as a separate process then the reducer is initialized. As the reducer task runs, it converts its input key/values pairs into lines and feeds the lines to the stdin of the process. In the meantime, the reducer collects the line oriented outputs from the stdout of the process, converts each line into a key/value pair, which is collected as the output of the reducer. By default, the prefix of a line up to the first tab character is the key and the the rest of the line (excluding the tab character) is the value. However, this can be customized So the output was added a tab (09) to the beginning of each line. Andy 2013/2/1 周梦想 <[email protected]> > hello, > I process a file using hadoop streaming. but I found streaming will add > byte 0x09 before 0x0a. So the file is changed after streaming process. > some one can tells me why add this byte to output? > > [zhouhh@Hadoop48 ~]$ ls -l README.txt > -rw-r--r-- 1 zhouhh zhouhh 1399 Feb 1 10:53 README.txt > > [zhouhh@Hadoop48 ~]$ wc README.txt > 34 182 1399 README.txt > [zhouhh@Hadoop48 ~]$ hadoop fs -ls > Found 3 items > -rw-r--r-- 2 zhouhh supergroup 9358 2013-01-10 17:52 > /user/zhouhh/fsimage > drwxr-xr-x - zhouhh supergroup 0 2013-02-01 10:30 > /user/zhouhh/gz > -rw-r--r-- 2 zhouhh supergroup 65 2012-09-26 14:10 > /user/zhouhh/test中文.txt > [zhouhh@Hadoop48 ~]$ hadoop fs -put README.txt . > [zhouhh@Hadoop48 ~]$ hadoop fs -ls > Found 4 items > -rw-r--r-- 2 zhouhh supergroup 1399 2013-02-01 10:56 > /user/zhouhh/README.txt > -rw-r--r-- 2 zhouhh supergroup 9358 2013-01-10 17:52 > /user/zhouhh/fsimage > drwxr-xr-x - zhouhh supergroup 0 2013-02-01 10:30 > /user/zhouhh/gz > -rw-r--r-- 2 zhouhh supergroup 65 2012-09-26 14:10 > /user/zhouhh/test中文.txt > > > [zhouhh@Hadoop48 ~]$ hadoop fs -ls README.txt > Found 1 items > -rw-r--r-- 2 zhouhh supergroup 1399 2013-02-01 10:56 > /user/zhouhh/README.txt > > [zhouhh@Hadoop48 ~]$ hadoop jar > $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt > -output wordcount1 -mapper /bin/cat -reducer /bin/sort > > > [zhouhh@Hadoop48 ~]$ hadoop fs -ls wordcount/part* > Found 1 items > -rw-r--r-- 2 zhouhh supergroup *1433* 2013-02-01 11:20 > /user/zhouhh/wordcount/part-00000 > > > [zhouhh@Hadoop48 ~]$ hadoop jar > $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt > -output wordcount1 -mapper /bin/cat -reducer /usr/bin/wc > > [zhouhh@Hadoop48 ~]$ hadoop fs -cat wordcount1/p* > 34 182 *1433* > > part of the two file of hex code: > sort README.txt : > streaming README.txt and reduce sort: > 0000000: 0a0a 0a0a 0a0a 0a61 6c67 6f72 6974 686d .......algorithm > | 0000000: *090a 090a 090a 090a 090a 090a 090a* 616c > ..............al > 0000010: 732e 2020 5468 6520 666f 726d 2061 6e64 s. The form and > | 0000010: 676f 7269 7468 6d73 2e20 2054 6865 2066 gorithms. > The f > 0000020: 206d 616e 6e65 7220 6f66 2074 6869 7320 manner of this > | 0000020: 6f72 6d20 616e 6420 6d61 6e6e 6572 206f orm and > manner o > 0000030: 4170 6163 6865 2053 6f66 7477 6172 6520 Apache Software > | 0000030: 6620 7468 6973 2041 7061 6368 6520 536f f this > Apache So > 0000040: 466f 756e 6461 7469 6f6e 0a61 6e64 206f Foundation.and o > | 0000040: 6674 7761 7265 2046 6f75 6e64 6174 696f ftware > Foundatio > > because there are 34 lines, so the file size add 34 of 09 byte. > 1399+34=1433. why? > > Best regards, > Andy >
