Re: the different size of file through hadoop streaming

周梦想 Thu, 31 Jan 2013 22:32:06 -0800

I have known why.

When an executable is specified for mappers, each mapper task will launch
the executable as a separate process when the mapper is initialized. As the
mapper task runs, it converts its inputs into lines and feed the lines to
the stdin of the process. In the meantime, the mapper collects the line
oriented outputs from the stdout of the process and converts each line into
a key/value pair, which is collected as the output of the mapper. By
default, the *prefix of a line up to the first tab character* is the *key* and
the the rest of the line (excluding the tab character) will be the *value*.
However, this can be customized


When an executable is specified for reducers, each reducer task will launch
the executable as a separate process then the reducer is initialized. As
the reducer task runs, it converts its input key/values pairs into lines
and feeds the lines to the stdin of the process. In the meantime, the
reducer collects the line oriented outputs from the stdout of the process,
converts each line into a key/value pair, which is collected as the output
of the reducer. By default, the prefix of a line up to the first tab
character is the key and the the rest of the line (excluding the tab
character) is the value. However, this can be customized
So the output was added a tab (09) to the beginning of each line.

Andy
2013/2/1 周梦想 <[email protected]>

> hello,
> I process a file using hadoop streaming. but I found streaming will add
> byte 0x09 before 0x0a. So the file is changed after streaming process.
> some one can tells me why add this byte to output?
>
> [zhouhh@Hadoop48 ~]$ ls -l README.txt
> -rw-r--r-- 1 zhouhh zhouhh 1399 Feb  1 10:53 README.txt
>
> [zhouhh@Hadoop48 ~]$ wc README.txt
>   34  182 1399 README.txt
> [zhouhh@Hadoop48 ~]$ hadoop fs -ls
> Found 3 items
> -rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
> /user/zhouhh/fsimage
> drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30
> /user/zhouhh/gz
> -rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
> /user/zhouhh/test中文.txt
> [zhouhh@Hadoop48 ~]$ hadoop fs -put README.txt .
> [zhouhh@Hadoop48 ~]$ hadoop fs -ls
> Found 4 items
> -rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
> /user/zhouhh/README.txt
> -rw-r--r--   2 zhouhh supergroup       9358 2013-01-10 17:52
> /user/zhouhh/fsimage
> drwxr-xr-x   - zhouhh supergroup          0 2013-02-01 10:30
> /user/zhouhh/gz
> -rw-r--r--   2 zhouhh supergroup         65 2012-09-26 14:10
> /user/zhouhh/test中文.txt
>
>
> [zhouhh@Hadoop48 ~]$ hadoop fs -ls README.txt
> Found 1 items
> -rw-r--r--   2 zhouhh supergroup       1399 2013-02-01 10:56
> /user/zhouhh/README.txt
>
> [zhouhh@Hadoop48 ~]$ hadoop jar
>  $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
> -output wordcount1 -mapper /bin/cat -reducer /bin/sort
>
>
> [zhouhh@Hadoop48 ~]$ hadoop fs -ls wordcount/part*
> Found 1 items
> -rw-r--r--   2 zhouhh supergroup       *1433* 2013-02-01 11:20
> /user/zhouhh/wordcount/part-00000
>
>
> [zhouhh@Hadoop48 ~]$ hadoop jar
>  $HADOOP_INSTALL/contrib/streaming/hadoop-streaming-*.jar -input README.txt
> -output wordcount1 -mapper /bin/cat -reducer /usr/bin/wc
>
>  [zhouhh@Hadoop48 ~]$ hadoop fs -cat wordcount1/p*
>      34     182    *1433*
>
> part of the two file of hex code:
> sort README.txt  :
>                                      streaming README.txt and reduce sort:
>   0000000: 0a0a 0a0a 0a0a 0a61 6c67 6f72 6974 686d  .......algorithm
>            |  0000000: *090a 090a 090a 090a 090a 090a 090a* 616c
>  ..............al
>   0000010: 732e 2020 5468 6520 666f 726d 2061 6e64  s.  The form and
>            |  0000010: 676f 7269 7468 6d73 2e20 2054 6865 2066  gorithms.
>  The f
>   0000020: 206d 616e 6e65 7220 6f66 2074 6869 7320   manner of this
>             |  0000020: 6f72 6d20 616e 6420 6d61 6e6e 6572 206f  orm and
> manner o
>   0000030: 4170 6163 6865 2053 6f66 7477 6172 6520  Apache Software
>             |  0000030: 6620 7468 6973 2041 7061 6368 6520 536f  f this
> Apache So
>   0000040: 466f 756e 6461 7469 6f6e 0a61 6e64 206f  Foundation.and o
>            |  0000040: 6674 7761 7265 2046 6f75 6e64 6174 696f  ftware
> Foundatio
>
> because there are 34 lines, so the file size add 34 of 09 byte.
> 1399+34=1433. why?
>
> Best regards,
> Andy
>

Re: the different size of file through hadoop streaming

Reply via email to