Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Kiran Dangeti Tue, 09 Jun 2015 23:13:32 -0700

\bbb
On Jun 10, 2015 10:58 AM, "anvesh ragi" <[email protected]> wrote:


> Hello all,
>
> I know that the tab is default input separator for fields :
>
> stream.map.output.field.separator
> stream.reduce.input.field.separator
> stream.reduce.output.field.separator
> mapreduce.textoutputformat.separator
>
> but if i try to write the generic parser option :
>
> stream.map.output.field.separator=\t (or)
> stream.map.output.field.separator="\t"
>
> to test how hadoop parses white space characters like "\t,\n" when used as
> separators. I observed that hadoop reads it as \t character but not "
>  " tab space itself. I checked it by printing each line in reducer (python)
> as it reads using :
>
> sys.stdout.write(str(line))
>
> My mapper emits key/value pairs as : key value1 value2
>
> using print (key,value1,value2,sep='\t',end='\n') command.
>
> So I expected my reducer to read each line as : key value1 value2 too,
> but instead sys.stdout.write(str(line)) printed :
>
> key value1 value2 \\with trailing space
>
> From Hadoop streaming - remove trailing tab from reducer output
> <http://stackoverflow.com/questions/18133290/hadoop-streaming-remove-trailing-tab-from-reducer-output>,
> I understood that the trailing space is due to
> mapreduce.textoutputformat.separator not being set and left as default.
>
> So, this confirmed my assumption that hadoop considered my total map
> output :
>
> key value1 value2
>
> as key and value as empty Text object since it read the separator from
> stream.map.output.field.separator=\t as "\t" character instead of "" tab
> space itself.
>
> Please help me understand this behavior and how can I use \t as a
> separator if I want to?
>
> Thanks & Regards,
> Anvesh R
>
>

Re: hadoop 2.4.0 streaming generic parser options using TAB as separator

Reply via email to