RE: Using hadoop streaming with binary data

Venkatesh Kavuluri Wed, 06 Feb 2013 13:38:40 -0800

You can use hadoop's DistCp to copy files via map/reduce.

Date: Wed, 6 Feb 2013 16:19:23 -0500
Subject: Using hadoop streaming with binary data
From: [email protected]
To: [email protected]


Is it possible to pass unmolested binary data through a map-only streaming job 
from the command line?  I.e., is there a way to avoid extra tabs and newlines 
in the output?  I don't need input splits or key/value pairs, I just want one 
whole input file fed unmodified into a program, and its output written 
unmodified to HDFS.  For example, I'd like to run:


    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in 
-output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set 
mapreduce.output.textoutputformat.separator to the empty string, and typedbytes 
prepends the size.  Is there a way to leave data alone out of the box, or will 
I have to write a custom InputFormat and OutputFormat?


Thanks!

RE: Using hadoop streaming with binary data

Reply via email to