I was able to write a little code to make this happen, and submitted a patch to Hadoop:
https://issues.apache.org/jira/browse/MAPREDUCE-5018 There is a jar file and shell script there for anybody who wants to try this without recompiling all of Hadoop. It lets you run something like "mapstream indir md5sum outdir" and get one map job per file in indir with real raw binary data passed to your map command and the output written to a file in outdir. This makes it easy to run all your favorite Unix commands as map-only streaming jobs, taking advantage of reliable distributed execution.
