1. I did try using NLineInputFormat, but this causes the "stream.map.input.ignoreKey" to no longer work. As per the streaming documentation:
"The configuration parameter is valid only if stream.map.input.writer.class is org.apache.hadoop.streaming.io.TextInputWriter.class." My mapper prefers the streaming stdin to not have the key as part of the input. I could obviously parse that out in the mapper, but the mapper belongs to a 3rd party. This is why I tried to do the RecordReader route. 2. Yes - I did export the classpath before running. 3. This may be the problem: bash-3.2$ jar -tf NLineRecordReader.jar META-INF/ META-INF/MANIFEST.MF NLineRecordReader.class I have specified "package mypackage;" at the top of the java file though. Then compiled using "javac" and then "jar cf". 4. The class is public. On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <[email protected]> wrote: > Hi Jason, > > A few questions (in order): > > 1. Does Hadoop's own NLineInputFormat not suffice? > > http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html > > 2. Do you make sure to pass your jar into the front-end too? > > $ export HADOOP_CLASSPATH=/path/to/your/jar > $ command… > > 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader? > > 4. Is your class marked public? > > On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <[email protected]> > wrote: > > Hi all, > > I'm experimenting with hadoop streaming on build 1.0.3. > > > > To give background info, i'm streaming a text file into mapper written > in C. > > Using the default settings, streaming uses TextInputFormat which creates > one > > record from each line. The problem I am having is that I need record > > boundaries to be every 4 lines. When the splitter breaks up the input > into > > the mappers, I have partial records on the boundaries due to this. To > > address this, my approach was to write a new RecordReader class almost in > > java that is almost identical to LineRecordReader, but with a modified > > next() method that reads 4 lines instead of one. > > > > I then compiled the new class and created a jar. I wanted to import > this at > > run time using the -libjars argument, like such: > > > > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars > > NLineRecordReader.jar -files test_stream.sh -inputreader > > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output > > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE > > > > Unfortunately, I keep getting the following error: > > -inputreader: class not found: mypackage.NLineRecordReader > > > > My question is 2 fold. Am I using the right approach to handle the 4 > line > > records with the custom RecordReader implementation? And why isn't > -libjars > > working to include my class to hadoop streaming at runtime? > > > > Thanks, > > Jason > > > > -- > Harsh J >
