(3)'s your problem for sure. Try this:
mkdir mypackage mv <class file> mypackage/ jar cvf NLineRecordReader.jar mypackage [Use this jar] On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <[email protected]> wrote: > 1. I did try using NLineInputFormat, but this causes the > "stream.map.input.ignoreKey" to no longer work. As per the streaming > documentation: > > "The configuration parameter is valid only if stream.map.input.writer.class > is org.apache.hadoop.streaming.io.TextInputWriter.class." > > My mapper prefers the streaming stdin to not have the key as part of the > input. I could obviously parse that out in the mapper, but the mapper > belongs to a 3rd party. This is why I tried to do the RecordReader route. > > 2. Yes - I did export the classpath before running. > > 3. This may be the problem: > > bash-3.2$ jar -tf NLineRecordReader.jar > META-INF/ > META-INF/MANIFEST.MF > NLineRecordReader.class > > I have specified "package mypackage;" at the top of the java file though. > Then compiled using "javac" and then "jar cf". > > 4. The class is public. > > > > On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <[email protected]> wrote: >> >> Hi Jason, >> >> A few questions (in order): >> >> 1. Does Hadoop's own NLineInputFormat not suffice? >> >> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html >> >> 2. Do you make sure to pass your jar into the front-end too? >> >> $ export HADOOP_CLASSPATH=/path/to/your/jar >> $ command… >> >> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader? >> >> 4. Is your class marked public? >> >> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <[email protected]> >> wrote: >> > Hi all, >> > I'm experimenting with hadoop streaming on build 1.0.3. >> > >> > To give background info, i'm streaming a text file into mapper written >> > in C. >> > Using the default settings, streaming uses TextInputFormat which creates >> > one >> > record from each line. The problem I am having is that I need record >> > boundaries to be every 4 lines. When the splitter breaks up the input >> > into >> > the mappers, I have partial records on the boundaries due to this. To >> > address this, my approach was to write a new RecordReader class almost >> > in >> > java that is almost identical to LineRecordReader, but with a modified >> > next() method that reads 4 lines instead of one. >> > >> > I then compiled the new class and created a jar. I wanted to import >> > this at >> > run time using the -libjars argument, like such: >> > >> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars >> > NLineRecordReader.jar -files test_stream.sh -inputreader >> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output >> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE >> > >> > Unfortunately, I keep getting the following error: >> > -inputreader: class not found: mypackage.NLineRecordReader >> > >> > My question is 2 fold. Am I using the right approach to handle the 4 >> > line >> > records with the custom RecordReader implementation? And why isn't >> > -libjars >> > working to include my class to hadoop streaming at runtime? >> > >> > Thanks, >> > Jason >> >> >> >> -- >> Harsh J > > -- Harsh J
