Also, consider using Maven for these kinda development, helps build sane jars automatically :)
On Thu, Oct 18, 2012 at 11:28 AM, Harsh J <[email protected]> wrote: > (3)'s your problem for sure. > > Try this: > > mkdir mypackage > mv <class file> mypackage/ > jar cvf NLineRecordReader.jar mypackage > [Use this jar] > > On Thu, Oct 18, 2012 at 10:54 AM, Jason Wang <[email protected]> wrote: >> 1. I did try using NLineInputFormat, but this causes the >> "stream.map.input.ignoreKey" to no longer work. As per the streaming >> documentation: >> >> "The configuration parameter is valid only if stream.map.input.writer.class >> is org.apache.hadoop.streaming.io.TextInputWriter.class." >> >> My mapper prefers the streaming stdin to not have the key as part of the >> input. I could obviously parse that out in the mapper, but the mapper >> belongs to a 3rd party. This is why I tried to do the RecordReader route. >> >> 2. Yes - I did export the classpath before running. >> >> 3. This may be the problem: >> >> bash-3.2$ jar -tf NLineRecordReader.jar >> META-INF/ >> META-INF/MANIFEST.MF >> NLineRecordReader.class >> >> I have specified "package mypackage;" at the top of the java file though. >> Then compiled using "javac" and then "jar cf". >> >> 4. The class is public. >> >> >> >> On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <[email protected]> wrote: >>> >>> Hi Jason, >>> >>> A few questions (in order): >>> >>> 1. Does Hadoop's own NLineInputFormat not suffice? >>> >>> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html >>> >>> 2. Do you make sure to pass your jar into the front-end too? >>> >>> $ export HADOOP_CLASSPATH=/path/to/your/jar >>> $ command… >>> >>> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader? >>> >>> 4. Is your class marked public? >>> >>> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <[email protected]> >>> wrote: >>> > Hi all, >>> > I'm experimenting with hadoop streaming on build 1.0.3. >>> > >>> > To give background info, i'm streaming a text file into mapper written >>> > in C. >>> > Using the default settings, streaming uses TextInputFormat which creates >>> > one >>> > record from each line. The problem I am having is that I need record >>> > boundaries to be every 4 lines. When the splitter breaks up the input >>> > into >>> > the mappers, I have partial records on the boundaries due to this. To >>> > address this, my approach was to write a new RecordReader class almost >>> > in >>> > java that is almost identical to LineRecordReader, but with a modified >>> > next() method that reads 4 lines instead of one. >>> > >>> > I then compiled the new class and created a jar. I wanted to import >>> > this at >>> > run time using the -libjars argument, like such: >>> > >>> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars >>> > NLineRecordReader.jar -files test_stream.sh -inputreader >>> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output >>> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE >>> > >>> > Unfortunately, I keep getting the following error: >>> > -inputreader: class not found: mypackage.NLineRecordReader >>> > >>> > My question is 2 fold. Am I using the right approach to handle the 4 >>> > line >>> > records with the custom RecordReader implementation? And why isn't >>> > -libjars >>> > working to include my class to hadoop streaming at runtime? >>> > >>> > Thanks, >>> > Jason >>> >>> >>> >>> -- >>> Harsh J >> >> > > > > -- > Harsh J -- Harsh J
