Re: hadoop streaming with custom RecordReader class

Jason Wang Wed, 17 Oct 2012 22:25:13 -0700

1. I did try using NLineInputFormat, but this causes the
"stream.map.input.ignoreKey" to no longer work.  As per the streaming
documentation:


"The configuration parameter is valid only if stream.map.input.writer.class
is org.apache.hadoop.streaming.io.TextInputWriter.class."

My mapper prefers the streaming stdin to not have the key as part of the
input.  I could obviously parse that out in the mapper, but the mapper
belongs to a 3rd party. This is why I tried to do the RecordReader route.

2. Yes - I did export the classpath before running.

3. This may be the problem:

bash-3.2$ jar -tf NLineRecordReader.jar
META-INF/
META-INF/MANIFEST.MF
NLineRecordReader.class

I have specified "package mypackage;" at the top of the java file though.
Then compiled using "javac" and then "jar cf".

4. The class is public.



On Wed, Oct 17, 2012 at 11:53 PM, Harsh J <[email protected]> wrote:

> Hi Jason,
>
> A few questions (in order):
>
> 1. Does Hadoop's own NLineInputFormat not suffice?
>
> http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/NLineInputFormat.html
>
> 2. Do you make sure to pass your jar into the front-end too?
>
> $ export HADOOP_CLASSPATH=/path/to/your/jar
> $ command…
>
> 3. Does jar -tf <yourjar> carry a proper mypackage.NLineRecordReader?
>
> 4. Is your class marked public?
>
> On Thu, Oct 18, 2012 at 9:32 AM, Jason Wang <[email protected]>
> wrote:
> > Hi all,
> > I'm experimenting with hadoop streaming on build 1.0.3.
> >
> > To give background info, i'm streaming a text file into mapper written
> in C.
> > Using the default settings, streaming uses TextInputFormat which creates
> one
> > record from each line.  The problem I am having is that I need record
> > boundaries to be every 4 lines.  When the splitter breaks up the input
> into
> > the mappers, I have partial records on the boundaries due to this.  To
> > address this, my approach was to write a new RecordReader class almost in
> > java that is almost identical to LineRecordReader, but with a modified
> > next() method that reads 4 lines instead of one.
> >
> > I then compiled the new class and created a jar.  I wanted to import
> this at
> > run time using the -libjars argument, like such:
> >
> > hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -libjars
> > NLineRecordReader.jar -files test_stream.sh -inputreader
> > mypackage.NLineRecordReader -input /Users/hadoop/test/test.txt -output
> > /Users/hadoop/test/output -mapper “test_stream.sh” -reducer NONE
> >
> > Unfortunately, I keep getting the following error:
> > -inputreader: class not found: mypackage.NLineRecordReader
> >
> > My question is 2 fold.  Am I using the right approach to handle the 4
> line
> > records with the custom RecordReader implementation?  And why isn't
> -libjars
> > working to include my class to hadoop streaming at runtime?
> >
> > Thanks,
> > Jason
>
>
>
> --
> Harsh J
>

Re: hadoop streaming with custom RecordReader class

Reply via email to