Hi, I have the following questions related to how to use NLineInputFormat with
AvroMapper. I am new to use Avro, so please help me if you think what I am
doing is not correct.
I have this project, need to pass the data to the MR job, that ideally each
mapper will consume one line from the text file, this line of text will be the
location of another resources), then load the data from this resource in each
mapper. The mapper output will be an AvroRecord object that I already write the
Schema file, and generated the Record object.
So far, my mapper works fine, if I am using the AvroUtf8InputFormat.
Here is my driver class, most logic list here:
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Test job"); FileInputFormat.addInputPath(conf, ....);
FileOutputFormat.setOutputPath(conf, .....);
conf.setInputFormat(AvroUtf8InputFormat.class);
AvroJob.setMapperClass(conf, Tracking3Mapper.class);
AvroJob.setInputSchema(conf, Schema.create(Schema.Type.STRING));
AvroJob.setMapOutputSchema(conf,
Pair.getPairSchema(Schema.create(Schema.Type.STRING),
TrackingActivities.SCHEMA$)); AvroJob.setOutputSchema(conf,
Pair.getPairSchema(Schema.create(Schema.Type.STRING),
TrackingActivities.SCHEMA$)); JobClient.runJob(conf);
And my mapper like following:public class Tracking3Mapper extends
AvroMapper<Utf8, Pair<CharSequence, TrackingActivities>> { public void
map(Utf8 value, AvroCollector<Pair<CharSequence, TrackingActivities>> output,
Reporter reporter) throws IOException { }}
Everything works as I expected, but here comes my question.I want to use
NLineInputFormat, as I want to make sure that each line in my data file will go
to one mapper, which means one mapper will consume one line of text. I tested
with the hadoop NLineFormat without using Avro, which works perfectly for my
use case, as default, each line of data only will go to one mapper. So I want
to use it with Avro.
It looks like I have 2 options, which I don't know which one works for Avro.
Option 1, change my code this way in the Driver:
NLineInputFormat.addInputPath(conf,
.....);conf.setInputFormat(AvroUtf8InputFormat.class);
Will AvroUtf8InputFormat wrap around the NLineFormat class correctly this way?
Option 2, which is the part I need help.
NLineFormat.addInputPath(conf,
....);conf.setInputFormat(NLineInputFormat.class);
If I do above, I will get the following exception:java.lang.ClassCastException:
org.apache.hadoop.io.LongWritable cannot be cast to
org.apache.avro.mapred.AvroWrapper at
org.apache.avro.mapred.HadoopMapper.map(HadoopMapper.java:34) at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436) at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
So my questions are:
1) To use NLineFormat in my MR job, will option 1 work?2) If I have to set
NLineInputFormat in conf.setInputFormat(), how I can make it work in my current
Mapper and Driver? Does that mean my mapper shouldn't extend from AvroMapper
any more? Can anyone give an example online location, or an example in the Avro
Test cases in source code?
Thanks
Yong