Question related to AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

java8964 java8964 Mon, 30 Sep 2013 19:38:28 -0700
Hi, 
I am new to user Avro. Currently, I am working on an existing project, and I 
want to see if using Avro makes sense.
The project is to do some ETL around 5 data sets' data. The ETL logic is not 
complex, it will do different transferring logic for different  data sets, and 
partition the data daily in reducer.
There was one MR job to handle all 5 data sets originally. The data files have 
the name convention to distinguish the data sets. So in the mapper, it bases on 
the file name to understand what data set it is, and generate the key as 
"datase_name + date" to partition the data set first by data set, then daily.
Now if I want to store the data in Avro format, it is straight-forward to write 
MR job for only one data set following a lot of online examples. I have no 
problem to change the MR job to store the data as Avro format for one data set.
But if I still want to use one MR job for all 5 data sets, I got a problem.
I tried both "SpecificRecord" and "GenericRecord", but I don't know how to 
solve this problem.
For example, I created 5 avsc files for 5 data sets, and generate the Record 
object for all of them. But in the mapper/reducer, I don't want to specify any 
Record class, and this same mapper/reducer should be able to handle all data 
sets. So I try to put SpecificRecord  class in my mapper/reducer, but in this 
case, I don't have the SpecificRecord.SCHEMA$ to use in my driver of 
AvroJob.setMapOutputSchema(conf, Schema), even though in my case, I really 
prefer the "SpecificRecord".
So that makes me to try "GenericRecord". I change all my mapper and reducer to 
use "GenericRecord" class. But still, I don't know what schema I should use in 
my driver class for AvroJob.setMapOutputSchema(conf, Schema). The problem is 
that is there a generic abstract schema class I can use in 
AvroJob.setMapOutputSchema or AvroJob.setOutputSchema? My mapper class will 
correctly generate either "GenericRecord" or "SpecificRecord" class at runtime 
based on the file name, and reducer will write the correct "GenericRecord" or 
"SpecificRecord" object to the right output location without knowing the 
concrete Record object. But what stops me now is what kind of schema object I 
can use in AvroJob. I don't know during the driver stage what is my output 
schema, but the mapper/reducer will figure that out at runtime. Can I do this 
in Avro?
Thanks
Yong
Question related to AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

Reply via email to