RE: Question related to AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

java8964 java8964 Tue, 01 Oct 2013 04:23:09 -0700

Hi, Alan:
Thanks for you suggestion. I will take a look about MultipleOutput.
But in this case, I still need to specify the schema in  my driver, right? You 
mean I should use union schema in this case? But in my mapper, should I use 
SpecificRecord or GenericRecord? I can use (K,V) in my reducer, but in the 
mapper, I need the concrete Record object to serialize my data, right?
Yong

From: [email protected]
To: [email protected]
Subject: RE: Question related to 
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)
Date: Mon, 30 Sep 2013 22:21:57 -0500

Hi Yong, It sounds like you might need to use AvroMultipleOutputs here.   You 
can set all five of your output schemas in your driver, then route your message 
to the appropriate output in your reducer.   See the following for mapred: 
http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html
 And the following for mapreduce: 
http://avro.apache.org/docs/1.7.5/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html
 If your mapper is generating the Avro records, then you will probably have to 
set AvroJob.setMapOutputSchema to a union of all five of your schemas. Thanks, 
Alan From: java8964 java8964 [mailto:[email protected]] 
Sent: Monday, September 30, 2013 9:37 PM
To: [email protected]
Subject: Question related to 
AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s) Hi,  
I am new to user Avro. Currently, I am working on an existing project, and I 
want to see if using Avro makes sense. The project is to do some ETL around 5 
data sets' data. The ETL logic is not complex, it will do different 
transferring logic for different  data sets, and partition the data daily in 
reducer. There was one MR job to handle all 5 data sets originally. The data 
files have the name convention to distinguish the data sets. So in the mapper, 
it bases on the file name to understand what data set it is, and generate the 
key as "datase_name + date" to partition the data set first by data set, then 
daily. Now if I want to store the data in Avro format, it is straight-forward 
to write MR job for only one data set following a lot of online examples. I 
have no problem to change the MR job to store the data as Avro format for one 
data set. But if I still want to use one MR job for all 5 data sets, I got a 
problem. I tried both "SpecificRecord" and "GenericRecord", but I don't know 
how to solve this problem. For example, I created 5 avsc files for 5 data sets, 
and generate the Record object for all of them. But in the mapper/reducer, I 
don't want to specify any Record class, and this same mapper/reducer should be 
able to handle all data sets. So I try to put SpecificRecord  class in my 
mapper/reducer, but in this case, I don't have the SpecificRecord.SCHEMA$ to 
use in my driver of AvroJob.setMapOutputSchema(conf, Schema), even though in my 
case, I really prefer the "SpecificRecord". So that makes me to try 
"GenericRecord". I change all my mapper and reducer to use "GenericRecord" 
class. But still, I don't know what schema I should use in my driver class for 
AvroJob.setMapOutputSchema(conf, Schema). The problem is that is there a 
generic abstract schema class I can use in AvroJob.setMapOutputSchema or 
AvroJob.setOutputSchema? My mapper class will correctly generate either 
"GenericRecord" or "SpecificRecord" class at runtime based on the file name, 
and reducer will write the correct "GenericRecord" or "SpecificRecord" object 
to the right output location without knowing the concrete Record object. But 
what stops me now is what kind of schema object I can use in AvroJob. I don't 
know during the driver stage what is my output schema, but the mapper/reducer 
will figure that out at runtime. Can I do this in Avro? Thanks Yong

RE: Question related to AvroJob.setMapOutputSchema(org.apache.hadoop.mapred.JobConf job, Schema s)

Reply via email to