I have just spent several frustrating hours on getting an example MR job using 
Avro working with Hadoop and after finally getting it working I thought I would 
share my findings with everyone.
I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and 
Reduce then attempted to deploy and run. I am setting up a development cluster 
with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my job and it 
failed with:
"org.apache.avro.generic.GenericData$Record cannot be cast to 
net.jacobmetcalf.avro.Room" 
Where Room is an Avro generated class. I found two problems. The first I have 
partly solved, the second one is more to do with Hadoop and is as yet unsolved:
1) Why when I am using Avro Specific does it end up going Generic?
When deserializing SpecificDatumReader.java attempts to instantiate your target 
class through reflection. If it fails to create your class it defaults to a 
GenericData.Record. This Doug has explained here: 
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%[email protected]%3E
 But why it is doing it was a little harder to work out. Debugging I saw the 
SpecificDatumReader could not find my class in its classpath. However in my Job 
Runner I had done: 
                job.setJarByClass(HouseAssemblyJob.class);      // This should 
ensure the JAR is distributed around the cluster
I expected with this Hadoop would distribute my Jar around the cluster. It may 
be doing the distribution but it definitely did not add it to the Reducers 
classpath. So to get round this I have now set HADOOP_CLASSPATH to the 
directory I am running from. This is not going to work in a real cluster where 
the Job Runner is on a different machine to where the Reducer so I am keen to 
figure out whether the problem is Hadoop 0.23, my environment variables or the 
fact I am running under Cygwin.

2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however 
want to use 1.6.3 (and 1.7 when it comes out) because of its support for 
immutability & builders in the generated classes. I probably could just hack 
the old Avro lib out of my Hadoop distribution and drop the new one in. However 
I thought it would be cleaner to get Hadoop to distribute my jar to all 
datanodes and then manipulate my classpath to get the latest version of Avro to 
the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly and 
tried to do this in my JobRunner:
                job.setJarByClass( MyJob.class);                                
                                                  // This should ensure the JAR 
is distributed around the cluster               config.setBoolean( 
MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version of 
avro?
But it continues to use 1.5.3. I suspect it is again to do with my 
HADOOP_CLASSPATH which has avro-1.5.3 in it:
                export 
HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
If anyone has done this and has any ideas please let me know?
Thanks
Jacob                                     

Reply via email to