Ken, thanks for getting back to me.
1) The Avro specific classes are generated and packed in the same JAR as the
mapper and reducer. Attached is my example
http://markmail.org/download.xqy?id=m6te4atgmyrrqyv5&number=1 which in parallel
I am also getting working on MRUnit so am discussing on that forum. If you want
to build it you will need to build odagio-avro.
I agree and cannot comprehend how if the mapper can serialize, the reducer
cannot deserialize. My only guess is that the reducer is running in a separate
JVM and it is only this which has classpath issues. Logically the mapper output
would be deserialized before my reducer is instantiated. I noticed that the JAR
does get exploded so my only thought is that there is something going wrong in
the Cygwin/Hadoop layer at reduction.
2) Yes the latest version of avro is in my Job Jar. However I am again not sure
how to manipulate the Hadoop classpath to ensure it is first. This is possibly
more a topic for the Hadoop list.
Regards
Jacob
From: [email protected]
Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and
"org.apache.avro.generic.GenericData$Record cannot be cast to "
Date: Sun, 13 May 2012 11:18:13 -0700
To: [email protected]
Hi Jacob,
On May 13, 2012, at 4:48am, Jacob Metcalf wrote:I have just spent several
frustrating hours on getting an example MR job using Avro working with Hadoop
and after finally getting it working I thought I would share my findings with
everyone.
I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map and
Reduce then attempted to deploy and run. I am setting up a development cluster
with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my job and it
failed with:
"org.apache.avro.generic.GenericData$Record cannot be cast to
net.jacobmetcalf.avro.Room"
Where Room is an Avro generated class. I found two problems. The first I have
partly solved, the second one is more to do with Hadoop and is as yet unsolved:
1) Why when I am using Avro Specific does it end up going Generic?
When deserializing SpecificDatumReader.java attempts to instantiate your target
class through reflection. If it fails to create your class it defaults to a
GenericData.Record. This Doug has explained here:
http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%[email protected]%3E
But why it is doing it was a little harder to work out. Debugging I saw the
SpecificDatumReader could not find my class in its classpath. However in my Job
Runner I had done:
job.setJarByClass(HouseAssemblyJob.class); // This should
ensure the JAR is distributed around the cluster
I expected with this Hadoop would distribute my Jar around the cluster. It may
be doing the distribution but it definitely did not add it to the Reducers
classpath. So to get round this I have now set HADOOP_CLASSPATH to the
directory I am running from. This is not going to work in a real cluster where
the Job Runner is on a different machine to where the Reducer so I am keen to
figure out whether the problem is Hadoop 0.23, my environment variables or the
fact I am running under Cygwin.
If your reducer is running, then Hadoop must have distributed your job jar.
In that case, any class that's actually in your job jar (in the proper
position) will be distributed and on the classpath.
Sometimes the problem is that you've got a dependent jar, which then needs to
be in the "lib" subdirectory inside of your job jar. Are you maybe building
your Avro generated classes into a separate jar, and then adding that to the
job jar?
Finally, running under Cygwin is…challenging. I teach a Hadoop class, and often
the hardest part of the lab is getting everybody's Cygwin installation working
with Hadoop. The fact that you've got pseudo-distributed mode working on Cygwin
is impressive in itself, but I would suggest trying your job on a real cluster,
e.g. use Elastic MapReduce.
2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I however
want to use 1.6.3 (and 1.7 when it comes out) because of its support for
immutability & builders in the generated classes. I probably could just hack
the old Avro lib out of my Hadoop distribution and drop the new one in. However
I thought it would be cleaner to get Hadoop to distribute my jar to all
datanodes and then manipulate my classpath to get the latest version of Avro to
the top. So I have packaged Avro 1.6.3 into my job jar using Maven assembly
Did you ensure that it's inside of the /lib subdirectory? What does your job
jar look like (via "jar tvf <path to job jar>")?
-- Ken
and tried to do this in my JobRunner:
job.setJarByClass( MyJob.class);
// This should ensure the JAR
is distributed around the cluster config.setBoolean(
MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version of
avro?
But it continues to use 1.5.3. I suspect it is again to do with my
HADOOP_CLASSPATH which has avro-1.5.3 in it:
export
HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
If anyone has done this and has any ideas please let me know?
Thanks
Jacob
--------------------------Ken Kruglerhttp://www.scaleunlimited.comcustom big
data solutions & trainingHadoop, Cascading, Mahout & Solr