Hi Jacob, On May 13, 2012, at 2:03pm, Jacob Metcalf wrote:
> Ken, thanks for getting back to me. > > 1) The Avro specific classes are generated and packed in the same JAR as the > mapper and reducer. Attached is my > examplehttp://markmail.org/download.xqy?id=m6te4atgmyrrqyv5&number=1 which in > parallel I am also getting working on MRUnit so am discussing on that forum. > If you want to build it you will need to build odagio-avro. > > I agree and cannot comprehend how if the mapper can serialize, the reducer > cannot deserialize. My only guess is that the reducer is running in a > separate JVM and it is only this which has classpath issues. Logically the > mapper output would be deserialized before my reducer is instantiated. I > noticed that the JAR does get exploded so my only thought is that there is > something going wrong in the Cygwin/Hadoop layer at reduction. > > 2) Yes the latest version of avro is in my Job Jar. However I am again not > sure how to manipulate the Hadoop classpath to ensure it is first. This is > possibly more a topic for the Hadoop list. Two comments… 1. Your pom.xml doesn't look like it's set up to build a proper Hadoop job jar. After running "mvn assembly:assembly" you should have a job jar that has a lib subdirectory, and inside of that sub-dir you'll have all fo the jars (NOT the classes) for your dependent jars such as avro. See http://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/ After running mvn assembly:assembly in your example directory I get a target/hadoop-example.jar file that's got Hadoop classes (and a bunch of others) all jammed inside it. And your job jar shouldn't have Hadoop classes or jars inside it - those should be provided. 2. I would suggest using Hadoop 0.20.2 if you're on Cygwin. That version avoids issues with Hadoop not being able to set permissions on local file system directories. Regards, -- Ken > From: [email protected] > Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and > "org.apache.avro.generic.GenericData$Record cannot be cast to " > Date: Sun, 13 May 2012 11:18:13 -0700 > To: [email protected] > > Hi Jacob, > > On May 13, 2012, at 4:48am, Jacob Metcalf wrote: > > > I have just spent several frustrating hours on getting an example MR job > using Avro working with Hadoop and after finally getting it working I thought > I would share my findings with everyone. > > I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map > and Reduce then attempted to deploy and run. I am setting up a development > cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my > job and it failed with: > > "org.apache.avro.generic.GenericData$Record cannot be cast to > net.jacobmetcalf.avro.Room" > > Where Room is an Avro generated class. I found two problems. The first I have > partly solved, the second one is more to do with Hadoop and is as yet > unsolved: > > 1) Why when I am using Avro Specific does it end up going Generic? > > When deserializing SpecificDatumReader.java attempts to instantiate your > target class through reflection. If it fails to create your class it defaults > to a GenericData.Record. This Doug has explained here: > http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%[email protected]%3E > > > But why it is doing it was a little harder to work out. Debugging I saw the > SpecificDatumReader could not find my class in its classpath. However in my > Job Runner I had done: > > job.setJarByClass(HouseAssemblyJob.class); // This should > ensure the JAR is distributed around the cluster > > I expected with this Hadoop would distribute my Jar around the cluster. It > may be doing the distribution but it definitely did not add it to the > Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to > the directory I am running from. This is not going to work in a real cluster > where the Job Runner is on a different machine to where the Reducer so I am > keen to figure out whether the problem is Hadoop 0.23, my environment > variables or the fact I am running under Cygwin. > > If your reducer is running, then Hadoop must have distributed your job jar. > > In that case, any class that's actually in your job jar (in the proper > position) will be distributed and on the classpath. > > Sometimes the problem is that you've got a dependent jar, which then needs to > be in the "lib" subdirectory inside of your job jar. Are you maybe building > your Avro generated classes into a separate jar, and then adding that to the > job jar? > > Finally, running under Cygwin is…challenging. I teach a Hadoop class, and > often the hardest part of the lab is getting everybody's Cygwin installation > working with Hadoop. The fact that you've got pseudo-distributed mode working > on Cygwin is impressive in itself, but I would suggest trying your job on a > real cluster, e.g. use Elastic MapReduce. > > 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ? > > Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I > however want to use 1.6.3 (and 1.7 when it comes out) because of its support > for immutability & builders in the generated classes. I probably could just > hack the old Avro lib out of my Hadoop distribution and drop the new one in. > However I thought it would be cleaner to get Hadoop to distribute my jar to > all datanodes and then manipulate my classpath to get the latest version of > Avro to the top. So I have packaged Avro 1.6.3 into my job jar using Maven > assembly > > Did you ensure that it's inside of the /lib subdirectory? What does your job > jar look like (via "jar tvf <path to job jar>")? > > -- Ken > > and tried to do this in my JobRunner: > > job.setJarByClass( MyJob.class); > // This should ensure the > JAR is distributed around the cluster > config.setBoolean( > MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version > of avro? > > But it continues to use 1.5.3. I suspect it is again to do with my > HADOOP_CLASSPATH which has avro-1.5.3 in it: > > export > HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*" > > If anyone has done this and has any ideas please let me know? > > Thanks > > Jacob > > -------------------------- > Ken Krugler > http://www.scaleunlimited.com > custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > > > -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
