Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to "

Ken Krugler Sun, 13 May 2012 16:29:36 -0700

Hi Jacob,

On May 13, 2012, at 2:03pm, Jacob Metcalf wrote:


> Ken, thanks for getting back to me. 
> 
> 1) The Avro specific classes are generated and packed in the same JAR as the 
> mapper and reducer. Attached is my 
> examplehttp://markmail.org/download.xqy?id=m6te4atgmyrrqyv5&number=1 which in 
> parallel I am also getting working on MRUnit so am discussing on that forum. 
> If you want to build it you will need to build odagio-avro.
> 
> I agree and cannot comprehend how if the mapper can serialize, the reducer 
> cannot deserialize. My only guess is that the reducer is running in a 
> separate JVM and it is only this which has classpath issues. Logically the 
> mapper output would be deserialized before my reducer is instantiated. I 
> noticed that the JAR does get exploded so my only thought is that there is 
> something going wrong in the Cygwin/Hadoop layer at reduction.
> 
> 2) Yes the latest version of avro is in my Job Jar. However I am again not 
> sure how to manipulate the Hadoop classpath to ensure it is first. This is 
> possibly more a topic for the Hadoop list.

Two comments…

1. Your pom.xml doesn't look like it's set up to build a proper Hadoop job jar.

After running "mvn assembly:assembly" you should have a job jar that has a lib 
subdirectory, and inside of that sub-dir you'll have all fo the jars (NOT the 
classes) for your dependent jars such as avro.

See http://exported.wordpress.com/2010/01/30/building-hadoop-job-jar-with-maven/

After running mvn assembly:assembly in your example directory I get a 
target/hadoop-example.jar file that's got Hadoop classes (and a bunch of 
others) all jammed inside it.

And your job jar shouldn't have Hadoop classes or jars inside it - those should 
be provided.

2. I would suggest using Hadoop 0.20.2 if you're on Cygwin.

That version avoids issues with Hadoop not being able to set permissions on 
local file system directories.

Regards,

-- Ken

> From: [email protected]
> Subject: Re: Hadoop 0.23, Avro Specific 1.6.3 and 
> "org.apache.avro.generic.GenericData$Record cannot be cast to "
> Date: Sun, 13 May 2012 11:18:13 -0700
> To: [email protected]
> 
> Hi Jacob,
> 
> On May 13, 2012, at 4:48am, Jacob Metcalf wrote:
> 
> 
> I have just spent several frustrating hours on getting an example MR job 
> using Avro working with Hadoop and after finally getting it working I thought 
> I would share my findings with everyone.
> 
> I wrote an example job trying to use Avro MR 1.6.3 to serialize between Map 
> and Reduce then attempted to deploy and run. I am setting up a development 
> cluster with Hadoop 0.23 running pseudo-distributed under cygwin. I ran my 
> job and it failed with:
> 
> "org.apache.avro.generic.GenericData$Record cannot be cast to 
> net.jacobmetcalf.avro.Room" 
> 
> Where Room is an Avro generated class. I found two problems. The first I have 
> partly solved, the second one is more to do with Hadoop and is as yet 
> unsolved:
> 
> 1) Why when I am using Avro Specific does it end up going Generic?
> 
> When deserializing SpecificDatumReader.java attempts to instantiate your 
> target class through reflection. If it fails to create your class it defaults 
> to a GenericData.Record. This Doug has explained here: 
> http://mail-archives.apache.org/mod_mbox/avro-user/201101.mbox/%[email protected]%3E
>  
> 
> But why it is doing it was a little harder to work out. Debugging I saw the 
> SpecificDatumReader could not find my class in its classpath. However in my 
> Job Runner I had done: 
> 
>               job.setJarByClass(HouseAssemblyJob.class);      // This should 
> ensure the JAR is distributed around the cluster
> 
> I expected with this Hadoop would distribute my Jar around the cluster. It 
> may be doing the distribution but it definitely did not add it to the 
> Reducers classpath. So to get round this I have now set HADOOP_CLASSPATH to 
> the directory I am running from. This is not going to work in a real cluster 
> where the Job Runner is on a different machine to where the Reducer so I am 
> keen to figure out whether the problem is Hadoop 0.23, my environment 
> variables or the fact I am running under Cygwin.
> 
> If your reducer is running, then Hadoop must have distributed your job jar.
> 
> In that case, any class that's actually in your job jar (in the proper 
> position) will be distributed and on the classpath.
> 
> Sometimes the problem is that you've got a dependent jar, which then needs to 
> be in the "lib" subdirectory inside of your job jar. Are you maybe building 
> your Avro generated classes into a separate jar, and then adding that to the 
> job jar?
> 
> Finally, running under Cygwin is…challenging. I teach a Hadoop class, and 
> often the hardest part of the lab is getting everybody's Cygwin installation 
> working with Hadoop. The fact that you've got pseudo-distributed mode working 
> on Cygwin is impressive in itself, but I would suggest trying your job on a 
> real cluster, e.g. use Elastic MapReduce.
> 
> 2) How can I upgrade Hadoop 0.23 to use Avro 1.6.3 ?
> 
> Whilst debugging I realised that Hadoop is shipping with Avro 1.5.3. I 
> however want to use 1.6.3 (and 1.7 when it comes out) because of its support 
> for immutability & builders in the generated classes. I probably could just 
> hack the old Avro lib out of my Hadoop distribution and drop the new one in. 
> However I thought it would be cleaner to get Hadoop to distribute my jar to 
> all datanodes and then manipulate my classpath to get the latest version of 
> Avro to the top. So I have packaged Avro 1.6.3 into my job jar using Maven 
> assembly
> 
> Did you ensure that it's inside of the /lib subdirectory? What does your job 
> jar look like (via "jar tvf <path to job jar>")?
> 
> -- Ken
> 
> and tried to do this in my JobRunner:
> 
>               job.setJarByClass( MyJob.class);                                
>                                                   // This should ensure the 
> JAR is distributed around the cluster
>               config.setBoolean( 
> MRJobConfig.MAPREDUCE_JOB_USER_CLASSPATH_FIRST, true ); // ensure my version 
> of avro?
> 
> But it continues to use 1.5.3. I suspect it is again to do with my 
> HADOOP_CLASSPATH which has avro-1.5.3 in it:
> 
>                 export 
> HADOOP_CLASSPATH="$HADOOP_COMMON_HOME/share/hadoop/mapreduce/*"
> 
> If anyone has done this and has any ideas please let me know?
> 
> Thanks
> 
> Jacob
> 
> --------------------------
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
> 
> 
> 
> 
> 

--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr

Re: Hadoop 0.23, Avro Specific 1.6.3 and "org.apache.avro.generic.GenericData$Record cannot be cast to "

Reply via email to