SPARK-3039 "Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API"
was marked resolved with Spark 1.2.0 release. However, when I download the
pre-built Spark distro for Hadoop 2.4 and later (spark-1.2.0-bin-hadoop2.4.tgz) and run it
against Avro code compiled against Hadoop 2.4/new Hadoop API I still get:

java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:135)

TaskAttemptContext was a class in the Hadoop 1.x series but became an interface
in Hadoop 2.x. Therefore there is a avro-mapred-1.7.6.jar and
avro-mapred-1.7.6-hadoop2.jar. For Hadoop 2.x the avro-mapred-1.7.6-hadoop2.jar
is needed.

So it seemed that spark-assembly-1.2.0-hadoop2.4.0.jar still did not contain
the org.apache.avro.mapreduce.AvroRecordReaderBase from avro-mapred-1.7.6-hadoop2.jar.

I then downloaded the source code and compiled with:
mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package

The hadoop-2.4 profile sets:
<avro.mapred.classifier>hadoop2</avro.mapred.classifier> which through
dependency management should pull in the right hadoop2 version:

<dependency>
        <groupId>org.apache.avro</groupId>
        <artifactId>avro-mapred</artifactId>
        <version>${avro.version}</version>
<classifier>${avro.mapred.classifier}</classifier>
        <exclusions>

However, same IncompatibleClassChangeError after replacing the assembly jar.

I had cleaned my local ~/.m2/repository before the build and found that for
avro-mapred both 1.7.5 (no extension, i.e. hadoop1) and 1.7.6 (hadoop2) had
been downloaded. That seemed a likely culprit.

After installing the created jar files into my local repo (had to handcopy
poms/jars for repl/yarn subprojects) and then running:

mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred

Building Spark Project Hive 1.2.0
[INFO] ------------------------------------------------------------------------
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]

Showed that hive-exec brought in the avro-mapred-1.7.5.jar (hadoop1). Fix for
spark 1.2.x:

spark-1.2.0/sql/hive/pom.xml

    <dependency>
      <groupId>org.spark-project.hive</groupId>
      <artifactId>hive-exec</artifactId>
      <version>${hive.version}</version>
      <exclusions>
        <exclusion>
          <groupId>commons-logging</groupId>
          <artifactId>commons-logging</artifactId>
        </exclusion>
        <exclusion>
          <groupId>com.esotericsoftware.kryo</groupId>
          <artifactId>kryo</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.avro</groupId>
          <artifactId>avro-mapred</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

Just add the last exclusion for avro-mapred (comparison at https://github.com/medale/spark/compare/apache:v1.2.1-rc2...medale:avro-hadoop2-v1.2.1-rc2).
 I was able to build and run against that fix with Avro code.

 Fix for current master: https://github.com/apache/spark/pull/4315

 Any feedback much appreciated,
 Markus

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to