It means that a hadoop1 dependency is getting into the jar somehow, although it's not obvious to me how...do you have a dependency tree you can tease apart?
On Thu, Apr 26, 2018 at 12:17 PM, David Ortiz <[email protected]> wrote: > Hello, > > I am playing around with trying to run a mapreduce pipeline we've had > in production for a little while on Spark. When I switch to the spark > pipeline and try to run it, I have run into the following exception: > > Exception in thread "Thread-32" java.lang.IncompatibleClassChangeError: > Found interface org.apache.hadoop.mapreduce.JobContext, but class was > expected > at org.apache.crunch.impl.mr.run.CrunchInputFormat.getSplits( > CrunchInputFormat.java:44) > at org.apache.spark.rdd.NewHadoopRDD.getPartitions( > NewHadoopRDD.scala:124) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( > MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:239) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( > RDD.scala:237) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1144) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:150) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.PairRDDFunctions. > saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) > at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ > saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:150) > at org.apache.spark.rdd.RDDOperationScope$.withScope( > RDDOperationScope.scala:111) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) > at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile( > PairRDDFunctions.scala:985) > at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopFile( > JavaPairRDD.scala:800) > at org.apache.crunch.impl.spark.SparkRuntime.monitorLoop( > SparkRuntime.java:321) > at org.apache.crunch.impl.spark.SparkRuntime.access$000( > SparkRuntime.java:77) > at org.apache.crunch.impl.spark.SparkRuntime$2.run( > SparkRuntime.java:136) > at java.lang.Thread.run(Thread.java:745) > > This happens both on EMR 5.12.0 using the 0.15.0 artifacts, as well as on > a non-production CDH cluster running CDH 5.13.1 parcels. > > Any idea what would cause this? > > Thanks, > Dave > > >
