The next thing to check is if you are mixing versions of Scala (2.11 vs 2.12) Or more specifically if you are compiling against a different version than is being packaged in your assembly.
On Fri, Apr 27, 2018 at 3:02 PM, David Ortiz <[email protected]> wrote: > Alright. After double checking all the versions, and rebuilding as a fat > jar, I'm now getting this. > > Driver stacktrace: > at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$ > scheduler$DAGScheduler$$failJobAndIndependentStages( > DAGScheduler.scala:1708) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$ > abortStage$1.apply(DAGScheduler.scala:1696) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$ > abortStage$1.apply(DAGScheduler.scala:1695) > at scala.collection.mutable.ResizableArray$class.foreach( > ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach( > ArrayBuffer.scala:48) > at org.apache.spark.scheduler.DAGScheduler.abortStage( > DAGScheduler.scala:1695) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$ > handleTaskSetFailed$1.apply(DAGScheduler.scala:855) > at org.apache.spark.scheduler.DAGScheduler$$anonfun$ > handleTaskSetFailed$1.apply(DAGScheduler.scala:855) > at scala.Option.foreach(Option.scala:257) > at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed( > DAGScheduler.scala:855) > at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. > doOnReceive(DAGScheduler.scala:1923) > at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. > onReceive(DAGScheduler.scala:1878) > at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop. > onReceive(DAGScheduler.scala:1867) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob( > DAGScheduler.scala:671) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) > at org.apache.spark.internal.io.SparkHadoopMapReduceWriter$.write( > SparkHadoopMapReduceWriter.scala:88) > ... 19 more > Caused by: java.lang.AbstractMethodError: org.apache.crunch.impl.spark. > fn.CrunchPairTuple2.call(Ljava/lang/Object;)Ljava/util/Iterator; > at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1. > apply(JavaRDDLike.scala:186) > at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1. > apply(JavaRDDLike.scala:186) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$ > anonfun$apply$23.apply(RDD.scala:797) > at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$ > anonfun$apply$23.apply(RDD.scala:797) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute( > MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask. > scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run( > Executor.scala:338) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:624) > ... 1 more > > On Thu, Apr 26, 2018 at 6:54 PM David Ortiz <[email protected]> wrote: > >> Oh wow. I'll take a look tomorrow morning and see if I can figure it out. >> >> On Thu, Apr 26, 2018, 6:08 PM Josh Wills <[email protected]> wrote: >> >>> It means that a hadoop1 dependency is getting into the jar somehow, >>> although it's not obvious to me how...do you have a dependency tree you can >>> tease apart? >>> >>> On Thu, Apr 26, 2018 at 12:17 PM, David Ortiz <[email protected]> wrote: >>> >>>> Hello, >>>> >>>> I am playing around with trying to run a mapreduce pipeline we've >>>> had in production for a little while on Spark. When I switch to the spark >>>> pipeline and try to run it, I have run into the following exception: >>>> >>>> Exception in thread "Thread-32" java.lang.IncompatibleClassChangeError: >>>> Found interface org.apache.hadoop.mapreduce.JobContext, but class was >>>> expected >>>> at org.apache.crunch.impl.mr.run.CrunchInputFormat.getSplits( >>>> CrunchInputFormat.java:44) >>>> at org.apache.spark.rdd.NewHadoopRDD.getPartitions( >>>> NewHadoopRDD.scala:124) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.rdd.MapPartitionsRDD.getPartitions( >>>> MapPartitionsRDD.scala:35) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:239) >>>> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply( >>>> RDD.scala:237) >>>> at scala.Option.getOrElse(Option.scala:120) >>>> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) >>>> at org.apache.spark.SparkContext.runJob(SparkContext.scala: >>>> 1952) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1144) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) >>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>> RDDOperationScope.scala:150) >>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>> RDDOperationScope.scala:111) >>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) >>>> at org.apache.spark.rdd.PairRDDFunctions. >>>> saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) >>>> at org.apache.spark.rdd.PairRDDFunctions$$anonfun$ >>>> saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) >>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>> RDDOperationScope.scala:150) >>>> at org.apache.spark.rdd.RDDOperationScope$.withScope( >>>> RDDOperationScope.scala:111) >>>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) >>>> at org.apache.spark.rdd.PairRDDFunctions. >>>> saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985) >>>> at org.apache.spark.api.java.JavaPairRDD. >>>> saveAsNewAPIHadoopFile(JavaPairRDD.scala:800) >>>> at org.apache.crunch.impl.spark.SparkRuntime.monitorLoop( >>>> SparkRuntime.java:321) >>>> at org.apache.crunch.impl.spark.SparkRuntime.access$000( >>>> SparkRuntime.java:77) >>>> at org.apache.crunch.impl.spark.SparkRuntime$2.run( >>>> SparkRuntime.java:136) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> This happens both on EMR 5.12.0 using the 0.15.0 artifacts, as well as >>>> on a non-production CDH cluster running CDH 5.13.1 parcels. >>>> >>>> Any idea what would cause this? >>>> >>>> Thanks, >>>> Dave >>>> >>>> >>>> >>>
