I am trying to read RDD avro, transform and write. I am able to run it locally fine but when i run onto cluster, i see issues with Avro.
export SPARK_HOME=/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1 export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf" export HADOOP_CONF_DIR=/apache/hadoop/conf export YARN_CONF_DIR=/apache/hadoop/conf export SPARK_JAR=$SPARK_HOME/lib/spark-assembly-1.0.2-hadoop2.4.1.jar export SPARK_LIBRARY_PATH=/apache/hadoop/lib/native export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf" export SPARK_YARN_USER_ENV="CLASSPATH=/apache/hadoop/conf" export SPARK_CLASSPATH=/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar:/home/dvasthimal/spark/avro-1.7.7.jar export SPARK_LIBRARY_PATH="/apache/hadoop/lib/native" export YARN_CONF_DIR=/apache/hadoop/conf/ cd $SPARK_HOME ./bin/spark-submit --master yarn-cluster --jars /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 --queue hdmi-spark --class com.company.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession subcommand=successevents outputdir=/user/dvasthimal/epdatasets/successdetail Spark assembly has been built with Hive, including Datanucleus jars on classpath 15/03/04 03:20:29 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2 15/03/04 03:20:30 INFO yarn.Client: Got Cluster metric info from ApplicationsManager (ASM), number of NodeManagers: 2221 15/03/04 03:20:30 INFO yarn.Client: Queue info ... queueName: hdmi-spark, queueCurrentCapacity: 0.7162806, queueMaxCapacity: 0.08, queueApplicationCount = 7, queueChildQueueCount = 0 15/03/04 03:20:30 INFO yarn.Client: Max mem capabililty of a single resource in this cluster 16384 15/03/04 03:20:30 INFO yarn.Client: Preparing Local resources 15/03/04 03:20:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/03/04 03:20:30 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 15/03/04 03:20:46 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 7780745 for dvasthimal on 10.115.206.112:8020 15/03/04 03:20:46 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark_reporting-1.0-SNAPSHOT.jar 15/03/04 03:20:47 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/spark-1.0.2-bin-2.4.1/lib/spark-assembly-1.0.2-hadoop2.4.1.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/spark-assembly-1.0.2-hadoop2.4.1.jar 15/03/04 03:20:52 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-mapred-1.7.7-hadoop2.jar 15/03/04 03:20:52 INFO yarn.Client: Uploading file:/home/dvasthimal/spark/avro-1.7.7.jar to hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/.sparkStaging/application_1425075571333_61948/avro-1.7.7.jar 15/03/04 03:20:54 INFO yarn.Client: Setting up the launch environment 15/03/04 03:20:54 INFO yarn.Client: Setting up container launch context 15/03/04 03:20:54 INFO yarn.Client: Command for starting the Spark ApplicationMaster: List($JAVA_HOME/bin/java, -server, -Xmx4096m, -Djava.io.tmpdir=$PWD/tmp, -Dspark.app.name=\"com.company.ep.poc.spark.reporting.SparkApp\", -Dlog4j.configuration=log4j-spark-container.properties, org.apache.spark.deploy.yarn.ApplicationMaster, --class, com.company.ep.poc.spark.reporting.SparkApp, --jar , file:/home/dvasthimal/spark/spark_reporting-1.0-SNAPSHOT.jar, --args 'startDate=2015-02-16' --args 'endDate=2015-02-16' --args 'epoutputdirectory=/user/dvasthimal/epdatasets_small/exptsession' --args 'subcommand=successevents' --args 'outputdir=/user/dvasthimal/epdatasets/successdetail' , --executor-memory, 2048, --executor-cores, 1, --num-executors , 3, 1>, <LOG_DIR>/stdout, 2>, <LOG_DIR>/stderr) 15/03/04 03:20:54 INFO yarn.Client: Submitting application to ASM 15/03/04 03:20:54 INFO impl.YarnClientImpl: Submitted application application_1425075571333_61948 15/03/04 03:20:56 INFO yarn.Client: Application report from ASM: application identifier: application_1425075571333_61948 appId: 61948 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: hdmi-spark appMasterRpcPort: -1 appStartTime: 1425464454263 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/ appUser: dvasthimal 15/03/04 03:21:18 INFO yarn.Client: Application report from ASM: application identifier: application_1425075571333_61948 appId: 61948 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: } appDiagnostics: appMasterHost: phxaishdc9dn0169.phx.company.com appQueue: hdmi-spark appMasterRpcPort: 0 appStartTime: 1425464454263 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/ appUser: dvasthimal …. …. 15/03/04 03:21:22 INFO yarn.Client: Application report from ASM: application identifier: application_1425075571333_61948 appId: 61948 clientToAMToken: Token { kind: YARN_CLIENT_TOKEN, service: } appDiagnostics: appMasterHost: phxaishdc9dn0169.phx.company.com appQueue: hdmi-spark appMasterRpcPort: 0 appStartTime: 1425464454263 yarnAppState: FINISHED distributedFinalState: FAILED appTrackingUrl: https://apollo-phx-rm-2.company.com:50030/proxy/application_1425075571333_61948/A appUser: dvasthimal AM failed with following exception /apache/hadoop/bin/yarn logs -applicationId application_1425075571333_61948 15/03/04 03:21:22 INFO NewHadoopRDD: Input split: hdfs:// apollo-phx-nn.company.com:8020/user/dvasthimal/epdatasets_small/exptsession/2015/02/16/part-r-00000.avro:0+13890 15/03/04 03:21:22 ERROR Executor: Exception in task ID 3 java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:111) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111) at org.apache.spark.scheduler.Task.run(Task.scala:51) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 1) Having figured out the error the fix would be to put the right version of avro libs into AM JVM classpath. Hence i included --jars /home/dvasthimal/spark/avro-mapred-1.7.7-hadoop2.jar,/home/dvasthimal/spark/avro-1.7.7.jar in spark-submit command. However i still see the same exception. 2) I tried to include these libs in SPARK_CLASSPATH. However i see the same exception. -- Deepak