Hi, I am running spark on zeppeling and trying to create some temp tables to run sql queries on. I have json data on hdfs which I am trying to load as a jsonRdd. Here are my commands:
val data=sc.sequenceFile("/user/ds=01-02-2015/hour=2/*", classOf[Null], > classOf[org.apache.hadoop.io.Text]).map{case (k,v) => v.toString()} > > import org.apache.spark.sql.SQLContext > val sqlContext = new SQLContext(sc) > val recordsJson = sqlContext.jsonRDD(data) > And here is the error i get which clearly shows its failing on the json rdd step: data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at map at > <console>:26 import org.apache.spark.sql.SQLContext sqlContext: > org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@313547c4 > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 0.0 (TID 3, gdoop-worker31.snc1): java.lang.ClassNotFoundException: > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at > org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:69) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at > java.lang.ClassLoader.loadClass(ClassLoader.java:358) at > java.lang.Class.forName0(Native Method) at > java.lang.Class.forName(Class.java:278) > I built zeppelin using: mvn clean package -DskipTests -Pspark-1.3 -Phadoop-2.6 -Dhadoop.version=2.6.0 -Pyarn mvn clean package -P build-distr -DskipTests Lastly here are my configs: Interpreter.json(spark section): > "id": "2ARHCUUUZ", > "name": "spark", > "group": "spark", > "properties": { > "spark.executor.memory": "512m", > "args": "", > "spark.yarn.jar": > "hdfs://namenode-vip.snc1:8020/spark/spark-assembly-1.3.1-hadoop2.6.0.jar", > "spark.cores.max": "", > "zeppelin.spark.concurrentSQL": "false", > "zeppelin.spark.useHiveContext": "true", > "zeppelin.pyspark.python": "python", > "zeppelin.dep.localrepo": "local-repo", > "spark.home": "/usr/local/lib/spark-1.3", > "spark.yarn.am.extraJavaOptions": > "-Dhdp.version\u003d2.2.0.0-2041", > "zeppelin.spark.maxResult": "1000", > "master": "yarn-client", > "spark.yarn.queue": "public", > "spark.yarn.access.namenodes": > "hdfs://namenode1.snc1:8032,hdfs://namenode2.snc1:8032", > "spark.scheduler.mode": "FAIR", > "spark.dynamicAllocation.enabled": "false", > "spark.executor.extraLibraryPath": > "/usr/lib/hadoop/lib/native/Linux-amd64-64", > "spark.executor.extraJavaOptions": > "-Dhdp.version\u003d2.2.0.0-2041", > "spark.app.name": "Zeppelin", > "spark.driver.extraLibraryPath": > "/usr/lib/hadoop/lib/native/Linux-amd64-64", > "spark.driver.extraJavaOptions": "-Dhdp.version\u003d2.2.0.0-2041" > } > zeppelin-env.sh > export HADOOP_CONF_DIR=/etc/hadoop/conf > export > SPARK_CLASSPATH=/usr/lib/hadoop/lib/*:/usr/lib/hadoop/lib/native/Linux-amd64-64 > export ZEPPELIN_PORT=10020 > export ZEPPELIN_JAVA_OPTS="-Dhdp.version=2.2.0.0-2041 > -Dspark.jars=/usr/hdp/current/hadoop-client/lib/hadoop-lzo-0.6.0.2.2.0.0-2041.jar" > Would anyone be able to help with the problem? Thanks in advance, Udit