Hi all,
I want to use Zeppelin in my CDH cluster, to run spark and pyspark code,
through Yarn.
My environment is:
CDH 5.4.2 (spark and yarn installed)
Zeppelin 0.5.0
I've built Zeppelin in this way:
"mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.2
-Phadoop-2.6 -Pyarn -DskipTests"
and installed it on a node of my CDH cluster.
I've set the following env. variables:
export ZEPPELIN_HOME=/opt/incubator-zeppelin
export ZEPPELIN_PORT=7979
export SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_DIST_CLASSPATH=$(hadoop classpath)
export HIVE_CONF_DIR="/etc/hive/conf"
export HIVECLASSPATH=$(find /opt/cloudera/parcels/CDH/lib/hive/lib/ -name
'*.jar' -print0 | sed 's/\x0/:/g')
My zeppelin-env.sh is:
export MASTER=yarn
export ZEPPELIN_JAVA_OPTS="-Dspark.executor.memory=512m -Dspark.cores.max=1"
export HADOOP_CONF_DIR="/etc/hadoop/conf.cloudera.yarn"
export
PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
Spark-Jar:
/opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly-1.3.0-cdh5.4.2-hadoop2.6.0-cdh5.4.2.jar
When I execute a simple PYSPARK code in my notebook:
sc.textFile("/user/example/test.txt").count()
I see the new application in my Yarn console, Zeppelin tells me that it's
RUNNING but nothing happens: I don't receive any results and I can't
execute any other code. No errors in my Yarn log.
In Zeppelin log:
INFO [2015-08-21 12:15:13,740] ({pool-1-thread-2}
SchedulerFactory.java[jobStarted]:132) - Job
paragraph_1440152108501_1888952717 started by scheduler
remoteinterpreter_695055854
INFO [2015-08-21 12:15:13,745] ({pool-1-thread-2}
Paragraph.java[jobRun]:189) - run paragraph 20150821-121508_181152709 using
null org.apache.zeppelin.interpreter.LazyOpenInterpreter@91f60ce
INFO [2015-08-21 12:15:13,776] ({pool-1-thread-2}
RemoteInterpreterProcess.java[reference]:108) - Run interpreter process
/opt/incubator-zeppelin/bin/interpreter.sh -d
/opt/incubator-zeppelin/interpreter/spark$
INFO [2015-08-21 12:15:15,346] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
INFO [2015-08-21 12:15:15,416] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkInterpreter
INFO [2015-08-21 12:15:15,423] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.SparkSqlInterpreter
INFO [2015-08-21 12:15:15,425] ({pool-1-thread-2}
RemoteInterpreter.java[init]:144) - Create remote interpreter
org.apache.zeppelin.spark.DepInterpreter
INFO [2015-08-21 12:15:15,440] ({pool-1-thread-2}
Paragraph.java[jobRun]:206) - RUN : sc.textFile("/user/admin/1.txt").count()
INFO [2015-08-21 12:15:25,146] ({qtp1379078592-48}
NotebookServer.java[onMessage]:112) - RECEIVE << PING
INFO [2015-08-21 12:15:58,657] ({Thread-33}
NotebookServer.java[broadcast]:264) - SEND >> NOTE
INFO [2015-08-21 12:15:58,800] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,314] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:15:59,829] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,344] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:00,857] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,370] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:01,883] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
INFO [2015-08-21 12:16:02,397] ({Thread-34}
NotebookServer.java[broadcast]:264) - SEND >> PROGRESS
If I execute the similar code in %spark, I see the application running in
Yarn console but after some seconds Zeppelin gives me this error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, cdhlva03.gcio.unicredit.eu): ExecutorLostFailure (executor 4
lost) Driver stacktrace: at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236) at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Guys, do you have some ideas?
Thanks a lot in advance