Hello, My name is Russell. My company is currently using Toree pyspark. However, I encounter a problem that I couldn't figure it out:
here is the kernel setting: { "language": "python", "display_name": "Apache Toree - PySpark", "env": { "__TOREE_SPARK_OPTS__": "--master yarn-client", "SPARK_HOME": "/usr/hdp/current/spark-client", "__TOREE_OPTS__": "", "DEFAULT_INTERPRETER": "PySpark", "PYTHONPATH": "/usr/hdp/current/spark-client/python:/usr/hdp/ current/spark-client/python/lib/py4j-0.9-src.zip", "PYTHON_EXEC": "python" }, "argv": [ "/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh", "--profile", "{connection_file}" ] } When I run a small pyspark program on jupyter note book, I got these error message: Error from python worker: /usr/bin/python: No module named pyspark PYTHONPATH was: /data18/hadoop/yarn/local/filecache/10/spark-hdp-assembly.jar java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.startDaemon( PythonWorkerFactory.scala:164) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon( PythonWorkerFactory.scala:87) at org.apache.spark.api.python.PythonWorkerFactory.create( PythonWorkerFactory.scala:63) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:134) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Could you give me some hints what would be the possible cause? I have tried submitting python job using the command line in yarn-client mode, and there is no problem getting the result. I think there must be some setting problems. Any help is greatly appreciated. Thanks ahead! Regards, Russell