bug: can not run Ipython notebook on cluster

Andy Davidson Fri, 06 Nov 2015 14:19:01 -0800

Does anyone use iPython notebooks?

I am able to use it on my local machine with spark how ever I can not get it
work on my cluster.



For unknown reason on my cluster I have to manually create the spark
context. My test code generated this exception

Exception: Python in worker has different version 2.7 than that in driver
2.6, PySpark cannot run with different minor versions

On my mac I can solve the exception problem by setting

export PYSPARK_PYTHON=python3

export PYSPARK_DRIVER_PYTHON=python3

IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark



On my cluster I set the values to python2.7. And PYTHON_OPTS=³notebook
no-browser port=7000² . I connect using a ssh tunnel from my local
machine.



I also tried installing python 3 , pip, ipython, and jupyter in/on my
cluster



I tried adding export PYSPARK_PYTHON=python2.7 to the
/root/spark/conf/spark-env.sh on all my machines


from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3


In [1]:
from pyspark import SparkContext
sc = SparkContext("local", "Simple App")
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
textFile.take(3)
---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-1-e0006b323300> in <module>()
      2 sc = SparkContext("local", "Simple App")
      3 textFile = 
sc.textFile("file:///home/ec2-user/dataScience/readme.md")
----> 4 textFile.take(3)

/root/spark/python/pyspark/rdd.py in take(self, num)
   1297 
   1298             p = range(partsScanned, min(partsScanned +
numPartsToTry, totalParts))
-> 1299             res = self.context.runJob(self, takeUpToNumLeft, p)
   1300 
   1301             items += res

/root/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc,
partitions, allowLocal)
    914         # SparkContext#runJob.
    915         mappedRDD = rdd.mapPartitions(partitionFunc)
--> 916         port = self._jvm.PythonRDD.runJob(self._jsc.sc(),
mappedRDD._jrdd, partitions)
    917         return list(_load_from_socket(port,
mappedRDD._jrdd_deserializer))
    918 

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
__call__(self, *args)
    536         answer = self.gateway_client.send_command(command)
    537         return_value = get_return_value(answer, self.gateway_client,
--> 538                 self.target_id, self.name)
    539 
    540         for temp_arg in temp_args:

/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in
get_return_value(answer, gateway_client, target_id, name)
    298                 raise Py4JJavaError(
    299                     'An error occurred while calling {0}{1}{2}.\n'.
--> 300                     format(target_id, '.', name), value)
    301             else:
    302                 raise Py4JError(

Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0
(TID 0, localhost): org.apache.spark.api.python.PythonException: Traceback
(most recent call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in
main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver
2.6, PySpark cannot run with different minor versions

        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
42)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
17)
        at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedu
ler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSched
uler.scala:1271)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSched
uler.scala:1270)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:5
9)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:697)
        at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply
(DAGScheduler.scala:697)
        at scala.Option.foreach(Option.scala:236)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.sca
la:697)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSched
uler.scala:1496)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedul
er.scala:1458)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGSchedul
er.scala:1447)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
        at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
        at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62
)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:43)
        at java.lang.reflect.Method.invoke(Method.java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most
recent call last):
  File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in
main
    ("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in driver
2.6, PySpark cannot run with different minor versions

        at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
        at 
org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
        at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:11
42)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6
17)
        ... 1 more

bug: can not run Ipython notebook on cluster

Reply via email to