I would try setting PYSPARK_DRIVER_PYTHON environment variable to the location of your python binary, especially if you are using a virtual environment.
-Don On Wed, Jun 3, 2015 at 8:24 PM, AlexG <swift...@gmail.com> wrote: > I have libskylark installed on both machines in my two node cluster in the > same locations, and checked that the following code, which calls > libskylark, > works on both nodes with 'pyspark rfmtest.py': > > import re > import numpy > import skylark.ml.kernels > import random > import os > > from pyspark import SparkContext > sc = SparkContext(appName="test") > > SIGMA = 10 > NUM_RF = 500 > numfeatures = 100 > numpoints = 1000 > kernel = skylark.ml.kernels.Gaussian(numfeatures, SIGMA) > S = kernel.rft(NUM_RF) > > rows = sc.parallelize(numpy.random.rand(numpoints, numfeatures).tolist(), > 6) > sketched_rows = rows.map(lambda row : S / > numpy.ndarray(shape=(1,numfeatures), buffer=numpy.array(row)).copy()) > > os.system("rm -rf spark_out") > sketched_rows.saveAsTextFile('spark_out') > > However, when I try to run the same code on the cluster with 'spark-submit > --master spark://master:7077 rfmtest.py', I get an ImportError saying that > skylark.sketch does not exist: > > 15/06/04 01:21:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on master:40244 (size: 67.5 KB, free: 265.3 MB) > 15/06/04 01:21:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on node001:45690 (size: 67.5 KB, free: 265.3 MB) > 15/06/04 01:21:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, > master): org.apache.spark.api.python.PythonException: Traceback (most > recent > call last): > File "/opt/Spark/python/pyspark/worker.py", line 88, in main > command = pickleSer._read_with_length(infile) > File "/opt/Spark/python/pyspark/serializers.py", line 156, in > _read_with_length > return self.loads(obj) > File "/opt/Spark/python/pyspark/serializers.py", line 405, in loads > return cPickle.loads(obj) > ImportError: No module named skylark.sketch > > at > org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) > at > org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176) > at > org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) > at > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > at org.apache.spark.scheduler.Task.run(Task.scala:64) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > Any ideas what might be going on? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/importerror-using-external-library-with-pyspark-tp23145.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Donald Drake Drake Consulting http://www.drakeconsulting.com/ http://www.MailLaunder.com/ 800-733-2143