I would try setting PYSPARK_DRIVER_PYTHON environment variable to the
location of your python binary, especially if you are using a virtual
environment.

-Don

On Wed, Jun 3, 2015 at 8:24 PM, AlexG <swift...@gmail.com> wrote:

> I have libskylark installed on both machines in my two node cluster in the
> same locations, and checked that the following code, which calls
> libskylark,
> works on both nodes with 'pyspark rfmtest.py':
>
> import re
> import numpy
> import skylark.ml.kernels
> import random
> import os
>
> from pyspark import SparkContext
> sc = SparkContext(appName="test")
>
> SIGMA = 10
> NUM_RF = 500
> numfeatures = 100
> numpoints = 1000
> kernel = skylark.ml.kernels.Gaussian(numfeatures, SIGMA)
> S = kernel.rft(NUM_RF)
>
> rows = sc.parallelize(numpy.random.rand(numpoints, numfeatures).tolist(),
> 6)
> sketched_rows = rows.map(lambda row : S /
> numpy.ndarray(shape=(1,numfeatures), buffer=numpy.array(row)).copy())
>
> os.system("rm -rf spark_out")
> sketched_rows.saveAsTextFile('spark_out')
>
> However, when I try to run the same code on the cluster with 'spark-submit
> --master spark://master:7077 rfmtest.py', I get an ImportError saying that
> skylark.sketch does not exist:
>
> 15/06/04 01:21:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
> on master:40244 (size: 67.5 KB, free: 265.3 MB)
> 15/06/04 01:21:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
> on node001:45690 (size: 67.5 KB, free: 265.3 MB)
> 15/06/04 01:21:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> master): org.apache.spark.api.python.PythonException: Traceback (most
> recent
> call last):
>   File "/opt/Spark/python/pyspark/worker.py", line 88, in main
>     command = pickleSer._read_with_length(infile)
>   File "/opt/Spark/python/pyspark/serializers.py", line 156, in
> _read_with_length
>     return self.loads(obj)
>   File "/opt/Spark/python/pyspark/serializers.py", line 405, in loads
>     return cPickle.loads(obj)
> ImportError: No module named skylark.sketch
>
>         at
> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
>         at
> org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
>         at
> org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>         at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>         at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>         at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>         at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>         at org.apache.spark.scheduler.Task.run(Task.scala:64)
>         at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:745)
>
> Any ideas what might be going on?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/importerror-using-external-library-with-pyspark-tp23145.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
800-733-2143

Reply via email to