I have libskylark installed on both machines in my two node cluster in the
same locations, and checked that the following code, which calls libskylark,
works on both nodes with 'pyspark rfmtest.py':

import re
import numpy
import skylark.ml.kernels
import random
import os

from pyspark import SparkContext
sc = SparkContext(appName="test")

SIGMA = 10
NUM_RF = 500
numfeatures = 100
numpoints = 1000
kernel = skylark.ml.kernels.Gaussian(numfeatures, SIGMA)
S = kernel.rft(NUM_RF)

rows = sc.parallelize(numpy.random.rand(numpoints, numfeatures).tolist(), 6)
sketched_rows = rows.map(lambda row : S /
numpy.ndarray(shape=(1,numfeatures), buffer=numpy.array(row)).copy())

os.system("rm -rf spark_out")
sketched_rows.saveAsTextFile('spark_out')

However, when I try to run the same code on the cluster with 'spark-submit
--master spark://master:7077 rfmtest.py', I get an ImportError saying that
skylark.sketch does not exist:

15/06/04 01:21:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on master:40244 (size: 67.5 KB, free: 265.3 MB)
15/06/04 01:21:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory
on node001:45690 (size: 67.5 KB, free: 265.3 MB)
15/06/04 01:21:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
master): org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
  File "/opt/Spark/python/pyspark/worker.py", line 88, in main
    command = pickleSer._read_with_length(infile)
  File "/opt/Spark/python/pyspark/serializers.py", line 156, in
_read_with_length
    return self.loads(obj)
  File "/opt/Spark/python/pyspark/serializers.py", line 405, in loads
    return cPickle.loads(obj)
ImportError: No module named skylark.sketch

        at 
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135)
        at
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176)
        at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
        at org.apache.spark.scheduler.Task.run(Task.scala:64)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Any ideas what might be going on?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/importerror-using-external-library-with-pyspark-tp23145.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to