I have libskylark installed on both machines in my two node cluster in the same locations, and checked that the following code, which calls libskylark, works on both nodes with 'pyspark rfmtest.py':
import re import numpy import skylark.ml.kernels import random import os from pyspark import SparkContext sc = SparkContext(appName="test") SIGMA = 10 NUM_RF = 500 numfeatures = 100 numpoints = 1000 kernel = skylark.ml.kernels.Gaussian(numfeatures, SIGMA) S = kernel.rft(NUM_RF) rows = sc.parallelize(numpy.random.rand(numpoints, numfeatures).tolist(), 6) sketched_rows = rows.map(lambda row : S / numpy.ndarray(shape=(1,numfeatures), buffer=numpy.array(row)).copy()) os.system("rm -rf spark_out") sketched_rows.saveAsTextFile('spark_out') However, when I try to run the same code on the cluster with 'spark-submit --master spark://master:7077 rfmtest.py', I get an ImportError saying that skylark.sketch does not exist: 15/06/04 01:21:50 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on master:40244 (size: 67.5 KB, free: 265.3 MB) 15/06/04 01:21:51 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on node001:45690 (size: 67.5 KB, free: 265.3 MB) 15/06/04 01:21:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, master): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/opt/Spark/python/pyspark/worker.py", line 88, in main command = pickleSer._read_with_length(infile) File "/opt/Spark/python/pyspark/serializers.py", line 156, in _read_with_length return self.loads(obj) File "/opt/Spark/python/pyspark/serializers.py", line 405, in loads return cPickle.loads(obj) ImportError: No module named skylark.sketch at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:135) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:176) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:94) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any ideas what might be going on? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/importerror-using-external-library-with-pyspark-tp23145.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org