I am running Spark programs on a large cluster (for which, I do not have administrative privileges). numpy is not installed on the worker nodes. Hence, I bundled numpy with my program, but I get the following error:
Traceback (most recent call last): File "/home/user/spark-script.py", line 12, in <module> import numpy File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line 170, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line 13, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line 8, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py", line 11, in <module> File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py", line 6, in <module> ImportError: cannot import name multiarray The script is actually quite simple: from pyspark import SparkConf, SparkContext sc = SparkContext() sc.addPyFile('numpy.zip') import numpy a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90])) print a.collect() I understand that the error occurs because numpy dynamically loads multiarray.so dependency and even if my numpy.zip file includes multiarray.so file, somehow the dynamic loading doesn't work with Apache Spark. Why so? And how do you othewise create a standalone numpymodule with static linking? P.S. The numpy.zip file I had included with the program was zipped version of the numpy installation on my Ubuntu machine. I also tried downloading numpy source and building it on my local machine and bundling it with the program, but the problem persisted. My local machine and the worker nodes both run Ubuntu 64. Thanks.