I am running Spark programs on a large cluster (for which, I do not have
administrative privileges). numpy is not installed on the worker nodes.
Hence, I bundled numpy with my program, but I get the following error:

Traceback (most recent call last):
  File "/home/user/spark-script.py", line 12, in <module>
    import numpy
  File "/usr/local/lib/python2.7/dist-packages/numpy/__init__.py", line
170, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/add_newdocs.py", line
13, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/__init__.py", line
8, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/lib/type_check.py",
line 11, in <module>
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/__init__.py",
line 6, in <module>
ImportError: cannot import name multiarray

The script is actually quite simple:

from pyspark import SparkConf, SparkContext
sc = SparkContext()

sc.addPyFile('numpy.zip')

import numpy

a = sc.parallelize(numpy.array([12, 23, 34, 45, 56, 67, 78, 89, 90]))
print a.collect()

I understand that the error occurs because numpy dynamically loads
multiarray.so dependency and even if my numpy.zip file includes
multiarray.so file, somehow the dynamic loading doesn't work with Apache
Spark. Why so? And how do you othewise create a standalone numpymodule with
static linking?

P.S. The numpy.zip file I had included with the program was zipped version
of the numpy installation on my Ubuntu machine. I also tried downloading
numpy source and building it and bundling it with the program, but the
problem persisted.

Thanks.

Reply via email to