Hey everyone,
I am currently trying to run a Python Spark job (using YARN client mode) that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, I create a dependencies.zip file that contains all of the dependencies/libraries (installed through pip) for the job to run successfully, such as pandas, scipy, tqdm, psycopg2, etc. The dependencies.zip file is contained within an outside directory (let’s call it “project”) that contains all the code to run my Spark job. I then zip up everything within project (including dependencies.zip) into project.zip. Then, I call spark-submit on the master node in my EMR cluster as follows: `spark-submit --packages … --py-files project.zip --jars ... run_command.py` Within “run_command.py” I add dependencies.zip as follows: `self.spark.sparkContext.addPyFile("dependencies.zip”)` The run_command.py then uses other files within project.zip to complete the spark job, and within those files, I import various libraries (found in dependencies.zip). I am running into a strange issue where all of the libraries are imported correctly (with no problems) with the exception of scipy and pandas. For scipy I get the following error: `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in <module> File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", line 1, in <module> ImportError: cannot import name _ccallback_c` And for pandas I get this error message: `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, in <module> ImportError: C extension: No module named tslib not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.` When I comment out the imports for these two libraries (and their use from within the code) everything works fine. Surprisingly, when I run the application locally (on master node) without passing in dependencies.zip, it picks and resolves the libraries from site-packages correctly and successfully runs to completion. dependencies.zip is created by zipping the contents of site-packages. Does anyone have any ideas as to what may be happening here? I would really appreciate it. pip version: 18.0 spark version: 2.3.1 python version: 2.7 Thank you, Jonas