Thanks Patrick. Using a conda virtual environment did help with libraries that required the extra C stuff.
Jonas On Fri, Sep 14, 2018 at 8:02 AM Patrick McCarthy <pmccar...@dstillery.com> wrote: > You didn't say how you're zipping the dependencies, but I'm guessing you > either include .egg files or zipped up a virtualenv. In either case, the > extra C stuff that scipy and pandas rely upon doesn't get included. > > An approach like this solved the last problem I had that seemed like this > - > https://community.hortonworks.com/articles/58418/running-pyspark-with-conda-env.html > > On Thu, Sep 13, 2018 at 10:08 PM, Jonas Shomorony <js...@stanford.edu> > wrote: > >> Hey everyone, >> >> >> I am currently trying to run a Python Spark job (using YARN client mode) >> that uses multiple libraries, on a Spark cluster on Amazon EMR. To do that, >> I create a dependencies.zip file that contains all of the >> dependencies/libraries (installed through pip) for the job to run >> successfully, such as pandas, scipy, tqdm, psycopg2, etc. The >> dependencies.zip file is contained within an outside directory (let’s call >> it “project”) that contains all the code to run my Spark job. I then zip up >> everything within project (including dependencies.zip) into project.zip. >> Then, I call spark-submit on the master node in my EMR cluster as follows: >> >> >> `spark-submit --packages … --py-files project.zip --jars ... >> run_command.py` >> >> >> Within “run_command.py” I add dependencies.zip as follows: >> >> `self.spark.sparkContext.addPyFile("dependencies.zip”)` >> >> >> The run_command.py then uses other files within project.zip to complete >> the spark job, and within those files, I import various libraries (found in >> dependencies.zip). >> >> >> I am running into a strange issue where all of the libraries are imported >> correctly (with no problems) with the exception of scipy and pandas. >> >> >> For scipy I get the following error: >> >> >> `File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/__init__.py", line 119, in >> <module> >> >> File "/mnt/tmp/pip-install-79wp6w/scipy/scipy/_lib/_ccallback.py", >> line 1, in <module> >> >> ImportError: cannot import name _ccallback_c` >> >> >> And for pandas I get this error message: >> >> >> `File "/mnt/tmp/pip-install-79wp6w/pandas/pandas/__init__.py", line 35, >> in <module> >> >> ImportError: C extension: No module named tslib not built. If you want to >> import pandas from the source directory, you may need to run 'python >> setup.py build_ext --inplace --force' to build the C extensions first.` >> >> >> When I comment out the imports for these two libraries (and their use >> from within the code) everything works fine. >> >> >> Surprisingly, when I run the application locally (on master node) without >> passing in dependencies.zip, it picks and resolves the libraries from >> site-packages correctly and successfully runs to completion. >> dependencies.zip is created by zipping the contents of site-packages. >> >> >> Does anyone have any ideas as to what may be happening here? I would >> really appreciate it. >> >> >> pip version: 18.0 >> >> spark version: 2.3.1 >> >> python version: 2.7 >> >> >> Thank you, >> >> >> Jonas >> >> >