[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944564#comment-16944564 ] Furcy Pin commented on SPARK-13587: --- Hello, I don't know where to ask this, but we have been using this feature on HDInsight 2.6.5 and we sometimes have a concurrency issue with pip. Basically it looks like in rare occasions, several executors set up the virtualenv simultaneously, which ends up in a kind of deadlock. When running the pip install command used by the executor manually, it suddenly hangs and when cancel throws this error : {code:java} File "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py", line 31, in acquire os.link(self.unique_name, self.lock_file) FileExistsError: [Errno 17] File exists: '/home/yarn/-' -> '/home/yarn/selfcheck.json.lock'{code} This happens with "spark.pyspark.virtualenv.type=native". We haven't tried with conda yet. It is pretty bad because when it happens: - some executors of the spark job just get stuck, and the spark job gets stuck - even if the job is restarted, the lock files stays there and causes the whole YARN host to be useless. Any suggestion or workaround would be appreciated. One idea would be to remove the "--cache-dir /home/yarn" option which is currently used in the pip install command, but it doesn't seem to be configurable right now. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659374#comment-16659374 ] Ruslan Dautkhanov commented on SPARK-13587: --- We're using conda environments shared across worker nodes through NFS. Has anyone used something like this? Another option that' more direct to this jira's description is `conda-pack` and using yarn's `--archives` option to distribute it: {code:bash} $ PYSPARK_DRIVER_PYTHON=`which python` \ PYSPARK_PYTHON=./environment/bin/python \ spark-submit \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \ --master yarn \ --deploy-mode client \ --archives environment.tar.gz#environment \ script.py {code} More details - https://conda.github.io/conda-pack/spark.html > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512433#comment-16512433 ] Matt Mould commented on SPARK-13587: What is the current status of this ticket please? This [article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html] suggests that it's done, but the it doesn't work for me with the following command. {code:java} spark-submit --deploy-mode cluster --master yarn --py-files parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true --conf spark.pyspark.virtualenv.type=native --conf spark.pyspark.virtualenv.requirements=requirements.txt --conf spark.pyspark.virtualenv.bin.path=virtualenv --conf spark.pyspark.python=python3 pyspark_poc_runner.py{code} > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221 ] Semet commented on SPARK-13587: --- Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) proposal I made, even without having to modify pyspark at all. I even think you can package a wheelhouse using this {{--archive}} argument. The drawback is indeed your spark-submit has to send this package to each node (1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies description formats, each node would download by itself the dependencies... > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115 ] Nicholas Chammas commented on SPARK-13587: -- To follow-up on my [earlier comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419], I created a [completely self-contained sample repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a technique for bundling PySpark app dependencies in an isolated way. It's the technique that Ben, I, and several others discussed here in this JIRA issue. https://github.com/massmutual/sample-pyspark-application The approach has advantages (like letting you ship a completely isolated Python environment, so you don't even need Python installed on the workers) and disadvantages (requires YARN; increases job startup time). Hope some of you find the sample repo useful until Spark adds more "first-class" support for Python dependency isolation. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923683#comment-15923683 ] Jeff Zhang commented on SPARK-13587: I linked a detailed document about how to use it in both batch mode and interactive mode, if anyone is interested in this, you can try this PR. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747407#comment-15747407 ] Jeff Zhang commented on SPARK-13587: If it is pretty large cluster, then I would suggest to set up a private pypi repository server. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747379#comment-15747379 ] Prasanna Santhanam commented on SPARK-13587: [~zjffdu] In case of Anaconda Python the environment is self-contained. The conda environment is managed using binaries that are hardlinked unlike in virtualenv. So zipping the conda environment zips the entire python system together. After that Spark only needs to know that the binary for Python it should use is relative to the archives that were distributed. Hence I was able to run my spark application without installing anaconda on any of my workers. Let's give these mechanisms names so it makes it easier to compare. On the one hand we have the "Library Distribution Mechanism" and on the other we have the "Library Download Mechanism". Right now, the overhead is greater in the case of the distribution mechanism because zipping the binaries eats up significant time before application startup. This was my observation with the 16core machine, YMMV. The library distribution mechanism can however be improved with some basic caching done on the gateway node so that subsequent zip operations on a different spark application with similar library requirements is faster. On the other hand, in the library download mechanism - the downloads are very fast and can even be cached locally or proxied to a locally managed egg/pypi repository/conda channel. So the download mechanism is still superior save for some complexity in the implementation and options to be specified. However, I'm not sure whether downloads will be throttled on publicly exposed repositories like PyPi when say a 1000 node spark cluster is simultaneously requesting for python binaries to be downloaded from all its workers. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747381#comment-15747381 ] Prasanna Santhanam commented on SPARK-13587: [~zjffdu] In case of Anaconda Python the environment is self-contained. The conda environment is managed using binaries that are hardlinked unlike in virtualenv. So zipping the conda environment zips the entire python system together. After that Spark only needs to know that the binary for Python it should use is relative to the archives that were distributed. Hence I was able to run my spark application without installing anaconda on any of my workers. Let's give these mechanisms names so it makes it easier to compare. On the one hand we have the "Library Distribution Mechanism" and on the other we have the "Library Download Mechanism". Right now, the overhead is greater in the case of the distribution mechanism because zipping the binaries eats up significant time before application startup. This was my observation with the 16core machine, YMMV. The library distribution mechanism can however be improved with some basic caching done on the gateway node so that subsequent zip operations on a different spark application with similar library requirements is faster. On the other hand, in the library download mechanism - the downloads are very fast and can even be cached locally or proxied to a locally managed egg/pypi repository/conda channel. So the download mechanism is still superior save for some complexity in the implementation and options to be specified. However, I'm not sure whether downloads will be throttled on publicly exposed repositories like PyPi when say a 1000 node spark cluster is simultaneously requesting for python binaries to be downloaded from all its workers. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747338#comment-15747338 ] Jeff Zhang commented on SPARK-13587: [~prasanna.santha...@icloud.com] I don't understand how this can work without python installed in the worker nodes. And regarding the overhead, I think no matter we dist the dependency or download the dependency. The overhead can not avoided. From my experience, one advantage of the downloading approach is that it would cache the dependencies on the worker node. So if the executor runs on one node that the dependency is cached there, it would save lot of time to set up the virtualenv. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747291#comment-15747291 ] Prasanna Santhanam commented on SPARK-13587: [~nchammas] sorry, this got buried in several other emails at my org. What you've implemented as a shell installer is exactly what I've done except within Spark code branched off of the 2.0.1 release. I use the YARN archives mechanism to distribute the zip files and control the conda environment binaries. It should be straightforward to change my diff to work with virtualenv as well. As you've explained the advantage of this process is that Python doesn't need to be installed at all in the worker nodes. I've also implemented the mechanism with {{--py-files}} so standalone spark can take advantage but I haven't got around to testing it yet. The downside of the zip distribution solution is however that the start time of the application significantly increased - nearly 4m to zip all libraries. I tested this on a 16 core machine with 30GB memory. When I try to zip the binaries it ends up creating a 400MB archive for just basic libaries like {{matplotlib}}, {{scipy}}, {{numpy}}. What zip times did you experience? Much to my surprise, contrasted with the original proposal by [~zjffdu] of downloading the dependencies on all the workers, this eats up significant time in a spark program that runs for no greater than 2s. This restricted me from pushing the implementation further. Would like to hear your observations from testing your shell implementation of the same. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740419#comment-15740419 ] Nicholas Chammas commented on SPARK-13587: -- Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to develop a solution that works for Spark on YARN: {code} set -e # Both these directories exist on all of our YARN nodes. # Otherwise, everything else is built and shipped out at submit-time # with our application. export HADOOP_CONF_DIR="/etc/hadoop/conf" export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6" export PATH="$SPARK_HOME/bin:$PATH" python3 -m venv venv/ source venv/bin/activate pip install -U pip pip install -r requirements.pip pip install -r requirements-dev.pip deactivate # This convoluted zip machinery is to ensure that the paths to the files inside the zip # look the same to Python when it runs within YARN. # If there is a simpler way to express this, I'd be interested to know! pushd venv/ zip -rq ../venv.zip * popd pushd myproject/ zip -rq ../myproject.zip * popd pushd tests/ zip -rq ../tests.zip * popd export PYSPARK_PYTHON="venv/bin/python" spark-submit \ --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \ --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \ --master yarn \ --deploy-mode client \ --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \ run_tests.py -v {code} My solution is based off of Ben's, except where Ben uses Conda I just use pip. I don't know if there is a way to adapt this solution to work with Spark on Mesos or Spark Standalone (and I haven't tried since my environment is YARN), but if someone figures it out please post your solution here! As Ben explains in [his blog post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you build and ship an isolated environment with your PySpark application out to the YARN cluster. The YARN nodes don't even need to have the correct version of Python (or Python at all!) installed, because you are shipping out a complete Python environment via the {{--archives}} option. I hope this helps some people who are looking for a workaround they can use today while a more robust solution is developed directly into Spark. And I wonder... if this {{--archives}} technique can be extended or translated to Mesos and Standalone somehow, maybe that would be a good enough solution for the time being? People would be able to run their jobs in an isolated Python environment using their tool of choice (conda or pip), and Spark wouldn't need to add any virtualenv-specific machinery. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714622#comment-15714622 ] Semet commented on SPARK-13587: --- For myself, I share a NFS folder with all the executors. It works because they all have the same architecture and distribution. Frankly, I begin to be a bit disapointed there is no infatuation, no real will to solve this huge hole in PySpark. Dependency management has been solved years ago in Python with Virtualenv in general and with Anaconda in Data Science, but PySpark still continue to play with the PYTHONPATH and there is no Spark core developer actively involved to help us integrating such patch. Dependency management for JAR are modernly handled by {{--packages}}, automatically downloading the files from a remote repository, why not doing that for Python as well? And maybe R as well if available? I even proposed a way to package everything in a single zip archive, called "wheelhouse", so executors might not have to download anything. So, please help us raising this concern to core developers to tell them that there are several persons interested in solving this issue. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713636#comment-15713636 ] Nicholas Chammas commented on SPARK-13587: -- [~tsp]: {quote} Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to --archives in the spark-submit command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application. {quote} Can you elaborate on how you did this? I'm willing to jump through some hoops to create a hackish way of distributing dependencies while this JIRA task gets worked out. What I'm trying is: # Create a virtual environment and activate it. # Pip install my requirements into that environment, as one would in a regular Python project. # Zip up the venv/ folder and ship it with my application using {{--py-files}}. I'm struggling to get the workers to pick up Python dependencies from the packaged venv over what's in the system site-packages. All I want is to be able to ship out the dependencies with the application from a virtual environment all at once (i.e. without having to enumerate each dependency). Has anyone been able to do this today? It would be good to document it as a workaround for people until this issue is resolved. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647331#comment-15647331 ] Prasanna Santhanam commented on SPARK-13587: Thanks for this JIRA and the work related to it - I have been testing this patch a little with conda environments. Previously, I have had reasonable success with zipping the contents of my conda environment in the gateway/driver node and submitting the zip file as an argument to {{--archives}} in the {{spark-submit}} command line. This approach works perfectly because it uses the existing spark infrastructure to distribute dependencies through to the workers. You actually don't even need anaconda installed on the workers since the zip can package the entire python installation within it. The downside of it being that conda zip files can bloat up quickly in a production spark application. [~zjffdu] In your approach I find that the driver program still executes on the native python installation and only the workers run within conda (virtualenv) environments. Would it not be possible to use the same conda environment throughout? ie. setup once on gateway node and propagate over the distributed cache as mentioned in a related PR comment. I can always force the driver python to be conda using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` but that is not the same conda environment as the one created by your PythonWorkerFactory. Or is it not your intention to make it work this way? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362294#comment-15362294 ] Semet commented on SPARK-13587: --- Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367 > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362014#comment-15362014 ] Takao Magoori commented on SPARK-13587: --- Sorry. It seems there is no isolated site-package directory. Workdir is just added to sys.path. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361940#comment-15361940 ] Takao Magoori commented on SPARK-13587: --- What is the reason why virtualenv is required ? I feel supporting naive virtualenv is unnecessary since each Spark executor(worker) already has its own isolated site-packages directory (workdir) just like naive virtualenv. Only for conda ? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355172#comment-15355172 ] Semet commented on SPARK-13587: --- yes it looks cool! Here is what I have in mind, tell me if it is the wrong direction - each job should execute in its own environment. - I love wheels, and wheelhouse. Providen the fact we build all the needed wheels on the same machine as the cluster, of we did retrived the right wheels on Pypi, pypi can install all dependencies with lightning speed, without the need of an internet connection (have configure the proxy for some corporates, or handle an internal mirror, etc). - so we deploy the job with a command line such as: {code} bin/spark-submit --master $(spark_master) --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=native" --conf "spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf "spark.pyspark.virtualenv.script=script_name" --conf "spark.pyspark.virtualenv.args='--opt1 --opt2'" {code} so: - {{wheelhouse.zip}} contains the whole wheels to install in a fresh virtualenv. No internet connection, the script it also deployed and installed, provided they go created like a nice module page (so easy to do with pbr) - {{spark.pyspark.virtualenv.script}} is the execution point of the script. It should be declared in the {{script}} section in the {{setup.py}} - {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351512#comment-15351512 ] Jeff Zhang commented on SPARK-13587: Thanks [~gae...@xeberon.net] Have you take a look at my PR ? welcome any comments. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351074#comment-15351074 ] Semet commented on SPARK-13587: --- I back this proposal and willing to work on it. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324323#comment-15324323 ] Jeff Zhang commented on SPARK-13587: Sorry, guys, I am busy on other stuff recently and late for update this ticket. I just attached the design and doc, and create the PR. If you are interested, please help review the design doc and PR, thanks [~gbow...@fastmail.co.uk] [~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] [~JnBrymn] > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324318#comment-15324318 ] Apache Spark commented on SPARK-13587: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/13599 > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324316#comment-15324316 ] Apache Spark commented on SPARK-13587: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/13598 > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311649#comment-15311649 ] Jeff Zhang commented on SPARK-13587: Yes, I focus on yarn mode, I did some test on local mode, but not full test. I haven't done it on mesos yet. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311646#comment-15311646 ] Jeff Zhang commented on SPARK-13587: Thanks [~gbow...@fastmail.co.uk] In my POC, I implemented it using both conda and native virtualenv/pip. User can choose whatever he want by specifying the option. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310881#comment-15310881 ] Greg Bowyer commented on SPARK-13587: - There is also a small wrapper for this http://knit.readthedocs.io/en/latest/ I guess there is no mesos and localmode support, but its a start > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310872#comment-15310872 ] Greg Bowyer commented on SPARK-13587: - Does this work for people ? https://www.continuum.io/blog/developer-blog/conda-spark The downsides are 1. Its conda, it is not quite as standard as virtualenv / pip 2. Conda will install none free software (e.g. MKL) - this might be a problem However I think that it might also help in that conda bundles dependencies like BLAS, LAPACK and friends which might save some folks heartache when dealing with badly configured clusters. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310782#comment-15310782 ] Greg Bowyer commented on SPARK-13587: - I have been out of this world for a long time. The Spex extension was designed to get around this, and was designed for a world where NFS was being removed from a cluster. Right now its a ugly ugly hack but I might be able to spend some time making it less of a hack. If this was integrated to spark would we want to make this transparent, or provide a flag to the launcher scripts. I figure a --py-deps=requirements.txt might work? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234264#comment-15234264 ] John Berryman commented on SPARK-13587: --- At my work we're using devpi and a homecooked proxy server to "fake" pip and serve up cached wheels. I'm not sure such a solution should be built into Spark, but it is at least a POC for being able to make the virtualenv idea fast (after the first build at least). Really it wouldn't be hard to set this up outside of Spark, you would just need to make {{pip}} actually point to {{my_own_specially_wheel_caching_pip}}. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218659#comment-15218659 ] Mike Sukmanowsky commented on SPARK-13587: -- That's the (hopefully) beautiful thing about pex. {noformat} $ pex numpy pandas -o c_requirements.pex $ ./c_requirements.pex Python 2.7.10 (default, Aug 4 2015, 19:54:05) [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin Type "help", "copyright", "credits" or "license" for more information. (InteractiveConsole) >>> import numpy >>> import pandas >>> print numpy.__version__ 1.11.0 >>> print pandas.__version__ 0.18.0 >>> {noformat} The catch of course is that the pex file itself would have to be built on a node with the same arch as Spark workers (i.e. can't build pex on Mac OS and ship to Linux cluster unless all dependencies are pure python). To build a platform agnostic env, we'd have to look at conda. I know there was an effort to support pex with pyspark https://github.com/URXtech/spex but it hasn't seen much activity recently. I tried reaching out to the author but got no response. I could take a shot at adding support for this unless @zjffdu already has plans. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218610#comment-15218610 ] Juliet Hougland commented on SPARK-13587: - Being able to ship around pex files like we do .py and .egg files sounds very reasonable from a delineation of responsibilities perspective. I like the idea and would support a change like that. A question/edge case worth working out is how pex files relate to compiled c libs that python libs may need to link to. I don't know much about pex, but initial assessment is that it shouldn't be a huge problem. I like this solution. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218109#comment-15218109 ] Mike Sukmanowsky commented on SPARK-13587: -- The NFS is a good idea for a current workaround. It's still a pain to build all required eggs in a dependency file like requirements.txt unless pyspark supports something like .pex files which build a self-contained python executable from said requirements file. If pex files were supported by pyspark, the daemon + workers would simply use /path/to/file.pex as the Python executable and get a fully baked virtualenv for free without Spark needing to concern itself with building these files. Twitter has reportedly used this to ship distributed Python applications for sometime. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214038#comment-15214038 ] Dripple commented on SPARK-13587: - [~juliet] can you give details/pointer on the solution you describe ? Thanks. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212848#comment-15212848 ] Jeff Zhang commented on SPARK-13587: bq. Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This problem is most of time you are not administrator and don't have permission to do that. It's inefficient to ask your administrator to install the environment for you. bq. "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. Some packages are binary and need to compile. And it is not easy to do dependency management in this way. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212191#comment-15212191 ] Juliet Hougland commented on SPARK-13587: - I really do think spark and pyspark needs to stay out of the business for installing anything for people. A generic executable is relatively neutral as to what exactly that executable does, which is good. Spark's scope should be computation/execution, not environment setup and teardown. Have you considered using NFS or Amazon EFS to allow users to create and manage their own envs and then mounting those on worker/executor nodes? This is an elegant solution we (many experienced people at Cloudera like Guru M and Tristan Z recommended this) have seen deployed successfully. I believe given the description of your problem it should suit your needs. [~vanzin] as suggested "one alternative to shared mounts is to store the thing in HDFS and use something like --files / --archives in Spark. The distribution to new containers is handled by YARN, and Spark just would need some adjustments to find the right executable inside those archives." > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201763#comment-15201763 ] Mike Sukmanowsky commented on SPARK-13587: -- Sorry to bug [~juliet] - any thoughts? We're currently trying to think about some kind of a workaround for Spark 1.6.0 working on Amazon EMR that'd allow us to create a conda/virtual env on YARN nodes prior to the application running but I don't think there's anything we can really do. Building on my suggestion, it'd probably also be helpful to have a --teardown option added to spark-submit that'd allow execution of some script after the Spark application is terminated. This way conda/virtual envs with temporary names could be created/destroyed. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184210#comment-15184210 ] Mike Sukmanowsky commented on SPARK-13587: -- [~juliet] I get the concerns relating to Spark supporting a complex virtualenv process. My main objection to only supporting something like --pyspark-python is the difficulty we currently face in locations like Amazon EMR, but really any Spark cluster where nodes are assumed to be added after an application is submitted deals with the issue. We have a bootstrap script which provisions our EMR nodes with required Python dependencies. This approach works alright for a cluster which tends to run very few applications, but if we have multiple tenants, this approach quickly gets unwieldy. Ideally, Spark applications could be submitted from a master node with a user never having to worry about dependency management at the node bootstrapping level. I was thinking that an interesting approach to this problem would be to provide some sort of a --bootstrap option to spark-submit which points to any executable which Spark will run and check for receipt of a 0 exit code before continuing to launch the application itself. This script could obviously execute any code such as creating a virtualenv or conda env and installing requirements. If a non-zero exit code were received, the Spark application would cease to continue. The generalization gets the Spark community away from having to support conda/virtualenv eccentricities. Thoughts? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179254#comment-15179254 ] Jeff Zhang commented on SPARK-13587: SPARK-13081 is almost done. But for this ticket, I definitely need your feedback and help, I am not heavy python user, so may miss something that python programmer concerns. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253 ] Juliet Hougland commented on SPARK-13587: - That is wonderful. Let me know if you'd like me to help work on it. It has been dangling on my mental todo list for a while. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179251#comment-15179251 ] Jeff Zhang commented on SPARK-13587: Actually I have created this ticket several weeks ago SPARK-13081 :) > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250 ] Juliet Hougland commented on SPARK-13587: - I made a comment related to this below. TLDR I think the suggested --py-env option could be encompassed an already needed --pyspark_python sort of option. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247 ] Juliet Hougland commented on SPARK-13587: - Currently the way users specify the workers' python interpreter is through the PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path to be specified by a cli flag. That is a current rough edge of using already installed envs on a cluster. If this was added as a cli flag, I could see valid options being 'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) and requiring a second flag to specify the requirements file. I think it helps prevent an explosion of flag for spark submit while helping handle a very important and often changed parameter for a job. What do you think of this? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179012#comment-15179012 ] Jeff Zhang commented on SPARK-13587: Thanks for the feedback [~j_houg]. Yes, that's why I make the virtualenv as application scope. Creating python package management tool for a cluster will be a big project and too heavy. And what does "--pyspark_python" mean ? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178883#comment-15178883 ] Jeff Zhang commented on SPARK-13587: Thanks for letting me know this. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536 ] Juliet Hougland commented on SPARK-13587: - If pyspark allows users to create virtual environments, users will also want and need other features of python environment management on a cluster. I think this change would broaden the scope of PySpark to include python package management on a cluster. I do not think that spark should be in the business of creating python environments. I think the support load in terms of feature requests, mailing list traffic, etc would be very large. This feature would begin to solve a problem, but would also put us on the hook for many more. I agree with the general intention of this JIRA -- make it easier to manage and interact with complex python environments on a cluster. Perhaps there are other ways to accomplish this without broadening scope and functionality as much. For example, checking a requirements file against an environment before execution. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175776#comment-15175776 ] Dan Blanchard commented on SPARK-13587: --- `conda list --export` will omit all pip-installed requirements and only export the conda ones. It is pretty common for people to have conda environments where they've installed things that weren't available through conda via pip. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175715#comment-15175715 ] Jeff Zhang commented on SPARK-13587: Thanks for the feedback [~dan.blanchard], In my POC, I use "conda list --export" to create the requirement file and seems conda can consume this, Although the format is a little different from virtualenv. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175672#comment-15175672 ] Dan Blanchard commented on SPARK-13587: --- One thing to note is that conda doesn't use the same requirements file format as virtualenv/pip. You'll want to use [conda env create|http://conda.pydata.org/docs/commands/env/conda-env-create.html] uses their own separate YAML format. I have a [pull request open to support requirements.txt files|https://github.com/conda/conda-env/pull/172], but that has been waiting for action for some time. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175646#comment-15175646 ] Mike Sukmanowsky commented on SPARK-13587: -- Perfect and understood about not wanting to promote these to first-class citizens without wider feedback. At the least, I'd say both {{--py-files}} and {{--py-venv}} options could be supported if we're concerned about introducing a deprecation like this. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175045#comment-15175045 ] Jeff Zhang commented on SPARK-13587: spark.pyspark.virtualenv.requirements is a local file (which would be distributed to all nodes) Regarding upgrade these to first-class citizens, I would be conservative for that. Needs more feedback from other users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175018#comment-15175018 ] Mike Sukmanowsky commented on SPARK-13587: -- One thought that just occurred to me, does {{spark.pyspark.virtualenv.requirements}} point to a path on the master node for a requirements file? It'd make sense if that was the case and then the requirements file was shipped to other nodes instead of assuming that this file existed on all Spark nodes at the same location. Also might be a good idea to upgrade these to first-class citizens of spark-submit by supporting them as optional params instead of config properties. I'd go so far as to say it makes sense to deprecate {{--py-files}} in favour of: * {{--py-venv-type=conda}} * {{--py-venv-bin=/path/to/conda}} * {{--py-venv-requirements=/local/path/to/requirements.txt}} > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175010#comment-15175010 ] Mike Sukmanowsky commented on SPARK-13587: -- Gotcha. I might suggest {{spark.pyspark.virtualenv.bin.path}} in that case. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175003#comment-15175003 ] Jeff Zhang commented on SPARK-13587: Thanks for your feedback [~msukmanowsky]. spark.pyspark.virtualenv.path is not the path where the virtualenv created, it is the path to the executable file for virtualenv/conda which is used for creating virtualenv ( I need to rename it to a more proper name to avoid confusing). In my POC, I will create virtualenv in all the executors not only driver. As you said, some python packages depends on C library, we can not guarantee it would work if we compile it in driver and distribute it to other nodes. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174996#comment-15174996 ] Mike Sukmanowsky commented on SPARK-13587: -- Thanks for letting me know about this [~jeffzhang]. I think in general, I'm +1 on the proposal. virtualenvs are the way to go to install requirements and ensure isolation of dependencies between multiple driver scripts. As you noted though, installing hefty requirements like pandas or numpy (assuming you aren't using Conda), would add a pretty significant overhead to startup which could be amortized if the driver was assumed to run for a long enough period of time. Conda of course would pretty well eliminate that problem as it provides pre-compiled binaries for most OSs. I'd like to offer [PEX|https://pex.readthedocs.org/en/stable/] as an alternative, where spark-submit would build a self-contained virtualenv in a .pex file on the Spark master node and then distribute to all other nodes. However, it turns out PEX doesn't support editable requirements and introduces an assumption that all nodes in a cluster are homogenous so that a Python package with C extensions compiled on the master node would run on worker nodes without issue. The latter assumption may be a leap too far for all Spark users. One thing I'm not entirely sure of is the need for the spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why would this path ever be specified? Wouldn't a temporary path be used and subsequently removed after the Python worker completes? > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173590#comment-15173590 ] Sean Owen commented on SPARK-13587: --- CC [~j_houg] for interest, comment > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228 ] Jeff Zhang commented on SPARK-13587: I have implemented POC for this features. Here's oen simple command for how to execute use virtualenv in pyspark {code} bin/spark-submit --master yarn --deploy-mode client --conf "spark.pyspark.virtualenv.enabled=true" --conf "spark.pyspark.virtualenv.type=conda" --conf "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda" ~/work/virtualenv/spark.py {code} There's 4 property needs to be set * spark.pyspark.virtualenv.enabled(enable virtualenv) * spark.pyspark.virtualenv.type (default/conda are supported, default is native) * spark.pyspark.virtualenv.requirements (requirement file for the dependencies) * spark.pyspark.virtualenv.path (path to the executable for for virtualenv/conda) Comments and feedback are welcome about how to improve it and whether it's valuable for users. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org