[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2019-10-04 Thread Furcy Pin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16944564#comment-16944564
 ] 

Furcy Pin commented on SPARK-13587:
---

Hello,

I don't know where to ask this, but we have been using this feature on 
HDInsight 2.6.5 and we sometimes have a concurrency issue with pip.
 Basically it looks like in rare occasions, several executors set up the 
virtualenv simultaneously, which ends up in a kind of deadlock.

When running the pip install command used by the executor manually, it suddenly 
hangs and when cancel throws this error :
{code:java}
File 
"/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py",
 line 31, in acquire
 os.link(self.unique_name, self.lock_file)
 FileExistsError: [Errno 17] File exists: '/home/yarn/-' -> 
'/home/yarn/selfcheck.json.lock'{code}
This happens with "spark.pyspark.virtualenv.type=native". 
We haven't tried with conda yet.

It is pretty bad because when it happens:
 - some executors of the spark job just get stuck, and the spark job gets stuck
 - even if the job is restarted, the lock files stays there and causes the 
whole YARN host to be useless.

Any suggestion or workaround would be appreciated. 
 One idea would be to remove the "--cache-dir /home/yarn" option which is 
currently used in the pip install command, but it doesn't seem to be 
configurable right now.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2018-10-22 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16659374#comment-16659374
 ] 

Ruslan Dautkhanov commented on SPARK-13587:
---

We're using conda environments shared across worker nodes through NFS. Has 
anyone used something like this?

Another option that' more direct to this jira's description is `conda-pack` and 
using yarn's `--archives` option to distribute it:

{code:bash}
$ PYSPARK_DRIVER_PYTHON=`which python` \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn \
--deploy-mode client \
--archives environment.tar.gz#environment \
script.py
{code}

More details - https://conda.github.io/conda-pack/spark.html



> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2018-06-14 Thread Matt Mould (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512433#comment-16512433
 ] 

Matt Mould commented on SPARK-13587:


What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but the it doesn't work for me with the following 
command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet commented on SPARK-13587:
---

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217115#comment-16217115
 ] 

Nicholas Chammas commented on SPARK-13587:
--

To follow-up on my [earlier 
comment|https://issues.apache.org/jira/browse/SPARK-13587?focusedCommentId=15740419=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15740419],
 I created a [completely self-contained sample 
repo|https://github.com/massmutual/sample-pyspark-application] demonstrating a 
technique for bundling PySpark app dependencies in an isolated way. It's the 
technique that Ben, I, and several others discussed here in this JIRA issue.

https://github.com/massmutual/sample-pyspark-application

The approach has advantages (like letting you ship a completely isolated Python 
environment, so you don't even need Python installed on the workers) and 
disadvantages (requires YARN; increases job startup time). Hope some of you 
find the sample repo useful until Spark adds more "first-class" support for 
Python dependency isolation.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-03-14 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923683#comment-15923683
 ] 

Jeff Zhang commented on SPARK-13587:


I linked a detailed document about how to use it in both batch mode and 
interactive mode, if anyone is interested in this, you can try this PR. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-13 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747407#comment-15747407
 ] 

Jeff Zhang commented on SPARK-13587:


If it is pretty large cluster, then I would suggest to set up a private pypi 
repository server. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-13 Thread Prasanna Santhanam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747379#comment-15747379
 ] 

Prasanna Santhanam commented on SPARK-13587:


[~zjffdu] In case of Anaconda Python the environment is self-contained. The 
conda environment is managed using binaries that are hardlinked unlike in 
virtualenv. So zipping the conda environment zips the entire python system 
together. After that Spark only needs to know that the binary for Python it 
should use is relative to the archives that were distributed. Hence I was able 
to run my spark application without installing anaconda on any of my workers.

Let's give these mechanisms names so it makes it easier to compare. On the one 
hand we have the "Library Distribution Mechanism" and on the other we have the 
"Library Download Mechanism". Right now, the overhead is greater in the case of 
the distribution mechanism because zipping the binaries eats up significant 
time before application startup. This was my observation with the 16core 
machine, YMMV. The library distribution mechanism can however be improved with 
some basic caching done on the gateway node so that subsequent zip operations 
on a different spark application with similar library requirements is faster. 

On the other hand, in the library download mechanism - the downloads are very 
fast and can even be cached locally or proxied to a locally managed egg/pypi 
repository/conda channel. So the download mechanism is still superior save for 
some complexity in the implementation and options to be specified. However, I'm 
not sure whether downloads will be throttled on publicly exposed repositories 
like PyPi when say a 1000 node spark cluster is simultaneously requesting for 
python binaries to be downloaded from all its workers.



> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-13 Thread Prasanna Santhanam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747381#comment-15747381
 ] 

Prasanna Santhanam commented on SPARK-13587:


[~zjffdu] In case of Anaconda Python the environment is self-contained. The 
conda environment is managed using binaries that are hardlinked unlike in 
virtualenv. So zipping the conda environment zips the entire python system 
together. After that Spark only needs to know that the binary for Python it 
should use is relative to the archives that were distributed. Hence I was able 
to run my spark application without installing anaconda on any of my workers.

Let's give these mechanisms names so it makes it easier to compare. On the one 
hand we have the "Library Distribution Mechanism" and on the other we have the 
"Library Download Mechanism". Right now, the overhead is greater in the case of 
the distribution mechanism because zipping the binaries eats up significant 
time before application startup. This was my observation with the 16core 
machine, YMMV. The library distribution mechanism can however be improved with 
some basic caching done on the gateway node so that subsequent zip operations 
on a different spark application with similar library requirements is faster. 

On the other hand, in the library download mechanism - the downloads are very 
fast and can even be cached locally or proxied to a locally managed egg/pypi 
repository/conda channel. So the download mechanism is still superior save for 
some complexity in the implementation and options to be specified. However, I'm 
not sure whether downloads will be throttled on publicly exposed repositories 
like PyPi when say a 1000 node spark cluster is simultaneously requesting for 
python binaries to be downloaded from all its workers.



> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-13 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747338#comment-15747338
 ] 

Jeff Zhang commented on SPARK-13587:


[~prasanna.santha...@icloud.com] I don't understand how this can work without 
python installed in the worker nodes. And regarding the overhead, I think no 
matter we dist the dependency or download the dependency. The overhead can not 
avoided. From my experience, one advantage of the downloading approach is that 
it would cache the dependencies on the worker node. So if the executor runs on 
one node that the dependency is cached there, it would save lot of time to set 
up the virtualenv. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-13 Thread Prasanna Santhanam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15747291#comment-15747291
 ] 

Prasanna Santhanam commented on SPARK-13587:


[~nchammas] sorry, this got buried in several other emails at my org. 

What you've implemented as a shell installer is exactly what I've done except 
within Spark code branched off of the 2.0.1 release. I use the YARN archives 
mechanism to distribute the zip files and control the conda environment 
binaries. It should be straightforward to change my diff to work with 
virtualenv as well. As you've explained the advantage of this process is that 
Python doesn't need to be installed at all in the worker nodes. I've also 
implemented the mechanism with {{--py-files}} so standalone spark can take 
advantage but I haven't got around to testing it yet. 

The downside of the zip distribution solution is however that the start time of 
the application significantly increased - nearly 4m to zip all libraries. I 
tested this on a 16 core machine with 30GB memory. When I try to zip the 
binaries it ends up creating a 400MB archive for just basic libaries like 
{{matplotlib}}, {{scipy}}, {{numpy}}. What zip times did you experience?

Much to my surprise, contrasted with the original proposal by [~zjffdu] of 
downloading the dependencies on all the workers, this eats up significant time 
in a spark program that runs for no greater than 2s. This restricted me from 
pushing the implementation further. Would like to hear your observations from 
testing your shell implementation of the same.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740419#comment-15740419
 ] 

Nicholas Chammas commented on SPARK-13587:
--

Thanks to a lot of help from [~quasi...@gmail.com] and [his blog post on this 
problem|http://quasiben.github.io/blog/2016/4/15/conda-spark/], I was able to 
develop a solution that works for Spark on YARN:

{code}
set -e

# Both these directories exist on all of our YARN nodes.
# Otherwise, everything else is built and shipped out at submit-time
# with our application.
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export SPARK_HOME="/hadoop/spark/spark-2.0.2-bin-hadoop2.6"

export PATH="$SPARK_HOME/bin:$PATH"

python3 -m venv venv/
source venv/bin/activate

pip install -U pip
pip install -r requirements.pip
pip install -r requirements-dev.pip

deactivate

# This convoluted zip machinery is to ensure that the paths to the files inside 
the zip
# look the same to Python when it runs within YARN.
# If there is a simpler way to express this, I'd be interested to know!
pushd venv/
zip -rq ../venv.zip *
popd
pushd myproject/
zip -rq ../myproject.zip *
popd
pushd tests/
zip -rq ../tests.zip *
popd

export PYSPARK_PYTHON="venv/bin/python"

spark-submit \
  --conf "spark.yarn.appMasterEnv.PYSPARK_PYTHON=venv/bin/python" \
  --conf "spark.yarn.appMasterEnv.SPARK_HOME=$SPARK_HOME" \
  --master yarn \
  --deploy-mode client \
  --archives "venv.zip#venv,myproject.zip#myproject,tests.zip#tests" \
  run_tests.py -v
{code}

My solution is based off of Ben's, except where Ben uses Conda I just use pip. 
I don't know if there is a way to adapt this solution to work with Spark on 
Mesos or Spark Standalone (and I haven't tried since my environment is YARN), 
but if someone figures it out please post your solution here!

As Ben explains in [his blog 
post|http://quasiben.github.io/blog/2016/4/15/conda-spark/], this lets you 
build and ship an isolated environment with your PySpark application out to the 
YARN cluster. The YARN nodes don't even need to have the correct version of 
Python (or Python at all!) installed, because you are shipping out a complete 
Python environment via the {{--archives}} option.

I hope this helps some people who are looking for a workaround they can use 
today while a more robust solution is developed directly into Spark.

And I wonder... if this {{--archives}} technique can be extended or translated 
to Mesos and Standalone somehow, maybe that would be a good enough solution for 
the time being? People would be able to run their jobs in an isolated Python 
environment using their tool of choice (conda or pip), and Spark wouldn't need 
to add any virtualenv-specific machinery.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-02 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714622#comment-15714622
 ] 

Semet commented on SPARK-13587:
---

For myself, I share a NFS folder with all the executors. It works because they 
all have the same architecture and distribution.

Frankly, I begin to be a bit disapointed there is no infatuation, no real will 
to solve this huge hole in PySpark. Dependency management has been solved years 
ago in Python with Virtualenv in general and with Anaconda in Data Science, but 
PySpark still continue to play with the PYTHONPATH and there is no Spark core 
developer actively involved to help us integrating such patch. Dependency 
management for JAR are modernly handled by {{--packages}}, automatically 
downloading the files from a remote repository, why not doing that for Python 
as well? And maybe R as well if available? I even proposed a way to package 
everything in a single zip archive, called "wheelhouse", so executors might not 
have to download anything.

So, please help us raising this concern to core developers to tell them that 
there are several persons interested in solving this issue.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713636#comment-15713636
 ] 

Nicholas Chammas commented on SPARK-13587:
--

[~tsp]:

{quote}
Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to --archives in the spark-submit command line. This approach works 
perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.
{quote}

Can you elaborate on how you did this? I'm willing to jump through some hoops 
to create a hackish way of distributing dependencies while this JIRA task gets 
worked out.

What I'm trying is:
# Create a virtual environment and activate it.
# Pip install my requirements into that environment, as one would in a regular 
Python project.
# Zip up the venv/ folder and ship it with my application using {{--py-files}}.

I'm struggling to get the workers to pick up Python dependencies from the 
packaged venv over what's in the system site-packages. All I want is to be able 
to ship out the dependencies with the application from a virtual environment 
all at once (i.e. without having to enumerate each dependency).

Has anyone been able to do this today? It would be good to document it as a 
workaround for people until this issue is resolved.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-11-08 Thread Prasanna Santhanam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647331#comment-15647331
 ] 

Prasanna Santhanam commented on SPARK-13587:


Thanks for this JIRA and the work related to it - I have been testing this 
patch a little with conda environments.

Previously, I have had reasonable success with zipping the contents of my conda 
environment in the gateway/driver node and submitting the zip file as an 
argument to {{--archives}} in the {{spark-submit}} command line. This approach 
works perfectly because it uses the existing spark infrastructure to distribute 
dependencies through to the workers. You actually don't even need anaconda 
installed on the workers since the zip can package the entire python 
installation within it. The downside of it being that conda zip files can bloat 
up quickly in a production spark application.

[~zjffdu] In your approach I find that the driver program still executes on the 
native python installation and only the workers run within conda (virtualenv) 
environments. Would it not be possible to use the same conda environment 
throughout? ie. setup once on gateway node and propagate over the distributed 
cache as mentioned in a related PR comment.

I can always force the driver python to be conda using `PYSPARK_PYTHON` and 
`PYSPARK_DRIVER_PYTHON` but that is not the same conda environment as the one 
created by your PythonWorkerFactory. Or is it not your intention to make it 
work this way?

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-07-05 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362294#comment-15362294
 ] 

Semet commented on SPARK-13587:
---

Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-07-04 Thread Takao Magoori (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362014#comment-15362014
 ] 

Takao Magoori commented on SPARK-13587:
---

Sorry. It seems there is no isolated site-package directory. Workdir is just 
added to sys.path.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-07-04 Thread Takao Magoori (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361940#comment-15361940
 ] 

Takao Magoori commented on SPARK-13587:
---

What is the reason why virtualenv is required ?
I feel supporting naive virtualenv is unnecessary since each Spark 
executor(worker) already has its own isolated site-packages directory (workdir) 
just like naive virtualenv. Only for conda ?


> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-29 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355172#comment-15355172
 ] 

Semet commented on SPARK-13587:
---

yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-27 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351512#comment-15351512
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks [~gae...@xeberon.net] Have you take a look at my PR ? welcome any 
comments. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-27 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351074#comment-15351074
 ] 

Semet commented on SPARK-13587:
---

I back this proposal and willing to work on it.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324323#comment-15324323
 ] 

Jeff Zhang commented on SPARK-13587:


Sorry, guys, I am busy on other stuff recently and late for update this ticket. 
 I just attached the design and doc, and create the PR. If you are interested, 
please help review the design doc and PR, thanks [~gbow...@fastmail.co.uk] 
[~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] [~JnBrymn]

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324318#comment-15324318
 ] 

Apache Spark commented on SPARK-13587:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13599

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324316#comment-15324316
 ] 

Apache Spark commented on SPARK-13587:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/13598

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311649#comment-15311649
 ] 

Jeff Zhang commented on SPARK-13587:


Yes, I focus on yarn mode, I did some test on local mode, but not full test. I 
haven't done it on mesos yet. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15311646#comment-15311646
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks [~gbow...@fastmail.co.uk] In my POC, I implemented it using both conda 
and native virtualenv/pip. User can choose whatever he want by specifying the 
option. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-01 Thread Greg Bowyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310881#comment-15310881
 ] 

Greg Bowyer commented on SPARK-13587:
-

There is also a small wrapper for this 

http://knit.readthedocs.io/en/latest/

I guess there is no mesos and localmode support, but its a start

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-01 Thread Greg Bowyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310872#comment-15310872
 ] 

Greg Bowyer commented on SPARK-13587:
-

Does this work for people ?

https://www.continuum.io/blog/developer-blog/conda-spark

The downsides are
1. Its conda, it is not quite as standard as virtualenv / pip
2. Conda will install none free software (e.g. MKL) - this might be a problem

However I think that it might also help in that conda bundles dependencies like 
BLAS, LAPACK and friends which might save some folks heartache when dealing 
with badly configured clusters.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-01 Thread Greg Bowyer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15310782#comment-15310782
 ] 

Greg Bowyer commented on SPARK-13587:
-

 I have been out of this world for a long time.

The Spex extension was designed to get around this, and was designed for a 
world where NFS was being removed from a cluster. Right now its a ugly ugly 
hack but I might be able to spend some time making it less of a hack.

If this was integrated to spark would we want to make this transparent, or 
provide a flag to the launcher scripts. I figure a --py-deps=requirements.txt 
might work?

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-04-10 Thread John Berryman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15234264#comment-15234264
 ] 

John Berryman commented on SPARK-13587:
---

At my work we're using devpi and a homecooked proxy server to "fake" pip and 
serve up cached wheels. I'm not sure such a solution should be built into 
Spark, but it is at least a POC for being able to make the virtualenv idea fast 
(after the first build at least). Really it wouldn't be hard to set this up 
outside of Spark, you would just need to make {{pip}} actually point to 
{{my_own_specially_wheel_caching_pip}}.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218659#comment-15218659
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

That's the (hopefully) beautiful thing about pex.

{noformat}
$ pex numpy pandas -o c_requirements.pex
$ ./c_requirements.pex
Python 2.7.10 (default, Aug  4 2015, 19:54:05)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> import numpy
>>> import pandas
>>> print numpy.__version__
1.11.0
>>> print pandas.__version__
0.18.0
>>>
{noformat}

The catch of course is that the pex file itself would have to be built on a 
node with the same arch as Spark workers (i.e. can't build pex on Mac OS and 
ship to Linux cluster unless all dependencies are pure python). To build a 
platform agnostic env, we'd have to look at conda.

I know there was an effort to support pex with pyspark 
https://github.com/URXtech/spex but it hasn't seen much activity recently. I 
tried reaching out to the author but got no response.

I could take a shot at adding support for this unless @zjffdu already has plans.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218610#comment-15218610
 ] 

Juliet Hougland commented on SPARK-13587:
-

Being able to ship around pex files like we do .py and .egg files sounds very 
reasonable from a delineation of responsibilities perspective.

I like the idea and would support a change like that. A question/edge case 
worth working out is how pex files relate to compiled c libs that python libs 
may need to link to. I don't know much about pex, but initial assessment is 
that it shouldn't be a huge problem. I like this solution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-30 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218109#comment-15218109
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

The NFS is a good idea for a current workaround. It's still a pain to build all 
required eggs in a dependency file like requirements.txt unless pyspark 
supports something like .pex files which build a self-contained python 
executable from said requirements file.

If pex files were supported by pyspark, the daemon + workers would simply use 
/path/to/file.pex as the Python executable and get a fully baked virtualenv for 
free without Spark needing to concern itself with building these files.

Twitter has reportedly used this to ship distributed Python applications for 
sometime.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-28 Thread Dripple (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214038#comment-15214038
 ] 

Dripple commented on SPARK-13587:
-

[~juliet] can you give details/pointer on the solution you describe ? 

Thanks.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-26 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212848#comment-15212848
 ] 

Jeff Zhang commented on SPARK-13587:


bq. Have you considered using NFS or Amazon EFS to allow users to create and 
manage their own envs and then mounting those on worker/executor nodes? 
This problem is most of time you are not administrator and don't have 
permission to do that. It's inefficient to ask your administrator to install 
the environment for you. 

bq. "one alternative to shared mounts is to store the thing in HDFS and use 
something like --files / --archives in Spark. 
Some packages are binary and need to compile. And it is not easy to do 
dependency management in this way. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-25 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212191#comment-15212191
 ] 

Juliet Hougland commented on SPARK-13587:
-

I really do think spark and pyspark needs to stay out of the business for 
installing anything for people. A generic executable is relatively neutral as 
to what exactly that executable does, which is good. Spark's scope should be 
computation/execution, not environment setup and teardown. 

Have you considered using NFS or Amazon EFS to allow users to create and manage 
their own envs and then mounting those on worker/executor nodes? This is an 
elegant solution we (many experienced people at Cloudera like Guru M and 
Tristan Z recommended this) have seen deployed successfully. I believe given 
the description of your problem it should suit your needs.

[~vanzin] as suggested "one alternative to shared mounts is to store the thing 
in HDFS and use something like --files / --archives in Spark. The distribution 
to new containers is handled by YARN, and Spark just would need some 
adjustments to find the right executable inside those archives."

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-18 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15201763#comment-15201763
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Sorry to bug [~juliet] - any thoughts? We're currently trying to think about 
some kind of a workaround for Spark 1.6.0 working on Amazon EMR that'd allow us 
to create a conda/virtual env on YARN nodes prior to the application running 
but I don't think there's anything we can really do.

Building on my suggestion, it'd probably also be helpful to have a --teardown 
option added to spark-submit that'd allow execution of some script after the 
Spark application is terminated. This way conda/virtual envs with temporary 
names could be created/destroyed.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-07 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184210#comment-15184210
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

[~juliet] I get the concerns relating to Spark supporting a complex virtualenv 
process. My main objection to only supporting something like --pyspark-python 
is the difficulty we currently face in locations like Amazon EMR, but really 
any Spark cluster where nodes are assumed to be added after an application is 
submitted deals with the issue.

We have a bootstrap script which provisions our EMR nodes with required Python 
dependencies. This approach works alright for a cluster which tends to run very 
few applications, but if we have multiple tenants, this approach quickly gets 
unwieldy. Ideally, Spark applications could be submitted from a master node 
with a user never having to worry about dependency management at the node 
bootstrapping level.

I was thinking that an interesting approach to this problem would be to provide 
some sort of a --bootstrap option to spark-submit which points to any 
executable which Spark will run and check for receipt of a 0 exit code before 
continuing to launch the application itself. This script could obviously 
execute any code such as creating a virtualenv or conda env and installing 
requirements. If a non-zero exit code were received, the Spark application 
would cease to continue.

The generalization gets the Spark community away from having to support 
conda/virtualenv eccentricities. Thoughts?

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179254#comment-15179254
 ] 

Jeff Zhang commented on SPARK-13587:


SPARK-13081 is almost done. But for this ticket, I definitely need your 
feedback and help, I am not heavy python user, so may miss something that 
python programmer concerns. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179253#comment-15179253
 ] 

Juliet Hougland commented on SPARK-13587:
-

That is wonderful. Let me know if you'd like me to help work on it. It has been 
dangling on my mental todo list for a while.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179251#comment-15179251
 ] 

Jeff Zhang commented on SPARK-13587:


Actually I have created this ticket several weeks ago SPARK-13081  :)

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179250#comment-15179250
 ] 

Juliet Hougland commented on SPARK-13587:
-

I made a comment related to this below. TLDR I think the suggested --py-env 
option could be encompassed an already needed --pyspark_python sort of option.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179247#comment-15179247
 ] 

Juliet Hougland commented on SPARK-13587:
-

Currently the way users specify the workers' python interpreter is through the 
PYSPARK_PYTHON env variable. It would be beneficial to users to allow that path 
to be specified by a cli flag. That is a current rough edge of using already 
installed envs on a cluster.

If this was added as a cli flag, I could see valid options being 
'pyspark/python/path', 'venv' (temp virtualenv), and 'conda' (temp conda env) 
and requiring a second flag to specify the requirements file. I think it helps 
prevent an explosion of flag for spark submit while helping handle a very 
important and often changed parameter for a job. What do you think of this? 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179012#comment-15179012
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for the feedback [~j_houg].  Yes, that's why I make the virtualenv as 
application scope. Creating python package management tool for a cluster will 
be a big project and too heavy.   And what does "--pyspark_python" mean ? 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178883#comment-15178883
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for letting me know this. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536
 ] 

Juliet Hougland commented on SPARK-13587:
-

If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175776#comment-15175776
 ] 

Dan Blanchard commented on SPARK-13587:
---

`conda list --export` will omit all pip-installed requirements and only export 
the conda ones.  It is pretty common for people to have conda environments 
where they've installed things that weren't available through conda via pip.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175715#comment-15175715
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for the feedback [~dan.blanchard], In my POC, I use "conda list 
--export" to create the requirement file and seems conda can consume this, 
Although the format is a little different from virtualenv. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175672#comment-15175672
 ] 

Dan Blanchard commented on SPARK-13587:
---

One thing to note is that conda doesn't use the same requirements file format 
as virtualenv/pip.  You'll want to use [conda env 
create|http://conda.pydata.org/docs/commands/env/conda-env-create.html] uses 
their own separate YAML format.  I have a [pull request open to support 
requirements.txt files|https://github.com/conda/conda-env/pull/172], but that 
has been waiting for action for some time.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175646#comment-15175646
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Perfect and understood about not wanting to promote these to first-class 
citizens without wider feedback. At the least, I'd say both {{--py-files}} and 
{{--py-venv}} options could be supported if we're concerned about introducing a 
deprecation like this.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175045#comment-15175045
 ] 

Jeff Zhang commented on SPARK-13587:


spark.pyspark.virtualenv.requirements is a local file (which would be 
distributed to all nodes) Regarding upgrade these to first-class citizens, I 
would be conservative for that. Needs more feedback from other users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175018#comment-15175018
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

One thought that just occurred to me, does 
{{spark.pyspark.virtualenv.requirements}} point to a path on the master node 
for a requirements file? It'd make sense if that was the case and then the 
requirements file was shipped to other nodes instead of assuming that this file 
existed on all Spark nodes at the same location.

Also might be a good idea to upgrade these to first-class citizens of 
spark-submit by supporting them as optional params instead of config 
properties. I'd go so far as to say it makes sense to deprecate {{--py-files}} 
in favour of:

* {{--py-venv-type=conda}}
* {{--py-venv-bin=/path/to/conda}}
* {{--py-venv-requirements=/local/path/to/requirements.txt}}


> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175010#comment-15175010
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Gotcha. I might suggest {{spark.pyspark.virtualenv.bin.path}} in that case.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175003#comment-15175003
 ] 

Jeff Zhang commented on SPARK-13587:


Thanks for your feedback [~msukmanowsky]. spark.pyspark.virtualenv.path is not 
the path where the virtualenv created, it is the path to the executable file 
for virtualenv/conda which is used for creating virtualenv ( I need to rename 
it to a more proper name to avoid confusing). 
In my POC, I will create virtualenv in all the executors not only driver. As 
you said, some python packages depends on C library, we can not guarantee it 
would work if we compile it in driver and distribute it to other nodes. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174996#comment-15174996
 ] 

Mike Sukmanowsky commented on SPARK-13587:
--

Thanks for letting me know about this [~jeffzhang].

I think in general, I'm +1 on the proposal.

virtualenvs are the way to go to install requirements and ensure isolation of 
dependencies between multiple driver scripts. As you noted though, installing 
hefty requirements like pandas or numpy (assuming you aren't using Conda), 
would add a pretty significant overhead to startup which could be amortized if 
the driver was assumed to run for a long enough period of time. Conda of course 
would pretty well eliminate that problem as it provides pre-compiled binaries 
for most OSs.

I'd like to offer [PEX|https://pex.readthedocs.org/en/stable/] as an 
alternative, where spark-submit would build a self-contained virtualenv in a 
.pex file on the Spark master node and then distribute to all other nodes. 
However, it turns out PEX doesn't support editable requirements and introduces 
an assumption that all nodes in a cluster are homogenous so that a Python 
package with C extensions compiled on the master node would run on worker nodes 
without issue. The latter assumption may be a leap too far for all Spark users.

One thing I'm not entirely sure of is the need for the 
spark.pyspark.virtualenv.path property. If the virtualenv is temporary, why 
would this path ever be specified? Wouldn't a temporary path be used and 
subsequently removed after the Python worker completes?

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173590#comment-15173590
 ] 

Sean Owen commented on SPARK-13587:
---

CC [~j_houg] for interest, comment

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang commented on SPARK-13587:


I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 property needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org