[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2018-06-14 Thread Matt Mould (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16512433#comment-16512433
 ] 

Matt Mould edited comment on SPARK-13587 at 6/14/18 1:21 PM:
-

What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but it doesn't work for me with the following command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}


was (Author: mattmould):
What is the current status of this ticket please? This 
[article|https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html]
 suggests that it's done, but the it doesn't work for me with the following 
command.
{code:java}
spark-submit --deploy-mode cluster --master yarn --py-files 
parallelisation_hack-0.1-py2.7.egg --conf spark.pyspark.virtualenv.enabled=true 
 --conf spark.pyspark.virtualenv.type=native --conf 
spark.pyspark.virtualenv.requirements=requirements.txt --conf 
spark.pyspark.virtualenv.bin.path=virtualenv --conf 
spark.pyspark.python=python3 pyspark_poc_runner.py{code}

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet edited comment on SPARK-13587 at 10/24/17 4:46 PM:
-

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...
The strong argument for wheelhouse is that is only packages the libraries used 
by the project, not the complete environment. The drawback is that it may not 
work well with anaconda.


was (Author: gae...@xeberon.net):
Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-06-29 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355172#comment-15355172
 ] 

Semet edited comment on SPARK-13587 at 6/29/16 12:52 PM:
-

yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

I don't have much experience on YARN or MESOS, what are the big differences?


was (Author: gae...@xeberon.net):
yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-06-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324323#comment-15324323
 ] 

Jeff Zhang edited comment on SPARK-13587 at 6/10/16 12:01 PM:
--

Sorry, guys, I am busy on other stuff recently and late for updating this 
ticket.  I just attached the desig doc and create the PR. If you are 
interested, please help review the design doc and try the PR, thanks 
[~gbow...@fastmail.co.uk] [~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] 
[~JnBrymn]


was (Author: zjffdu):
Sorry, guys, I am busy on other stuff recently and late for updating this 
ticket.  I just attached the design and doc, and create the PR. If you are 
interested, please help review the design doc and PR, thanks 
[~gbow...@fastmail.co.uk] [~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] 
[~JnBrymn]

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-06-10 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324323#comment-15324323
 ] 

Jeff Zhang edited comment on SPARK-13587 at 6/10/16 11:43 AM:
--

Sorry, guys, I am busy on other stuff recently and late for updating this 
ticket.  I just attached the design and doc, and create the PR. If you are 
interested, please help review the design doc and PR, thanks 
[~gbow...@fastmail.co.uk] [~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] 
[~JnBrymn]


was (Author: zjffdu):
Sorry, guys, I am busy on other stuff recently and late for update this ticket. 
 I just attached the design and doc, and create the PR. If you are interested, 
please help review the design doc and PR, thanks [~gbow...@fastmail.co.uk] 
[~Dripple] [~juliet] [~msukmanowsky] [~dan.blanchard] [~JnBrymn]

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-03 Thread Juliet Hougland (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178536#comment-15178536
 ] 

Juliet Hougland edited comment on SPARK-13587 at 3/3/16 9:21 PM:
-

If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

Edit: I see now that you are proposing a short lived virtualenv. My objections 
about the broadening of scope still apply. I generally do not agree with 
suggestions that tightly tie us (and users) to a specific method of pyenv 
management. The loose coupling of python envs one a cluster to pyspark (via a 
path to an interpreter) is a positive feature. I would much rather add 
--pyspark_python to the cli tool (and deprecate the env var) than add a ton of 
logic to create environments for users. 


was (Author: juliet):
If pyspark allows users to create virtual environments, users will also want 
and need other features of python environment management on a cluster. I think 
this change would broaden the scope of PySpark to include python package 
management on a cluster. I do not think that spark should be in the business of 
creating python environments. I think the support load in terms of feature 
requests, mailing list traffic, etc would be very large. This feature would 
begin to solve a problem, but would also put us on the hook for many more. 

I agree with the general intention of this JIRA -- make it easier to manage and 
interact with complex python environments on a cluster. Perhaps there are other 
ways to accomplish this without broadening scope and functionality as much. For 
example, checking a requirements file against an environment before execution.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175646#comment-15175646
 ] 

Mike Sukmanowsky edited comment on SPARK-13587 at 3/2/16 2:19 PM:
--

Perfect and understood about not wanting to promote these to first-class 
citizens without wider feedback. At the least, I'd say both {{--py-files}} and 
{{--py-venv}} options could be supported if we're concerned about introducing a 
deprecation like this.


was (Author: msukmanowsky):
Perfect and understood about not wanting to promote these to first-class 
citizens without wider feedback. At the least, I'd say both {{ --py-files }} 
and {{ --py-venv }} options could be supported if we're concerned about 
introducing a deprecation like this.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-02 Thread Mike Sukmanowsky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15175646#comment-15175646
 ] 

Mike Sukmanowsky edited comment on SPARK-13587 at 3/2/16 2:19 PM:
--

Perfect and understood about not wanting to promote these to first-class 
citizens without wider feedback. At the least, I'd say both {{ --py-files }} 
and {{ --py-venv }} options could be supported if we're concerned about 
introducing a deprecation like this.


was (Author: msukmanowsky):
Perfect and understood about not wanting to promote these to first-class 
citizens without wider feedback. At the least, I'd say both {{--py-files}} and 
{{--py-venv}} options could be supported if we're concerned about introducing a 
deprecation like this.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:17 AM:


This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(flag to enable virtualenv)
* spark.pyspark.virtualenv.type  (native/conda are supported, default is native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virtualenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virutalenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/2/16 4:12 AM:


This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable file for for 
virtualenv/conda which is used for creating virutalenv)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/1/16 8:30 AM:


This method is trying to create virtualenv before python worker start, and this 
virtualenv is application scope, after the spark application job finish, the 
virtualenv will be cleanup. And the virtualenvs don't need to be the same path 
for each node (In my POC, it is the yarn container working directory). So that 
means user don't need to manually install packages on each node (sometimes you 
even can't install packages on cluster due to security reason). This is the 
biggest benefit and purpose that user can create virtualenv on demand without 
touching each node even when you are not administrator.  The cons is the extra 
cost for installing the required packages before starting python worker. But if 
it is an application which will run for several hours then the extra cost can 
be ignored.  

I have implemented POC for this features. Here's one simple command for how to 
use virtualenv in pyspark.
{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
I have implemented POC for this features. Here's oen simple command for how to 
use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:04 AM:


I have implemented POC for this features. Here's oen simple command for how to 
use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173228#comment-15173228
 ] 

Jeff Zhang edited comment on SPARK-13587 at 3/1/16 5:02 AM:


I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 properties needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 


was (Author: zjffdu):
I have implemented POC for this features. Here's oen simple command for how to 
execute use virtualenv in pyspark

{code}
bin/spark-submit --master yarn --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=conda" --conf 
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt" 
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
~/work/virtualenv/spark.py
{code}

There's 4 property needs to be set 
* spark.pyspark.virtualenv.enabled(enable virtualenv)
* spark.pyspark.virtualenv.type  (default/conda are supported, default is 
native)
* spark.pyspark.virtualenv.requirements  (requirement file for the dependencies)
* spark.pyspark.virtualenv.path  (path to the executable for for 
virtualenv/conda)

Comments and feedback are welcome about how to improve it and whether it's 
valuable for users. 

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org