Re: Support virtualenv in PySpark

2016-03-01 Thread Jeff Zhang
I may not express it clearly. This method is trying to create virtualenv
before python worker start, and this virtualenv is application scope, after
the spark application job finish, the virtualenv will be cleanup. And the
virtualenvs don't need to be the same path for each node (In my POC, it is
the yarn container working directory). So that means user don't need to
manually install packages on each node (sometimes you even can't install
packages on cluster due to security reason). This is the biggest benefit
and purpose that user can create virtualenv on demand without touching each
node even when you are not administrator.  The cons is the extra cost for
installing the required packages before starting python worker. But if it
is an application which will run for several hours then the extra cost can
be ignored.

On Tue, Mar 1, 2016 at 4:15 PM, Mohannad Ali  wrote:

> Hello Jeff,
>
> Well this would also mean that you have to manage the same virtualenv
> (same path) on all nodes and install your packages to it the same way you
> would if you would install the packages to the default python path.
>
> In any case at the moment you can already do what you proposed by creating
> identical virtualenvs on all nodes on the same path and change the spark
> python path to point to the virtualenv.
>
> Best Regards,
> Mohannad
> On Mar 1, 2016 06:07, "Jeff Zhang"  wrote:
>
>> I have created jira for this feature , comments and feedback are welcome
>> about how to improve it and whether it's valuable for users.
>>
>> https://issues.apache.org/jira/browse/SPARK-13587
>>
>>
>> Here's some background info and status of this work.
>>
>>
>> Currently, it's not easy for user to add third party python packages in
>> pyspark.
>>
>>- One way is to using --py-files (suitable for simple dependency, but
>>not suitable for complicated dependency, especially with transitive
>>dependency)
>>- Another way is install packages manually on each node (time
>>wasting, and not easy to switch to different environment)
>>
>> Python now has 2 different virtualenv implementation. One is native
>> virtualenv another is through conda.
>>
>> I have implemented POC for this features. Here's one simple command for
>> how to use virtualenv in pyspark
>>
>> bin/spark-submit --master yarn --deploy-mode client --conf 
>> "spark.pyspark.virtualenv.enabled=true" --conf 
>> "spark.pyspark.virtualenv.type=conda" --conf 
>> "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
>>  --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
>> ~/work/virtualenv/spark.py
>>
>> There're 4 properties needs to be set
>>
>>- spark.pyspark.virtualenv.enabled (enable virtualenv)
>>- spark.pyspark.virtualenv.type (native/conda are supported, default
>>is native)
>>- spark.pyspark.virtualenv.requirements (requirement file for the
>>dependencies)
>>- spark.pyspark.virtualenv.path (path to the executable for for
>>virtualenv/conda)
>>
>>
>>
>>
>>
>>
>> Best Regards
>>
>> Jeff Zhang
>>
>


-- 
Best Regards

Jeff Zhang


Re: Support virtualenv in PySpark

2016-03-01 Thread Mohannad Ali
Hello Jeff,

Well this would also mean that you have to manage the same virtualenv (same
path) on all nodes and install your packages to it the same way you would
if you would install the packages to the default python path.

In any case at the moment you can already do what you proposed by creating
identical virtualenvs on all nodes on the same path and change the spark
python path to point to the virtualenv.

Best Regards,
Mohannad
On Mar 1, 2016 06:07, "Jeff Zhang"  wrote:

> I have created jira for this feature , comments and feedback are welcome
> about how to improve it and whether it's valuable for users.
>
> https://issues.apache.org/jira/browse/SPARK-13587
>
>
> Here's some background info and status of this work.
>
>
> Currently, it's not easy for user to add third party python packages in
> pyspark.
>
>- One way is to using --py-files (suitable for simple dependency, but
>not suitable for complicated dependency, especially with transitive
>dependency)
>- Another way is install packages manually on each node (time wasting,
>and not easy to switch to different environment)
>
> Python now has 2 different virtualenv implementation. One is native
> virtualenv another is through conda.
>
> I have implemented POC for this features. Here's one simple command for
> how to use virtualenv in pyspark
>
> bin/spark-submit --master yarn --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=conda" --conf 
> "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
>  --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
> ~/work/virtualenv/spark.py
>
> There're 4 properties needs to be set
>
>- spark.pyspark.virtualenv.enabled (enable virtualenv)
>- spark.pyspark.virtualenv.type (native/conda are supported, default
>is native)
>- spark.pyspark.virtualenv.requirements (requirement file for the
>dependencies)
>- spark.pyspark.virtualenv.path (path to the executable for for
>virtualenv/conda)
>
>
>
>
>
>
> Best Regards
>
> Jeff Zhang
>


Support virtualenv in PySpark

2016-02-29 Thread Jeff Zhang
I have created jira for this feature , comments and feedback are welcome
about how to improve it and whether it's valuable for users.

https://issues.apache.org/jira/browse/SPARK-13587


Here's some background info and status of this work.


Currently, it's not easy for user to add third party python packages in
pyspark.

   - One way is to using --py-files (suitable for simple dependency, but
   not suitable for complicated dependency, especially with transitive
   dependency)
   - Another way is install packages manually on each node (time wasting,
   and not easy to switch to different environment)

Python now has 2 different virtualenv implementation. One is native
virtualenv another is through conda.

I have implemented POC for this features. Here's one simple command for how
to use virtualenv in pyspark

bin/spark-submit --master yarn --deploy-mode client --conf
"spark.pyspark.virtualenv.enabled=true" --conf
"spark.pyspark.virtualenv.type=conda" --conf
"spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
--conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"
 ~/work/virtualenv/spark.py

There're 4 properties needs to be set

   - spark.pyspark.virtualenv.enabled (enable virtualenv)
   - spark.pyspark.virtualenv.type (native/conda are supported, default is
   native)
   - spark.pyspark.virtualenv.requirements (requirement file for the
   dependencies)
   - spark.pyspark.virtualenv.path (path to the executable for for
   virtualenv/conda)






Best Regards

Jeff Zhang