[jira] [Comment Edited] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322
 ] 

Semet edited comment on SPARK-16367 at 10/24/17 5:29 PM:
-

Yes, I don't use it because it is a feature of {{pip}}:
{code}
pip wheel --wheel-dir wheelhouse .
{code}

It is described [in here|https://wheel.readthedocs.io/en/stable/].


was (Author: gae...@xeberon.net):
Yes, I don't use it because it is a feature of {{pip}}:
{{code}
pip wheel --wheel-dir wheelhouse .
{{code}}

It is described [in here|https://wheel.readthedocs.io/en/stable/].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322
 ] 

Semet commented on SPARK-16367:
---

Yes, I don't use it because it is a feature of {{pip}}:
{{code}
pip wheel --wheel-dir wheelhouse .
{{code}}

It is described [in here|https://wheel.readthedocs.io/en/stable/].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet edited comment on SPARK-13587 at 10/24/17 4:46 PM:
-

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...
The strong argument for wheelhouse is that is only packages the libraries used 
by the project, not the complete environment. The drawback is that it may not 
work well with anaconda.


was (Author: gae...@xeberon.net):
Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217221#comment-16217221
 ] 

Semet commented on SPARK-13587:
---

Hello. For me this solution is equivalent with my "Wheelhouse" (SPARK-16367) 
proposal I made, even without having to modify pyspark at all. I even think you 
can package a wheelhouse using this {{--archive}} argument.
The drawback is indeed your spark-submit has to send this package to each node 
(1 to n). If Pyspark supported {{requirements.txt}}/{{Pipfile}} dependencies 
description formats, each node would download by itself the dependencies...

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-01-10 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815466#comment-15815466
 ] 

Semet commented on SPARK-16367:
---

Yes in our pull request both conda and pip are supported. Wheel allow pip to 
behave like conda, i.e., not recompiling numpy everytime.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-12-02 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15714622#comment-15714622
 ] 

Semet commented on SPARK-13587:
---

For myself, I share a NFS folder with all the executors. It works because they 
all have the same architecture and distribution.

Frankly, I begin to be a bit disapointed there is no infatuation, no real will 
to solve this huge hole in PySpark. Dependency management has been solved years 
ago in Python with Virtualenv in general and with Anaconda in Data Science, but 
PySpark still continue to play with the PYTHONPATH and there is no Spark core 
developer actively involved to help us integrating such patch. Dependency 
management for JAR are modernly handled by {{--packages}}, automatically 
downloading the files from a remote repository, why not doing that for Python 
as well? And maybe R as well if available? I even proposed a way to package 
everything in a single zip archive, called "wheelhouse", so executors might not 
have to download anything.

So, please help us raising this concern to core developers to tell them that 
there are several persons interested in solving this issue.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5864) support .jar as python package

2016-10-08 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557842#comment-15557842
 ] 

Semet commented on SPARK-5864:
--

can I recommend to add .whl and .tar.gz in the file pattern? Python 
distribution package (.tar.gz) or wheels (.whl) could also be added to the 
PYTHONPATH to inject python dependencies for a job.

> support .jar as python package
> --
>
> Key: SPARK-5864
> URL: https://issues.apache.org/jira/browse/SPARK-5864
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.3.0
>
>
> Support .jar file as python package (same to .zip or .egg)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-969) Persistent web ui

2016-10-01 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15538162#comment-15538162
 ] 

Semet commented on SPARK-969:
-

I agree that archiving this page for post mortem analysis is helpful.

> Persistent web ui
> -
>
> Key: SPARK-969
> URL: https://issues.apache.org/jira/browse/SPARK-969
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
> Fix For: 1.0.0
>
>
> The Spark application web ui (at port 4040) is extremely helpful for 
> debugging application correctness & performance. However, once the 
> application completes (and thus SparkContext is stopped), the web ui is no 
> longer accessible. It would be great to refactor the UI so stage informations 
> (perhaps in JSON format or directly in HTML format) are stored persistently 
> in the file system and can be viewed after the fact.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10057) Faill to load class org.slf4j.impl.StaticLoggerBinder

2016-09-20 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506972#comment-15506972
 ] 

Semet commented on SPARK-10057:
---

Hello

I confirm we have this issue on Spark 1.6.1 when writing a parquet file. There 
are some information on [this 
stackoverflow|http://stackoverflow.com/questions/33832804/spark-1-5-2-and-slf4j-staticloggerbinder]

> Faill to load class org.slf4j.impl.StaticLoggerBinder
> -
>
> Key: SPARK-10057
> URL: https://issues.apache.org/jira/browse/SPARK-10057
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Davies Liu
>
> Some loggings are dropped, because it can't load class 
> "org.slf4j.impl.StaticLoggerBinder"
> {code}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-09-09 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

This ticket proposes to allow users the ability to deploy their job as "Wheels" 
packages. The Python community is strongly advocating to promote this way of 
packaging and distributing Python application as a "standard way of deploying 
Python App". In other word, this is the "Pythonic Way of Deployment".

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-09-07 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470328#comment-15470328
 ] 

Semet commented on SPARK-16367:
---

Blog post about PySpark job deployment: 
http://www.great-a-blog.co/wheel-deployment-for-pyspark/

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> 

[jira] [Commented] (SPARK-5091) Hooks for PySpark tasks

2016-09-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15466855#comment-15466855
 ] 

Semet commented on SPARK-5091:
--

It is a better option to use virtualenv and proper installation with pip, which 
is more scalable for Python jobs. Manipulating PYTHONPATH can lead so lot of 
strange behavior.

See [#14180|https://github.com/apache/spark/pull/14180].

> Hooks for PySpark tasks
> ---
>
> Key: SPARK-5091
> URL: https://issues.apache.org/jira/browse/SPARK-5091
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Davies Liu
>
> Currently, it's not convenient to add package on executor to PYTHONPATH (we 
> did not assume the environment of driver an executor are identical). 
> It will be nice to have a hook to called before/after every tasks, then user 
> could manipulate sys.path by pre-task-hooks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17360) PySpark can create dataframe from a Python generator

2016-09-01 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15455415#comment-15455415
 ] 

Semet commented on SPARK-17360:
---

Look like it work straigh, without modifying the code, however I propose in my 
PR a minor optimization to avoid useless loops.

> PySpark can create dataframe from a Python generator
> 
>
> Key: SPARK-17360
> URL: https://issues.apache.org/jira/browse/SPARK-17360
> Project: Spark
>  Issue Type: Improvement
>Reporter: Semet
>Priority: Trivial
>
> It looks like one can create a dataframe from a Python generator, which might 
> be more efficient that by creating the list of row and use createDataframe:
> {code}
> >>> # On Python 3, you want to use "range" on the following line
> >>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 
> >>> 1000))
> >>> d  # Please note that 'd' is a generator and not a structure with the 
> >>> 1000 elements.
>  at 0x7f1234b92af0>
> >>> sqlContext.createDataFrame(d).take(5)
> [Row(age=1, name=u'Alice-1')]
> [Row(age=2, name=u'Alice-2')]
> [Row(age=3, name=u'Alice-3')]
> [Row(age=4, name=u'Alice-4')]
> [Row(age=5, name=u'Alice-5')]
> {code}
> Looking at the code, there is nothing important to change in the code, only 
> doc and unit tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17360) PySpark can create dataframe from a Python generator

2016-09-01 Thread Semet (JIRA)
Semet created SPARK-17360:
-

 Summary: PySpark can create dataframe from a Python generator
 Key: SPARK-17360
 URL: https://issues.apache.org/jira/browse/SPARK-17360
 Project: Spark
  Issue Type: Improvement
Reporter: Semet
Priority: Trivial


It looks like one can create a dataframe from a Python generator, which might 
be more efficient that by creating the list of row and use createDataframe:

{code}
>>> # On Python 3, you want to use "range" on the following line
>>> d = ({'name': 'Alice-{}'.format(i), 'age': i} for i in xrange(0, 1000))
>>> d  # Please note that 'd' is a generator and not a structure with the 
>>> 1000 elements.
 at 0x7f1234b92af0>
>>> sqlContext.createDataFrame(d).take(5)
[Row(age=1, name=u'Alice-1')]
[Row(age=2, name=u'Alice-2')]
[Row(age=3, name=u'Alice-3')]
[Row(age=4, name=u'Alice-4')]
[Row(age=5, name=u'Alice-5')]
{code}

Looking at the code, there is nothing important to change in the code, only doc 
and unit tests



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-08-22 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

This ticket proposes to allow users the ability to deploy their job as "Wheels" 
packages. The Python community is strongly advocating to promote this way of 
packaging and distributing Python application as a "standard way of deploying 
Python App". In other word, this is the "Pythonic Way of Deployment".

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit 

[jira] [Commented] (SPARK-5160) Python module in jars

2016-08-11 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15416907#comment-15416907
 ] 

Semet commented on SPARK-5160:
--

zip files are already supported, just add zip to --pyfiles and they get added 
to the PATH. I am working on Wheel support for Spark: SPARK-16367

> Python module in jars
> -
>
> Key: SPARK-5160
> URL: https://issues.apache.org/jira/browse/SPARK-5160
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, Spark Core
>Reporter: Davies Liu
>
> In order to simplify publish of spark packages with Python modules, we could 
> put the python module into jars (jar is a zip but with different extension). 
> Python can import the module in a jar when:
> 1) the module is in the top level of jar
> 2) the path to jar is sys.path
> So, we should put the path of jar into PYTHONPATH in driver and executor.
> cc [~pwendell]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16992) Pep8 code style

2016-08-10 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415127#comment-15415127
 ] 

Semet edited comment on SPARK-16992 at 8/10/16 11:15 AM:
-

For the import statement ordering, this helped us a lot to set up a automatic 
merge tool between the prod and main branches (so patch done in prod get 
integrated into main automatically). At least enforcing the sort of the import.

I am pretty much integrist with code style, and Python provides so much tools 
to check and automate the formatting of the code style. autopep8 does a pretty 
good job.

If you agree, I can also put it in post commit hook that automatically fixes 
the code.


was (Author: gae...@xeberon.net):
For the import statement ordering, this helped us a lot to set up a automatic 
merge tool between the prod and main branches (so patch done in prod get 
integrated into main automatically). At least enforcing the sort of the import.

I am pretty much integrist with code style, and Python provides so much tools 
to check and automate the formatting of the code style. autopep8 does a pretty 
good job.

> Pep8 code style
> ---
>
> Key: SPARK-16992
> URL: https://issues.apache.org/jira/browse/SPARK-16992
> Project: Spark
>  Issue Type: Improvement
>Reporter: Semet
>
> Add code style checks and auto formating into the Python code.
> Features:
> - add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
> while Python files uses 4) for compatible editors (almost every editors has a 
> plugin to support {{.editconfig}} file)
> - use autopep8 to fix basic pep8 mistakes
> - use isort to automatically sort and organise {{import}} statements and 
> organise them into logically linked order (see doc here. The most important 
> thing is that it splits import statements that loads more than one object 
> into several lines. It send keep the imports sorted. Said otherwise, for a 
> given module import, the line where it should be added will be fixed. This 
> will increase the number of line of the file, but this facilitates a lot file 
> maintainance and file merges if needed.
> add a 'validate.sh' script in order to automatise the correction (need isort 
> and autopep8 installed)
> You can see similar script in prod in the 
> [Buildbot|https://github.com/buildbot/buildbot/blob/master/common/validate.sh]
>  project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16992) Pep8 code style

2016-08-10 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16992:
--
Description: 
Add code style checks and auto formating into the Python code.

Features:

- add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
while Python files uses 4) for compatible editors (almost every editors has a 
plugin to support {{.editconfig}} file)
- use autopep8 to fix basic pep8 mistakes
- use isort to automatically sort and organise {{import}} statements and 
organise them into logically linked order (see doc here. The most important 
thing is that it splits import statements that loads more than one object into 
several lines. It send keep the imports sorted. Said otherwise, for a given 
module import, the line where it should be added will be fixed. This will 
increase the number of line of the file, but this facilitates a lot file 
maintainance and file merges if needed.
add a 'validate.sh' script in order to automatise the correction (need isort 
and autopep8 installed)
You can see similar script in prod in the 
[Buildbot|https://github.com/buildbot/buildbot/blob/master/common/validate.sh] 
project.



  was:
Add code style checks and auto formating into the Python code.

Features:

- add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
while Python files uses 4) for compatible editors (almost every editors has a 
plugin to support {{.editconfig}} file)
- use autopep8 to fix basic pep8 mistakes
- use isort to automatically sort and organise {{import}} statements and 
organise them into logically linked order (see doc here. The most important 
thing is that it splits import statements that loads more than one object into 
several lines. It send keep the imports sorted. Said otherwise, for a given 
module import, the line where it should be added will be fixed. This will 
increase the number of line of the file, but this facilitates a lot file 
maintainance and file merges if needed.
add a 'validate.sh' script in order to automatise the correction (need isort 
and autopep8 installed)
You can see similar script in prod in the Buildbot project




> Pep8 code style
> ---
>
> Key: SPARK-16992
> URL: https://issues.apache.org/jira/browse/SPARK-16992
> Project: Spark
>  Issue Type: Improvement
>Reporter: Semet
>
> Add code style checks and auto formating into the Python code.
> Features:
> - add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
> while Python files uses 4) for compatible editors (almost every editors has a 
> plugin to support {{.editconfig}} file)
> - use autopep8 to fix basic pep8 mistakes
> - use isort to automatically sort and organise {{import}} statements and 
> organise them into logically linked order (see doc here. The most important 
> thing is that it splits import statements that loads more than one object 
> into several lines. It send keep the imports sorted. Said otherwise, for a 
> given module import, the line where it should be added will be fixed. This 
> will increase the number of line of the file, but this facilitates a lot file 
> maintainance and file merges if needed.
> add a 'validate.sh' script in order to automatise the correction (need isort 
> and autopep8 installed)
> You can see similar script in prod in the 
> [Buildbot|https://github.com/buildbot/buildbot/blob/master/common/validate.sh]
>  project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16992) Pep8 code style

2016-08-10 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15415127#comment-15415127
 ] 

Semet commented on SPARK-16992:
---

For the import statement ordering, this helped us a lot to set up a automatic 
merge tool between the prod and main branches (so patch done in prod get 
integrated into main automatically). At least enforcing the sort of the import.

I am pretty much integrist with code style, and Python provides so much tools 
to check and automate the formatting of the code style. autopep8 does a pretty 
good job.

> Pep8 code style
> ---
>
> Key: SPARK-16992
> URL: https://issues.apache.org/jira/browse/SPARK-16992
> Project: Spark
>  Issue Type: Improvement
>Reporter: Semet
>
> Add code style checks and auto formating into the Python code.
> Features:
> - add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
> while Python files uses 4) for compatible editors (almost every editors has a 
> plugin to support {{.editconfig}} file)
> - use autopep8 to fix basic pep8 mistakes
> - use isort to automatically sort and organise {{import}} statements and 
> organise them into logically linked order (see doc here. The most important 
> thing is that it splits import statements that loads more than one object 
> into several lines. It send keep the imports sorted. Said otherwise, for a 
> given module import, the line where it should be added will be fixed. This 
> will increase the number of line of the file, but this facilitates a lot file 
> maintainance and file merges if needed.
> add a 'validate.sh' script in order to automatise the correction (need isort 
> and autopep8 installed)
> You can see similar script in prod in the Buildbot project



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16992) Pep8 code style

2016-08-10 Thread Semet (JIRA)
Semet created SPARK-16992:
-

 Summary: Pep8 code style
 Key: SPARK-16992
 URL: https://issues.apache.org/jira/browse/SPARK-16992
 Project: Spark
  Issue Type: Improvement
Reporter: Semet


Add code style checks and auto formating into the Python code.

Features:

- add a {{.editconfig}} file (Spark's Scala files use 2-spaces indentation, 
while Python files uses 4) for compatible editors (almost every editors has a 
plugin to support {{.editconfig}} file)
- use autopep8 to fix basic pep8 mistakes
- use isort to automatically sort and organise {{import}} statements and 
organise them into logically linked order (see doc here. The most important 
thing is that it splits import statements that loads more than one object into 
several lines. It send keep the imports sorted. Said otherwise, for a given 
module import, the line where it should be added will be fixed. This will 
increase the number of line of the file, but this facilitates a lot file 
maintainance and file merges if needed.
add a 'validate.sh' script in order to automatise the correction (need isort 
and autopep8 installed)
You can see similar script in prod in the Buildbot project





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-13 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375005#comment-15375005
 ] 

Semet edited comment on SPARK-16367 at 7/13/16 1:48 PM:


I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

Pull Request: [#14180|https://github.com/apache/spark/pull/14180]
Design Doc: [Wheel and Virtualenv 
support|https://docs.google.com/document/d/1oXN7c2xE42-MHhuGqt_i7oeIjAwBI0E9phoNuSfR5Bs/edit?usp=sharing]


was (Author: gae...@xeberon.net):
I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

Pull Request: [#14180|https://github.com/apache/spark/pull/14180]

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via 

[jira] [Comment Edited] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-13 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375005#comment-15375005
 ] 

Semet edited comment on SPARK-16367 at 7/13/16 1:48 PM:


I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

Pull Request: [#14180|https://github.com/apache/spark/pull/14180]


was (Author: gae...@xeberon.net):
I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-13 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375005#comment-15375005
 ] 

Semet commented on SPARK-16367:
---

I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365599#comment-15365599
 ] 

Semet commented on SPARK-16367:
---

Wheels are tagged by os, architecture, Python version and it seems to be enough 
for being compiled on one machine and work on another, if compatible. Pip 
install is responsible for finding the right wheel of a wanted module.

For example on my machine, when I do a "pip install numpy" I don't have any 
compilation, pip directly takes the binary wheel from pypi, so installation is 
fast. But if you have an older version of Python, for instance 2.6, since there 
is no wheels for 2.6, pip install will compile some C modules and store the 
wheel in ~/.cache/pip. So futur installation will not require compilation.

You can even take this wheel and add it to you pypi-local repository on 
artifactory so this package will be available on you pypi mirror (see doc about 
artifactory support for pypi).

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

You can see that: 
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational* 
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node. 

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes. 

*Previous approaches* 
I based the current proposal over the two following bugs related to this point: 
- SPARK-6764 ("Wheel support for PySpark") 
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation 

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available. 

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file. 

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally). 

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org. 

*Use Case 1: no internet connectivity* 
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse": 

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies 
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package: 
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt 
{code} 
astroid==1.4.6 # via pylint 
autopep8==1.2.4 
click==6.6 # via pip-tools 
colorama==0.3.7 # via pylint 
enum34==1.1.6 # via hypothesis 
findspark==1.0.0 # via spark-testing-base 
first==2.0.1 # via pip-tools 
hypothesis==3.4.0 # via spark-testing-base 
lazy-object-proxy==1.2.2 # via astroid 
linecache2==1.0.0 # via traceback2 
pbr==1.10.0 
pep8==1.7.0 # via autopep8 
pip-tools==1.6.5 
py==1.4.31 # via pytest 
pyflakes==1.2.3 
pylint==1.5.6 
pytest==2.9.2 # via spark-testing-base 
six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
spark-testing-base==0.0.7.post2 
traceback2==1.4.0 # via unittest2 
unittest2==1.1.0 # via spark-testing-base 
wheel==0.29.0 
wrapt==1.10.8 # via astroid 
{code} 
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy 
-- create a virtualenv if not already in one: 
{code} 
virtualenv env 
{code} 
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need. 
- create the wheelhouse for your current project 
{code} 
pip install wheelhouse 
pip wheel . --wheel-dir wheelhouse 
{code} 
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}. 
- zip it into a {{wheelhouse.zip}}. 

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically. 

Now comes the time to submit the project: 
{code} 
bin/spark-submit --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
{code} 

You can see that: 
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally), for 
example in companies with a weird internet proxy settings or if you want to 
protect your spark cluster from the web.

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally), for 
example in companies with a weird internet proxy settings or if you want to 
protect your spark cluster from the web.

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363980#comment-15363980
 ] 

Semet commented on SPARK-16367:
---

Description of the bug updated :)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available.
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file.
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally).
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org.
> *Use Case 1: no internet connectivity*
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse":
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally).

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror. In 
this case the simplest way to deploy a project with several dependencies is to 
build and then send to complete "wheelhouse":

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system* in a directory {{wheelhouse}}.
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through projects such as http://doc.devpi.net/latest/ 
(untested) or the Pypi mirror support on Artifactory (tested personnally)

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror.

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway. Mirroring of 
Pypi is possible through

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror.

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
In Python, the packaging standard is now the "wheels" file format, which goes 
further that good old ".egg" files. With a wheel file (".whl"), the package is 
already prepared for a given architecture. You can have several wheels for a 
given package version, each specific to an architecture, or environment. 

For example, look at https://pypi.python.org/pypi/numpy all the different 
version of Wheel available.

The {{pip}} tools knows how to select the right wheel file matching the current 
system, and how to install this package in a light speed (without compilation). 
Said otherwise, package that requires compilation of a C module, for instance 
"numpy", does *not* compile anything when installing from wheel file.

{{pypi.pypthon.org}} already provided wheels for major python version. It the 
wheel is not available, pip will compile it from source anyway.

{{pip}} also provides the ability to generate easily all wheels of all packages 
used for a given project which is inside a "virtualenv". This is called 
"wheelhouse". You can even don't mess with this compilation and retrieve it 
directly from pypi.python.org.

*Use Case 1: no internet connectivity*
Here my first proposal for a deployment workflow, in the case where the Spark 
cluster does not have any internet connectivity or access to a Pypi mirror.

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363902#comment-15363902
 ] 

Semet commented on SPARK-16367:
---

I am wrong on one point: even if each node fetch independently from the 
official pypi or a mirror, each node won't compile since it can find the 
precompiled wheels. 

I'll add the option to specify the pypi url to use for fetching, for users that 
has a pypi mirror (artifactory or the project you mentioned), wheelhouse 
creation is even not needed. Only the current script (and dependencies that are 
not on pypi, for example other internal projects) will have to be sent through 
a wheel or egg to be installed properly in the virtualenv.

Sounds like a good solution. But for users that don't have the spark cluster 
connected to Internet or to a pypi mirror, I will keep the ability to install 
from a full wheelhouse.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363899#comment-15363899
 ] 

Semet commented on SPARK-16367:
---

Yes and you have artifactory that has and automatic mirroring capability of 
pypi. Works pretty well. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> You 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363898#comment-15363898
 ] 

Semet commented on SPARK-16367:
---

For me, you as a developer are already inside a virtual env when you write your 
script. So you already have compiled your machine during pip install. And you 
only need to create the wheels once. 

I also advocate to send your script as a wheel so it will be also already 
packaged, but this is not mandatory.

This is to follow the same pattern than with the uber fat jar: send everything 
the python executor will need to execute your spark job, without assuming you 
have anything already installed on each node.

This cost indeed the creation time of the wheel. But once created (takes a few 
seconds, maybe a minute if you install numpy), the deployment cost is really 
low. Each node won't have to compile independently, it is just file transfers 
and unzipping automatically handled by pip.

Users can also directly copy the wheel from pypi.python.org, for numpy for 
example.

I am not fan of letting the cluster fetch from Internet, you'll have to ensure 
the proxy is correctly set in corporation, might not be the case anywhere. But 
as an option yes I also would provide it. But this is more expensive to deploy: 
each node will fetch and compile vs all wheels are prepared and pip only choose 
the right one to unzip.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363189#comment-15363189
 ] 

Semet commented on SPARK-16367:
---

This is where the magic of wheels lies:
- look at https://pypi.python.org/pypi/numpy , there are all wheels for various 
Python version, 32/64b, Linux/Mac/Windows. Simply copy from pypi.python.org + 
some drops, and that's all, no compilation needed
- no compilation is needed upon installation, and of all wheels are put in the 
wheelhouse archive the installation will only consist of package unzipping 
(automatically handled by pip install)
- the creation of wheelhouse is really simple: pip install wheels, and then pip 
wheels. I'll write a tutorial in the documentation.

I have rebased your patch actually, so the cache thing will be kept:)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --files 
/path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip --conf 
"spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- The wheelhouse deployment is triggered by the {{ --conf 
"spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
and {{wheelhouse.zip}} are copied through {{--files}}. The names of both files 
can be changed through {{--conf}} arguments. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be 

[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-07-05 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362294#comment-15362294
 ] 

Semet commented on SPARK-13587:
---

Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6764) Add wheel package support for PySpark

2016-07-05 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361408#comment-15361408
 ] 

Semet edited comment on SPARK-6764 at 7/5/16 9:44 AM:
--

Hello
I am working on a new proposal for complete wheel support, along with 
virtualenv. I think this will solve many dependency problem with python 
packages.

Full proposal is here: https://issues.apache.org/jira/browse/SPARK-16367


was (Author: gae...@xeberon.net):
Hello
I am working on a new proposal for complete wheel support, along with 
virtualenv. I think this will solve many dependency problem with python 
packages.

> Add wheel package support for PySpark
> -
>
> Key: SPARK-6764
> URL: https://issues.apache.org/jira/browse/SPARK-6764
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, PySpark
>Reporter: Takao Magoori
>Priority: Minor
>  Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and 
> .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to 
> Python's sys.path. When the user program imports it, 
> [zipimport|https://docs.python.org/2.7/library/zipimport.html] is 
> automatically invoked under the hood. That is, data-files and dynamic 
> modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and 
> .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to 
> Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
> modules as far as the author of the package uses [pkg_resources 
> API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
>  properly. But so many python modules does not use pkg_resources API, that 
> causes "ImportError"or "No such file" error. Moreover, creating eggs of 
> dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format 
> "[wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, 
> we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
> requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
> find_packages,
> setup,
> )
> setup(
> name='yourpkg',
> version='0.0.1',
> packages=find_packages(),
> install_requires=[
> 'SQLAlchemy',
> 'MySQL-python',
> 'requests',
> 'simplejson>=3.6.0,<=3.6.5',
> 'pydoop',
> ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all 
> dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') 
> your_driver_bootstrap.py
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Labels: newbie wh  (was: newbie)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=native" --conf 
> "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
> --conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
> 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Labels: newbie python python-wheel wheelhouse  (was: newbie wh)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=native" --conf 
> "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
> --conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
> 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")
First part of my proposal was to merge, in order to support wheels install and 
virtualenv creation

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Issue Type: New Feature  (was: Improvement)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> So here is my proposal:
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=native" --conf 
> "spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
> --conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
> "spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
> ~/path/to/launcher_script.py
> {code}
> You can see that:
> - no 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
This can take some times, but at the end you have all the .whl required *for 
your current system*
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, my proposal for on Pyspark developers point of 
view:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, how my proposal will be for developers:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a requirements.txt. I recommend to specify all package version. You 
can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 bits, you will have 

[jira] [Updated] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Semet updated SPARK-16367:
--
Description: 
*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, how my proposal will be for developers:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a {{requirements.txt}}. I recommend to specify all package version. 
You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
{code}
astroid==1.4.6# via pylint
autopep8==1.2.4
click==6.6# via pip-tools
colorama==0.3.7   # via pylint
enum34==1.1.6 # via hypothesis
findspark==1.0.0  # via spark-testing-base
first==2.0.1  # via pip-tools
hypothesis==3.4.0 # via spark-testing-base
lazy-object-proxy==1.2.2  # via astroid
linecache2==1.0.0 # via traceback2
pbr==1.10.0
pep8==1.7.0   # via autopep8
pip-tools==1.6.5
py==1.4.31# via pytest
pyflakes==1.2.3
pylint==1.5.6
pytest==2.9.2 # via spark-testing-base
six==1.10.0   # via astroid, pip-tools, pylint, unittest2
spark-testing-base==0.0.7.post2
traceback2==1.4.0 # via unittest2
unittest2==1.1.0  # via spark-testing-base
wheel==0.29.0
wrapt==1.10.8 # via astroid
{code}
-- write a setup.py with some entry points or package. Use 
[PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of maitaining 
a setup.py files really easy
-- create a virtualenv if not already in one:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 bits, you will 

[jira] [Created] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-04 Thread Semet (JIRA)
Semet created SPARK-16367:
-

 Summary: Wheelhouse Support for PySpark
 Key: SPARK-16367
 URL: https://issues.apache.org/jira/browse/SPARK-16367
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, PySpark
Affects Versions: 1.6.2, 1.6.1, 2.0.0
Reporter: Semet


*Rational*
Is it recommended, in order to deploying Scala packages written in Scala, to 
build big fat jar files. This allows to have all dependencies on one package so 
the only "cost" is copy time to deploy this file on every Spark Node.

On the other hand, Python deployment is more difficult once you want to use 
external packages, and you don't really want to mess with the IT to deploy the 
packages on the virtualenv of each nodes.

*Previous approaches*
I based the current proposal over the two following bugs related to this point:
- SPARK-6764 ("Wheel support for PySpark")
- SPARK-13587("Support virtualenv in PySpark")

So here is my proposal:

*Uber Fat Wheelhouse for Python Deployment*
In Python, the packaging standard is now "wheels", which goes further that old 
good ".egg" files. With a wheel file (".whl"), the package is already prepared 
for a given architecture. You can have several wheel, each specific to an 
architecture, or environment. 

The {{pip}} tools now how to select the package matching the current system, 
how to install this package in a light speed. Said otherwise, package that 
requires compilation of a C module, for instance, does *not* compile anything 
when installing from wheel file.

{{pip}} also provides the ability to generate easily all wheel of all packages 
used for a given module (inside a "virtualenv"). This is called "wheelhouse". 
You can even don't mess with this compilation and retrieve it directly from 
pypi.python.org.

*Developer workflow*
Here is, in a more concrete way, how my proposal will be for developers:

- you are writing a PySpark script that increase in term of size and 
dependencies. Deploying on Spark for example requires to build numpy or Theano 
and other dependencies
- to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
into a standard Python package:
-- write a requirements.txt.
-- write a setup.py. Use [PBR|http://docs.openstack.org/developer/pbr/] it 
makes the jobs of maitaining a setup.py files really easy
-- use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
requirements.txt
-- create a virtualenv if not already:
{code}
virtualenv env
{code}
-- Work on your environment, define the requirement you need in 
{{requirements.txt}}, do all the {{pip install}} you need.
- create the wheelhouse for your current project
{code}
pip install wheelhouse
pip wheel . --wheel-dir wheelhouse
{code}
- zip it into a {{wheelhouse.zip}}.

Note that you can have your own package (for instance 'my_package') be 
generated into a wheel and so installed by {{pip}} automatically.

Now comes the time to submit the project:
{code}
bin/spark-submit  --master master --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.requirements=/path/to/virtualenv/requirements.txt" 
--conf "spark.pyspark.virtualenv.bin.path=virtualenv" 
"spark.pyspark.virtualenv.wheelhouse=/path/to/virtualenv/wheelhouse.zip"  
~/path/to/launcher_script.py
{code}

You can see that:
- no extra argument is add in the command line. All configuration goes through 
{{--conf}} argument (this has been directly taken from SPARK-13587). According 
to the history on spark source code, I guess the goal is to simplify the 
maintainance of the various command line interface, by avoiding too many 
specific argument.
- the command line is pretty complex indeed. I guess with a proper 
documentation this might not be a problem
- you still need to define the path to {{requirement.txt}} and 
{{wheelhouse.zip}} (they will be automatically copied to each node). This is 
important since this will allow {{pip install}}, running of each node, to pick 
only the wheels he needs. For example, if you have a package compiled on 32 
bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will only 
select the right one
- I have choosen to keep the script at the end of the command line, but for me 
it is just a launcher script, it can only be 4 lines:
{code}
/#!/usr/bin/env python  

from mypackage import run
run()
{code}

*advantages*
- quick installation, since there is no compilation
- no Internet connectivity support, no need mess with the corporate proxy or 
require a local mirroring of pypi.
- package versionning isolation (two spark job can depends on two different 
version of a given library)

*disadvantages*
- slighly more complex to setup than sending a simple python script, but this 
feature is not lost
- support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
one has to 

[jira] [Commented] (SPARK-6764) Add wheel package support for PySpark

2016-07-04 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15361408#comment-15361408
 ] 

Semet commented on SPARK-6764:
--

Hello
I am working on a new proposal for complete wheel support, along with 
virtualenv. I think this will solve many dependency problem with python 
packages.

> Add wheel package support for PySpark
> -
>
> Key: SPARK-6764
> URL: https://issues.apache.org/jira/browse/SPARK-6764
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, PySpark
>Reporter: Takao Magoori
>Priority: Minor
>  Labels: newbie
>
> We can do _spark-submit_ with one or more Python packages (.egg,.zip and 
> .jar) by *--py-files* option.
> h4. zip packaging
> Spark put a zip file on its working directory and adds the absolute path to 
> Python's sys.path. When the user program imports it, 
> [zipimport|https://docs.python.org/2.7/library/zipimport.html] is 
> automatically invoked under the hood. That is, data-files and dynamic 
> modules(.pyd .so) can not be used since zipimport supports only .py, .pyc and 
> .pyo.
> h4. egg packaging
> Spark put an egg file on its working directory and adds the absolute path to 
> Python's sys.path. Unlike zipimport, egg can handle data files and dynamid 
> modules as far as the author of the package uses [pkg_resources 
> API|https://pythonhosted.org/setuptools/formats.html#other-technical-considerations]
>  properly. But so many python modules does not use pkg_resources API, that 
> causes "ImportError"or "No such file" error. Moreover, creating eggs of 
> dependencies and further dependencies are troublesome job.
> h4. wheel packaging
> Supporting new Python standard package-format 
> "[wheel|https://wheel.readthedocs.org/en/latest/]; would be nice. With wheel, 
> we can do spark-submit with complex dependencies simply as follows.
> 1. Write requirements.txt file.
> {noformat}
> SQLAlchemy
> MySQL-python
> requests
> simplejson>=3.6.0,<=3.6.5
> pydoop
> {noformat}
> 2. Do wheel packaging by only one command. All dependencies are wheel-ed.
> {noformat}
> $ your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse --requirement 
> requirements.txt
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') your_driver.py
> {noformat}
> If your pyspark driver is a package which consists of many modules,
> 1. Write setup.py for your pyspark driver package.
> {noformat}
> from setuptools import (
> find_packages,
> setup,
> )
> setup(
> name='yourpkg',
> version='0.0.1',
> packages=find_packages(),
> install_requires=[
> 'SQLAlchemy',
> 'MySQL-python',
> 'requests',
> 'simplejson>=3.6.0,<=3.6.5',
> 'pydoop',
> ],
> )
> {noformat}
> 2. Do wheel packaging by only one command. Your driver package and all 
> dependencies are wheel-ed.
> {noformat}
> your_pip_dir/pip wheel --wheel-dir /tmp/wheelhouse your_driver_package/.
> {noformat}
> 3. Do spark-submit
> {noformat}
> your_spark_home/bin/spark-submit --master local[4] --py-files $(find 
> /tmp/wheelhouse/ -name "*.whl" -print0 | sed -e 's/\x0/,/g') 
> your_driver_bootstrap.py
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-13587) Support virtualenv in PySpark

2016-06-29 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355172#comment-15355172
 ] 

Semet edited comment on SPARK-13587 at 6/29/16 12:52 PM:
-

yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

I don't have much experience on YARN or MESOS, what are the big differences?


was (Author: gae...@xeberon.net):
yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-29 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15355172#comment-15355172
 ] 

Semet commented on SPARK-13587:
---

yes it looks cool!
Here is what I have in mind, tell me if it is the wrong direction
- each job should execute in its own environment. 
- I love wheels, and wheelhouse. Providen the fact we build all the needed 
wheels on the same machine as the cluster, of we did retrived the right wheels 
on Pypi, pypi can install all dependencies with lightning speed, without the 
need of an internet connection (have configure the proxy for some corporates, 
or handle an internal mirror, etc).
- so we deploy the job with a command line such as:
  
{code}
bin/spark-submit --master $(spark_master) --deploy-mode client --conf 
"spark.pyspark.virtualenv.enabled=true" --conf 
"spark.pyspark.virtualenv.type=native" --conf 
"spark.pyspark.virtualenv.wheelhouse=/path/to/wheelhouse.zip" --conf 
"spark.pyspark.virtualenv.script=script_name" --conf 
"spark.pyspark.virtualenv.args='--opt1 --opt2'"
{code}

so:
- {{wheelhouse.zip}} contains the whole wheels to install in a fresh 
virtualenv. No internet connection, the script it also deployed and installed, 
provided they go created like a nice module page (so easy to do with pbr)
- {{spark.pyspark.virtualenv.script}} is the execution point of the script. It 
should be declared in the {{script}} section in the {{setup.py}}
- {{spark.pyspark.virtualenv.args}} allows to pass extra arguments to the script

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2016-06-27 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351074#comment-15351074
 ] 

Semet commented on SPARK-13587:
---

I back this proposal and willing to work on it.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Jeff Zhang
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org