[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2018-05-30 Thread Cyril Scetbon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16495979#comment-16495979
 ] 

Cyril Scetbon commented on SPARK-16367:
---

Is there a better way to do it today ? I see that ticket has been opened for 
almost 2 years

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>Priority: Major
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217354#comment-16217354
 ] 

Dan Blanchard commented on SPARK-16367:
---

Right, sorry. I just meant to point out that the proposal you have up there has 
someone installing the {{wheelhouse}} package, which is unnecessary. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217322#comment-16217322
 ] 

Semet commented on SPARK-16367:
---

Yes, I don't use it because it is a feature of {{pip}}:
{{code}
pip wheel --wheel-dir wheelhouse .
{{code}}

It is described [in here|https://wheel.readthedocs.io/en/stable/].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-10-24 Thread Dan Blanchard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16217253#comment-16217253
 ] 

Dan Blanchard commented on SPARK-16367:
---

You don't actually appear to use the {{wheelhouse}} Python package at all in 
this proposal. You use {{pip wheel --wheel-dir=/path/to/wheelhouse}} to create 
a directory with all the wheels for the project, and that just requires {{pip}} 
and {{wheel}}. {{wheelhouse}} is a separate package for performing operations 
on those directories, but you do not appear to use it.

The whole wheelhouse-as-a-name-for-a-directory-of-wheels thing is just a pun 
that a lot of people like to use, separate from the {{wheelhouse}} package.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-01-10 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815466#comment-15815466
 ] 

Semet commented on SPARK-16367:
---

Yes in our pull request both conda and pip are supported. Wheel allow pip to 
behave like conda, i.e., not recompiling numpy everytime.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2017-01-10 Thread Benjamin Zaitlen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15815083#comment-15815083
 ] 

Benjamin Zaitlen commented on SPARK-16367:
--

[~gae...@xeberon.net] do you also have plans on supporting conda packages and 
environments as well ?

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-09-07 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15470328#comment-15470328
 ] 

Semet commented on SPARK-16367:
---

Blog post about PySpark job deployment: 
http://www.great-a-blog.co/wheel-deployment-for-pyspark/

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-13 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15375005#comment-15375005
 ] 

Semet commented on SPARK-16367:
---

I have sent a design doc and pullrequest. They are deliberately based on 
[#13599|https://github.com/apache/spark/pull/13599] and [design doc from 
SPARK-13587|https://docs.google.com/document/d/1MpURTPv0xLvIWhcJdkc5lDMWYBRJ4zAQ69rP2WA8-TM/edit?usp=sharing].

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15365599#comment-15365599
 ] 

Semet commented on SPARK-16367:
---

Wheels are tagged by os, architecture, Python version and it seems to be enough 
for being compiled on one machine and work on another, if compatible. Pip 
install is responsible for finding the right wheel of a wanted module.

For example on my machine, when I do a "pip install numpy" I don't have any 
compilation, pip directly takes the binary wheel from pypi, so installation is 
fast. But if you have an older version of Python, for instance 2.6, since there 
is no wheels for 2.6, pip install will compile some C modules and store the 
wheel in ~/.cache/pip. So futur installation will not require compilation.

You can even take this wheel and add it to you pypi-local repository on 
artifactory so this package will be available on you pypi mirror (see doc about 
artifactory support for pypi).

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15364972#comment-15364972
 ] 

Jeff Zhang commented on SPARK-16367:


[~gae...@xeberon.net] I still don't understand how the binary wheel work in 
different OS machines since you build wheel on the client machine. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need. 
> - create 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363980#comment-15363980
 ] 

Semet commented on SPARK-16367:
---

Description of the bug updated :)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Virtualenv, wheel suppoer and "Uber Fat Wheelhouse" for PySpark*
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available.
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file.
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally).
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org.
> *Use Case 1: no internet connectivity*
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse":
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363902#comment-15363902
 ] 

Semet commented on SPARK-16367:
---

I am wrong on one point: even if each node fetch independently from the 
official pypi or a mirror, each node won't compile since it can find the 
precompiled wheels. 

I'll add the option to specify the pypi url to use for fetching, for users that 
has a pypi mirror (artifactory or the project you mentioned), wheelhouse 
creation is even not needed. Only the current script (and dependencies that are 
not on pypi, for example other internal projects) will have to be sent through 
a wheel or egg to be installed properly in the virtualenv.

Sounds like a good solution. But for users that don't have the spark cluster 
connected to Internet or to a pypi mirror, I will keep the ability to install 
from a full wheelhouse.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363899#comment-15363899
 ] 

Semet commented on SPARK-16367:
---

Yes and you have artifactory that has and automatic mirroring capability of 
pypi. Works pretty well. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py
> {code}
> You 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-06 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363898#comment-15363898
 ] 

Semet commented on SPARK-16367:
---

For me, you as a developer are already inside a virtual env when you write your 
script. So you already have compiled your machine during pip install. And you 
only need to create the wheels once. 

I also advocate to send your script as a wheel so it will be also already 
packaged, but this is not mandatory.

This is to follow the same pattern than with the uber fat jar: send everything 
the python executor will need to execute your spark job, without assuming you 
have anything already installed on each node.

This cost indeed the creation time of the wheel. But once created (takes a few 
seconds, maybe a minute if you install numpy), the deployment cost is really 
low. Each node won't have to compile independently, it is just file transfers 
and unzipping automatically handled by pip.

Users can also directly copy the wheel from pypi.python.org, for numpy for 
example.

I am not fan of letting the cluster fetch from Internet, you'll have to ensure 
the proxy is correctly set in corporation, might not be the case anywhere. But 
as an option yes I also would provide it. But this is more expensive to deploy: 
each node will fetch and compile vs all wheels are prepared and pip only choose 
the right one to unzip.

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363509#comment-15363509
 ] 

Jeff Zhang commented on SPARK-16367:


Preparing the wheelhouse seems time consuming to me especially when many 
packages are needed and themselves also have dependencies as well.  If internet 
is accessible, I would rather ask the cluster to do that.  
[~gae...@xeberon.net] Do you know whether local repository is supported by 
python ? So that cluster administrator can create a private wheelhouse so that 
all the machines in the cluster can access that repository just like private 
maven repository. 

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363513#comment-15363513
 ] 

Jeff Zhang commented on SPARK-16367:


Oh, happen to find this project to build local python package repository. Maybe 
user can use that http://doc.devpi.net/latest/


> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some times, but at the end you have all the .whl required *for 
> your current system*
> - zip it into a {{wheelhouse.zip}}.
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically.
> Now comes the time to submit the project:
> {code}
> bin/spark-submit  --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Semet (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15363189#comment-15363189
 ] 

Semet commented on SPARK-16367:
---

This is where the magic of wheels lies:
- look at https://pypi.python.org/pypi/numpy , there are all wheels for various 
Python version, 32/64b, Linux/Mac/Windows. Simply copy from pypi.python.org + 
some drops, and that's all, no compilation needed
- no compilation is needed upon installation, and of all wheels are put in the 
wheelhouse archive the installation will only consist of package unzipping 
(automatically handled by pip install)
- the creation of wheelhouse is really simple: pip install wheels, and then pip 
wheels. I'll write a tutorial in the documentation.

I have rebased your patch actually, so the cache thing will be kept:)

> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need.
> - create the wheelhouse for your current project
> {code}
> pip install wheelhouse
> pip wheel . --wheel-dir wheelhouse
> {code}
> This can take some 

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

2016-07-05 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15362764#comment-15362764
 ] 

Jeff Zhang commented on SPARK-16367:


[~gae...@xeberon.net] Thanks for the new idea, this makes the community more 
powerful. 
Here's my few concerns and comments.

bq. support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
one has to send all wheels flavours and ensure pip is able to install in every 
environment.
What about different OS ? Can Wheels compiled on the client machine be used on 
a different OS ? This is my largest concern for this approach. 

bq. and disk space
For yarn, this is not a problem, because the container will be cleanup after 
the container is exited.

I am not sure whether the extra steps for creating wheelhouse is too 
complicated for users. 
BTW in the approach of SPARK-13587, I specified the cache-dir when creating 
virtualenv. That means only the first time compilation is needed, after that 
each installation will pick up the wheel file in cache dir. 


> Wheelhouse Support for PySpark
> --
>
> Key: SPARK-16367
> URL: https://issues.apache.org/jira/browse/SPARK-16367
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, PySpark
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Semet
>  Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational*
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node.
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes.
> *Previous approaches*
> I based the current proposal over the two following bugs related to this 
> point:
> - SPARK-6764 ("Wheel support for PySpark")
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation
> *Uber Fat Wheelhouse for Python Deployment*
> In Python, the packaging standard is now "wheels", which goes further that 
> old good ".egg" files. With a wheel file (".whl"), the package is already 
> prepared for a given architecture. You can have several wheel, each specific 
> to an architecture, or environment. 
> The {{pip}} tools now how to select the package matching the current system, 
> how to install this package in a light speed. Said otherwise, package that 
> requires compilation of a C module, for instance, does *not* compile anything 
> when installing from wheel file.
> {{pip}} also provides the ability to generate easily all wheel of all 
> packages used for a given module (inside a "virtualenv"). This is called 
> "wheelhouse". You can even don't mess with this compilation and retrieve it 
> directly from pypi.python.org.
> *Developer workflow*
> Here is, in a more concrete way, my proposal for on Pyspark developers point 
> of view:
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package:
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt
> {code}
> astroid==1.4.6# via pylint
> autopep8==1.2.4
> click==6.6# via pip-tools
> colorama==0.3.7   # via pylint
> enum34==1.1.6 # via hypothesis
> findspark==1.0.0  # via spark-testing-base
> first==2.0.1  # via pip-tools
> hypothesis==3.4.0 # via spark-testing-base
> lazy-object-proxy==1.2.2  # via astroid
> linecache2==1.0.0 # via traceback2
> pbr==1.10.0
> pep8==1.7.0   # via autopep8
> pip-tools==1.6.5
> py==1.4.31# via pytest
> pyflakes==1.2.3
> pylint==1.5.6
> pytest==2.9.2 # via spark-testing-base
> six==1.10.0   # via astroid, pip-tools, pylint, unittest2
> spark-testing-base==0.0.7.post2
> traceback2==1.4.0 # via unittest2
> unittest2==1.1.0  # via spark-testing-base
> wheel==0.29.0
> wrapt==1.10.8 # via astroid
> {code}
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy
> -- create a virtualenv if not already in one:
> {code}
> virtualenv env
> {code}
> -- Work on your environment, define the