[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

Cyril Scetbon (JIRA) Wed, 30 May 2018 18:52:22 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-16367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495979#comment-16495979
 ]


Cyril Scetbon commented on SPARK-16367:
---------------------------------------

Is there a better way to do it today ? I see that ticket has been opened for 
almost 2 years

> Wheelhouse Support for PySpark
> ------------------------------
>
>                 Key: SPARK-16367
>                 URL: https://issues.apache.org/jira/browse/SPARK-16367
>             Project: Spark
>          Issue Type: New Feature
>          Components: Deploy, PySpark
>    Affects Versions: 1.6.1, 1.6.2, 2.0.0
>            Reporter: Semet
>            Priority: Major
>              Labels: newbie, python, python-wheel, wheelhouse
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> *Rational* 
> Is it recommended, in order to deploying Scala packages written in Scala, to 
> build big fat jar files. This allows to have all dependencies on one package 
> so the only "cost" is copy time to deploy this file on every Spark Node. 
> On the other hand, Python deployment is more difficult once you want to use 
> external packages, and you don't really want to mess with the IT to deploy 
> the packages on the virtualenv of each nodes. 
> This ticket proposes to allow users the ability to deploy their job as 
> "Wheels" packages. The Python community is strongly advocating to promote 
> this way of packaging and distributing Python application as a "standard way 
> of deploying Python App". In other word, this is the "Pythonic Way of 
> Deployment".
> *Previous approaches* 
> I based the current proposal over the two following bugs related to this 
> point: 
> - SPARK-6764 ("Wheel support for PySpark") 
> - SPARK-13587("Support virtualenv in PySpark")
> First part of my proposal was to merge, in order to support wheels install 
> and virtualenv creation 
> *Virtualenv, wheel support and "Uber Fat Wheelhouse" for PySpark* 
> In Python, the packaging standard is now the "wheels" file format, which goes 
> further that good old ".egg" files. With a wheel file (".whl"), the package 
> is already prepared for a given architecture. You can have several wheels for 
> a given package version, each specific to an architecture, or environment. 
> For example, look at https://pypi.python.org/pypi/numpy all the different 
> version of Wheel available. 
> The {{pip}} tools knows how to select the right wheel file matching the 
> current system, and how to install this package in a light speed (without 
> compilation). Said otherwise, package that requires compilation of a C 
> module, for instance "numpy", does *not* compile anything when installing 
> from wheel file. 
> {{pypi.pypthon.org}} already provided wheels for major python version. It the 
> wheel is not available, pip will compile it from source anyway. Mirroring of 
> Pypi is possible through projects such as http://doc.devpi.net/latest/ 
> (untested) or the Pypi mirror support on Artifactory (tested personnally). 
> {{pip}} also provides the ability to generate easily all wheels of all 
> packages used for a given project which is inside a "virtualenv". This is 
> called "wheelhouse". You can even don't mess with this compilation and 
> retrieve it directly from pypi.python.org. 
> *Use Case 1: no internet connectivity* 
> Here my first proposal for a deployment workflow, in the case where the Spark 
> cluster does not have any internet connectivity or access to a Pypi mirror. 
> In this case the simplest way to deploy a project with several dependencies 
> is to build and then send to complete "wheelhouse": 
> - you are writing a PySpark script that increase in term of size and 
> dependencies. Deploying on Spark for example requires to build numpy or 
> Theano and other dependencies 
> - to use "Big Fat Wheelhouse" support of Pyspark, you need to turn his script 
> into a standard Python package: 
> -- write a {{requirements.txt}}. I recommend to specify all package version. 
> You can use [pip-tools|https://github.com/nvie/pip-tools] to maintain the 
> requirements.txt 
> {code} 
> astroid==1.4.6 # via pylint 
> autopep8==1.2.4 
> click==6.6 # via pip-tools 
> colorama==0.3.7 # via pylint 
> enum34==1.1.6 # via hypothesis 
> findspark==1.0.0 # via spark-testing-base 
> first==2.0.1 # via pip-tools 
> hypothesis==3.4.0 # via spark-testing-base 
> lazy-object-proxy==1.2.2 # via astroid 
> linecache2==1.0.0 # via traceback2 
> pbr==1.10.0 
> pep8==1.7.0 # via autopep8 
> pip-tools==1.6.5 
> py==1.4.31 # via pytest 
> pyflakes==1.2.3 
> pylint==1.5.6 
> pytest==2.9.2 # via spark-testing-base 
> six==1.10.0 # via astroid, pip-tools, pylint, unittest2 
> spark-testing-base==0.0.7.post2 
> traceback2==1.4.0 # via unittest2 
> unittest2==1.1.0 # via spark-testing-base 
> wheel==0.29.0 
> wrapt==1.10.8 # via astroid 
> {code} 
> -- write a setup.py with some entry points or package. Use 
> [PBR|http://docs.openstack.org/developer/pbr/] it makes the jobs of 
> maitaining a setup.py files really easy 
> -- create a virtualenv if not already in one: 
> {code} 
> virtualenv env 
> {code} 
> -- Work on your environment, define the requirement you need in 
> {{requirements.txt}}, do all the {{pip install}} you need. 
> - create the wheelhouse for your current project 
> {code} 
> pip install wheelhouse 
> pip wheel . --wheel-dir wheelhouse 
> {code} 
> This can take some times, but at the end you have all the .whl required *for 
> your current system* in a directory {{wheelhouse}}. 
> - zip it into a {{wheelhouse.zip}}. 
> Note that you can have your own package (for instance 'my_package') be 
> generated into a wheel and so installed by {{pip}} automatically. 
> Now comes the time to submit the project: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/virtualenv/requirements.txt,/path/to/virtualenv/wheelhouse.zip 
> --conf "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
> {code} 
> You can see that: 
> - no extra argument is add in the command line. All configuration goes 
> through {{--conf}} argument (this has been directly taken from SPARK-13587). 
> According to the history on spark source code, I guess the goal is to 
> simplify the maintainance of the various command line interface, by avoiding 
> too many specific argument. 
> - The wheelhouse deployment is triggered by the {{\-\-conf 
> "spark.pyspark.virtualenv.enabled=true" }} argument. The {{requirements.txt}} 
> and {{wheelhouse.zip}} are copied through {{--files}}. The names of both 
> files can be changed through {{\-\-conf}} arguments. I guess with a proper 
> documentation this might not be a problem 
> - you still need to define the path to {{requirement.txt}} and 
> {{wheelhouse.zip}} (they will be automatically copied to each node). This is 
> important since this will allow {{pip install}}, running of each node, to 
> pick only the wheels he needs. For example, if you have a package compiled on 
> 32 bits and 64 bits, you will have 2 wheels, and on each node, {{pip}} will 
> only select the right one 
> - I have choosen to keep the script at the end of the command line, but for 
> me it is just a launcher script, it can only be 4 lines: 
> {code} 
> /#!/usr/bin/env python        
> from mypackage import run 
> run() 
> {code} 
> - on each node, a new virtualenv is created *at each deployment*. This has a 
> cost, but not so much, since the {{pip install}} will only install wheel, no 
> compilation nor internet connection will be required. The command line for 
> installing the wheel on each node will be like: 
> {code} 
> pip install --no-index --find-links=/path/to/node/wheelhouse -r 
> requirements.txt 
> {code} 
> *advantages* 
> - quick installation, since there is no compilation 
> - no Internet connectivity support, no need mess with the corporate proxy or 
> require a local mirroring of pypi. 
> - package versionning isolation (two spark job can depends on two different 
> version of a given library) 
> *disadvantages* 
> - creating a virtualenv at each execution takes time, not that much but still 
> it can take some seconds 
> - and disk space 
> - slighly more complex to setup than sending a simple python script, but this 
> feature is not lost 
> - support of heterogenous Spark nodes (ex: 32 bits, 64 bits) is possible but 
> one has to send all wheels flavours and ensure pip is able to install in 
> every environment. The complexity of this task is on the hands of the 
> developer and no more on the IT persons! (TMHO, this is an advantage) 
> *Use Case 2: the Spark cluster has access to Pypi or a mirror of Pypi* 
> This is the more elegant situation. The Spark cluster (each node) can install 
> the dependencies of your project independently from the wheels provided by 
> Pypi. Your internal dependencies and your job project can also comes in 
> independent wheel files as well. In this case the workflow is much simpler: 
> - Turn your project into a Python module 
> - write {{requirements.txt}} and {{setup.py}} like in Use Case 1 
> - create the wheel with {{pip wheels}}. But now we will not send *ALL* the 
> dependencies. Only the one that are not on Pypi (current job project, other 
> internal dependencies, etc). 
> - no need to create a wheelhouse. You can still copy the wheels either with 
> {{--py-files}} (will be automatically installed) or inside a wheelhouse named 
> {{wheelhouse.zip}} 
> Deployment becomes: 
> Now comes the time to submit the project: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt --py-files 
> /path/to/project/internal_dependency_1.whl,/path/to/project/internal_dependency_2.whl,/path/to/project/current_project.whl
>  --conf "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
> ~/path/to/launcher_script.py 
> {code} 
> or with a wheelhouse that only contains internal dependencies and current 
> project wheels: 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"; 
> ~/path/to/launcher_script.py 
> {code} 
> or if you want to use the official Pypi or have configured {{pip.conf}} to 
> hit the internal pypi mirror (see doc bellow): 
> {code} 
> bin/spark-submit --master master --deploy-mode client --files 
> /path/to/project/requirements.txt,/path/to/project/wheelhouse.zip --conf 
> "spark.pyspark.virtualenv.enabled=true" ~/path/to/launcher_script.py 
> {code} 
> On each node, the deployment will be done with a command such as: 
> {code} 
> pip install --index-url http://pypi.mycompany.com 
> --find-links=/path/to/node/wheelhouse -r requirements.txt 
> {code} 
> Note: 
> - {{\-\-conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} allows to 
> specify a Pypi mirror, for example a mirror internal to your company network. 
> If not provided, the default Pypi mirror (pypi.python.org) will be requested 
> - to send a wheelhouse, use {{\-\-files}}. To send individual wheels, use 
> {{\-\-py-files}}. With the latter, all wheels will be installed. For multiple 
> architecture cluster, prepare all needed wheels for all architecture and use 
> a wheelhouse archive, this allows {{pip}} to choose the right version of the 
> wheel automatically. 
> *code submission* 
> I already started working on this point, starting by merging the 2 
> mergerequests [#5408|https://github.com/apache/spark/pull/5408] and 
> [#13599|https://github.com/apache/spark/pull/13599] 
> I'll upload a patch asap for review. 
> I see two major interogations: 
> - I don't know that much YARN or MESOS, so I might require some help for the 
> final integration 
> - documentation should really be carefully crafted so users are not lost in 
> all these concepts 
> I really think having this "wheelhouse" support for spark will really helps 
> using, maintaining, and evolving Python scripts on Spark. Python has a rich 
> set of mature libraries Spark should do anythink to help developers easily 
> access and use them in their everyday job. 
> *Important notes about some complex package such as numpy* 
> Numpy is the kind of package that take several minutes to deploy and we want 
> to avoid having all nodes install it each time. Pypi provides several 
> precompiled wheel but it may occurs that the wheel are not right for your 
> platform or the platform fo your cluster. 
> Wheels are *not* cached for pip version < 7.0. From pip v7.0 and +, wheel are 
> automatically cached when built (if needed), so the first installation might 
> take some time, but after the installation will be straight forward.
> On most of my machines, numpy is installed without any compilation thanks to 
> wheels
> *Certificate* 
> pip does not use system ssl certificate. If you use a local pypi mirror 
> behind https with internal certificate, you'll have to setup pypi correctly 
> with the following content in {{~/.pip/pip.conf}}: 
> {code} 
> [global] 
> cert = /path/to/your/internal/certificates.pem 
> {code} 
> First creation might take some times, but pip will automatically cache the 
> wheel for your system in {{~/.cache/pip/wheels}}. You can of course recreate 
> the wheel with {{pip wheel}} or find the wheel in {{~/.cache/pip/wheels}}. 
> You can use {{pip -v install numpy}} to see where it has placed the wheel in 
> cache. 
> If you use Artifactory, you can upload your wheels at a local, central cache 
> that can be shared accross all your slave. See [this 
> documentation|https://www.jfrog.com/confluence/display/RTF/PyPI+Repositories#PyPIRepositories-LocalRepositories]
>  to see how this works. This way, you can insert wheels in this local cache 
> and it will be seens as if it has been uploaded to the official repository 
> (local cache + remote cache can be "merged" into a virtual repository with 
> artifactory) 
> *Set use of internal pypi mirror* 
> Ask your IT to update the {{~/.pip/pip.conf}} of the node to point by default 
> to the internal mirror: 
> {code} 
> [global] 
> ; Low timeout 
> timeout = 20 
> index-url = https://&lt;user&gt;:&lt;pass&gt;@pypi.mycompany.org/ 
> {code} 
> Now, no more need to specify the {{\-\-conf 
> "spark.pyspark.virtualenv.index_url=http://pypi.mycompany.com/"}} in your 
> Spark submit command line 
> Note: this will not work when installing package with {{python setup.py 
> install}} syntax. In this case you need to update {{~/.pypirc}} and use the 
> {{-r}} argument. This syntax is not used in spark-submit
> *Extra*
> Approach vulgarized at the [following blog 
> post|http://www.great-a-blog.co/wheel-deployment-for-pyspark/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16367) Wheelhouse Support for PySpark

Reply via email to