Re: Add python library with native code

Masood Krohy Fri, 05 Jun 2020 19:48:19 -0700

Not totally sure it's gonna help your use case, but I'd recommend thatyou consider these too:


 * pex <https://github.com/pantsbuild/pex> A library and tool for
   generating .pex (Python EXecutable) files
 * cluster-pack <https://github.com/criteo/cluster-pack> cluster-pack
   is a library on top of either pex or conda-pack to make your Python
   code easily available on a cluster.


Masood

__________________

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 6/5/20 4:29 AM, Stone Zhong wrote:

Thanks Dark. Looked at that article. I think the article describedapproach B, let me summary both approach A and approach BA) Put libraries in a network share, mount on each node, and in yourcode, manually set PYTHONPATHB) In your code, manually install the necessary package using "pipinstall -r <temp_dir>"
I think approach B is very similar to approach A, both has pros andcons. With B), your cluster need to have internet access (which in mycase, our cluster runs in an isolated environment for securityreason), but you can set a private pip server anyway and stage thoseneeded packages, while for A, you need to have admin permission to beable to mount the network share which is also a devop burden.
I am wondering if spark can create some new API to tackle thisscenario instead of these workaround, which I suppose would be moreclean and elegant.
Regards,
Stone
On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader<relinquisheddra...@gmail.com <mailto:relinquisheddra...@gmail.com>>wrote:
    Hi Stone,

    Have you looked into this article?
    https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987


    I haven't tried it with .so files however I did use the approach
    he recommends to install my other dependencies.
    I Hope it helps.

    On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com
    <mailto:stone.zh...@gmail.com>> wrote:

        Hi,

        So my pyspark app depends on some python libraries, it is not
        a problem, I pack all the dependencies into a file libs.zip,
        and then call *sc.addPyFile("libs.zip")* and it works pretty
        well for a while.

        Then I encountered a problem, if any of my library has any
        binary file dependency (like .so files), this approach does
        not work. Mainly because when you set PYTHONPATH to a zip
        file, python does not look up needed binary library (e.g. a
        .so file) inside the zip file, this is a python
        /*limitation*/. So I got a workaround:

        1) Do not call sc.addPyFile, instead extract the libs.zip into
        current directory
        2) When my python code starts, manually call
        *sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH

        This workaround works well for me. Then I got another problem:
        what if my code in executor need python library that has
        binary code? Below is am example:

            def do_something(p):
                ...

            rdd = sc.parallelize([
                {"x": 1, "y": 2},
                {"x": 2, "y": 3},
                {"x": 3, "y": 4},
            ])
            a = rdd.map(do_something)

        What if the function "do_something" need a python library
        that has binary code? My current solution is, extract libs.zip
        into a NFS share (or a SMB share) and manually do
        *sys.path.insert(0, f"share_mount_dir/libs") *in my
        "do_something" function, but adding such code in each function
        looks ugly, is there any better/elegant solution?

        Thanks,
        Stone

Re: Add python library with native code

Reply via email to