Re: Add python library with native code

2020-06-06 Thread Stone Zhong
Great, thank you Masood, will look into it.

Regards,
Stone

On Fri, Jun 5, 2020 at 7:47 PM Masood Krohy 
wrote:

> Not totally sure it's gonna help your use case, but I'd recommend that you
> consider these too:
>
>- pex   A library and tool for
>generating .pex (Python EXecutable) files
>- cluster-pack    cluster-pack
>is a library on top of either pex or conda-pack to make your Python code
>easily available on a cluster.
>
> Masood
>
> __
>
> Masood Krohy, Ph.D.
> Data Science Advisor|Platform Architecthttps://www.analytical.works
>
> On 6/5/20 4:29 AM, Stone Zhong wrote:
>
> Thanks Dark. Looked at that article. I think the article described
> approach B, let me summary both approach A and approach B
> A) Put libraries in a network share, mount on each node, and in your code,
> manually set PYTHONPATH
> B) In your code, manually install the necessary package using "pip install
> -r "
>
> I think approach B is very similar to approach A, both has pros and cons.
> With B), your cluster need to have internet access (which in my case, our
> cluster runs in an isolated environment for security reason), but you can
> set a private pip server anyway and stage those needed packages, while for
> A, you need to have admin permission to be able to mount the network share
> which is also a devop burden.
>
> I am wondering if spark can create some new API to tackle this scenario
> instead of these workaround, which I suppose would be more clean and
> elegant.
>
> Regards,
> Stone
>
>
> On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader 
> wrote:
>
>> Hi Stone,
>>
>> Have you looked into this article?
>>
>> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>>
>>
>> I haven't tried it with .so files however I did use the approach he
>> recommends to install my other dependencies.
>> I Hope it helps.
>>
>> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:
>>
>>> Hi,
>>>
>>> So my pyspark app depends on some python libraries, it is not a problem,
>>> I pack all the dependencies into a file libs.zip, and then call
>>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>>
>>> Then I encountered a problem, if any of my library has any binary file
>>> dependency (like .so files), this approach does not work. Mainly because
>>> when you set PYTHONPATH to a zip file, python does not look up needed
>>> binary library (e.g. a .so file) inside the zip file, this is a python
>>> *limitation*. So I got a workaround:
>>>
>>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>>> directory
>>> 2) When my python code starts, manually call *sys.path.insert(0,
>>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>>
>>> This workaround works well for me. Then I got another problem: what if
>>> my code in executor need python library that has binary code? Below is am
>>> example:
>>>
>>> def do_something(p):
>>> ...
>>>
>>> rdd = sc.parallelize([
>>> {"x": 1, "y": 2},
>>> {"x": 2, "y": 3},
>>> {"x": 3, "y": 4},
>>> ])
>>> a = rdd.map(do_something)
>>>
>>> What if the function "do_something" need a python library that has
>>> binary code? My current solution is, extract libs.zip into a NFS share (or
>>> a SMB share) and manually do *sys.path.insert(0,
>>> f"share_mount_dir/libs") *in my "do_something" function, but adding
>>> such code in each function looks ugly, is there any better/elegant
>>> solution?
>>>
>>> Thanks,
>>> Stone
>>>
>>>


Re: Add python library with native code

2020-06-05 Thread Masood Krohy
Not totally sure it's gonna help your use case, but I'd recommend that 
you consider these too:


 * pex  A library and tool for
   generating .pex (Python EXecutable) files
 * cluster-pack  cluster-pack
   is a library on top of either pex or conda-pack to make your Python
   code easily available on a cluster.

Masood

__

Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works

On 6/5/20 4:29 AM, Stone Zhong wrote:
Thanks Dark. Looked at that article. I think the article described 
approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your 
code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip 
install -r "


I think approach B is very similar to approach A, both has pros and 
cons. With B), your cluster need to have internet access (which in my 
case, our cluster runs in an isolated environment for security 
reason), but you can set a private pip server anyway and stage those 
needed packages, while for A, you need to have admin permission to be 
able to mount the network share which is also a devop burden.


I am wondering if spark can create some new API to tackle this 
scenario instead of these workaround, which I suppose would be more 
clean and elegant.


Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader 
mailto:relinquisheddra...@gmail.com>> 
wrote:


Hi Stone,

Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987


I haven't tried it with .so files however I did use the approach
he recommends to install my other dependencies.
I Hope it helps.

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong mailto:stone.zh...@gmail.com>> wrote:

Hi,

So my pyspark app depends on some python libraries, it is not
a problem, I pack all the dependencies into a file libs.zip,
and then call *sc.addPyFile("libs.zip")* and it works pretty
well for a while.

Then I encountered a problem, if any of my library has any
binary file dependency (like .so files), this approach does
not work. Mainly because when you set PYTHONPATH to a zip
file, python does not look up needed binary library (e.g. a
.so file) inside the zip file, this is a python
/*limitation*/. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into
current directory
2) When my python code starts, manually call
*sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH

This workaround works well for me. Then I got another problem:
what if my code in executor need python library that has
binary code? Below is am example:

def do_something(p):
    ...

rdd = sc.parallelize([
    {"x": 1, "y": 2},
    {"x": 2, "y": 3},
    {"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library
that has binary code? My current solution is, extract libs.zip
into a NFS share (or a SMB share) and manually do
*sys.path.insert(0, f"share_mount_dir/libs") *in my
"do_something" function, but adding such code in each function
looks ugly, is there any better/elegant solution?

Thanks,
Stone



Re: Add python library with native code

2020-06-05 Thread Stone Zhong
Thanks Dark. Looked at that article. I think the article described approach
B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your code,
manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip install
-r "

I think approach B is very similar to approach A, both has pros and cons.
With B), your cluster need to have internet access (which in my case, our
cluster runs in an isolated environment for security reason), but you can
set a private pip server anyway and stage those needed packages, while for
A, you need to have admin permission to be able to mount the network share
which is also a devop burden.

I am wondering if spark can create some new API to tackle this scenario
instead of these workaround, which I suppose would be more clean and
elegant.

Regards,
Stone


On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader 
wrote:

> Hi Stone,
>
> Have you looked into this article?
> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>
>
> I haven't tried it with .so files however I did use the approach he
> recommends to install my other dependencies.
> I Hope it helps.
>
> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:
>
>> Hi,
>>
>> So my pyspark app depends on some python libraries, it is not a problem,
>> I pack all the dependencies into a file libs.zip, and then call
>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>
>> Then I encountered a problem, if any of my library has any binary file
>> dependency (like .so files), this approach does not work. Mainly because
>> when you set PYTHONPATH to a zip file, python does not look up needed
>> binary library (e.g. a .so file) inside the zip file, this is a python
>> *limitation*. So I got a workaround:
>>
>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>> directory
>> 2) When my python code starts, manually call *sys.path.insert(0,
>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>
>> This workaround works well for me. Then I got another problem: what if my
>> code in executor need python library that has binary code? Below is am
>> example:
>>
>> def do_something(p):
>> ...
>>
>> rdd = sc.parallelize([
>> {"x": 1, "y": 2},
>> {"x": 2, "y": 3},
>> {"x": 3, "y": 4},
>> ])
>> a = rdd.map(do_something)
>>
>> What if the function "do_something" need a python library that has
>> binary code? My current solution is, extract libs.zip into a NFS share (or
>> a SMB share) and manually do *sys.path.insert(0,
>> f"share_mount_dir/libs") *in my "do_something" function, but adding such
>> code in each function looks ugly, is there any better/elegant solution?
>>
>> Thanks,
>> Stone
>>
>>


Re: Add python library with native code

2020-06-05 Thread Dark Crusader
Hi Stone,

Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987

I haven't tried it with .so files however I did use the approach he
recommends to install my other dependencies.
I Hope it helps.

On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:

> Hi,
>
> So my pyspark app depends on some python libraries, it is not a problem, I
> pack all the dependencies into a file libs.zip, and then call
> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>
> Then I encountered a problem, if any of my library has any binary file
> dependency (like .so files), this approach does not work. Mainly because
> when you set PYTHONPATH to a zip file, python does not look up needed
> binary library (e.g. a .so file) inside the zip file, this is a python
> *limitation*. So I got a workaround:
>
> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
> directory
> 2) When my python code starts, manually call *sys.path.insert(0,
> f"{os.getcwd()}/libs")* to set PYTHONPATH
>
> This workaround works well for me. Then I got another problem: what if my
> code in executor need python library that has binary code? Below is am
> example:
>
> def do_something(p):
> ...
>
> rdd = sc.parallelize([
> {"x": 1, "y": 2},
> {"x": 2, "y": 3},
> {"x": 3, "y": 4},
> ])
> a = rdd.map(do_something)
>
> What if the function "do_something" need a python library that has
> binary code? My current solution is, extract libs.zip into a NFS share (or
> a SMB share) and manually do *sys.path.insert(0,
> f"share_mount_dir/libs") *in my "do_something" function, but adding such
> code in each function looks ugly, is there any better/elegant solution?
>
> Thanks,
> Stone
>
>


Add python library with native code

2020-06-05 Thread Stone Zhong
Hi,

So my pyspark app depends on some python libraries, it is not a problem, I
pack all the dependencies into a file libs.zip, and then call
*sc.addPyFile("libs.zip")* and it works pretty well for a while.

Then I encountered a problem, if any of my library has any binary file
dependency (like .so files), this approach does not work. Mainly because
when you set PYTHONPATH to a zip file, python does not look up needed
binary library (e.g. a .so file) inside the zip file, this is a python
*limitation*. So I got a workaround:

1) Do not call sc.addPyFile, instead extract the libs.zip into current
directory
2) When my python code starts, manually call *sys.path.insert(0,
f"{os.getcwd()}/libs")* to set PYTHONPATH

This workaround works well for me. Then I got another problem: what if my
code in executor need python library that has binary code? Below is am
example:

def do_something(p):
...

rdd = sc.parallelize([
{"x": 1, "y": 2},
{"x": 2, "y": 3},
{"x": 3, "y": 4},
])
a = rdd.map(do_something)

What if the function "do_something" need a python library that has
binary code? My current solution is, extract libs.zip into a NFS share (or
a SMB share) and manually do *sys.path.insert(0, f"share_mount_dir/libs") *in
my "do_something" function, but adding such code in each function looks
ugly, is there any better/elegant solution?

Thanks,
Stone