Not totally sure it's gonna help your use case, but I'd recommend that
you consider these too:
* pex <https://github.com/pantsbuild/pex> A library and tool for
generating .pex (Python EXecutable) files
* cluster-pack <https://github.com/criteo/cluster-pack> cluster-pack
is a library on top of either pex or conda-pack to make your Python
code easily available on a cluster.
Masood
__________________
Masood Krohy, Ph.D.
Data Science Advisor|Platform Architect
https://www.analytical.works
On 6/5/20 4:29 AM, Stone Zhong wrote:
Thanks Dark. Looked at that article. I think the article described
approach B, let me summary both approach A and approach B
A) Put libraries in a network share, mount on each node, and in your
code, manually set PYTHONPATH
B) In your code, manually install the necessary package using "pip
install -r <temp_dir>"
I think approach B is very similar to approach A, both has pros and
cons. With B), your cluster need to have internet access (which in my
case, our cluster runs in an isolated environment for security
reason), but you can set a private pip server anyway and stage those
needed packages, while for A, you need to have admin permission to be
able to mount the network share which is also a devop burden.
I am wondering if spark can create some new API to tackle this
scenario instead of these workaround, which I suppose would be more
clean and elegant.
Regards,
Stone
On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader
<relinquisheddra...@gmail.com <mailto:relinquisheddra...@gmail.com>>
wrote:
Hi Stone,
Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
I haven't tried it with .so files however I did use the approach
he recommends to install my other dependencies.
I Hope it helps.
On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong <stone.zh...@gmail.com
<mailto:stone.zh...@gmail.com>> wrote:
Hi,
So my pyspark app depends on some python libraries, it is not
a problem, I pack all the dependencies into a file libs.zip,
and then call *sc.addPyFile("libs.zip")* and it works pretty
well for a while.
Then I encountered a problem, if any of my library has any
binary file dependency (like .so files), this approach does
not work. Mainly because when you set PYTHONPATH to a zip
file, python does not look up needed binary library (e.g. a
.so file) inside the zip file, this is a python
/*limitation*/. So I got a workaround:
1) Do not call sc.addPyFile, instead extract the libs.zip into
current directory
2) When my python code starts, manually call
*sys.path.insert(0, f"{os.getcwd()}/libs")* to set PYTHONPATH
This workaround works well for me. Then I got another problem:
what if my code in executor need python library that has
binary code? Below is am example:
def do_something(p):
...
rdd = sc.parallelize([
{"x": 1, "y": 2},
{"x": 2, "y": 3},
{"x": 3, "y": 4},
])
a = rdd.map(do_something)
What if the function "do_something" need a python library
that has binary code? My current solution is, extract libs.zip
into a NFS share (or a SMB share) and manually do
*sys.path.insert(0, f"share_mount_dir/libs") *in my
"do_something" function, but adding such code in each function
looks ugly, is there any better/elegant solution?
Thanks,
Stone