Re: NoClassDefFoundError: scala/Product$class

2020-06-06 Thread James Moore
How are you depending on that org.bdgenomics.adam library?  Maybe you're
pulling the 2.11 version of that.


Add python library

2020-06-06 Thread Anwar AliKhan
 " > Have you looked into this article?
https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
 "

This is weird !
I was hanging out here https://machinelearningmastery.com/start-here/.
When I came across this post.

The weird part is I was just wondering  how I can take one of the
projects(Open AI GYM taxi-vt2 in Python), a project I want to develop
further.

I want to run on Spark using Spark's parallelism features and GPU
capabilities,  when I am using bigger datasets . While installing the
workers (slaves)  doing the sliced dataset computations on the new 8GB RAM
Raspberry Pi (Linux).

Are any other documents on official website which shows how to do that,  or
any other location  , preferably showing full self contained examples?



On Fri, 5 Jun 2020, 09:02 Dark Crusader, 
wrote:

> Hi Stone,
>
>
> I haven't tried it with .so files however I did use the approach he
> recommends to install my other dependencies.
> I Hope it helps.
>
> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:
>
>> Hi,
>>
>> So my pyspark app depends on some python libraries, it is not a problem,
>> I pack all the dependencies into a file libs.zip, and then call
>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>
>> Then I encountered a problem, if any of my library has any binary file
>> dependency (like .so files), this approach does not work. Mainly because
>> when you set PYTHONPATH to a zip file, python does not look up needed
>> binary library (e.g. a .so file) inside the zip file, this is a python
>> *limitation*. So I got a workaround:
>>
>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>> directory
>> 2) When my python code starts, manually call *sys.path.insert(0,
>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>
>> This workaround works well for me. Then I got another problem: what if my
>> code in executor need python library that has binary code? Below is am
>> example:
>>
>> def do_something(p):
>> ...
>>
>> rdd = sc.parallelize([
>> {"x": 1, "y": 2},
>> {"x": 2, "y": 3},
>> {"x": 3, "y": 4},
>> ])
>> a = rdd.map(do_something)
>>
>> What if the function "do_something" need a python library that has
>> binary code? My current solution is, extract libs.zip into a NFS share (or
>> a SMB share) and manually do *sys.path.insert(0,
>> f"share_mount_dir/libs") *in my "do_something" function, but adding such
>> code in each function looks ugly, is there any better/elegant solution?
>>
>> Thanks,
>> Stone
>>
>>


[pyspark 2.3+] Add scala library to pyspark app and use to derive columns

2020-06-06 Thread Rishi Shah
Hi All,

I have a use case where I need to utilize java/scala for regex mapping (as
lookbehinds are not well supported with python).. However our entire code
is python based so was wondering if there's a suggested way of creating a
scala/java lib and use that within pyspark..

I came across this,
https://diogoalexandrefranco.github.io/scala-code-in-pyspark/ - will try it
out but my colleague ran into some issues with serialization before while
trying to use java lib with pyspark.

Typical use case is to use library functions to derive columns.

Any input helps, appreciate it!

-- 
Regards,

Rishi Shah


Unsubscribe

2020-06-06 Thread Sunil Prabhakara



Re: NoClassDefFoundError: scala/Product$class

2020-06-06 Thread Sean Owen
Spark 3 supports only Scala 2.12. This actually sounds like third party
library is compiled for 2.11 or something.

On Fri, Jun 5, 2020 at 11:11 PM charles_cai <1620075...@qq.com> wrote:

> Hi Pol,
>
> thanks for your suggestion, I am going to use Spark-3.0.0 for GPU
> acceleration,so I update the scala to the *version 2.12.11* and the latest
> *2.13* ,but the error is still there, and by the way , the Spark version is
> *spark-3.0.0-preview2-bin-without-hadoop*
>
> Caused by: java.lang.ClassNotFoundException: scala.Product$class
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
>
> Charles cai
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Add python library with native code

2020-06-06 Thread Stone Zhong
Great, thank you Masood, will look into it.

Regards,
Stone

On Fri, Jun 5, 2020 at 7:47 PM Masood Krohy 
wrote:

> Not totally sure it's gonna help your use case, but I'd recommend that you
> consider these too:
>
>- pex   A library and tool for
>generating .pex (Python EXecutable) files
>- cluster-pack    cluster-pack
>is a library on top of either pex or conda-pack to make your Python code
>easily available on a cluster.
>
> Masood
>
> __
>
> Masood Krohy, Ph.D.
> Data Science Advisor|Platform Architecthttps://www.analytical.works
>
> On 6/5/20 4:29 AM, Stone Zhong wrote:
>
> Thanks Dark. Looked at that article. I think the article described
> approach B, let me summary both approach A and approach B
> A) Put libraries in a network share, mount on each node, and in your code,
> manually set PYTHONPATH
> B) In your code, manually install the necessary package using "pip install
> -r "
>
> I think approach B is very similar to approach A, both has pros and cons.
> With B), your cluster need to have internet access (which in my case, our
> cluster runs in an isolated environment for security reason), but you can
> set a private pip server anyway and stage those needed packages, while for
> A, you need to have admin permission to be able to mount the network share
> which is also a devop burden.
>
> I am wondering if spark can create some new API to tackle this scenario
> instead of these workaround, which I suppose would be more clean and
> elegant.
>
> Regards,
> Stone
>
>
> On Fri, Jun 5, 2020 at 1:02 AM Dark Crusader 
> wrote:
>
>> Hi Stone,
>>
>> Have you looked into this article?
>>
>> https://medium.com/@SSKahani/pyspark-applications-dependencies-99415e0df987
>>
>>
>> I haven't tried it with .so files however I did use the approach he
>> recommends to install my other dependencies.
>> I Hope it helps.
>>
>> On Fri, Jun 5, 2020 at 1:12 PM Stone Zhong  wrote:
>>
>>> Hi,
>>>
>>> So my pyspark app depends on some python libraries, it is not a problem,
>>> I pack all the dependencies into a file libs.zip, and then call
>>> *sc.addPyFile("libs.zip")* and it works pretty well for a while.
>>>
>>> Then I encountered a problem, if any of my library has any binary file
>>> dependency (like .so files), this approach does not work. Mainly because
>>> when you set PYTHONPATH to a zip file, python does not look up needed
>>> binary library (e.g. a .so file) inside the zip file, this is a python
>>> *limitation*. So I got a workaround:
>>>
>>> 1) Do not call sc.addPyFile, instead extract the libs.zip into current
>>> directory
>>> 2) When my python code starts, manually call *sys.path.insert(0,
>>> f"{os.getcwd()}/libs")* to set PYTHONPATH
>>>
>>> This workaround works well for me. Then I got another problem: what if
>>> my code in executor need python library that has binary code? Below is am
>>> example:
>>>
>>> def do_something(p):
>>> ...
>>>
>>> rdd = sc.parallelize([
>>> {"x": 1, "y": 2},
>>> {"x": 2, "y": 3},
>>> {"x": 3, "y": 4},
>>> ])
>>> a = rdd.map(do_something)
>>>
>>> What if the function "do_something" need a python library that has
>>> binary code? My current solution is, extract libs.zip into a NFS share (or
>>> a SMB share) and manually do *sys.path.insert(0,
>>> f"share_mount_dir/libs") *in my "do_something" function, but adding
>>> such code in each function looks ugly, is there any better/elegant
>>> solution?
>>>
>>> Thanks,
>>> Stone
>>>
>>>