subject:"PySpark API on top of Apache Arrow"

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Jules Damji

Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And 
point to the blog by their contributors from Two Sigma. :-) 

“On the other hand, Pandas UDF built atop Apache Arrow accords high-performance 
to Python developers, whether you use Pandas UDFs on a single-node machine or 
distributed cluster.”

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 26, 2018, at 12:41 PM, Corey Nolet <cjno...@gmail.com> wrote:
> 
> Gourav & Nicholas,
> 
> THank you! It does look like the pyspark Pandas UDF is exactly what I want 
> and the article I read didn't mention that it used Arrow underneath. Looks 
> like Wes McKinney was also key part of building the Pandas UDF.
> 
> Gourav,
> 
> I totally apologize for my long and drawn out response to you. I initially 
> misunderstood your response. I also need to take the time to dive into the 
> PySpark source code- I was assuming that it was just firing up JVMs under the 
> hood.
> 
> Thanks again! I'll report back with findings. 
> 
>> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <nipari...@gmail.com> wrote:
>> hi corey
>> 
>> not familiar with arrow, plasma. However recently read an article about 
>> spark on
>> a standalone machine (your case). Sounds like you could take benefit of 
>> pyspark
>> "as-is"
>> 
>> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
>> 
>> regars,
>> 
>> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>:
>>> Please forgive me if this question has been asked already. 
>>> 
>>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if 
>>> anyone knows of any efforts to implement the PySpark API on top of Apache 
>>> Arrow directly. In my case, I'm doing data science on a machine with 288 
>>> cores and 1TB of ram. 
>>> 
>>> It would make life much easier if I was able to use the flexibility of the 
>>> PySpark API (rather than having to be tied to the operations in Pandas). It 
>>> seems like an implementation would be fairly straightforward using the 
>>> Plasma server and object_ids. 
>>> 
>>> If you have not heard of an effort underway to accomplish this, any reasons 
>>> why it would be a bad idea?
>>> 
>>> 
>>> Thanks!
>> 
>

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet

Gourav & Nicholas,

THank you! It does look like the pyspark Pandas UDF is exactly what I want
and the article I read didn't mention that it used Arrow underneath. Looks
like Wes McKinney was also key part of building the Pandas UDF.

Gourav,

I totally apologize for my long and drawn out response to you. I initially
misunderstood your response. I also need to take the time to dive into the
PySpark source code- I was assuming that it was just firing up JVMs under
the hood.

Thanks again! I'll report back with findings.

On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <nipari...@gmail.com> wrote:

> hi corey
>
> not familiar with arrow, plasma. However recently read an article about
> spark on
> a standalone machine (your case). Sounds like you could take benefit of
> pyspark
> "as-is"
>
> https://databricks.com/blog/2018/05/03/benchmarking-
> apache-spark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>:
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
>> anyone knows of any efforts to implement the PySpark API on top of Apache
>> Arrow directly. In my case, I'm doing data science on a machine with 288
>> cores and 1TB of ram.
>>
>> It would make life much easier if I was able to use the flexibility of
>> the PySpark API (rather than having to be tied to the operations in
>> Pandas). It seems like an implementation would be fairly straightforward
>> using the Plasma server and object_ids.
>>
>> If you have not heard of an effort underway to accomplish this, any
>> reasons why it would be a bad idea?
>>
>>
>> Thanks!
>>
>
>

Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Nicolas Paris

hi corey

not familiar with arrow, plasma. However recently read an article about
spark on
a standalone machine (your case). Sounds like you could take benefit of
pyspark
"as-is"

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

regars,

2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>:

> Please forgive me if this question has been asked already.
>
> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
> anyone knows of any efforts to implement the PySpark API on top of Apache
> Arrow directly. In my case, I'm doing data science on a machine with 288
> cores and 1TB of ram.
>
> It would make life much easier if I was able to use the flexibility of the
> PySpark API (rather than having to be tied to the operations in Pandas). It
> seems like an implementation would be fairly straightforward using the
> Plasma server and object_ids.
>
> If you have not heard of an effort underway to accomplish this, any
> reasons why it would be a bad idea?
>
>
> Thanks!
>

PySpark API on top of Apache Arrow

2018-05-23 Thread Corey Nolet

Please forgive me if this question has been asked already.

I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
anyone knows of any efforts to implement the PySpark API on top of Apache
Arrow directly. In my case, I'm doing data science on a machine with 288
cores and 1TB of ram.

It would make life much easier if I was able to use the flexibility of the
PySpark API (rather than having to be tied to the operations in Pandas). It
seems like an implementation would be fairly straightforward using the
Plasma server and object_ids.

If you have not heard of an effort underway to accomplish this, any reasons
why it would be a bad idea?


Thanks!

Re: PySpark API on top of Apache Arrow

Re: PySpark API on top of Apache Arrow

Re: PySpark API on top of Apache Arrow

PySpark API on top of Apache Arrow

4 matches

Site Navigation

Mail list logo

Footer information