Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Jules Damji
Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And 
point to the blog by their contributors from Two Sigma. :-) 

“On the other hand, Pandas UDF built atop Apache Arrow accords high-performance 
to Python developers, whether you use Pandas UDFs on a single-node machine or 
distributed cluster.”

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On May 26, 2018, at 12:41 PM, Corey Nolet  wrote:
> 
> Gourav & Nicholas,
> 
> THank you! It does look like the pyspark Pandas UDF is exactly what I want 
> and the article I read didn't mention that it used Arrow underneath. Looks 
> like Wes McKinney was also key part of building the Pandas UDF.
> 
> Gourav,
> 
> I totally apologize for my long and drawn out response to you. I initially 
> misunderstood your response. I also need to take the time to dive into the 
> PySpark source code- I was assuming that it was just firing up JVMs under the 
> hood.
> 
> Thanks again! I'll report back with findings. 
> 
>> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris  wrote:
>> hi corey
>> 
>> not familiar with arrow, plasma. However recently read an article about 
>> spark on
>> a standalone machine (your case). Sounds like you could take benefit of 
>> pyspark
>> "as-is"
>> 
>> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html
>> 
>> regars,
>> 
>> 2018-05-23 22:30 GMT+02:00 Corey Nolet :
>>> Please forgive me if this question has been asked already. 
>>> 
>>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if 
>>> anyone knows of any efforts to implement the PySpark API on top of Apache 
>>> Arrow directly. In my case, I'm doing data science on a machine with 288 
>>> cores and 1TB of ram. 
>>> 
>>> It would make life much easier if I was able to use the flexibility of the 
>>> PySpark API (rather than having to be tied to the operations in Pandas). It 
>>> seems like an implementation would be fairly straightforward using the 
>>> Plasma server and object_ids. 
>>> 
>>> If you have not heard of an effort underway to accomplish this, any reasons 
>>> why it would be a bad idea?
>>> 
>>> 
>>> Thanks!
>> 
> 


Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Corey Nolet
Gourav & Nicholas,

THank you! It does look like the pyspark Pandas UDF is exactly what I want
and the article I read didn't mention that it used Arrow underneath. Looks
like Wes McKinney was also key part of building the Pandas UDF.

Gourav,

I totally apologize for my long and drawn out response to you. I initially
misunderstood your response. I also need to take the time to dive into the
PySpark source code- I was assuming that it was just firing up JVMs under
the hood.

Thanks again! I'll report back with findings.

On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris  wrote:

> hi corey
>
> not familiar with arrow, plasma. However recently read an article about
> spark on
> a standalone machine (your case). Sounds like you could take benefit of
> pyspark
> "as-is"
>
> https://databricks.com/blog/2018/05/03/benchmarking-
> apache-spark-on-a-single-node-machine.html
>
> regars,
>
> 2018-05-23 22:30 GMT+02:00 Corey Nolet :
>
>> Please forgive me if this question has been asked already.
>>
>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
>> anyone knows of any efforts to implement the PySpark API on top of Apache
>> Arrow directly. In my case, I'm doing data science on a machine with 288
>> cores and 1TB of ram.
>>
>> It would make life much easier if I was able to use the flexibility of
>> the PySpark API (rather than having to be tied to the operations in
>> Pandas). It seems like an implementation would be fairly straightforward
>> using the Plasma server and object_ids.
>>
>> If you have not heard of an effort underway to accomplish this, any
>> reasons why it would be a bad idea?
>>
>>
>> Thanks!
>>
>
>


Re: PySpark API on top of Apache Arrow

2018-05-26 Thread Nicolas Paris
hi corey

not familiar with arrow, plasma. However recently read an article about
spark on
a standalone machine (your case). Sounds like you could take benefit of
pyspark
"as-is"

https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html

regars,

2018-05-23 22:30 GMT+02:00 Corey Nolet :

> Please forgive me if this question has been asked already.
>
> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if
> anyone knows of any efforts to implement the PySpark API on top of Apache
> Arrow directly. In my case, I'm doing data science on a machine with 288
> cores and 1TB of ram.
>
> It would make life much easier if I was able to use the flexibility of the
> PySpark API (rather than having to be tied to the operations in Pandas). It
> seems like an implementation would be fairly straightforward using the
> Plasma server and object_ids.
>
> If you have not heard of an effort underway to accomplish this, any
> reasons why it would be a bad idea?
>
>
> Thanks!
>