Re: PySpark API on top of Apache Arrow
Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And point to the blog by their contributors from Two Sigma. :-) “On the other hand, Pandas UDF built atop Apache Arrow accords high-performance to Python developers, whether you use Pandas UDFs on a single-node machine or distributed cluster.” Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On May 26, 2018, at 12:41 PM, Corey Nolet wrote: > > Gourav & Nicholas, > > THank you! It does look like the pyspark Pandas UDF is exactly what I want > and the article I read didn't mention that it used Arrow underneath. Looks > like Wes McKinney was also key part of building the Pandas UDF. > > Gourav, > > I totally apologize for my long and drawn out response to you. I initially > misunderstood your response. I also need to take the time to dive into the > PySpark source code- I was assuming that it was just firing up JVMs under the > hood. > > Thanks again! I'll report back with findings. > >> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris wrote: >> hi corey >> >> not familiar with arrow, plasma. However recently read an article about >> spark on >> a standalone machine (your case). Sounds like you could take benefit of >> pyspark >> "as-is" >> >> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html >> >> regars, >> >> 2018-05-23 22:30 GMT+02:00 Corey Nolet : >>> Please forgive me if this question has been asked already. >>> >>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >>> anyone knows of any efforts to implement the PySpark API on top of Apache >>> Arrow directly. In my case, I'm doing data science on a machine with 288 >>> cores and 1TB of ram. >>> >>> It would make life much easier if I was able to use the flexibility of the >>> PySpark API (rather than having to be tied to the operations in Pandas). It >>> seems like an implementation would be fairly straightforward using the >>> Plasma server and object_ids. >>> >>> If you have not heard of an effort underway to accomplish this, any reasons >>> why it would be a bad idea? >>> >>> >>> Thanks! >> >
Re: PySpark API on top of Apache Arrow
Gourav & Nicholas, THank you! It does look like the pyspark Pandas UDF is exactly what I want and the article I read didn't mention that it used Arrow underneath. Looks like Wes McKinney was also key part of building the Pandas UDF. Gourav, I totally apologize for my long and drawn out response to you. I initially misunderstood your response. I also need to take the time to dive into the PySpark source code- I was assuming that it was just firing up JVMs under the hood. Thanks again! I'll report back with findings. On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris wrote: > hi corey > > not familiar with arrow, plasma. However recently read an article about > spark on > a standalone machine (your case). Sounds like you could take benefit of > pyspark > "as-is" > > https://databricks.com/blog/2018/05/03/benchmarking- > apache-spark-on-a-single-node-machine.html > > regars, > > 2018-05-23 22:30 GMT+02:00 Corey Nolet : > >> Please forgive me if this question has been asked already. >> >> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >> anyone knows of any efforts to implement the PySpark API on top of Apache >> Arrow directly. In my case, I'm doing data science on a machine with 288 >> cores and 1TB of ram. >> >> It would make life much easier if I was able to use the flexibility of >> the PySpark API (rather than having to be tied to the operations in >> Pandas). It seems like an implementation would be fairly straightforward >> using the Plasma server and object_ids. >> >> If you have not heard of an effort underway to accomplish this, any >> reasons why it would be a bad idea? >> >> >> Thanks! >> > >
Re: PySpark API on top of Apache Arrow
hi corey not familiar with arrow, plasma. However recently read an article about spark on a standalone machine (your case). Sounds like you could take benefit of pyspark "as-is" https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html regars, 2018-05-23 22:30 GMT+02:00 Corey Nolet : > Please forgive me if this question has been asked already. > > I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if > anyone knows of any efforts to implement the PySpark API on top of Apache > Arrow directly. In my case, I'm doing data science on a machine with 288 > cores and 1TB of ram. > > It would make life much easier if I was able to use the flexibility of the > PySpark API (rather than having to be tied to the operations in Pandas). It > seems like an implementation would be fairly straightforward using the > Plasma server and object_ids. > > If you have not heard of an effort underway to accomplish this, any > reasons why it would be a bad idea? > > > Thanks! >