Re: PySpark API on top of Apache Arrow
Actually, we do mention that Pandas UDF is built upon Apache Arrow.. :-) And point to the blog by their contributors from Two Sigma. :-) “On the other hand, Pandas UDF built atop Apache Arrow accords high-performance to Python developers, whether you use Pandas UDFs on a single-node machine or distributed cluster.” Cheers Jules Sent from my iPhone Pardon the dumb thumb typos :) > On May 26, 2018, at 12:41 PM, Corey Nolet <cjno...@gmail.com> wrote: > > Gourav & Nicholas, > > THank you! It does look like the pyspark Pandas UDF is exactly what I want > and the article I read didn't mention that it used Arrow underneath. Looks > like Wes McKinney was also key part of building the Pandas UDF. > > Gourav, > > I totally apologize for my long and drawn out response to you. I initially > misunderstood your response. I also need to take the time to dive into the > PySpark source code- I was assuming that it was just firing up JVMs under the > hood. > > Thanks again! I'll report back with findings. > >> On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <nipari...@gmail.com> wrote: >> hi corey >> >> not familiar with arrow, plasma. However recently read an article about >> spark on >> a standalone machine (your case). Sounds like you could take benefit of >> pyspark >> "as-is" >> >> https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html >> >> regars, >> >> 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>: >>> Please forgive me if this question has been asked already. >>> >>> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >>> anyone knows of any efforts to implement the PySpark API on top of Apache >>> Arrow directly. In my case, I'm doing data science on a machine with 288 >>> cores and 1TB of ram. >>> >>> It would make life much easier if I was able to use the flexibility of the >>> PySpark API (rather than having to be tied to the operations in Pandas). It >>> seems like an implementation would be fairly straightforward using the >>> Plasma server and object_ids. >>> >>> If you have not heard of an effort underway to accomplish this, any reasons >>> why it would be a bad idea? >>> >>> >>> Thanks! >> >
Re: PySpark API on top of Apache Arrow
Gourav & Nicholas, THank you! It does look like the pyspark Pandas UDF is exactly what I want and the article I read didn't mention that it used Arrow underneath. Looks like Wes McKinney was also key part of building the Pandas UDF. Gourav, I totally apologize for my long and drawn out response to you. I initially misunderstood your response. I also need to take the time to dive into the PySpark source code- I was assuming that it was just firing up JVMs under the hood. Thanks again! I'll report back with findings. On Sat, May 26, 2018 at 2:51 PM, Nicolas Paris <nipari...@gmail.com> wrote: > hi corey > > not familiar with arrow, plasma. However recently read an article about > spark on > a standalone machine (your case). Sounds like you could take benefit of > pyspark > "as-is" > > https://databricks.com/blog/2018/05/03/benchmarking- > apache-spark-on-a-single-node-machine.html > > regars, > > 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>: > >> Please forgive me if this question has been asked already. >> >> I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if >> anyone knows of any efforts to implement the PySpark API on top of Apache >> Arrow directly. In my case, I'm doing data science on a machine with 288 >> cores and 1TB of ram. >> >> It would make life much easier if I was able to use the flexibility of >> the PySpark API (rather than having to be tied to the operations in >> Pandas). It seems like an implementation would be fairly straightforward >> using the Plasma server and object_ids. >> >> If you have not heard of an effort underway to accomplish this, any >> reasons why it would be a bad idea? >> >> >> Thanks! >> > >
Re: PySpark API on top of Apache Arrow
hi corey not familiar with arrow, plasma. However recently read an article about spark on a standalone machine (your case). Sounds like you could take benefit of pyspark "as-is" https://databricks.com/blog/2018/05/03/benchmarking-apache-spark-on-a-single-node-machine.html regars, 2018-05-23 22:30 GMT+02:00 Corey Nolet <cjno...@gmail.com>: > Please forgive me if this question has been asked already. > > I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if > anyone knows of any efforts to implement the PySpark API on top of Apache > Arrow directly. In my case, I'm doing data science on a machine with 288 > cores and 1TB of ram. > > It would make life much easier if I was able to use the flexibility of the > PySpark API (rather than having to be tied to the operations in Pandas). It > seems like an implementation would be fairly straightforward using the > Plasma server and object_ids. > > If you have not heard of an effort underway to accomplish this, any > reasons why it would be a bad idea? > > > Thanks! >
PySpark API on top of Apache Arrow
Please forgive me if this question has been asked already. I'm working in Python with Arrow+Plasma+Pandas Dataframes. I'm curious if anyone knows of any efforts to implement the PySpark API on top of Apache Arrow directly. In my case, I'm doing data science on a machine with 288 cores and 1TB of ram. It would make life much easier if I was able to use the flexibility of the PySpark API (rather than having to be tied to the operations in Pandas). It seems like an implementation would be fairly straightforward using the Plasma server and object_ids. If you have not heard of an effort underway to accomplish this, any reasons why it would be a bad idea? Thanks!