Re: Optimising pandas relational ops with pyarrow

Chris Nuernberger Fri, 01 Jan 2021 11:07:43 -0800

Ivan,

The Clojure dataset abstraction does not copy the data, uses mmap, and is
generally extremely fast for aggregate group-by operations
<https://github.com/zero-one-group/geni-performance-benchmark/>. Just FYI.


On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote:

> Hi!
> I plan to:
> -  join
> - group by
> - filter
> data using pyarrow (new to it). The idea is to get better performance and
> memory utilisation ( apache arrow columnar compression) compared to pandas.
> Seems like pyarrow has no support for joining two Tables / Dataset by key
> so I have to fallback to pandas.
> I don’t really follow how pyarrow <-> pandas integration works. Will
> pandas rely on apache arrow data structure? I’m fine with using only these
> flat types for columns to avoid "corner cases"
> - string
> - int
> - long
> - decimal
>
> I have a feeling that pandas will copy all data from apache arrow and
> double the size (according to the doc). Did I get it right?
> What is the right way to join, groupBy and filter several "Tables" /
> "Datasets" utilizing pyarrow (underlying apache arrow) power?
>
> Thank you!
>

Re: Optimising pandas relational ops with pyarrow

Reply via email to