Re: Optimising pandas relational ops with pyarrow

Vibhatha Abeykoon Fri, 01 Jan 2021 15:25:28 -0800

Checkout https://cylondata.org/.


We have also worked on this problem in both sequential and distributed
execution mode. An early DataFrame API is also available.

[1]. https://cylondata.org/docs/python
[2]. https://cylondata.org/docs/python_api_docs


On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]>
wrote:

> Ivan,
>
> The Clojure dataset abstraction does not copy the data, uses mmap, and is
> generally extremely fast for aggregate group-by operations
> <https://github.com/zero-one-group/geni-performance-benchmark/>. Just FYI.
>
> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote:
>
>> Hi!
>> I plan to:
>> -  join
>> - group by
>> - filter
>> data using pyarrow (new to it). The idea is to get better performance and
>> memory utilisation ( apache arrow columnar compression) compared to pandas.
>> Seems like pyarrow has no support for joining two Tables / Dataset by key
>> so I have to fallback to pandas.
>> I don’t really follow how pyarrow <-> pandas integration works. Will
>> pandas rely on apache arrow data structure? I’m fine with using only these
>> flat types for columns to avoid "corner cases"
>> - string
>> - int
>> - long
>> - decimal
>>
>> I have a feeling that pandas will copy all data from apache arrow and
>> double the size (according to the doc). Did I get it right?
>> What is the right way to join, groupBy and filter several "Tables" /
>> "Datasets" utilizing pyarrow (underlying apache arrow) power?
>>
>> Thank you!
>>
> --
Vibhatha Abeykoon

Re: Optimising pandas relational ops with pyarrow

Reply via email to