Re: Optimising pandas relational ops with pyarrow

Wes McKinney Fri, 01 Jan 2021 15:48:04 -0800

Note that many of us think it's important to have canonical
implementations of important algorithms (aggregate / hash aggregate,
joins, sorts, etc.) in the Apache project and available to e.g.
pyarrow users, as opposed to having to direct them to a third party
project. I've been unable to do this work myself given my other
responsibilities, but I will be continuing to direct funding /
engineering time from my organization toward these goals. I hope that
others from the community can join in to help out to make the work go
faster.


On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <[email protected]> wrote:
>
> Hi, thanks for the pointers. We tried cylondata already. We find it hard to 
> build, some lack of tests for Java, seems like sort and filter not supported 
> yet...
> We are short on time that is why we can’t afford to build own ci/cd for 
> cylondata...
> Project looks very promising and for now it’s a huge technical risk for us.
>
>
> On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <[email protected]> wrote:
>>
>> Checkout https://cylondata.org/.
>>
>> We have also worked on this problem in both sequential and distributed 
>> execution mode. An early DataFrame API is also available.
>>
>> [1]. https://cylondata.org/docs/python
>> [2]. https://cylondata.org/docs/python_api_docs
>>
>>
>> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]> 
>> wrote:
>>>
>>> Ivan,
>>>
>>> The Clojure dataset abstraction does not copy the data, uses mmap, and is 
>>> generally extremely fast for aggregate group-by operations. Just FYI.
>>>
>>>
>>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote:
>>>>
>>>> Hi!
>>>> I plan to:
>>>> -  join
>>>> - group by
>>>> - filter
>>>> data using pyarrow (new to it). The idea is to get better performance and 
>>>> memory utilisation ( apache arrow columnar compression) compared to pandas.
>>>> Seems like pyarrow has no support for joining two Tables / Dataset by key 
>>>> so I have to fallback to pandas.
>>>> I don’t really follow how pyarrow <-> pandas integration works. Will 
>>>> pandas rely on apache arrow data structure? I’m fine with using only these 
>>>> flat types for columns to avoid "corner cases"
>>>> - string
>>>> - int
>>>> - long
>>>> - decimal
>>>>
>>>> I have a feeling that pandas will copy all data from apache arrow and 
>>>> double the size (according to the doc). Did I get it right?
>>>> What is the right way to join, groupBy and filter several "Tables" / 
>>>> "Datasets" utilizing pyarrow (underlying apache arrow) power?
>>>>
>>>> Thank you!
>>
>> --
>> Vibhatha Abeykoon

Re: Optimising pandas relational ops with pyarrow

Reply via email to