Note that many of us think it's important to have canonical implementations of important algorithms (aggregate / hash aggregate, joins, sorts, etc.) in the Apache project and available to e.g. pyarrow users, as opposed to having to direct them to a third party project. I've been unable to do this work myself given my other responsibilities, but I will be continuing to direct funding / engineering time from my organization toward these goals. I hope that others from the community can join in to help out to make the work go faster.
On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <[email protected]> wrote: > > Hi, thanks for the pointers. We tried cylondata already. We find it hard to > build, some lack of tests for Java, seems like sort and filter not supported > yet... > We are short on time that is why we can’t afford to build own ci/cd for > cylondata... > Project looks very promising and for now it’s a huge technical risk for us. > > > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <[email protected]> wrote: >> >> Checkout https://cylondata.org/. >> >> We have also worked on this problem in both sequential and distributed >> execution mode. An early DataFrame API is also available. >> >> [1]. https://cylondata.org/docs/python >> [2]. https://cylondata.org/docs/python_api_docs >> >> >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]> >> wrote: >>> >>> Ivan, >>> >>> The Clojure dataset abstraction does not copy the data, uses mmap, and is >>> generally extremely fast for aggregate group-by operations. Just FYI. >>> >>> >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote: >>>> >>>> Hi! >>>> I plan to: >>>> - join >>>> - group by >>>> - filter >>>> data using pyarrow (new to it). The idea is to get better performance and >>>> memory utilisation ( apache arrow columnar compression) compared to pandas. >>>> Seems like pyarrow has no support for joining two Tables / Dataset by key >>>> so I have to fallback to pandas. >>>> I don’t really follow how pyarrow <-> pandas integration works. Will >>>> pandas rely on apache arrow data structure? I’m fine with using only these >>>> flat types for columns to avoid "corner cases" >>>> - string >>>> - int >>>> - long >>>> - decimal >>>> >>>> I have a feeling that pandas will copy all data from apache arrow and >>>> double the size (according to the doc). Did I get it right? >>>> What is the right way to join, groupBy and filter several "Tables" / >>>> "Datasets" utilizing pyarrow (underlying apache arrow) power? >>>> >>>> Thank you! >> >> -- >> Vibhatha Abeykoon
