Checkout https://cylondata.org/.
We have also worked on this problem in both sequential and distributed execution mode. An early DataFrame API is also available. [1]. https://cylondata.org/docs/python [2]. https://cylondata.org/docs/python_api_docs On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]> wrote: > Ivan, > > The Clojure dataset abstraction does not copy the data, uses mmap, and is > generally extremely fast for aggregate group-by operations > <https://github.com/zero-one-group/geni-performance-benchmark/>. Just FYI. > > On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote: > >> Hi! >> I plan to: >> - join >> - group by >> - filter >> data using pyarrow (new to it). The idea is to get better performance and >> memory utilisation ( apache arrow columnar compression) compared to pandas. >> Seems like pyarrow has no support for joining two Tables / Dataset by key >> so I have to fallback to pandas. >> I don’t really follow how pyarrow <-> pandas integration works. Will >> pandas rely on apache arrow data structure? I’m fine with using only these >> flat types for columns to avoid "corner cases" >> - string >> - int >> - long >> - decimal >> >> I have a feeling that pandas will copy all data from apache arrow and >> double the size (according to the doc). Did I get it right? >> What is the right way to join, groupBy and filter several "Tables" / >> "Datasets" utilizing pyarrow (underlying apache arrow) power? >> >> Thank you! >> > -- Vibhatha Abeykoon
