Ivan, The Clojure dataset abstraction does not copy the data, uses mmap, and is generally extremely fast for aggregate group-by operations <https://github.com/zero-one-group/geni-performance-benchmark/>. Just FYI.
On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> wrote: > Hi! > I plan to: > - join > - group by > - filter > data using pyarrow (new to it). The idea is to get better performance and > memory utilisation ( apache arrow columnar compression) compared to pandas. > Seems like pyarrow has no support for joining two Tables / Dataset by key > so I have to fallback to pandas. > I don’t really follow how pyarrow <-> pandas integration works. Will > pandas rely on apache arrow data structure? I’m fine with using only these > flat types for columns to avoid "corner cases" > - string > - int > - long > - decimal > > I have a feeling that pandas will copy all data from apache arrow and > double the size (according to the doc). Did I get it right? > What is the right way to join, groupBy and filter several "Tables" / > "Datasets" utilizing pyarrow (underlying apache arrow) power? > > Thank you! >
