Yes, Ivan we are continuously working on improving things on Java. Our project https://twister2.org/ heavily involves with the JVM based high performance data engineering. Cylon will be powering Twister2 on top of Apache Arrow. Currently we mainly focus on Python and C++ stack towards a fully-fledged DataFrame. Also we would like to go with the community to support the required tools for the users.
On Fri, Jan 1, 2021 at 6:48 PM Wes McKinney <[email protected]> wrote: > Note that many of us think it's important to have canonical > implementations of important algorithms (aggregate / hash aggregate, > joins, sorts, etc.) in the Apache project and available to e.g. > pyarrow users, as opposed to having to direct them to a third party > project. I've been unable to do this work myself given my other > responsibilities, but I will be continuing to direct funding / > engineering time from my organization toward these goals. I hope that > others from the community can join in to help out to make the work go > faster. > > On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <[email protected]> wrote: > > > > Hi, thanks for the pointers. We tried cylondata already. We find it hard > to build, some lack of tests for Java, seems like sort and filter not > supported yet... > > We are short on time that is why we can’t afford to build own ci/cd for > cylondata... > > Project looks very promising and for now it’s a huge technical risk for > us. > > > > > > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <[email protected]> > wrote: > >> > >> Checkout https://cylondata.org/. > >> > >> We have also worked on this problem in both sequential and distributed > execution mode. An early DataFrame API is also available. > >> > >> [1]. https://cylondata.org/docs/python > >> [2]. https://cylondata.org/docs/python_api_docs > >> > >> > >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]> > wrote: > >>> > >>> Ivan, > >>> > >>> The Clojure dataset abstraction does not copy the data, uses mmap, and > is generally extremely fast for aggregate group-by operations. Just FYI. > >>> > >>> > >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]> > wrote: > >>>> > >>>> Hi! > >>>> I plan to: > >>>> - join > >>>> - group by > >>>> - filter > >>>> data using pyarrow (new to it). The idea is to get better performance > and memory utilisation ( apache arrow columnar compression) compared to > pandas. > >>>> Seems like pyarrow has no support for joining two Tables / Dataset by > key so I have to fallback to pandas. > >>>> I don’t really follow how pyarrow <-> pandas integration works. Will > pandas rely on apache arrow data structure? I’m fine with using only these > flat types for columns to avoid "corner cases" > >>>> - string > >>>> - int > >>>> - long > >>>> - decimal > >>>> > >>>> I have a feeling that pandas will copy all data from apache arrow and > double the size (according to the doc). Did I get it right? > >>>> What is the right way to join, groupBy and filter several "Tables" / > "Datasets" utilizing pyarrow (underlying apache arrow) power? > >>>> > >>>> Thank you! > >> > >> -- > >> Vibhatha Abeykoon > -- Vibhatha Abeykoon
