Re: Optimising pandas relational ops with pyarrow

Vibhatha Abeykoon Fri, 01 Jan 2021 16:19:31 -0800

Yes, Ivan we are continuously working on improving things on Java. Our
project
https://twister2.org/ heavily involves with the JVM based high performance
data engineering. Cylon will be powering Twister2 on top of Apache Arrow.
Currently we mainly focus on Python and C++ stack towards a fully-fledged
DataFrame. Also we would like to go with the community to support the
required tools for the users.


On Fri, Jan 1, 2021 at 6:48 PM Wes McKinney <[email protected]> wrote:

> Note that many of us think it's important to have canonical
> implementations of important algorithms (aggregate / hash aggregate,
> joins, sorts, etc.) in the Apache project and available to e.g.
> pyarrow users, as opposed to having to direct them to a third party
> project. I've been unable to do this work myself given my other
> responsibilities, but I will be continuing to direct funding /
> engineering time from my organization toward these goals. I hope that
> others from the community can join in to help out to make the work go
> faster.
>
> On Fri, Jan 1, 2021 at 5:36 PM Ivan Petrov <[email protected]> wrote:
> >
> > Hi, thanks for the pointers. We tried cylondata already. We find it hard
> to build, some lack of tests for Java, seems like sort and filter not
> supported yet...
> > We are short on time that is why we can’t afford to build own ci/cd for
> cylondata...
> > Project looks very promising and for now it’s a huge technical risk for
> us.
> >
> >
> > On Sat, 2 Jan 2021 at 00:25, Vibhatha Abeykoon <[email protected]>
> wrote:
> >>
> >> Checkout https://cylondata.org/.
> >>
> >> We have also worked on this problem in both sequential and distributed
> execution mode. An early DataFrame API is also available.
> >>
> >> [1]. https://cylondata.org/docs/python
> >> [2]. https://cylondata.org/docs/python_api_docs
> >>
> >>
> >> On Fri, Jan 1, 2021 at 2:07 PM Chris Nuernberger <[email protected]>
> wrote:
> >>>
> >>> Ivan,
> >>>
> >>> The Clojure dataset abstraction does not copy the data, uses mmap, and
> is generally extremely fast for aggregate group-by operations. Just FYI.
> >>>
> >>>
> >>> On Fri, Jan 1, 2021 at 10:24 AM Ivan Petrov <[email protected]>
> wrote:
> >>>>
> >>>> Hi!
> >>>> I plan to:
> >>>> -  join
> >>>> - group by
> >>>> - filter
> >>>> data using pyarrow (new to it). The idea is to get better performance
> and memory utilisation ( apache arrow columnar compression) compared to
> pandas.
> >>>> Seems like pyarrow has no support for joining two Tables / Dataset by
> key so I have to fallback to pandas.
> >>>> I don’t really follow how pyarrow <-> pandas integration works. Will
> pandas rely on apache arrow data structure? I’m fine with using only these
> flat types for columns to avoid "corner cases"
> >>>> - string
> >>>> - int
> >>>> - long
> >>>> - decimal
> >>>>
> >>>> I have a feeling that pandas will copy all data from apache arrow and
> double the size (according to the doc). Did I get it right?
> >>>> What is the right way to join, groupBy and filter several "Tables" /
> "Datasets" utilizing pyarrow (underlying apache arrow) power?
> >>>>
> >>>> Thank you!
> >>
> >> --
> >> Vibhatha Abeykoon
>
-- 
Vibhatha Abeykoon

Re: Optimising pandas relational ops with pyarrow

Reply via email to