Hi Jacek, I don't think there is any great way to do this today. There is an open issue for this that describes a possible workaround: https://github.com/apache/arrow/issues/14882
Ian On Fri, Sep 29, 2023 at 3:50 PM Jacek Pliszka <[email protected]> wrote: > > Hi! > > I am looking for an efficient way of working with pieces of a Table > > Let's say I got a table with 100M rows with a key column having 100 > different values. > > I would like to be able to quickly get a subtable with just rows for > the given key value. > Currently I run filter 100 times to generate 100 subtables which is > not very performant. > > In pandas it is trivial with list(pd.DataFrame.groupby("key")) but I > would like to > not to switch to pandas just for that. spark's rdd.groupByKey is > another example. > > Even something smaller - like "rechunk" on ChunkedArray - something that > reorders values between chunks based on some other array would be great > (e.g. rechunk([[1,2,3,4,5]], [1,0,1,0,1]) == [[2,4], [1,3,5]] ) > as then working on chunks is faster. > > Any ideas better than filtering 100 times? > > Best Regards, > > Jacek
