Re: Any efficient way of partitioning tables in memory?

Ian Cook Fri, 29 Sep 2023 13:00:54 -0700

Hi Jacek,

I don't think there is any great way to do this today. There is an
open issue for this that describes a possible workaround:
https://github.com/apache/arrow/issues/14882


Ian

On Fri, Sep 29, 2023 at 3:50 PM Jacek Pliszka <[email protected]> wrote:
>
> Hi!
>
> I am looking for an efficient way of working with pieces of a Table
>
> Let's say I got a table with 100M rows with a key column having 100
> different values.
>
> I would like to be able to quickly get a subtable with just rows for
> the given key value.
> Currently I run filter 100 times to generate 100 subtables which is
> not very performant.
>
> In pandas it is trivial with list(pd.DataFrame.groupby("key")) but I
> would like to
> not to switch to pandas just for that. spark's rdd.groupByKey is
> another example.
>
> Even something smaller - like "rechunk" on ChunkedArray - something that
> reorders values between chunks based on some other array would be great
> (e.g. rechunk([[1,2,3,4,5]], [1,0,1,0,1]) == [[2,4], [1,3,5]] )
> as then working on chunks is faster.
>
> Any ideas better than filtering 100 times?
>
> Best Regards,
>
> Jacek

Re: Any efficient way of partitioning tables in memory?

Reply via email to