Any efficient way of partitioning tables in memory?

Jacek Pliszka Fri, 29 Sep 2023 12:50:04 -0700

Hi!

I am looking for an efficient way of working with pieces of a Table


Let's say I got a table with 100M rows with a key column having 100
different values.

I would like to be able to quickly get a subtable with just rows for
the given key value.
Currently I run filter 100 times to generate 100 subtables which is
not very performant.

In pandas it is trivial with list(pd.DataFrame.groupby("key")) but I
would like to
not to switch to pandas just for that. spark's rdd.groupByKey is
another example.

Even something smaller - like "rechunk" on ChunkedArray - something that
reorders values between chunks based on some other array would be great
(e.g. rechunk([[1,2,3,4,5]], [1,0,1,0,1]) == [[2,4], [1,3,5]] )
as then working on chunks is faster.

Any ideas better than filtering 100 times?

Best Regards,

Jacek

Any efficient way of partitioning tables in memory?

Reply via email to