Hi!
I am looking for an efficient way of working with pieces of a Table
Let's say I got a table with 100M rows with a key column having 100
different values.
I would like to be able to quickly get a subtable with just rows for
the given key value.
Currently I run filter 100 times to generate 100 subtables which is
not very performant.
In pandas it is trivial with list(pd.DataFrame.groupby("key")) but I
would like to
not to switch to pandas just for that. spark's rdd.groupByKey is
another example.
Even something smaller - like "rechunk" on ChunkedArray - something that
reorders values between chunks based on some other array would be great
(e.g. rechunk([[1,2,3,4,5]], [1,0,1,0,1]) == [[2,4], [1,3,5]] )
as then working on chunks is faster.
Any ideas better than filtering 100 times?
Best Regards,
Jacek