Based on your gist, I came up with the following (which can be executed after the end of what you had written, I removed the `if __name__ == '__main__':` and executed with `python -i create_groups_table.py` for ease):
```py # As an example, we start with the following groups of id1 values: # [[10], [11], [12]] # [[13, 14, 15], [16, 17, 18]] # [[19, 20]] # And say we want to end with: # [[10]] # [[13, 14], [18]] flat_list = pyarrow.compute.list_flatten(group_table['groups']) id1_lists = pyarrow.compute.struct_field(flat_list, 0) id1_parent_indices = pyarrow.compute.list_parent_indices(id1_lists) # these indices tell us how to regroup id1 values into lists id1_arr = pyarrow.compute.list_flatten(id1_lists) id2_arr = pyarrow.compute.struct_field(flat_list, 1) valid_id1 = np.array([10, 13, 14, 18], dtype=np.int64) mask = pyarrow.compute.is_in(id1_arr, pyarrow.array(valid_id1)) masked_id1_arr = pyarrow.compute.filter(id1_arr, mask) masked_id1_parent_indices = pyarrow.compute.filter(id1_parent_indices, mask) remaining_id1_parents, masked_id1_split_indices = np.unique(masked_id1_parent_indices, return_index=True) new_id1_lists = np.split(masked_id1_arr, masked_id1_split_indices[1:]) new_id2_arr = id2_arr.take(remaining_id1_parents) groups_parent_indices = group_table['groups'].chunks[0].value_parent_indices() # these indices tell us how to assign id1 lists to groups retained_group_indices = groups_parent_indices.take(remaining_id1_parents) unique_retained_group_indices, retained_group_indices_splits = np.unique(retained_group_indices, return_index=True) new_group_id = group_table['group_id'].take(unique_retained_group_indices) new_groups_flat = pyarrow.compute.make_struct(new_id1_lists, new_id2_arr, field_names=['id1', 'id2']) new_groups = CreateGroups(np.split(new_groups_flat, retained_group_indices_splits[1:])) new_table = CreateTable(new_group_id, new_groups) ``` I think this accomplishes what I want, but I don't have a good sense about how it can be made better. If you have ideas, let me know! - db On Fri, Mar 10, 2023 at 6:12 PM Aldrin <[email protected]> wrote: > The first solution that comes to mind is: > * don't flatten the structure, just walk the structure > * create a new version of each id1 list that contains the data you want to > **keep** > * construct a new table containing all of the old data, but replace the > id1 structs with their reduced versions > * after constructing the new table, you can delete the old id1 lists to > free up memory > > I think using `take` on the id1 list will maintain the structure for you. > The tricky part is definitely keeping the structure of the groups list. You > could potentially use `value_parent_indices` ([1]) on the groups list to > gather each resulting item (struct<modified id1 list, id2 struct>) into the > correct ListArray structure. Essentially, you: > * use `value_parent_indices` to create a list of lists > * you filter each item's id1 list as needed > * then you gather new items into the appropriate groups list (list of > lists) based on `value_parent_indices` > * construct table from old `group_ids` and new `groups` > > I'm not sure this is a good solution, but it seems vaguely like a > reasonable baseline solution. I think you rely on functions on the id1 list > to move data efficiently (or not move data). I think reconstructing as I > mentioned above should mostly be zero-copy. > > This probably isn't easy to follow... so I decided to write some code in a > gist [2]. I can get around to doing the filtering part and creating reduced > id1 lists next week if it'd be helpful, but maybe the description above and > the excerpt in the gist are collectively helpful enough? > > good luck! > > [1]: > https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.value_parent_indices > [2]: https://gist.github.com/drin/b8fe9c5bb239b27bece81d817c83f0f4 > > Aldrin Montana > Computer Science PhD Student > UC Santa Cruz > > > On Fri, Mar 10, 2023 at 2:44 PM Data Bytes <[email protected]> wrote: > >> I have data in the following format: >> >>> import pyarrow as pa >> >>> import pyarrow.parquet as pq >> >>> import pyarrow.compute as pc >> >>> import numpy as np >> >>> t = pq.read_table('/tmp/example.parquet') >> >>> t.schema >> group_id: double >> groups: list<item: struct<id1: list<item: int64>, id2: struct<type: >> string, value: string>>> >> child 0, item: struct<id1: list<item: int64>, id2: struct<type: string, >> value: string>> >> child 0, id1: list<item: int64> >> child 0, item: int64 >> child 1, id2: struct<type: string, value: string> >> child 0, type: string >> child 1, value: string >> >> >> I want to efficiently (i.e., fast with minimal copying) eliminate some of >> the values in the id1 lists, and eliminate entire group items if no values >> are left in the corresponding id1 list. >> >> So far I am able to do the following, which seems efficient: >> >>> flat_list = pc.list_flatten(t['groups']) >> >>> flat_list.type >> StructType(struct<id1: list<item: int64>, id2: struct<type: string, >> value: string>>) >> >>> id1_lists = pc.struct_field(flat_list, 0) >> >>> id1_lists.type >> ListType(list<item: int64>) >> >>> parent_indices = pc.list_parent_indices(id1_lists) >> >>> id1_arr = pc.list_flatten(id1_lists) >> >>> id1_arr.type >> DataType(int64) >> >> >> From here I am able to mask id1_arr and parent_indices: >> >>> # for example, I typically have 0.1% of the data survive this step >> >>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001, >> 0.999]) >> >>> masked_id1_arr = pc.filter(id1_arr, mask) >> >>> masked_parent_indices = pc.filter(parent_indices, mask) >> >>> len(id1_arr) >> 1309610 >> >>> len(masked_id1_arr) >> 1343 >> >> >> But I am stuck on how to reconstruct the original schema efficiently. Any >> ideas are appreciated! Example data file can be found here: >> https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA >> >> - db >> >
