The first solution that comes to mind is: * don't flatten the structure, just walk the structure * create a new version of each id1 list that contains the data you want to **keep** * construct a new table containing all of the old data, but replace the id1 structs with their reduced versions * after constructing the new table, you can delete the old id1 lists to free up memory
I think using `take` on the id1 list will maintain the structure for you. The tricky part is definitely keeping the structure of the groups list. You could potentially use `value_parent_indices` ([1]) on the groups list to gather each resulting item (struct<modified id1 list, id2 struct>) into the correct ListArray structure. Essentially, you: * use `value_parent_indices` to create a list of lists * you filter each item's id1 list as needed * then you gather new items into the appropriate groups list (list of lists) based on `value_parent_indices` * construct table from old `group_ids` and new `groups` I'm not sure this is a good solution, but it seems vaguely like a reasonable baseline solution. I think you rely on functions on the id1 list to move data efficiently (or not move data). I think reconstructing as I mentioned above should mostly be zero-copy. This probably isn't easy to follow... so I decided to write some code in a gist [2]. I can get around to doing the filtering part and creating reduced id1 lists next week if it'd be helpful, but maybe the description above and the excerpt in the gist are collectively helpful enough? good luck! [1]: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.value_parent_indices [2]: https://gist.github.com/drin/b8fe9c5bb239b27bece81d817c83f0f4 Aldrin Montana Computer Science PhD Student UC Santa Cruz On Fri, Mar 10, 2023 at 2:44 PM Data Bytes <[email protected]> wrote: > I have data in the following format: > >>> import pyarrow as pa > >>> import pyarrow.parquet as pq > >>> import pyarrow.compute as pc > >>> import numpy as np > >>> t = pq.read_table('/tmp/example.parquet') > >>> t.schema > group_id: double > groups: list<item: struct<id1: list<item: int64>, id2: struct<type: > string, value: string>>> > child 0, item: struct<id1: list<item: int64>, id2: struct<type: string, > value: string>> > child 0, id1: list<item: int64> > child 0, item: int64 > child 1, id2: struct<type: string, value: string> > child 0, type: string > child 1, value: string > > > I want to efficiently (i.e., fast with minimal copying) eliminate some of > the values in the id1 lists, and eliminate entire group items if no values > are left in the corresponding id1 list. > > So far I am able to do the following, which seems efficient: > >>> flat_list = pc.list_flatten(t['groups']) > >>> flat_list.type > StructType(struct<id1: list<item: int64>, id2: struct<type: string, value: > string>>) > >>> id1_lists = pc.struct_field(flat_list, 0) > >>> id1_lists.type > ListType(list<item: int64>) > >>> parent_indices = pc.list_parent_indices(id1_lists) > >>> id1_arr = pc.list_flatten(id1_lists) > >>> id1_arr.type > DataType(int64) > > > From here I am able to mask id1_arr and parent_indices: > >>> # for example, I typically have 0.1% of the data survive this step > >>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001, > 0.999]) > >>> masked_id1_arr = pc.filter(id1_arr, mask) > >>> masked_parent_indices = pc.filter(parent_indices, mask) > >>> len(id1_arr) > 1309610 > >>> len(masked_id1_arr) > 1343 > > > But I am stuck on how to reconstruct the original schema efficiently. Any > ideas are appreciated! Example data file can be found here: > https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA > > - db >
