Based on your gist, I came up with the following (which can be executed
after the end of what you had written, I removed the `if __name__ ==
'__main__':` and executed with `python -i create_groups_table.py` for ease):

```py
# As an example, we start with the following groups of id1 values:
#   [[10], [11], [12]]
#   [[13, 14, 15], [16, 17, 18]]
#   [[19, 20]]

# And say we want to end with:
#   [[10]]
#   [[13, 14], [18]]

flat_list = pyarrow.compute.list_flatten(group_table['groups'])
id1_lists = pyarrow.compute.struct_field(flat_list, 0)
id1_parent_indices = pyarrow.compute.list_parent_indices(id1_lists)  #
these indices tell us how to regroup id1 values into lists
id1_arr = pyarrow.compute.list_flatten(id1_lists)
id2_arr = pyarrow.compute.struct_field(flat_list, 1)

valid_id1 = np.array([10, 13, 14, 18], dtype=np.int64)
mask = pyarrow.compute.is_in(id1_arr, pyarrow.array(valid_id1))
masked_id1_arr = pyarrow.compute.filter(id1_arr, mask)
masked_id1_parent_indices = pyarrow.compute.filter(id1_parent_indices, mask)

remaining_id1_parents, masked_id1_split_indices =
np.unique(masked_id1_parent_indices, return_index=True)
new_id1_lists = np.split(masked_id1_arr, masked_id1_split_indices[1:])

new_id2_arr = id2_arr.take(remaining_id1_parents)

groups_parent_indices =
group_table['groups'].chunks[0].value_parent_indices()  # these indices
tell us how to assign id1 lists to groups
retained_group_indices = groups_parent_indices.take(remaining_id1_parents)
unique_retained_group_indices, retained_group_indices_splits =
np.unique(retained_group_indices, return_index=True)

new_group_id = group_table['group_id'].take(unique_retained_group_indices)

new_groups_flat = pyarrow.compute.make_struct(new_id1_lists, new_id2_arr,
field_names=['id1', 'id2'])
new_groups = CreateGroups(np.split(new_groups_flat,
retained_group_indices_splits[1:]))

new_table = CreateTable(new_group_id, new_groups)
```

I think this accomplishes what I want, but I don't have a good sense about
how it can be made better. If you have ideas, let me know!

- db

On Fri, Mar 10, 2023 at 6:12 PM Aldrin <[email protected]> wrote:

> The first solution that comes to mind is:
> * don't flatten the structure, just walk the structure
> * create a new version of each id1 list that contains the data you want to
> **keep**
> * construct a new table containing all of the old data, but replace the
> id1 structs with their reduced versions
> * after constructing the new table, you can delete the old id1 lists to
> free up memory
>
> I think using `take` on the id1 list will maintain the structure for you.
> The tricky part is definitely keeping the structure of the groups list. You
> could potentially use `value_parent_indices` ([1]) on the groups list to
> gather each resulting item (struct<modified id1 list, id2 struct>) into the
> correct ListArray structure. Essentially, you:
> * use `value_parent_indices` to create a list of lists
> * you filter each item's id1 list as needed
> * then you gather new items into the appropriate groups list (list of
> lists) based on `value_parent_indices`
> * construct table from old `group_ids` and new `groups`
>
> I'm not sure this is a good solution, but it seems vaguely like a
> reasonable baseline solution. I think you rely on functions on the id1 list
> to move data efficiently (or not move data). I think reconstructing as I
> mentioned above should mostly be zero-copy.
>
> This probably isn't easy to follow... so I decided to write some code in a
> gist [2]. I can get around to doing the filtering part and creating reduced
> id1 lists next week if it'd be helpful, but maybe the description above and
> the excerpt in the gist are collectively helpful enough?
>
> good luck!
>
> [1]:
> https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.value_parent_indices
> [2]: https://gist.github.com/drin/b8fe9c5bb239b27bece81d817c83f0f4
>
> Aldrin Montana
> Computer Science PhD Student
> UC Santa Cruz
>
>
> On Fri, Mar 10, 2023 at 2:44 PM Data Bytes <[email protected]> wrote:
>
>> I have data in the following format:
>> >>> import pyarrow as pa
>> >>> import pyarrow.parquet as pq
>> >>> import pyarrow.compute as pc
>> >>> import numpy as np
>> >>> t = pq.read_table('/tmp/example.parquet')
>> >>> t.schema
>> group_id: double
>> groups: list<item: struct<id1: list<item: int64>, id2: struct<type:
>> string, value: string>>>
>>   child 0, item: struct<id1: list<item: int64>, id2: struct<type: string,
>> value: string>>
>>       child 0, id1: list<item: int64>
>>           child 0, item: int64
>>       child 1, id2: struct<type: string, value: string>
>>           child 0, type: string
>>           child 1, value: string
>>
>>
>> I want to efficiently (i.e., fast with minimal copying) eliminate some of
>> the values in the id1 lists, and eliminate entire group items if no values
>> are left in the corresponding id1 list.
>>
>> So far I am able to do the following, which seems efficient:
>> >>> flat_list = pc.list_flatten(t['groups'])
>> >>> flat_list.type
>> StructType(struct<id1: list<item: int64>, id2: struct<type: string,
>> value: string>>)
>> >>> id1_lists = pc.struct_field(flat_list, 0)
>> >>> id1_lists.type
>> ListType(list<item: int64>)
>> >>> parent_indices = pc.list_parent_indices(id1_lists)
>> >>> id1_arr = pc.list_flatten(id1_lists)
>> >>> id1_arr.type
>> DataType(int64)
>>
>>
>> From here I am able to mask id1_arr and parent_indices:
>> >>> # for example, I typically have 0.1% of the data survive this step
>> >>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001,
>> 0.999])
>> >>> masked_id1_arr = pc.filter(id1_arr, mask)
>> >>> masked_parent_indices = pc.filter(parent_indices, mask)
>> >>> len(id1_arr)
>> 1309610
>> >>> len(masked_id1_arr)
>> 1343
>>
>>
>> But I am stuck on how to reconstruct the original schema efficiently. Any
>> ideas are appreciated! Example data file can be found here:
>> https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA
>>
>> - db
>>
>

Reply via email to