I have data in the following format:
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pyarrow.compute as pc
>>> import numpy as np
>>> t = pq.read_table('/tmp/example.parquet')
>>> t.schema
group_id: double
groups: list<item: struct<id1: list<item: int64>, id2: struct<type: string,
value: string>>>
child 0, item: struct<id1: list<item: int64>, id2: struct<type: string,
value: string>>
child 0, id1: list<item: int64>
child 0, item: int64
child 1, id2: struct<type: string, value: string>
child 0, type: string
child 1, value: string
I want to efficiently (i.e., fast with minimal copying) eliminate some of
the values in the id1 lists, and eliminate entire group items if no values
are left in the corresponding id1 list.
So far I am able to do the following, which seems efficient:
>>> flat_list = pc.list_flatten(t['groups'])
>>> flat_list.type
StructType(struct<id1: list<item: int64>, id2: struct<type: string, value:
string>>)
>>> id1_lists = pc.struct_field(flat_list, 0)
>>> id1_lists.type
ListType(list<item: int64>)
>>> parent_indices = pc.list_parent_indices(id1_lists)
>>> id1_arr = pc.list_flatten(id1_lists)
>>> id1_arr.type
DataType(int64)
>From here I am able to mask id1_arr and parent_indices:
>>> # for example, I typically have 0.1% of the data survive this step
>>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001,
0.999])
>>> masked_id1_arr = pc.filter(id1_arr, mask)
>>> masked_parent_indices = pc.filter(parent_indices, mask)
>>> len(id1_arr)
1309610
>>> len(masked_id1_arr)
1343
But I am stuck on how to reconstruct the original schema efficiently. Any
ideas are appreciated! Example data file can be found here:
https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA
- db