[Python] Reconstruct original schema after multiple uses of list_flatten

Data Bytes Fri, 10 Mar 2023 14:44:30 -0800

 I have data in the following format:
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> import pyarrow.compute as pc
>>> import numpy as np
>>> t = pq.read_table('/tmp/example.parquet')
>>> t.schema
group_id: double
groups: list<item: struct<id1: list<item: int64>, id2: struct<type: string,
value: string>>>
  child 0, item: struct<id1: list<item: int64>, id2: struct<type: string,
value: string>>
      child 0, id1: list<item: int64>
          child 0, item: int64
      child 1, id2: struct<type: string, value: string>
          child 0, type: string
          child 1, value: string



I want to efficiently (i.e., fast with minimal copying) eliminate some of
the values in the id1 lists, and eliminate entire group items if no values
are left in the corresponding id1 list.

So far I am able to do the following, which seems efficient:
>>> flat_list = pc.list_flatten(t['groups'])
>>> flat_list.type
StructType(struct<id1: list<item: int64>, id2: struct<type: string, value:
string>>)
>>> id1_lists = pc.struct_field(flat_list, 0)
>>> id1_lists.type
ListType(list<item: int64>)
>>> parent_indices = pc.list_parent_indices(id1_lists)
>>> id1_arr = pc.list_flatten(id1_lists)
>>> id1_arr.type
DataType(int64)


>From here I am able to mask id1_arr and parent_indices:
>>> # for example, I typically have 0.1% of the data survive this step
>>> mask = np.random.choice([True, False], size=len(id1_arr), p=[0.001,
0.999])
>>> masked_id1_arr = pc.filter(id1_arr, mask)
>>> masked_parent_indices = pc.filter(parent_indices, mask)
>>> len(id1_arr)
1309610
>>> len(masked_id1_arr)
1343


But I am stuck on how to reconstruct the original schema efficiently. Any
ideas are appreciated! Example data file can be found here:
https://wormhole.app/nWWQ4#ZNR9BZATe-N3dF-HUIftUA

- db

[Python] Reconstruct original schema after multiple uses of list_flatten

Reply via email to