Here you go: https://gist.github.com/lidavidm/2375cf34ee57fc694ba90d85025ab894
Pasted inline (let's hope the formatting holds up):
import pyarrow as pa
list_of_struct = pa.array([
[{"item_id": 0, "price": 100}, {"item_id": 1, "price": 50}],
[{"item_id": 10, "price": 20}, None],
None
], type=pa.list_(pa.struct([
pa.field("item_id", pa.int64()),
pa.field("price", pa.int64()),
])))
# One array per struct field (this incurs some overhead as it may
# allocate new validity bitmaps)
subarrays = list_of_struct.values.flatten()
# The rest of this is just manipulating array container objects
# Validity bitmap, offsets
buffers = list_of_struct.buffers()[:2]
item_id = pa.ListArray.from_buffers(
pa.list_(pa.int64()),
len(list_of_struct),
buffers,
list_of_struct.null_count,
list_of_struct.offset,
[subarrays[0]])
prices = pa.ListArray.from_buffers(
pa.list_(pa.int64()),
len(list_of_struct),
buffers,
list_of_struct.null_count,
list_of_struct.offset,
[subarrays[1]])
print(item_id)
print(prices)
-David
On Wed, Nov 10, 2021, at 16:32, Tim Nicolson wrote:
> David,
>
> Thanks for the info - glad that this feature is in the pipeline!
>
> I'd really appreciate some pointers on how to efficiently decompose the
> ListArray/StructArray - happy to flesh it out and come back with an example
> for posterity...
>
> Thanks again,
>
> Tim
>
> On Wed, Nov 10, 2021 at 5:20 PM David Li <[email protected]> wrote:
>> __
>> Hey Tim,
>>
>> We're still wiring up all the work needed for nested field refs in general
>> (see ARROW-14658 [1]). And we haven't listed out what kinds of references we
>> want to support. I would say we want to support things that Substrait
>> supports [2] and the behavior you describe here appears to correspond to
>> "masked complex expression" references there, that said, the way it
>> ultimately gets implemented/exposed may be different.
>>
>> For now, you will have to read the column and then postprocess it yourself
>> (this will require you to manually decompose the ListArray/StructArray and
>> reconstruct the ListArray - I can work out an example if that would help).
>>
>> By the way, thank you for the example here - it reminds me that we also
>> likely should support pushing down the projection so that we only load the
>> necessary leaf nodes in Parquet as well.
>>
>> [1]: https://issues.apache.org/jira/browse/ARROW-14658
>> [2]:
>> https://substrait.io/expressions/field_references/#masked-complex-expression
>>
>> Best,
>> David
>>
>> On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote:
>>> Hi,
>>>
>>> I have a parquet dataset containing "order" structs each of which has a
>>> list of "item" structs. I would like to read a subset of the item structs.
>>> e.g.
>>>
>>> order_id: int64
>>> ...other fields...
>>> items: list<item: struct<item_id: int64, price: int64, ...other fields...>>
>>>
>>> # is this/will this be possible?
>>> dataset.to_table(columns=["order_id", "items.item_id", items.price"])
>>>
>>> I guess they'd be lists of scalars rather than a list of structs with fewer
>>> fields?
>>>
>>> I couldn't see any reference to *lists* in
>>> https://github.com/apache/arrow/pull/11466.
>>>
>>> Is this possible or planned? Is there another way to achieve this?
>>>
>>> Thanks in advance,
>>>
>>> Tim
>>