Super helpful. I've productionised that - we can strip it out once we can push it down.
Thanks again David, Tim On Thu, Nov 11, 2021 at 1:05 AM David Li <[email protected]> wrote: > Here you go: > https://gist.github.com/lidavidm/2375cf34ee57fc694ba90d85025ab894 > > Pasted inline (let's hope the formatting holds up): > > import pyarrow as pa > > list_of_struct = pa.array([ > [{"item_id": 0, "price": 100}, {"item_id": 1, "price": 50}], > [{"item_id": 10, "price": 20}, None], > None > ], type=pa.list_(pa.struct([ > pa.field("item_id", pa.int64()), > pa.field("price", pa.int64()), > ]))) > > # One array per struct field (this incurs some overhead as it may > # allocate new validity bitmaps) > subarrays = list_of_struct.values.flatten() > # The rest of this is just manipulating array container objects > > # Validity bitmap, offsets > buffers = list_of_struct.buffers()[:2] > > item_id = pa.ListArray.from_buffers( > pa.list_(pa.int64()), > len(list_of_struct), > buffers, > list_of_struct.null_count, > list_of_struct.offset, > [subarrays[0]]) > > prices = pa.ListArray.from_buffers( > pa.list_(pa.int64()), > len(list_of_struct), > buffers, > list_of_struct.null_count, > list_of_struct.offset, > [subarrays[1]]) > > print(item_id) > print(prices) > > -David > > On Wed, Nov 10, 2021, at 16:32, Tim Nicolson wrote: > > David, > > Thanks for the info - glad that this feature is in the pipeline! > > I'd really appreciate some pointers on how to efficiently decompose the > ListArray/StructArray - happy to flesh it out and come back with an example > for posterity... > > Thanks again, > > Tim > > On Wed, Nov 10, 2021 at 5:20 PM David Li <[email protected]> wrote: > > > Hey Tim, > > We're still wiring up all the work needed for nested field refs in general > (see ARROW-14658 [1]). And we haven't listed out what kinds of references > we want to support. I would say we want to support things that Substrait > supports [2] and the behavior you describe here appears to correspond to > "masked complex expression" references there, that said, the way it > ultimately gets implemented/exposed may be different. > > For now, you will have to read the column and then postprocess it yourself > (this will require you to manually decompose the ListArray/StructArray and > reconstruct the ListArray - I can work out an example if that would help). > > By the way, thank you for the example here - it reminds me that we also > likely should support pushing down the projection so that we only load the > necessary leaf nodes in Parquet as well. > > [1]: https://issues.apache.org/jira/browse/ARROW-14658 > [2]: > https://substrait.io/expressions/field_references/#masked-complex-expression > > Best, > David > > On Tue, Nov 9, 2021, at 15:45, Tim Nicolson wrote: > > Hi, > > I have a parquet dataset containing "order" structs each of which has a > list of "item" structs. I would like to read a subset of the item structs. > e.g. > > order_id: int64 > > ...other fields... > > items: list<item: struct<item_id: int64, price: int64, ...other fields...>> > > > # is this/will this be possible? > > dataset.to_table(columns=["order_id", "items.item_id", items.price"]) > > > I guess they'd be lists of scalars rather than a list of structs with > fewer fields? > > I couldn't see any reference to *lists* in > https://github.com/apache/arrow/pull/11466. > > Is this possible or planned? Is there another way to achieve this? > > Thanks in advance, > > Tim > > > >
