> However, I'm wondering if there's a better path to integrating that more 
> "natively" into Arrow. Happy to make contributions if that's an option.

I'm not quite sure I understand where Arrow integration comes into
play here.  Would that scanner use Arrow internally?  Or would you
only convert the output to Arrow?

If you're planning on using Arrow's scanner internally then I think
there are improvements that could be made to add support for partially
reading a dataset.  There are a few tentative explorations here (e.g.
I think marsupialtail has been looking at slicing fragments and there
is some existing Spark work with slicing fragments).

If you're only converting the output to Arrow then I don't think there
is any special interoperability concerns.

What does `scanner` return?  Does it return a table?  Or an iterator?

> it's hard to express nested field references (i have to use a non-public 
> pc.Expression._nested_field and convert list-of-struct to struct-of-list)

There was some work recently to add better support for nested field references:

```
>>> import pyarrow as pa
>>> import pyarrow.dataset as ds

>>> points = [{'x': 7, 'y': 10}, {'x': 12, 'y': 44}]
>>> table = pa.Table.from_pydict('points': points)
>>> ds.write_dataset(table, '/tmp/my_dataset', format='parquet')

>>> ds.dataset('/tmp/my_dataset').to_table(columns={'x': ds.field('points', 
>>> 'x')}, filter=(ds.field('points', 'y') > 20))
pyarrow.Table
x: int64
----
x: [[12]]
```

> the compute operations are not implemented for List arrays.

What sorts of operations would you like to see on list arrays?  Can
you give an example?

On Tue, Sep 20, 2022 at 10:13 AM Chang She <[email protected]> wrote:
>
> Hi there,
>
> We're creating a new columnar data format for computer vision with Arrow 
> integration as a first class citizen (github.com/eto-ai/lance). It 
> significantly outperforms parquet in a variety of computer vision workloads.
>
> Question 1:
>
> Because vision data tends to be large-ish blobs, we want to be very careful 
> about how much data is being retrieved. So we want to be able to push-down 
> limit/offset when it's appropriate to support data exploration queries (e.g., 
> "show me page 2 of N images that meet these filtering criteria"). For now 
> we've implemented our own Dataset/Scanner subclass to support these extra 
> options.
>
> Example:
>
> ```python
> lance.dataset(uri).scanner(limit=10, offset=20)
> ```
>
> And the idea is that it would only retrieve those rows from disk.
>
> However, I'm wondering if there's a better path to integrating that more 
> "natively" into Arrow. Happy to make contributions if that's an option.
>
>
> Question 2:
>
> In computer vision we're often dealing with deeply nested data types (e.g., 
> for object detection annotations that has a list of labels, bounding boxes, 
> polygons, etc for each image). Lance supports efficient filtering scans on 
> these list-of-struct columns, but a) it's hard to express nested field 
> references (i have to use a non-public pc.Expression._nested_field and 
> convert list-of-struct to struct-of-list), and b) the compute operations are 
> not implemented for List arrays.
>
> Any guidance/thoughts on how y'all are thinking about efficient compute on 
> nested data?
>
> Thanks!
>
> Chang
>

Reply via email to