hi Chang,

There are a few rough edges here that you've run into:

* It looks like Array.to_numpy does not "automatically lower" to the
storage type when trying to convert to NumPy format. In the absence of
some other conversion rule, converting to the storage type seems like
a reasonable alternative to failing. Can you open a Jira issue about
this? This could probably be fixed easily in time for the 10.0.0
release, much more easily than the next issue

* On the query, it looks like the filter portion at least is being
handled by Arrow/Acero — the syntax / UX relating to nested types here
is relatively unexplored relative to non-nested types. Here comparing
the label type (itself a list of dictionary-encoded strings) to a
string seems invalid, probably you would need to check for inclusion
of the string in the label list-of-strings. I do not know what the
syntax for this would be with DuckDB (to check for inclusion of a
string in a list of strings) but in principle this is something that
should be able to be made to work with some effort

- Wes

On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
>
> Hey y'all, thanks in advance for the discussion.
>
> I'm creating Arrow extensions for computer vision and I'm running into issues 
> in two scenarios. I couldn't find the answers in the archive so I thought I'd 
> post here.
>
> Example:
> I make an extension type called "Label" that has storage type 
> "dictionary<int8, string>". This is an object detection dataset so each row 
> represents an image and has multiple detected objects that needs to be 
> labeled. So there's a "name" column that is "list<label>":
>
> Example table schema:
> image_id: int
> uri: string
> label: list<label>   # list<dictionary<int8, string>>  storage type
>
>
> Problems:
> 1. `to_numpy` does not seem to work with a nested column. e.g., if I try to 
> call `to_numpy` on the `label` column, then I get "Not implemented type for 
> Arrow list to pandas: extension<label<LabelType>>"
> 2. If I'm querying this dataset using duckdb, running "select * from dataset 
> where label='person'" results in: "Function 'equal' has no kernel matching 
> input types (extension<label<LabelType>>, string)"
>
> Am I missing an alternate path to make this work with extension types?
> Does implementing this in Arrow consist of checking if something is an 
> extension type and if so, use the storage type instead? Is this something 
> that's already on the roadmap at all?
>
> Thanks!
>
> Chang She

Reply via email to