Re: guidance on extension types

Chang She Wed, 21 Sep 2022 20:49:23 -0700

Thanks Wes.

=> Array.to_numpy : I opened ARROW-17813
<https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
added some details / repro code. There's also a follow-up thing about the
other direction, converting from a pandas DataFrame column to an Arrow
list<extension>.

=> You're right, I was a little hasty in the description and it wasn't very
accurate:

Scenario 1:

If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
`pc.field("extension") == 'string'` would be a valid filter but
currently triggers the "function 'equal' has no kernel matching input
types" error.
This is the path used by DuckDB if you add something like
`extension=='string'` in the where clause.
If Arrow/Acero is also able to automatically lower to storage type for the
functions then it would make running compute on extension types a lot
easier. Even for a list<label> column, at least in duckdb you could use
"UNNEST" to make it work.

Scenario 2:

The trouble with using UNNEST is it makes the query a lot more complicated
and has perf implications. If we're working a lot with nested data types,
it would be easier to have a set of array functions.
If there's a nested ExtensionArray, then something like a list-contains
function would make things a lot easier. However, I think this is a lot
more work (and depends on other systems like duckdb to integrate with these
functions as well).

Would it make sense for me to create a JIRA for scenario 1 to continue
further discussion?

Thanks again.

On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <[email protected]> wrote:

> hi Chang,
>
> There are a few rough edges here that you've run into:
>
> * It looks like Array.to_numpy does not "automatically lower" to the
> storage type when trying to convert to NumPy format. In the absence of
> some other conversion rule, converting to the storage type seems like
> a reasonable alternative to failing. Can you open a Jira issue about
> this? This could probably be fixed easily in time for the 10.0.0
> release, much more easily than the next issue
>
> * On the query, it looks like the filter portion at least is being
> handled by Arrow/Acero — the syntax / UX relating to nested types here
> is relatively unexplored relative to non-nested types. Here comparing
> the label type (itself a list of dictionary-encoded strings) to a
> string seems invalid, probably you would need to check for inclusion
> of the string in the label list-of-strings. I do not know what the
> syntax for this would be with DuckDB (to check for inclusion of a
> string in a list of strings) but in principle this is something that
> should be able to be made to work with some effort
>
> - Wes
>
> On Sun, Sep 18, 2022 at 8:23 PM Chang She <[email protected]> wrote:
> >
> > Hey y'all, thanks in advance for the discussion.
> >
> > I'm creating Arrow extensions for computer vision and I'm running into
> issues in two scenarios. I couldn't find the answers in the archive so I
> thought I'd post here.
> >
> > Example:
> > I make an extension type called "Label" that has storage type
> "dictionary<int8, string>". This is an object detection dataset so each row
> represents an image and has multiple detected objects that needs to be
> labeled. So there's a "name" column that is "list<label>":
> >
> > Example table schema:
> > image_id: int
> > uri: string
> > label: list<label>   # list<dictionary<int8, string>>  storage type
> >
> >
> > Problems:
> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I try
> to call `to_numpy` on the `label` column, then I get "Not implemented type
> for Arrow list to pandas: extension<label<LabelType>>"
> > 2. If I'm querying this dataset using duckdb, running "select * from
> dataset where label='person'" results in: "Function 'equal' has no kernel
> matching input types (extension<label<LabelType>>, string)"
> >
> > Am I missing an alternate path to make this work with extension types?
> > Does implementing this in Arrow consist of checking if something is an
> extension type and if so, use the storage type instead? Is this something
> that's already on the roadmap at all?
> >
> > Thanks!
> >
> > Chang She
>

Re: guidance on extension types

Reply via email to