Re: guidance on extension types

Chang She Wed, 21 Sep 2022 21:45:43 -0700

Yup we’ve run into this as well. Though I think you could control this by
implementing a pandas extension dtype to go with the arrow extension type?



On Wed, Sep 21, 2022 at 9:17 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Also, note I've raised a similar issue (
> https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls.
> One thing that I think would be nice is to be able to hook into the python
> conversion when necessary translate to Python objects when necessary.
>
>
>
> On Wed, Sep 21, 2022 at 8:49 PM Chang She <ch...@eto.ai> wrote:
>
>> Thanks Wes.
>>
>> => Array.to_numpy : I opened ARROW-17813
>> <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and
>> added some details / repro code. There's also a follow-up thing about the
>> other direction, converting from a pandas DataFrame column to an Arrow
>> list<extension>.
>>
>> => You're right, I was a little hasty in the description and it wasn't
>> very accurate:
>>
>> Scenario 1:
>>
>> If I have a non-nested ExtensionArray whose storage is a DictionaryArray,
>> `pc.field("extension") == 'string'` would be a valid filter but
>> currently triggers the "function 'equal' has no kernel matching input
>> types" error.
>> This is the path used by DuckDB if you add something like
>> `extension=='string'` in the where clause.
>> If Arrow/Acero is also able to automatically lower to storage type for
>> the functions then it would make running compute on extension types a lot
>> easier. Even for a list<label> column, at least in duckdb you could use
>> "UNNEST" to make it work.
>>
>>
>> Scenario 2:
>>
>> The trouble with using UNNEST is it makes the query a lot more
>> complicated and has perf implications. If we're working a lot with nested
>> data types, it would be easier to have a set of array functions.
>> If there's a nested ExtensionArray, then something like a list-contains
>> function would make things a lot easier. However, I think this is a lot
>> more work (and depends on other systems like duckdb to integrate with these
>> functions as well).
>>
>>
>> Would it make sense for me to create a JIRA for scenario 1 to continue
>> further discussion?
>>
>>
>> Thanks again.
>>
>>
>> On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> hi Chang,
>>>
>>> There are a few rough edges here that you've run into:
>>>
>>> * It looks like Array.to_numpy does not "automatically lower" to the
>>> storage type when trying to convert to NumPy format. In the absence of
>>> some other conversion rule, converting to the storage type seems like
>>> a reasonable alternative to failing. Can you open a Jira issue about
>>> this? This could probably be fixed easily in time for the 10.0.0
>>> release, much more easily than the next issue
>>>
>>> * On the query, it looks like the filter portion at least is being
>>> handled by Arrow/Acero — the syntax / UX relating to nested types here
>>> is relatively unexplored relative to non-nested types. Here comparing
>>> the label type (itself a list of dictionary-encoded strings) to a
>>> string seems invalid, probably you would need to check for inclusion
>>> of the string in the label list-of-strings. I do not know what the
>>> syntax for this would be with DuckDB (to check for inclusion of a
>>> string in a list of strings) but in principle this is something that
>>> should be able to be made to work with some effort
>>>
>>> - Wes
>>>
>>> On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote:
>>> >
>>> > Hey y'all, thanks in advance for the discussion.
>>> >
>>> > I'm creating Arrow extensions for computer vision and I'm running into
>>> issues in two scenarios. I couldn't find the answers in the archive so I
>>> thought I'd post here.
>>> >
>>> > Example:
>>> > I make an extension type called "Label" that has storage type
>>> "dictionary<int8, string>". This is an object detection dataset so each row
>>> represents an image and has multiple detected objects that needs to be
>>> labeled. So there's a "name" column that is "list<label>":
>>> >
>>> > Example table schema:
>>> > image_id: int
>>> > uri: string
>>> > label: list<label>   # list<dictionary<int8, string>>  storage type
>>> >
>>> >
>>> > Problems:
>>> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I
>>> try to call `to_numpy` on the `label` column, then I get "Not implemented
>>> type for Arrow list to pandas: extension<label<LabelType>>"
>>> > 2. If I'm querying this dataset using duckdb, running "select * from
>>> dataset where label='person'" results in: "Function 'equal' has no kernel
>>> matching input types (extension<label<LabelType>>, string)"
>>> >
>>> > Am I missing an alternate path to make this work with extension types?
>>> > Does implementing this in Arrow consist of checking if something is an
>>> extension type and if so, use the storage type instead? Is this something
>>> that's already on the roadmap at all?
>>> >
>>> > Thanks!
>>> >
>>> > Chang She
>>>
>>

Re: guidance on extension types

Reply via email to