Yup we’ve run into this as well. Though I think you could control this by implementing a pandas extension dtype to go with the arrow extension type?
On Wed, Sep 21, 2022 at 9:17 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Also, note I've raised a similar issue ( > https://issues.apache.org/jira/browse/ARROW-17535) for to_pandas calls. > One thing that I think would be nice is to be able to hook into the python > conversion when necessary translate to Python objects when necessary. > > > > On Wed, Sep 21, 2022 at 8:49 PM Chang She <ch...@eto.ai> wrote: > >> Thanks Wes. >> >> => Array.to_numpy : I opened ARROW-17813 >> <https://issues.apache.org/jira/browse/ARROW-17813> as you suggested and >> added some details / repro code. There's also a follow-up thing about the >> other direction, converting from a pandas DataFrame column to an Arrow >> list<extension>. >> >> => You're right, I was a little hasty in the description and it wasn't >> very accurate: >> >> Scenario 1: >> >> If I have a non-nested ExtensionArray whose storage is a DictionaryArray, >> `pc.field("extension") == 'string'` would be a valid filter but >> currently triggers the "function 'equal' has no kernel matching input >> types" error. >> This is the path used by DuckDB if you add something like >> `extension=='string'` in the where clause. >> If Arrow/Acero is also able to automatically lower to storage type for >> the functions then it would make running compute on extension types a lot >> easier. Even for a list<label> column, at least in duckdb you could use >> "UNNEST" to make it work. >> >> >> Scenario 2: >> >> The trouble with using UNNEST is it makes the query a lot more >> complicated and has perf implications. If we're working a lot with nested >> data types, it would be easier to have a set of array functions. >> If there's a nested ExtensionArray, then something like a list-contains >> function would make things a lot easier. However, I think this is a lot >> more work (and depends on other systems like duckdb to integrate with these >> functions as well). >> >> >> Would it make sense for me to create a JIRA for scenario 1 to continue >> further discussion? >> >> >> Thanks again. >> >> >> On Tue, Sep 20, 2022 at 6:11 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >>> hi Chang, >>> >>> There are a few rough edges here that you've run into: >>> >>> * It looks like Array.to_numpy does not "automatically lower" to the >>> storage type when trying to convert to NumPy format. In the absence of >>> some other conversion rule, converting to the storage type seems like >>> a reasonable alternative to failing. Can you open a Jira issue about >>> this? This could probably be fixed easily in time for the 10.0.0 >>> release, much more easily than the next issue >>> >>> * On the query, it looks like the filter portion at least is being >>> handled by Arrow/Acero — the syntax / UX relating to nested types here >>> is relatively unexplored relative to non-nested types. Here comparing >>> the label type (itself a list of dictionary-encoded strings) to a >>> string seems invalid, probably you would need to check for inclusion >>> of the string in the label list-of-strings. I do not know what the >>> syntax for this would be with DuckDB (to check for inclusion of a >>> string in a list of strings) but in principle this is something that >>> should be able to be made to work with some effort >>> >>> - Wes >>> >>> On Sun, Sep 18, 2022 at 8:23 PM Chang She <ch...@eto.ai> wrote: >>> > >>> > Hey y'all, thanks in advance for the discussion. >>> > >>> > I'm creating Arrow extensions for computer vision and I'm running into >>> issues in two scenarios. I couldn't find the answers in the archive so I >>> thought I'd post here. >>> > >>> > Example: >>> > I make an extension type called "Label" that has storage type >>> "dictionary<int8, string>". This is an object detection dataset so each row >>> represents an image and has multiple detected objects that needs to be >>> labeled. So there's a "name" column that is "list<label>": >>> > >>> > Example table schema: >>> > image_id: int >>> > uri: string >>> > label: list<label> # list<dictionary<int8, string>> storage type >>> > >>> > >>> > Problems: >>> > 1. `to_numpy` does not seem to work with a nested column. e.g., if I >>> try to call `to_numpy` on the `label` column, then I get "Not implemented >>> type for Arrow list to pandas: extension<label<LabelType>>" >>> > 2. If I'm querying this dataset using duckdb, running "select * from >>> dataset where label='person'" results in: "Function 'equal' has no kernel >>> matching input types (extension<label<LabelType>>, string)" >>> > >>> > Am I missing an alternate path to make this work with extension types? >>> > Does implementing this in Arrow consist of checking if something is an >>> extension type and if so, use the storage type instead? Is this something >>> that's already on the roadmap at all? >>> > >>> > Thanks! >>> > >>> > Chang She >>> >>