[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2019-05-06 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834100#comment-16834100
 ] 

Joris Van den Bossche commented on ARROW-2428:
--

[~xhochy] did you have already a specific hook in mind or tried something 
specific at the AHL hackathon?

One way might be to allow the user to specify the target dtypes in 
{{to_pandas}} (on an optional per column basis). If an ExtensionDtype instance 
is passed there, arrow could delegate converting the arrow array to a pandas 
ExtensionArray to the ExtensionDtype/Array class itself. 

Similarly, if we start storing the name of the ExtensionDtype in the pandas 
metadata, we could also automatically re-create the dtype from that name 
(without the need for the user to pass it explicitly, for the default).

See also the discussion in https://github.com/pandas-dev/pandas/issues/20612

> [Python] Support ExtensionArrays in to_pandas conversion
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2018-05-13 Thread Uwe L. Korn (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473529#comment-16473529
 ] 

Uwe L. Korn commented on ARROW-2428:


We had a stab at this at the AHL hackathon and came to the conclusion that we 
should wait for the Pandas 0.23 release. The interface still needs to a settle 
a bit more and it is not as easy as initially expected as we use (deep) 
internal APIs from Pandas to make the conversion from Arrow Tables to 
DataFrames fast. Thus I have removed the {{beginner}} label. Someone knowing 
the internals of Pandas very well might solve this easily but others first need 
to understand the mechanics of Pandas' BlockManager.

> [Python] Support ExtensionArrays in to_pandas conversion
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2428) [Python] Support ExtensionArrays in to_pandas conversion

2018-05-12 Thread Alex Hagerman (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16473253#comment-16473253
 ] 

Alex Hagerman commented on ARROW-2428:
--

[~xhochy] I was reading through the meta issue and trying to understand what we 
have to make sure to pass. Do you think this has settled enough to begin work? 
It appears pandas will expect a class defining the type, which I'm guessing the 
objects in the arrow column will be instances of that user type? Do we expect 
arrow columns to meet all the requirements of ExtensionArray?

 

I was specifically looking at this to understand what options have to be passed 
and what the ExtensionArray requires.

https://github.com/pandas-dev/pandas/pull/19174/files#diff-e448fe09dbe8aed468d89a4c90e65cff

> [Python] Support ExtensionArrays in to_pandas conversion
> 
>
> Key: ARROW-2428
> URL: https://issues.apache.org/jira/browse/ARROW-2428
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Uwe L. Korn
>Priority: Major
>  Labels: beginner
> Fix For: 1.0.0
>
>
> With the next release of Pandas, it will be possible to define custom column 
> types that back a {{pandas.Series}}. Thus we will not be able to cover all 
> possible column types in the {{to_pandas}} conversion by default as we won't 
> be aware of all extension arrays.
> To enable users to create {{ExtensionArray}} instances from Arrow columns in 
> the {{to_pandas}} conversion, we should provide a hook in the {{to_pandas}} 
> call where they can overload the default conversion routines with the ones 
> that produce their {{ExtensionArray}} instances.
> This should avoid additional copies in the case where we would nowadays first 
> convert the Arrow column into a default Pandas column (probably of object 
> type) and the user would afterwards convert it to a more efficient 
> {{ExtensionArray}}. This hook here will be especially useful when you build 
> {{ExtensionArrays}} where the storage is backed by Arrow.
> The meta-issue that tracks the implementation inside of Pandas is: 
> https://github.com/pandas-dev/pandas/issues/19696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)