[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-19 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933897#comment-16933897
 ] 

Wes McKinney commented on ARROW-1664:
-

I don't think that xarray is compatible with the Arrow columnar format. Let's 
move the discussion to the mailing list if there is something more actionable

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932827#comment-16932827
 ] 

Joris Van den Bossche commented on ARROW-1664:
--

In general, xarray datasets/dataarrays do not necessarily match Arrow's data 
model (eg they can have multiple dimensions). Of course, you can have a subset 
of cases where your xarray object would map nicely to an Arrow table.  
Also, given that xarray uses contiguous numpy arrays and Arrow 1D arrays, I am 
not sure that Arrow is very suited for zero-copy serialization for such 
objects? (converting to arrow could be zero-copy, but not the other way around?)

So given that, I am not sure pyarrow should necessarily support xarray objects 
specifically. 
We could indeed think about a "table protocol", but for that I think it would 
be nice to have some more practical use cases.


> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932813#comment-16932813
 ] 

Antoine Pitrou commented on ARROW-1664:
---

Ah, perhaps at some point we want to define a PyArrow table protocol like we 
already have a PyArrow array protocol. [~jorisvandenbossche] what do you think?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932797#comment-16932797
 ] 

Mitar commented on ARROW-1664:
--

It is like extension of DataFrame to multiple dimensions.
{quote}Xarray introduces labels in the form of dimensions, coordinates and 
attributes on top of raw [NumPy|http://www.numpy.org/]-like arrays, which 
allows for a more intuitive, more concise, and less error-prone developer 
experience. The package includes a large and growing library of domain-agnostic 
functions for advanced analytics and visualization with these data structures.

Xarray was inspired by and borrows heavily from 
[pandas|http://pandas.pydata.org/], the popular data analysis package focused 
on labelled tabular data.
{quote}
So internally it is ndarrays. This is why I think serialization could be 
possible, similar to how Pandas DataFrames internally use ndarrays.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932790#comment-16932790
 ] 

Antoine Pitrou commented on ARROW-1664:
---

Does xarray have a Table-like or DataFrame-like concept?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932788#comment-16932788
 ] 

Mitar commented on ARROW-1664:
--

I see. So why not also have then `pa.Table.from_xarray`?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932688#comment-16932688
 ] 

Antoine Pitrou commented on ARROW-1664:
---

> There is no special handling of Pandas DataFrame in arrow?

What do you mean? You can ingest a DataFrame using pa.Table.from_pandas(), for 
example.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932670#comment-16932670
 ] 

Mitar commented on ARROW-1664:
--

Nice. And so Arrow support for Pandas DataFrame is only through:

[https://github.com/pandas-dev/pandas/blob/34fff1f336d3b083dd09f5036c2bb9b80edfb619/pandas/core/arrays/integer.py#L370]

There is no special handling of Pandas DataFrame in arrow?

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932629#comment-16932629
 ] 

Antoine Pitrou commented on ARROW-1664:
---

In ARROW-3829 we added a Python protocol to allow arbitrary objects to expose 
Arrow conversion capabilities. Does that solve the issue for you? Of course 
xarray would have to implement that protocol.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932630#comment-16932630
 ] 

Antoine Pitrou commented on ARROW-1664:
---

See example in the PR:
https://github.com/apache/arrow/pull/5106/files#diff-8e181378bc711f4297fbe708ba95d3b8L1566


> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Mitar (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932606#comment-16932606
 ] 

Mitar commented on ARROW-1664:
--

As [~wesmckinn]  wrote: the idea is to get zero-copy reads. So serializing 
might be slow, but deserializing would be fast.

I think Pandas DataFrame also is not using "arrow under the hood" but arrow 
supports it. Why not then also work on supporting xarray?

It is maybe not a priority now, but it should/could be done, in my view. So I 
would ask to reopen this issue.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932601#comment-16932601
 ] 

Antoine Pitrou commented on ARROW-1664:
---

I'm not sure what the exact request is here. The way to integrate xarray and 
Arrow should be for xarray to use Arrow under the hood, IMHO. Arrow isn't meant 
as a general serialization layer - and besides, there's no reason it should be 
faster than anything else at serializing foreign binary data.

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Mitar
>Priority: Minor
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset

2018-11-10 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682614#comment-16682614
 ] 

Wes McKinney commented on ARROW-1664:
-

You should be able to write serializers for use with `pa.serialize` that result 
in zero-copy reads. May require a bit of work to get various pandas Index types 
serializing properly

> [Python] Support for xarray.DataArray and xarray.Dataset
> 
>
> Key: ARROW-1664
> URL: https://issues.apache.org/jira/browse/ARROW-1664
> Project: Apache Arrow
>  Issue Type: Bug
>Reporter: Mitar
>Priority: Major
>
> DataArray and Dataset are efficient in-memory representations for multi 
> dimensional data. It would be great if one could share them between processes 
> using Arrow.
> http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray
> http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)