[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933897#comment-16933897 ] Wes McKinney commented on ARROW-1664: - I don't think that xarray is compatible with the Arrow columnar format. Let's move the discussion to the mailing list if there is something more actionable > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932827#comment-16932827 ] Joris Van den Bossche commented on ARROW-1664: -- In general, xarray datasets/dataarrays do not necessarily match Arrow's data model (eg they can have multiple dimensions). Of course, you can have a subset of cases where your xarray object would map nicely to an Arrow table. Also, given that xarray uses contiguous numpy arrays and Arrow 1D arrays, I am not sure that Arrow is very suited for zero-copy serialization for such objects? (converting to arrow could be zero-copy, but not the other way around?) So given that, I am not sure pyarrow should necessarily support xarray objects specifically. We could indeed think about a "table protocol", but for that I think it would be nice to have some more practical use cases. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932813#comment-16932813 ] Antoine Pitrou commented on ARROW-1664: --- Ah, perhaps at some point we want to define a PyArrow table protocol like we already have a PyArrow array protocol. [~jorisvandenbossche] what do you think? > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932797#comment-16932797 ] Mitar commented on ARROW-1664: -- It is like extension of DataFrame to multiple dimensions. {quote}Xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw [NumPy|http://www.numpy.org/]-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures. Xarray was inspired by and borrows heavily from [pandas|http://pandas.pydata.org/], the popular data analysis package focused on labelled tabular data. {quote} So internally it is ndarrays. This is why I think serialization could be possible, similar to how Pandas DataFrames internally use ndarrays. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932790#comment-16932790 ] Antoine Pitrou commented on ARROW-1664: --- Does xarray have a Table-like or DataFrame-like concept? > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932788#comment-16932788 ] Mitar commented on ARROW-1664: -- I see. So why not also have then `pa.Table.from_xarray`? > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932688#comment-16932688 ] Antoine Pitrou commented on ARROW-1664: --- > There is no special handling of Pandas DataFrame in arrow? What do you mean? You can ingest a DataFrame using pa.Table.from_pandas(), for example. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932670#comment-16932670 ] Mitar commented on ARROW-1664: -- Nice. And so Arrow support for Pandas DataFrame is only through: [https://github.com/pandas-dev/pandas/blob/34fff1f336d3b083dd09f5036c2bb9b80edfb619/pandas/core/arrays/integer.py#L370] There is no special handling of Pandas DataFrame in arrow? > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932629#comment-16932629 ] Antoine Pitrou commented on ARROW-1664: --- In ARROW-3829 we added a Python protocol to allow arbitrary objects to expose Arrow conversion capabilities. Does that solve the issue for you? Of course xarray would have to implement that protocol. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932630#comment-16932630 ] Antoine Pitrou commented on ARROW-1664: --- See example in the PR: https://github.com/apache/arrow/pull/5106/files#diff-8e181378bc711f4297fbe708ba95d3b8L1566 > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932606#comment-16932606 ] Mitar commented on ARROW-1664: -- As [~wesmckinn] wrote: the idea is to get zero-copy reads. So serializing might be slow, but deserializing would be fast. I think Pandas DataFrame also is not using "arrow under the hood" but arrow supports it. Why not then also work on supporting xarray? It is maybe not a priority now, but it should/could be done, in my view. So I would ask to reopen this issue. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932601#comment-16932601 ] Antoine Pitrou commented on ARROW-1664: --- I'm not sure what the exact request is here. The way to integrate xarray and Arrow should be for xarray to use Arrow under the hood, IMHO. Arrow isn't meant as a general serialization layer - and besides, there's no reason it should be faster than anything else at serializing foreign binary data. > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Mitar >Priority: Minor > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-1664) [Python] Support for xarray.DataArray and xarray.Dataset
[ https://issues.apache.org/jira/browse/ARROW-1664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16682614#comment-16682614 ] Wes McKinney commented on ARROW-1664: - You should be able to write serializers for use with `pa.serialize` that result in zero-copy reads. May require a bit of work to get various pandas Index types serializing properly > [Python] Support for xarray.DataArray and xarray.Dataset > > > Key: ARROW-1664 > URL: https://issues.apache.org/jira/browse/ARROW-1664 > Project: Apache Arrow > Issue Type: Bug >Reporter: Mitar >Priority: Major > > DataArray and Dataset are efficient in-memory representations for multi > dimensional data. It would be great if one could share them between processes > using Arrow. > http://xarray.pydata.org/en/stable/generated/xarray.DataArray.html#xarray.DataArray > http://xarray.pydata.org/en/stable/generated/xarray.Dataset.html#xarray.Dataset -- This message was sent by Atlassian JIRA (v7.6.3#76005)