[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369640#comment-16369640 ] ASF GitHub Bot commented on ARROW-2121: --- wesm closed pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/python/README-benchmarks.md b/python/README-benchmarks.md index 3fecb35cb..60fa88f4a 100644 --- a/python/README-benchmarks.md +++ b/python/README-benchmarks.md @@ -41,8 +41,6 @@ First you have to install ASV's development version: pip install git+https://github.com/airspeed-velocity/asv.git ``` - - Then you need to set up a few environment variables: ```shell diff --git a/python/benchmarks/convert_pandas.py b/python/benchmarks/convert_pandas.py index c4a7a59cb..244b3dcc8 100644 --- a/python/benchmarks/convert_pandas.py +++ b/python/benchmarks/convert_pandas.py @@ -48,3 +48,23 @@ def setup(self, n, dtype): def time_to_series(self, n, dtype): self.arrow_data.to_pandas() + + +class ZeroCopyPandasRead(object): + +def setup(self): +# Transpose to make column-major +values = np.random.randn(10, 10) + +df = pd.DataFrame(values.T) +ctx = pa.default_serialization_context() + +self.serialized = ctx.serialize(df) +self.as_buffer = self.serialized.to_buffer() +self.as_components = self.serialized.to_components() + +def time_deserialize_from_buffer(self): +pa.deserialize(self.as_buffer) + +def time_deserialize_from_components(self): +pa.deserialize_components(self.as_components) diff --git a/python/doc/source/ipc.rst b/python/doc/source/ipc.rst index 9bf93ffe8..bce8b1ed1 100644 --- a/python/doc/source/ipc.rst +++ b/python/doc/source/ipc.rst @@ -317,9 +317,8 @@ An object can be reconstructed from its component-based representation using Serializing pandas Objects -- -We provide a serialization context that has optimized handling of pandas -objects like ``DataFrame`` and ``Series``. This can be created with -``pyarrow.pandas_serialization_context()``. Combined with component-based +The default serialization context has optimized handling of pandas +objects like ``DataFrame`` and ``Series``. Combined with component-based serialization above, this enables zero-copy transport of pandas DataFrame objects not containing any Python objects: @@ -327,7 +326,7 @@ objects not containing any Python objects: import pandas as pd df = pd.DataFrame({'a': [1, 2, 3, 4, 5]}) - context = pa.pandas_serialization_context() + context = pa.default_serialization_context() serialized_df = context.serialize(df) df_components = serialized_df.to_components() original_df = context.deserialize_components(df_components) diff --git a/python/pyarrow/__init__.py b/python/pyarrow/__init__.py index d95954ed3..15a37ca10 100644 --- a/python/pyarrow/__init__.py +++ b/python/pyarrow/__init__.py @@ -125,7 +125,6 @@ localfs = LocalFileSystem.get_instance() from pyarrow.serialization import (default_serialization_context, - pandas_serialization_context, register_default_serialization_handlers, register_torch_serialization_handlers) diff --git a/python/pyarrow/pandas_compat.py b/python/pyarrow/pandas_compat.py index e8fa83fe7..6d4bf5e78 100644 --- a/python/pyarrow/pandas_compat.py +++ b/python/pyarrow/pandas_compat.py @@ -27,7 +27,7 @@ import six import pyarrow as pa -from pyarrow.compat import PY2, zip_longest # noqa +from pyarrow.compat import builtin_pickle, PY2, zip_longest # noqa def infer_dtype(column): @@ -424,11 +424,19 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. Note that +# we do not use isinstance here because _int.CategoricalBlock is a +# subclass of _int.ObjectBlock. +if type(block) == _int.ObjectBlock: +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps( +values, protocol=builtin_pickle.HIGHEST_PROTOCOL) + blocks.append(block_data) return { @@ -463,6 +471,9 @@ def _reconstruct_block(item): block = _int.make_block(block_arr, placement=placement,
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369638#comment-16369638 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-366845422 Merging this, since the last Appveyor build had passed This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369616#comment-16369616 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-366835235 Thanks @wesm I *think* I've enabled it now. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369606#comment-16369606 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-366831288 @robertnishihara would you mind enabling appveyor on your fork when you have a chance? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369590#comment-16369590 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-366821860 Sorry for the delay, looking now, and may as well add a benchmark for zero-copy DataFrame while I'm at it This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16360836#comment-16360836 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364946568 Yep, I have this on deck to look at today This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359097#comment-16359097 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364599769 Ok, I'm pretty happy with this now. @wesm @pcmoritz let me know if you have any comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16359080#comment-16359080 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364595801 Let's not merge this just yet, I'd like to brainstorm other approaches a little. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358991#comment-16358991 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167156435 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,18 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. Note that +# we do not use isinstance here because _int.CategoricalBlock is a +# subclass of _int.ObjectBlock. +if type(block) == _int.ObjectBlock: +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358992#comment-16358992 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167350906 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,19 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. Note that +# we do not use isinstance here because _int.CategoricalBlock is a +# subclass of _int.ObjectBlock. +if type(block) == _int.ObjectBlock: +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps( +values, protocol=builtin_pickle.HIGHEST_PROTOCOL) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358990#comment-16358990 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on issue #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364573786 Some performance numbers. The numbers are somewhat variable if you run the benchmarks multiple times. ```python import pyarrow as pa import pandas as pd df = pd.DataFrame(data={str(i): [i, str(i)] for i in range(10 ** 6)}) ``` Before this PR ```python context = pa.pandas_serialization_context() %time s = pa.serialize(df, context=context).to_buffer() # 570ms %time d = pa.deserialize(s, context=context) # 485ms %timeit s = pa.serialize(df, context=context).to_buffer() # 482ms %timeit d = pa.deserialize(s, context=context) # 376ms ``` After this PR ```python %time s = pa.serialize(df).to_buffer() # 577ms %time d = pa.deserialize(s) # 672ms %timeit s = pa.serialize(df).to_buffer() # 467ms %timeit d = pa.deserialize(s) # 349ms ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16358038#comment-16358038 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167156435 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,18 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. Note that +# we do not use isinstance here because _int.CategoricalBlock is a +# subclass of _int.ObjectBlock. +if type(block) == _int.ObjectBlock: +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357976#comment-16357976 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara commented on a change in pull request #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#discussion_r167148817 ## File path: python/pyarrow/pandas_compat.py ## @@ -421,11 +421,16 @@ def dataframe_to_serialized_dict(frame): block_data.update(dictionary=values.categories, ordered=values.ordered) values = values.codes - block_data.update( placement=block.mgr_locs.as_array, block=values ) + +# If we are dealing with an object array, pickle it instead. +if isinstance(block, _int.ObjectBlock): +block_data['object'] = None +block_data['block'] = builtin_pickle.dumps(values) Review comment: Should we be using `_pickle_to_buffer` here? Does that make a difference? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357967#comment-16357967 ] ASF GitHub Bot commented on ARROW-2121: --- wesm commented on issue #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581#issuecomment-364344672 Well, we need to preserve the zero-copy pandas reads. Now that our ASV benchmarking setup has been rehabilitated we should be able to do that in this patch to verify performance This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2121) Consider special casing object arrays in pandas serializers.
[ https://issues.apache.org/jira/browse/ARROW-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16357951#comment-16357951 ] ASF GitHub Bot commented on ARROW-2121: --- robertnishihara opened a new pull request #1581: [WIP] ARROW-2121: [Python] Handle object arrays directly in pandas serializer. URL: https://github.com/apache/arrow/pull/1581 The goal here is to get the best of both the `pandas_serialization_context` (speed at serializing pandas dataframes containing strings and other objects) and the `default_serialization_context` (correctly serializing a large class of numpy object arrays). This PR sort of messes up the function `pa.pandas_compat.dataframe_to_serialized_dict`. Is that function just a helper function for implementing the custom pandas serializers? Or is it intended to be used in other places. TODO in this PR (assuming you think this approach is reasonable): - [ ] remove `pandas_serialization_context` - [ ] make sure this code path is tested - [ ] double check that performance is good cc @wesm @pcmoritz @devin-petersohn This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Consider special casing object arrays in pandas serializers. > > > Key: ARROW-2121 > URL: https://issues.apache.org/jira/browse/ARROW-2121 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Robert Nishihara >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian JIRA (v7.6.3#76005)