[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383637#comment-16383637 ] ASF GitHub Bot commented on ARROW-2205: --- xhochy closed pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/cpp/src/arrow/python/arrow_to_pandas.cc b/cpp/src/arrow/python/arrow_to_pandas.cc index aefd4d76d..21e848281 100644 --- a/cpp/src/arrow/python/arrow_to_pandas.cc +++ b/cpp/src/arrow/python/arrow_to_pandas.cc @@ -362,6 +362,29 @@ static void ConvertBooleanNoNulls(PandasOptions options, const ChunkedArray& dat } } +template +static Status ConvertIntegerObjects(PandasOptions options, const ChunkedArray& data, +PyObject** out_values) { + PyAcquireGIL lock; + for (int c = 0; c < data.num_chunks(); c++) { +const auto& arr = *data.chunk(c); +const T* in_values = GetPrimitiveValues(arr); + +for (int i = 0; i < arr.length(); ++i) { + if (arr.IsNull(i)) { +Py_INCREF(Py_None); +*out_values++ = Py_None; + } else { +*out_values++ = std::is_signed::value +? PyLong_FromLongLong(in_values[i]) +: PyLong_FromUnsignedLongLong(in_values[i]); +RETURN_IF_PYERROR(); + } +} + } + return Status::OK(); +} + template inline Status ConvertBinaryLike(PandasOptions options, const ChunkedArray& data, PyObject** out_values) { @@ -684,6 +707,22 @@ class ObjectBlock : public PandasBlock { if (type == Type::BOOL) { RETURN_NOT_OK(ConvertBooleanWithNulls(options_, data, out_buffer)); +} else if (type == Type::UINT8) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::INT8) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::UINT16) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::INT16) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::UINT32) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::INT32) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::UINT64) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); +} else if (type == Type::INT64) { + RETURN_NOT_OK(ConvertIntegerObjects(options_, data, out_buffer)); } else if (type == Type::BINARY) { RETURN_NOT_OK(ConvertBinaryLike(options_, data, out_buffer)); } else if (type == Type::STRING) { @@ -1202,34 +1241,33 @@ using BlockMap = std::unordered_map>; static Status GetPandasBlockType(const Column& col, const PandasOptions& options, PandasBlock::type* output_type) { +#define INTEGER_CASE(NAME) \ + *output_type = \ + col.null_count() > 0 \ + ? options.integer_object_nulls ? PandasBlock::OBJECT : PandasBlock::DOUBLE \ + : PandasBlock::NAME; \ + break; + switch (col.type()->id()) { case Type::BOOL: *output_type = col.null_count() > 0 ? PandasBlock::OBJECT : PandasBlock::BOOL; break; case Type::UINT8: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT8; - break; + INTEGER_CASE(UINT8); case Type::INT8: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT8; - break; + INTEGER_CASE(INT8); case Type::UINT16: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT16; - break; + INTEGER_CASE(UINT16); case Type::INT16: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT16; - break; + INTEGER_CASE(INT16); case Type::UINT32: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::UINT32; - break; + INTEGER_CASE(UINT32); case Type::INT32: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT32; - break; -case Type::INT64: - *output_type = col.null_count() > 0 ? PandasBlock::DOUBLE : PandasBlock::INT64; - break; + INTEGER_CASE(INT32);
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382265#comment-16382265 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-369649945 Rebasing this again This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Assignee: Albert Shieh >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380977#comment-16380977 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-369369328 Rebased. Going to wait for the builds to rujn This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Assignee: Albert Shieh >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16380978#comment-16380978 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-369369328 Rebased. Going to wait for the builds to run This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Assignee: Albert Shieh >Priority: Major > Labels: pull-request-available > Fix For: 0.9.0 > > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378833#comment-16378833 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-368927373 yes, plan to review this today This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377105#comment-16377105 ] ASF GitHub Bot commented on ARROW-2205: --- cpcloud commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170646464 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: Grouping by modules (which contain functions) is the solution to that particular problem. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377102#comment-16377102 ] ASF GitHub Bot commented on ARROW-2205: --- pitrou commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170645928 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: I would vote against the pytest-style of a forest of functions. In my experience the lack of organization produces difficult to maintain test modules. Organizing the test methods into several classes helped me figure out which features were tested and how. An alternative would be to split the tests into several modules (or perhaps several modules inside a subpackage). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377094#comment-16377094 ] ASF GitHub Bot commented on ARROW-2205: --- cpcloud commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170642827 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: Sure, but if we are going to do it eventually then we shouldn't knowingly add to the debt in the name of consistency. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377083#comment-16377083 ] ASF GitHub Bot commented on ARROW-2205: --- adshieh commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170640785 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: Sure! However, it seems like none of the test methods use `self` and the test classes are just for organizational purposes, so moving it to a test function would be a deviation? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377082#comment-16377082 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170640655 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: We should probably convert this whole module to pytest-style in a separate patch This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16377068#comment-16377068 ] ASF GitHub Bot commented on ARROW-2205: --- cpcloud commented on a change in pull request #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#discussion_r170637799 ## File path: python/pyarrow/tests/test_convert_pandas.py ## @@ -615,6 +615,36 @@ def test_int_object_nulls(self): _check_pandas_roundtrip(df, expected=expected, expected_schema=schema) +def test_int_object_nulls_option(self): Review comment: It doesn't look like you're using `self` here. Can you make this into a test function and [`pytest.mark.parametrize`](https://docs.pytest.org/en/latest/parametrize.html) it on the `int_dtypes` parameter? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16376976#comment-16376976 ] ASF GitHub Bot commented on ARROW-2205: --- adshieh commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-368526062 I personally prefer keyword arguments because the number of calls to `to_pandas` in my use cases has been limited, so having the documentation in the method and avoiding the extra step of creating an options object has been convenient. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375756#comment-16375756 ] ASF GitHub Bot commented on ARROW-2205: --- pitrou commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-368256301 > I am wondering out loud if there's anything we can do to help with the API for a growing number of pandas conversion arguments (like using an options object instead of keyword args) Perhaps the dialect concept used by the [csv module](https://docs.python.org/3/library/csv.html) can be re-used? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375704#comment-16375704 ] ASF GitHub Bot commented on ARROW-2205: --- xhochy commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-368247892 @wesm I think using `kwargs` seems to be the most pythonic way to do this. With Pandas I also wondered in the beginning over the large number of kwargs but in the end, it seems like a good-enough solution. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16375665#comment-16375665 ] ASF GitHub Bot commented on ARROW-2205: --- wesm commented on issue #1650: ARROW-2205: [Python] Option for integer object nulls URL: https://github.com/apache/arrow/pull/1650#issuecomment-368241657 Thanks for working on this @adshieh! I am wondering out loud if there's anything we can do to help with the API for a growing number of pandas conversion arguments (like using an options object instead of keyword args) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)