[jira] [Updated] (ARROW-3053) [Python] pandas decimal conversion segfault
[ https://issues.apache.org/jira/browse/ARROW-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Shieh updated ARROW-3053: Description: This example segfaults when trying to convert a pandas DataFrame with a decimal column and at least one other object column to a pyarrow Table after a round trip through HDF5: {code:java} import decimal import pandas as pd import pyarrow as pa data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}} df = pd.DataFrame.from_dict(data) df.to_hdf('test.h5', 'test') df = pd.read_hdf('test.h5', 'test') table = pa.Table.from_pandas(df) {code} This is the gdb backtrace: {code:java} #0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, _object*, _object*, bool, std::shared_ptr const&, std::shared_ptr*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #7 0x004c37ed in PyEval_EvalFrameEx () #8 0x004b9ab6 in PyEval_EvalCodeEx () #9 0x004c1e6f in PyEval_EvalFrameEx () #10 0x004b9ab6 in PyEval_EvalCodeEx () #11 0x004d55f3 in ?? () #12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #13 0x004bc3fa in PyEval_EvalFrameEx () #14 0x004b9ab6 in PyEval_EvalCodeEx () #15 0x004eb30f in ?? () #16 0x004e5422 in PyRun_FileExFlags () #17 0x004e3cd6 in PyRun_SimpleFileExFlags () #18 0x00493ae2 in Py_Main () #19 0x7f18a79c4830 in __libc_start_main (main=0x4934c0 , argc=2, argv=0x7fffcf079508, init=, fini=, rtld_fini=, stack_end=0x7fffcf0794f8) at ../csu/libc-start.c:291 #20 0x004933e9 in _start () {code} was: This example segfaults when trying to convert a pandas DataFrame with a decimal column and at least one other object column and at to a pyarrow Table after a round trip through HDF5: {code:java} import decimal import pandas as pd import pyarrow as pa data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}} df = pd.DataFrame.from_dict(data) df.to_hdf('test.h5', 'test') df = pd.read_hdf('test.h5', 'test') table = pa.Table.from_pandas(df) {code} This is the gdb backtrace: {code:java} #0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, _object*, _object*, bool, std::shared_ptr const&, std::shared_ptr*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #7 0x004c37ed in PyEval_EvalFrameEx () #8 0x004b9ab6 in PyEval_EvalCodeEx () #9 0x004c1e6f in PyEval_EvalFrameEx () #10 0x004b9ab6 in PyEval_EvalCodeEx () #11 0x004d55f3 in ?? () #12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #13 0x004bc3fa in PyEval_EvalFrameEx () #14 0x004b9ab6 in PyEval_EvalCodeEx () #15 0x004eb30f in ?? () #16 0x004e5422 in PyRun_FileExFlags () #17 0x004e3cd
[jira] [Created] (ARROW-3053) [Python] pandas decimal conversion segfault
Albert Shieh created ARROW-3053: --- Summary: [Python] pandas decimal conversion segfault Key: ARROW-3053 URL: https://issues.apache.org/jira/browse/ARROW-3053 Project: Apache Arrow Issue Type: Bug Components: C++, Python Affects Versions: 0.10.0 Reporter: Albert Shieh This example segfaults when trying to convert a pandas DataFrame with a decimal column and at least one other object column and at to a pyarrow Table after a round trip through HDF5: {code:java} import decimal import pandas as pd import pyarrow as pa data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}} df = pd.DataFrame.from_dict(data) df.to_hdf('test.h5', 'test') df = pd.read_hdf('test.h5', 'test') table = pa.Table.from_pandas(df) {code} This is the gdb backtrace: {code:java} #0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, _object*, _object*, bool, std::shared_ptr const&, std::shared_ptr*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10 #6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #7 0x004c37ed in PyEval_EvalFrameEx () #8 0x004b9ab6 in PyEval_EvalCodeEx () #9 0x004c1e6f in PyEval_EvalFrameEx () #10 0x004b9ab6 in PyEval_EvalCodeEx () #11 0x004d55f3 in ?? () #12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, _object*, _object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so #13 0x004bc3fa in PyEval_EvalFrameEx () #14 0x004b9ab6 in PyEval_EvalCodeEx () #15 0x004eb30f in ?? () #16 0x004e5422 in PyRun_FileExFlags () #17 0x004e3cd6 in PyRun_SimpleFileExFlags () #18 0x00493ae2 in Py_Main () #19 0x7f18a79c4830 in __libc_start_main (main=0x4934c0 , argc=2, argv=0x7fffcf079508, init=, fini=, rtld_fini=, stack_end=0x7fffcf0794f8) at ../csu/libc-start.c:291 #20 0x004933e9 in _start () {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.
[ https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386615#comment-16386615 ] Albert Shieh commented on ARROW-2122: - How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'? > [Python] Pyarrow fails to serialize dataframe with timestamp. > - > > Key: ARROW-2122 > URL: https://issues.apache.org/jira/browse/ARROW-2122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Nishihara >Priority: Major > Fix For: 0.9.0 > > > The bug can be reproduced as follows. > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) > s = pa.serialize(df).to_buffer() > new_df = pa.deserialize(s) # this fails{code} > The last line fails with > {code:java} > Traceback (most recent call last): > File "", line 1, in > File "serialization.pxi", line 441, in pyarrow.lib.deserialize > File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from > File "serialization.pxi", line 257, in > pyarrow.lib.SerializedPyObject.deserialize > File "serialization.pxi", line 174, in > pyarrow.lib.SerializationContext._deserialize_callback > File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in > _deserialize_pandas_dataframe > return pdcompat.serialized_dict_to_dataframe(data) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > serialized_dict_to_dataframe > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in > _reconstruct_block > dtype = _make_datetimetz(item['timezone']) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in > _make_datetimetz > return DatetimeTZDtype('ns', tz=tz) > File > "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py", > line 409, in __new__ > raise ValueError("DatetimeTZDtype constructor must have a tz " > ValueError: DatetimeTZDtype constructor must have a tz supplied{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.
[ https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386615#comment-16386615 ] Albert Shieh edited comment on ARROW-2122 at 3/5/18 7:31 PM: - How about {code} '+{:d}'.format(tz._minutes) {code} or some other prefix? was (Author: adshieh): How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'? > [Python] Pyarrow fails to serialize dataframe with timestamp. > - > > Key: ARROW-2122 > URL: https://issues.apache.org/jira/browse/ARROW-2122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Nishihara >Priority: Major > Fix For: 0.9.0 > > > The bug can be reproduced as follows. > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) > s = pa.serialize(df).to_buffer() > new_df = pa.deserialize(s) # this fails{code} > The last line fails with > {code:java} > Traceback (most recent call last): > File "", line 1, in > File "serialization.pxi", line 441, in pyarrow.lib.deserialize > File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from > File "serialization.pxi", line 257, in > pyarrow.lib.SerializedPyObject.deserialize > File "serialization.pxi", line 174, in > pyarrow.lib.SerializationContext._deserialize_callback > File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in > _deserialize_pandas_dataframe > return pdcompat.serialized_dict_to_dataframe(data) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > serialized_dict_to_dataframe > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in > _reconstruct_block > dtype = _make_datetimetz(item['timezone']) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in > _make_datetimetz > return DatetimeTZDtype('ns', tz=tz) > File > "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py", > line 409, in __new__ > raise ValueError("DatetimeTZDtype constructor must have a tz " > ValueError: DatetimeTZDtype constructor must have a tz supplied{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.
[ https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386217#comment-16386217 ] Albert Shieh commented on ARROW-2122: - The issue is that the timestamp has a pytz.FixedOffset timezone, which has a zone attribute of None where arrow expects a string. > [Python] Pyarrow fails to serialize dataframe with timestamp. > - > > Key: ARROW-2122 > URL: https://issues.apache.org/jira/browse/ARROW-2122 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Reporter: Robert Nishihara >Priority: Major > Fix For: 0.9.0 > > > The bug can be reproduced as follows. > {code:java} > import pyarrow as pa > import pandas as pd > df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) > s = pa.serialize(df).to_buffer() > new_df = pa.deserialize(s) # this fails{code} > The last line fails with > {code:java} > Traceback (most recent call last): > File "", line 1, in > File "serialization.pxi", line 441, in pyarrow.lib.deserialize > File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from > File "serialization.pxi", line 257, in > pyarrow.lib.SerializedPyObject.deserialize > File "serialization.pxi", line 174, in > pyarrow.lib.SerializationContext._deserialize_callback > File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in > _deserialize_pandas_dataframe > return pdcompat.serialized_dict_to_dataframe(data) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > serialized_dict_to_dataframe > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in > > for block in data['blocks']] > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in > _reconstruct_block > dtype = _make_datetimetz(item['timezone']) > File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in > _make_datetimetz > return DatetimeTZDtype('ns', tz=tz) > File > "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py", > line 409, in __new__ > raise ValueError("DatetimeTZDtype constructor must have a tz " > ValueError: DatetimeTZDtype constructor must have a tz supplied{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Shieh updated ARROW-2205: Affects Version/s: 0.8.0 > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Affects Versions: 0.8.0 >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2205) [Python] Option for integer object nulls
[ https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Shieh updated ARROW-2205: Summary: [Python] Option for integer object nulls (was: Option for integer object nulls) > [Python] Option for integer object nulls > > > Key: ARROW-2205 > URL: https://issues.apache.org/jira/browse/ARROW-2205 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: Albert Shieh >Priority: Major > Labels: pull-request-available > > I have a use case where the loss of precision in casting integers to floats > matters, and pandas supports storing integers with nulls without loss of > precision in object columns. However, a roundtrip through arrow will cast the > object columns to float columns, even though the object columns are stored in > arrow as integers with nulls. > This is a minimal example demonstrating the behavior of a roundtrip: > {code} > import numpy as np > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {code} > a > 0 None > 1 1 > a > 0 NaN > 1 1.0 > {code} > This seems to be the desired behavior, given test_int_object_nulls in > test_convert_pandas. > I think it would be useful to add an option in the to_pandas methods to allow > integers with nulls to be returned as object columns. The option can default > to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2205) Option for integer object nulls
Albert Shieh created ARROW-2205: --- Summary: Option for integer object nulls Key: ARROW-2205 URL: https://issues.apache.org/jira/browse/ARROW-2205 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: Albert Shieh I have a use case where the loss of precision in casting integers to floats matters, and pandas supports storing integers with nulls without loss of precision in object columns. However, a roundtrip through arrow will cast the object columns to float columns, even though the object columns are stored in arrow as integers with nulls. This is a minimal example demonstrating the behavior of a roundtrip: {code} import numpy as np import pandas as pd import pyarrow as pa df = pd.DataFrame({"a": np.array([None, 1], dtype=object)}) df_pa = pa.Table.from_pandas(df).to_pandas() print(df) print(df_pa) {code} The output is: {code} a 0 None 1 1 a 0 NaN 1 1.0 {code} This seems to be the desired behavior, given test_int_object_nulls in test_convert_pandas. I think it would be useful to add an option in the to_pandas methods to allow integers with nulls to be returned as object columns. The option can default to false in order to preserve the current behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index
[ https://issues.apache.org/jira/browse/ARROW-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Shieh updated ARROW-1958: Component/s: (was: C) Python > [Python] Error in pandas conversion for datetimetz row index > > > Key: ARROW-1958 > URL: https://issues.apache.org/jira/browse/ARROW-1958 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Ubuntu 16.04 >Reporter: Albert Shieh > > The pandas conversion of a datetimetz row index in a Table fails with non-UTC > time zones because the values are stored as datetime64\[ns\] and interpreted > as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and > converted to datetime64\[ns, tz\]. There's correct handling for time zones > for columns in Column.to_pandas, but not for the row index in > table_to_blockmanager. > This is a minimal example demonstrating the failure of a roundtrip between a > DataFrame and a Table: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({ > 'a': pd.date_range( > start='2017-01-01', periods=3, tz='America/New_York' > ) > }) > df = df.set_index('a') > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {noformat} > Empty DataFrame > Columns: [] > Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 > 00:00:00-05:00] > Empty DataFrame > Columns: [] > Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 > 05:00:00-05:00] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index
[ https://issues.apache.org/jira/browse/ARROW-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Albert Shieh updated ARROW-1958: Description: The pandas conversion of a datetimetz row index in a Table fails with non-UTC time zones because the values are stored as datetime64\[ns\] and interpreted as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and converted to datetime64\[ns, tz\]. There's correct handling for time zones for columns in Column.to_pandas, but not for the row index in table_to_blockmanager. This is a minimal example demonstrating the failure of a roundtrip between a DataFrame and a Table: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({ 'a': pd.date_range( start='2017-01-01', periods=3, tz='America/New_York' ) }) df = df.set_index('a') df_pa = pa.Table.from_pandas(df).to_pandas() print(df) print(df_pa) {code} The output is: {noformat} Empty DataFrame Columns: [] Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 00:00:00-05:00] Empty DataFrame Columns: [] Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 05:00:00-05:00] {noformat} was: The pandas conversion of a datetimetz row index in a Table fails with non-UTC time zones because the values are stored as datetime64\[ns\] and interpreted as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and converted to datetime64\[ns, tz\]. There's correct handling for time zones for columns in Column.to_pandas, but not for the row index in table_to_blockmanager. This is a minimal example demonstrating the failure of a roundtrip between a DataFrame and a Table: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({ 'a': pd.date_range( start='2017-01-01', periods=3, tz='America/New_York' ) }) df.set_index('a') df_pa = pa.Table.from_pandas(df).to_pandas() print(df) print(df_pa) {code} The output is: {noformat} Empty DataFrame Columns: [] Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 00:00:00-05:00] Empty DataFrame Columns: [] Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 05:00:00-05:00] {noformat} > [Python] Error in pandas conversion for datetimetz row index > > > Key: ARROW-1958 > URL: https://issues.apache.org/jira/browse/ARROW-1958 > Project: Apache Arrow > Issue Type: Bug > Components: C >Affects Versions: 0.8.0 > Environment: Ubuntu 16.04 >Reporter: Albert Shieh > > The pandas conversion of a datetimetz row index in a Table fails with non-UTC > time zones because the values are stored as datetime64\[ns\] and interpreted > as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and > converted to datetime64\[ns, tz\]. There's correct handling for time zones > for columns in Column.to_pandas, but not for the row index in > table_to_blockmanager. > This is a minimal example demonstrating the failure of a roundtrip between a > DataFrame and a Table: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({ > 'a': pd.date_range( > start='2017-01-01', periods=3, tz='America/New_York' > ) > }) > df = df.set_index('a') > df_pa = pa.Table.from_pandas(df).to_pandas() > print(df) > print(df_pa) > {code} > The output is: > {noformat} > Empty DataFrame > Columns: [] > Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 > 00:00:00-05:00] > Empty DataFrame > Columns: [] > Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 > 05:00:00-05:00] > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index
Albert Shieh created ARROW-1958: --- Summary: [Python] Error in pandas conversion for datetimetz row index Key: ARROW-1958 URL: https://issues.apache.org/jira/browse/ARROW-1958 Project: Apache Arrow Issue Type: Bug Components: C Affects Versions: 0.8.0 Environment: Ubuntu 16.04 Reporter: Albert Shieh The pandas conversion of a datetimetz row index in a Table fails with non-UTC time zones because the values are stored as datetime64\[ns\] and interpreted as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and converted to datetime64\[ns, tz\]. There's correct handling for time zones for columns in Column.to_pandas, but not for the row index in table_to_blockmanager. This is a minimal example demonstrating the failure of a roundtrip between a DataFrame and a Table: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({ 'a': pd.date_range( start='2017-01-01', periods=3, tz='America/New_York' ) }) df.set_index('a') df_pa = pa.Table.from_pandas(df).to_pandas() print(df) print(df_pa) {code} The output is: {noformat} Empty DataFrame Columns: [] Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 00:00:00-05:00] Empty DataFrame Columns: [] Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 05:00:00-05:00] {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)