[jira] [Updated] (ARROW-3053) [Python] pandas decimal conversion segfault

2018-08-14 Thread Albert Shieh (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-3053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Shieh updated ARROW-3053:

Description: 
This example segfaults when trying to convert a pandas DataFrame with a decimal 
column and at least one other object column to a pyarrow Table after a round 
trip through HDF5:
{code:java}
import decimal
import pandas as pd
import pyarrow as pa

data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}}

df = pd.DataFrame.from_dict(data)
df.to_hdf('test.h5', 'test')
df = pd.read_hdf('test.h5', 'test')

table = pa.Table.from_pandas(df)
{code}
This is the gdb backtrace:
{code:java}
#0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, 
_object*, _object*, bool, std::shared_ptr const&, 
std::shared_ptr*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, 
_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#7 0x004c37ed in PyEval_EvalFrameEx ()
#8 0x004b9ab6 in PyEval_EvalCodeEx ()
#9 0x004c1e6f in PyEval_EvalFrameEx ()
#10 0x004b9ab6 in PyEval_EvalCodeEx ()
#11 0x004d55f3 in ?? ()
#12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, 
_object*, _object*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#13 0x004bc3fa in PyEval_EvalFrameEx ()
#14 0x004b9ab6 in PyEval_EvalCodeEx ()
#15 0x004eb30f in ?? ()
#16 0x004e5422 in PyRun_FileExFlags ()
#17 0x004e3cd6 in PyRun_SimpleFileExFlags ()
#18 0x00493ae2 in Py_Main ()
#19 0x7f18a79c4830 in __libc_start_main (main=0x4934c0 , argc=2, 
argv=0x7fffcf079508, init=, fini=, 
rtld_fini=, stack_end=0x7fffcf0794f8) at ../csu/libc-start.c:291
#20 0x004933e9 in _start ()
{code}

  was:
This example segfaults when trying to convert a pandas DataFrame with a decimal 
column and at least one other object column and at to a pyarrow Table after a 
round trip through HDF5:
{code:java}
import decimal
import pandas as pd
import pyarrow as pa

data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}}

df = pd.DataFrame.from_dict(data)
df.to_hdf('test.h5', 'test')
df = pd.read_hdf('test.h5', 'test')

table = pa.Table.from_pandas(df)
{code}
This is the gdb backtrace:
{code:java}
#0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, 
_object*, _object*, bool, std::shared_ptr const&, 
std::shared_ptr*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, 
_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#7 0x004c37ed in PyEval_EvalFrameEx ()
#8 0x004b9ab6 in PyEval_EvalCodeEx ()
#9 0x004c1e6f in PyEval_EvalFrameEx ()
#10 0x004b9ab6 in PyEval_EvalCodeEx ()
#11 0x004d55f3 in ?? ()
#12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, 
_object*, _object*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#13 0x004bc3fa in PyEval_EvalFrameEx ()
#14 0x004b9ab6 in PyEval_EvalCodeEx ()
#15 0x004eb30f in ?? ()
#16 0x004e5422 in PyRun_FileExFlags ()
#17 0x004e3cd

[jira] [Created] (ARROW-3053) [Python] pandas decimal conversion segfault

2018-08-14 Thread Albert Shieh (JIRA)
Albert Shieh created ARROW-3053:
---

 Summary: [Python] pandas decimal conversion segfault
 Key: ARROW-3053
 URL: https://issues.apache.org/jira/browse/ARROW-3053
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
Affects Versions: 0.10.0
Reporter: Albert Shieh


This example segfaults when trying to convert a pandas DataFrame with a decimal 
column and at least one other object column and at to a pyarrow Table after a 
round trip through HDF5:
{code:java}
import decimal
import pandas as pd
import pyarrow as pa

data = {'a': {0: 'a'}, 'b': {0: decimal.Decimal('0.0')}}

df = pd.DataFrame.from_dict(data)
df.to_hdf('test.h5', 'test')
df = pd.read_hdf('test.h5', 'test')

table = pa.Table.from_pandas(df)
{code}
This is the gdb backtrace:
{code:java}
#0 0x7f188a08fc0b in arrow::py::internal::PandasObjectIsNull(_object*) () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#1 0x7f188a09931c in arrow::py::NumPyConverter::ConvertDecimals() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#2 0x7f188a09ef4b in arrow::py::NumPyConverter::ConvertObjectsInfer() () 
from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#3 0x7f188a09f5db in arrow::py::NumPyConverter::ConvertObjects() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#4 0x7f188a09f715 in arrow::py::NumPyConverter::Convert() () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#5 0x7f188a0a0f5e in arrow::py::NdarrayToArrow(arrow::MemoryPool*, 
_object*, _object*, bool, std::shared_ptr const&, 
std::shared_ptr*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/libarrow_python.so.10
#6 0x7f188ab1a13e in __pyx_pw_7pyarrow_3lib_79array(_object*, _object*, 
_object*) () from /home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#7 0x004c37ed in PyEval_EvalFrameEx ()
#8 0x004b9ab6 in PyEval_EvalCodeEx ()
#9 0x004c1e6f in PyEval_EvalFrameEx ()
#10 0x004b9ab6 in PyEval_EvalCodeEx ()
#11 0x004d55f3 in ?? ()
#12 0x7f188aa75eac in __pyx_pw_7pyarrow_3lib_5Table_17from_pandas(_object*, 
_object*, _object*) () from 
/home/ashieh/.local/lib/python2.7/site-packages/pyarrow/lib.so
#13 0x004bc3fa in PyEval_EvalFrameEx ()
#14 0x004b9ab6 in PyEval_EvalCodeEx ()
#15 0x004eb30f in ?? ()
#16 0x004e5422 in PyRun_FileExFlags ()
#17 0x004e3cd6 in PyRun_SimpleFileExFlags ()
#18 0x00493ae2 in Py_Main ()
#19 0x7f18a79c4830 in __libc_start_main (main=0x4934c0 , argc=2, 
argv=0x7fffcf079508, init=, fini=, 
rtld_fini=, stack_end=0x7fffcf0794f8) at ../csu/libc-start.c:291
#20 0x004933e9 in _start ()
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Albert Shieh (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386615#comment-16386615
 ] 

Albert Shieh commented on ARROW-2122:
-

How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'?

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Albert Shieh (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386615#comment-16386615
 ] 

Albert Shieh edited comment on ARROW-2122 at 3/5/18 7:31 PM:
-

How about 
{code}
'+{:d}'.format(tz._minutes)
{code}
or some other prefix?


was (Author: adshieh):
How about '+{:d}'.format(tz._minutes), or some other prefix besides '+'?

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2122) [Python] Pyarrow fails to serialize dataframe with timestamp.

2018-03-05 Thread Albert Shieh (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386217#comment-16386217
 ] 

Albert Shieh commented on ARROW-2122:
-

The issue is that the timestamp has a pytz.FixedOffset timezone, which has a 
zone attribute of None where arrow expects a string.

> [Python] Pyarrow fails to serialize dataframe with timestamp.
> -
>
> Key: ARROW-2122
> URL: https://issues.apache.org/jira/browse/ARROW-2122
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Reporter: Robert Nishihara
>Priority: Major
> Fix For: 0.9.0
>
>
> The bug can be reproduced as follows.
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({'A': [pd.Timestamp('2012-11-11 00:00:00+01:00'), pd.NaT]}) 
> s = pa.serialize(df).to_buffer()
> new_df = pa.deserialize(s) # this fails{code}
> The last line fails with
> {code:java}
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "serialization.pxi", line 441, in pyarrow.lib.deserialize
>   File "serialization.pxi", line 404, in pyarrow.lib.deserialize_from
>   File "serialization.pxi", line 257, in 
> pyarrow.lib.SerializedPyObject.deserialize
>   File "serialization.pxi", line 174, in 
> pyarrow.lib.SerializationContext._deserialize_callback
>   File "/home/ubuntu/arrow/python/pyarrow/serialization.py", line 77, in 
> _deserialize_pandas_dataframe
>     return pdcompat.serialized_dict_to_dataframe(data)
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> serialized_dict_to_dataframe
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 446, in 
> 
>     for block in data['blocks']]
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 466, in 
> _reconstruct_block
>     dtype = _make_datetimetz(item['timezone'])
>   File "/home/ubuntu/arrow/python/pyarrow/pandas_compat.py", line 481, in 
> _make_datetimetz
>     return DatetimeTZDtype('ns', tz=tz)
>   File 
> "/home/ubuntu/anaconda3/lib/python3.5/site-packages/pandas/core/dtypes/dtypes.py",
>  line 409, in __new__
>     raise ValueError("DatetimeTZDtype constructor must have a tz "
> ValueError: DatetimeTZDtype constructor must have a tz supplied{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2205) [Python] Option for integer object nulls

2018-02-23 Thread Albert Shieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Shieh updated ARROW-2205:

Affects Version/s: 0.8.0

> [Python] Option for integer object nulls
> 
>
> Key: ARROW-2205
> URL: https://issues.apache.org/jira/browse/ARROW-2205
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Affects Versions: 0.8.0
>Reporter: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
>
> I have a use case where the loss of precision in casting integers to floats 
> matters, and pandas supports storing integers with nulls without loss of 
> precision in object columns. However, a roundtrip through arrow will cast the 
> object columns to float columns, even though the object columns are stored in 
> arrow as integers with nulls.
> This is a minimal example demonstrating the behavior of a roundtrip:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"a": np.array([None, 1], dtype=object)})
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {code}
>   a
> 0  None
> 1 1
>  a
> 0  NaN
> 1  1.0
> {code}
> This seems to be the desired behavior, given test_int_object_nulls in 
> test_convert_pandas.
> I think it would be useful to add an option in the to_pandas methods to allow 
> integers with nulls to be returned as object columns. The option can default 
> to false in order to preserve the current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2205) [Python] Option for integer object nulls

2018-02-23 Thread Albert Shieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Shieh updated ARROW-2205:

Summary: [Python] Option for integer object nulls  (was: Option for integer 
object nulls)

> [Python] Option for integer object nulls
> 
>
> Key: ARROW-2205
> URL: https://issues.apache.org/jira/browse/ARROW-2205
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: Albert Shieh
>Priority: Major
>  Labels: pull-request-available
>
> I have a use case where the loss of precision in casting integers to floats 
> matters, and pandas supports storing integers with nulls without loss of 
> precision in object columns. However, a roundtrip through arrow will cast the 
> object columns to float columns, even though the object columns are stored in 
> arrow as integers with nulls.
> This is a minimal example demonstrating the behavior of a roundtrip:
> {code}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"a": np.array([None, 1], dtype=object)})
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {code}
>   a
> 0  None
> 1 1
>  a
> 0  NaN
> 1  1.0
> {code}
> This seems to be the desired behavior, given test_int_object_nulls in 
> test_convert_pandas.
> I think it would be useful to add an option in the to_pandas methods to allow 
> integers with nulls to be returned as object columns. The option can default 
> to false in order to preserve the current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2205) Option for integer object nulls

2018-02-23 Thread Albert Shieh (JIRA)
Albert Shieh created ARROW-2205:
---

 Summary: Option for integer object nulls
 Key: ARROW-2205
 URL: https://issues.apache.org/jira/browse/ARROW-2205
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: Albert Shieh


I have a use case where the loss of precision in casting integers to floats 
matters, and pandas supports storing integers with nulls without loss of 
precision in object columns. However, a roundtrip through arrow will cast the 
object columns to float columns, even though the object columns are stored in 
arrow as integers with nulls.

This is a minimal example demonstrating the behavior of a roundtrip:
{code}
import numpy as np
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({"a": np.array([None, 1], dtype=object)})
df_pa = pa.Table.from_pandas(df).to_pandas()

print(df)
print(df_pa)
{code}
The output is:
{code}
  a
0  None
1 1
 a
0  NaN
1  1.0
{code}
This seems to be the desired behavior, given test_int_object_nulls in 
test_convert_pandas.

I think it would be useful to add an option in the to_pandas methods to allow 
integers with nulls to be returned as object columns. The option can default to 
false in order to preserve the current behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index

2017-12-29 Thread Albert Shieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Shieh updated ARROW-1958:

Component/s: (was: C)
 Python

> [Python] Error in pandas conversion for datetimetz row index
> 
>
> Key: ARROW-1958
> URL: https://issues.apache.org/jira/browse/ARROW-1958
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Ubuntu 16.04
>Reporter: Albert Shieh
>
> The pandas conversion of a datetimetz row index in a Table fails with non-UTC 
> time zones because the values are stored as datetime64\[ns\] and interpreted 
> as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and 
> converted to datetime64\[ns, tz\]. There's correct handling for time zones 
> for columns in Column.to_pandas, but not for the row index in 
> table_to_blockmanager.
> This is a minimal example demonstrating the failure of a roundtrip between a 
> DataFrame and a Table:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({
> 'a': pd.date_range(
> start='2017-01-01', periods=3, tz='America/New_York'
> )
> })
> df = df.set_index('a')
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {noformat}
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 
> 00:00:00-05:00]
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 
> 05:00:00-05:00]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index

2017-12-29 Thread Albert Shieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-1958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Shieh updated ARROW-1958:

Description: 
The pandas conversion of a datetimetz row index in a Table fails with non-UTC 
time zones because the values are stored as datetime64\[ns\] and interpreted as 
datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and 
converted to datetime64\[ns, tz\]. There's correct handling for time zones for 
columns in Column.to_pandas, but not for the row index in table_to_blockmanager.

This is a minimal example demonstrating the failure of a roundtrip between a 
DataFrame and a Table:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({
'a': pd.date_range(
start='2017-01-01', periods=3, tz='America/New_York'
)
})
df = df.set_index('a')
df_pa = pa.Table.from_pandas(df).to_pandas()

print(df)
print(df_pa)
{code}

The output is:
{noformat}
Empty DataFrame
Columns: []
Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 
00:00:00-05:00]
Empty DataFrame
Columns: []
Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 
05:00:00-05:00]
{noformat}

  was:
The pandas conversion of a datetimetz row index in a Table fails with non-UTC 
time zones because the values are stored as datetime64\[ns\] and interpreted as 
datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and 
converted to datetime64\[ns, tz\]. There's correct handling for time zones for 
columns in Column.to_pandas, but not for the row index in table_to_blockmanager.

This is a minimal example demonstrating the failure of a roundtrip between a 
DataFrame and a Table:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({
'a': pd.date_range(
start='2017-01-01', periods=3, tz='America/New_York'
)
})
df.set_index('a')
df_pa = pa.Table.from_pandas(df).to_pandas()

print(df)
print(df_pa)
{code}

The output is:
{noformat}
Empty DataFrame
Columns: []
Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 
00:00:00-05:00]
Empty DataFrame
Columns: []
Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 
05:00:00-05:00]
{noformat}


> [Python] Error in pandas conversion for datetimetz row index
> 
>
> Key: ARROW-1958
> URL: https://issues.apache.org/jira/browse/ARROW-1958
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C
>Affects Versions: 0.8.0
> Environment: Ubuntu 16.04
>Reporter: Albert Shieh
>
> The pandas conversion of a datetimetz row index in a Table fails with non-UTC 
> time zones because the values are stored as datetime64\[ns\] and interpreted 
> as datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and 
> converted to datetime64\[ns, tz\]. There's correct handling for time zones 
> for columns in Column.to_pandas, but not for the row index in 
> table_to_blockmanager.
> This is a minimal example demonstrating the failure of a roundtrip between a 
> DataFrame and a Table:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({
> 'a': pd.date_range(
> start='2017-01-01', periods=3, tz='America/New_York'
> )
> })
> df = df.set_index('a')
> df_pa = pa.Table.from_pandas(df).to_pandas()
> print(df)
> print(df_pa)
> {code}
> The output is:
> {noformat}
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 
> 00:00:00-05:00]
> Empty DataFrame
> Columns: []
> Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 
> 05:00:00-05:00]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (ARROW-1958) [Python] Error in pandas conversion for datetimetz row index

2017-12-29 Thread Albert Shieh (JIRA)
Albert Shieh created ARROW-1958:
---

 Summary: [Python] Error in pandas conversion for datetimetz row 
index
 Key: ARROW-1958
 URL: https://issues.apache.org/jira/browse/ARROW-1958
 Project: Apache Arrow
  Issue Type: Bug
  Components: C
Affects Versions: 0.8.0
 Environment: Ubuntu 16.04
Reporter: Albert Shieh


The pandas conversion of a datetimetz row index in a Table fails with non-UTC 
time zones because the values are stored as datetime64\[ns\] and interpreted as 
datetime64\[ns, tz\], rather than interpreted as datetime64\[ns, UTC\] and 
converted to datetime64\[ns, tz\]. There's correct handling for time zones for 
columns in Column.to_pandas, but not for the row index in table_to_blockmanager.

This is a minimal example demonstrating the failure of a roundtrip between a 
DataFrame and a Table:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({
'a': pd.date_range(
start='2017-01-01', periods=3, tz='America/New_York'
)
})
df.set_index('a')
df_pa = pa.Table.from_pandas(df).to_pandas()

print(df)
print(df_pa)
{code}

The output is:
{noformat}
Empty DataFrame
Columns: []
Index: [2017-01-01 00:00:00-05:00, 2017-01-02 00:00:00-05:00, 2017-01-03 
00:00:00-05:00]
Empty DataFrame
Columns: []
Index: [2017-01-01 05:00:00-05:00, 2017-01-02 05:00:00-05:00, 2017-01-03 
05:00:00-05:00]
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)