[jira] [Created] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
Dave Challis created ARROW-2429: --- Summary: [Python] Timestamp unit in schema changes when writing to Parquet file then reading back Key: ARROW-2429 URL: https://issues.apache.org/jira/browse/ARROW-2429 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python Reporter: Dave Challis When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
Dave Challis created ARROW-2423: --- Summary: [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects Key: ARROW-2423 URL: https://issues.apache.org/jira/browse/ARROW-2423 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python 3.6.3 Reporter: Dave Challis Checking a PyArrow datatype object for equality with non-PyArrow datatypes causes a `ValueError` to be raised, rather than either returning a True/False value, or returning [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] if the comparison isn't implemented. E.g. attempting to call: {code:java} import pyarrow pyarrow.int32() == 'foo' {code} results in: {code:java} Traceback (most recent call last): File "types.pxi", line 1221, in pyarrow.lib.type_for_alias KeyError: 'foo' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "t.py", line 2, in pyarrow.int32() == 'foo' File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ File "types.pxi", line 113, in pyarrow.lib.DataType.equals File "types.pxi", line 1223, in pyarrow.lib.type_for_alias ValueError: No type alias for foo {code} The expected outcome for the above would be for the comparison to return `False`, as that's the general behaviour for comparisons between objects of different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
Dave Challis created ARROW-2406: --- Summary: [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided Key: ARROW-2406 URL: https://issues.apache.org/jira/browse/ARROW-2406 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra Python 3.6.3 Reporter: Dave Challis Minimal example to recreate: {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code:python} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code:python} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2391) Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
Dave Challis created ARROW-2391: --- Summary: Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 Key: ARROW-2391 URL: https://issues.apache.org/jira/browse/ARROW-2391 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra Python 3.6 Reporter: Dave Challis When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a `pyarrow.Schema` provided, the function call results in a segmentation fault if Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type. A minimal example which shows this is: {{import pandas as pd}} {{import pyarrow as pa}} {{df = pd.DataFrame(\{'created': ['2018-05-10T10:24:01']})}} {{df['created'] = pd.to_datetime(df['created'])}} {{schema = pa.schema([pa.field('created', pa.date64())])}} {{pa.Table.from_pandas(df, schema=schema)}} Executing the above causes the python interpreter to exit with "Segmentation fault: 11". Attempting to convert into various other datatypes (by specifying different schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)