[jira] [Commented] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441058#comment-16441058 ] Dave Challis commented on ARROW-2429: - Thanks [~xhochy] and [~joshuastorck], that makes sense now. > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Assignee: Uwe L. Korn >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > Minimal example: > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2429: Description: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. Minimal example: {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} was: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > Minimal example: > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
Dave Challis created ARROW-2429: --- Summary: [Python] Timestamp unit in schema changes when writing to Parquet file then reading back Key: ARROW-2429 URL: https://issues.apache.org/jira/browse/ARROW-2429 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python Reporter: Dave Challis When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back
[ https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2429: Description: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} was: When creating an Arrow table from a Pandas DataFrame, the table schema contains a field of type `timestamp[ns]`. When serialising that table to a parquet file and then immediately reading it back, the schema of the table read instead contains a field with type `timestamp[us]`. {code:python} #!/usr/bin/env python import pyarrow as pa import pyarrow.parquet as pq import pandas as pd # create DataFrame with a datetime column df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) df['created'] = pd.to_datetime(df['created']) # create Arrow table from DataFrame table = pa.Table.from_pandas(df, preserve_index=False) # write the table as a parquet file, then read it back again pq.write_table(table, 'foo.parquet') table2 = pq.read_table('foo.parquet') print(table.schema[0]) # pyarrow.Field (nanosecond units) print(table2.schema[0]) # pyarrow.Field (microsecond units) {code} > [Python] Timestamp unit in schema changes when writing to Parquet file then > reading back > > > Key: ARROW-2429 > URL: https://issues.apache.org/jira/browse/ARROW-2429 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > PyArrow 0.9.0 (py36_1) > Python >Reporter: Dave Challis >Priority: Minor > > When creating an Arrow table from a Pandas DataFrame, the table schema > contains a field of type `timestamp[ns]`. > When serialising that table to a parquet file and then immediately reading it > back, the schema of the table read instead contains a field with type > `timestamp[us]`. > > {code:python} > #!/usr/bin/env python > import pyarrow as pa > import pyarrow.parquet as pq > import pandas as pd > # create DataFrame with a datetime column > df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']}) > df['created'] = pd.to_datetime(df['created']) > # create Arrow table from DataFrame > table = pa.Table.from_pandas(df, preserve_index=False) > # write the table as a parquet file, then read it back again > pq.write_table(table, 'foo.parquet') > table2 = pq.read_table('foo.parquet') > print(table.schema[0]) # pyarrow.Field (nanosecond > units) > print(table2.schema[0]) # pyarrow.Field (microsecond > units) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects
Dave Challis created ARROW-2423: --- Summary: [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects Key: ARROW-2423 URL: https://issues.apache.org/jira/browse/ARROW-2423 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra PyArrow 0.9.0 (py36_1) Python 3.6.3 Reporter: Dave Challis Checking a PyArrow datatype object for equality with non-PyArrow datatypes causes a `ValueError` to be raised, rather than either returning a True/False value, or returning [NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented] if the comparison isn't implemented. E.g. attempting to call: {code:java} import pyarrow pyarrow.int32() == 'foo' {code} results in: {code:java} Traceback (most recent call last): File "types.pxi", line 1221, in pyarrow.lib.type_for_alias KeyError: 'foo' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "t.py", line 2, in pyarrow.int32() == 'foo' File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__ File "types.pxi", line 113, in pyarrow.lib.DataType.equals File "types.pxi", line 1223, in pyarrow.lib.type_for_alias ValueError: No type alias for foo {code} The expected outcome for the above would be for the comparison to return `False`, as that's the general behaviour for comparisons between objects of different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
[ https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430261#comment-16430261 ] Dave Challis commented on ARROW-2406: - [~kszucs] My mistake, retested and noticed I was using an older env with pyarrow 0.8.0, looks like the issue was resolved in 0.9.0. > [Python] Segfault when creating PyArrow table from Pandas for empty string > column when schema provided > -- > > Key: ARROW-2406 > URL: https://issues.apache.org/jira/browse/ARROW-2406 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Mac OS High Sierra > Python 3.6.3 >Reporter: Dave Challis >Priority: Major > Fix For: 0.9.0 > > > Minimal example to recreate: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema){code} > > This causes the python interpreter to exit with "Segmentation fault: 11". > The following examples all work without any issue: > {code} > # column 'a' is no longer empty > df = pd.DataFrame({'a': ['foo']}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > {code} > # column 'a' is empty, but no schema is specified > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > pa.Table.from_pandas(df) > {code} > {code} > # column 'a' is empty, but no type 'str' specified in Pandas > df = pd.DataFrame({'a': []}) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
[ https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis closed ARROW-2406. --- Resolution: Fixed Fix Version/s: (was: 0.10.0) 0.9.0 > [Python] Segfault when creating PyArrow table from Pandas for empty string > column when schema provided > -- > > Key: ARROW-2406 > URL: https://issues.apache.org/jira/browse/ARROW-2406 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Mac OS High Sierra > Python 3.6.3 >Reporter: Dave Challis >Priority: Major > Fix For: 0.9.0 > > > Minimal example to recreate: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema){code} > > This causes the python interpreter to exit with "Segmentation fault: 11". > The following examples all work without any issue: > {code} > # column 'a' is no longer empty > df = pd.DataFrame({'a': ['foo']}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > {code} > # column 'a' is empty, but no schema is specified > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > pa.Table.from_pandas(df) > {code} > {code} > # column 'a' is empty, but no type 'str' specified in Pandas > df = pd.DataFrame({'a': []}) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
[ https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2406: Affects Version/s: (was: 0.9.0) 0.8.0 > [Python] Segfault when creating PyArrow table from Pandas for empty string > column when schema provided > -- > > Key: ARROW-2406 > URL: https://issues.apache.org/jira/browse/ARROW-2406 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.8.0 > Environment: Mac OS High Sierra > Python 3.6.3 >Reporter: Dave Challis >Priority: Major > Fix For: 0.9.0 > > > Minimal example to recreate: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema){code} > > This causes the python interpreter to exit with "Segmentation fault: 11". > The following examples all work without any issue: > {code} > # column 'a' is no longer empty > df = pd.DataFrame({'a': ['foo']}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > {code} > # column 'a' is empty, but no schema is specified > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > pa.Table.from_pandas(df) > {code} > {code} > # column 'a' is empty, but no type 'str' specified in Pandas > df = pd.DataFrame({'a': []}) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
[ https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2391: Description: When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a `pyarrow.Schema` provided, the function call results in a segmentation fault if Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type. A minimal example which shows this is: {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) df['created'] = pd.to_datetime(df['created'])}} schema = pa.schema([pa.field('created', pa.date64())]) pa.Table.from_pandas(df, schema=schema) {code} Executing the above causes the python interpreter to exit with "Segmentation fault: 11". Attempting to convert into various other datatypes (by specifying different schemas) either succeeds, or raises an exception if the conversion is invalid. was: When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a `pyarrow.Schema` provided, the function call results in a segmentation fault if Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type. A minimal example which shows this is: {{import pandas as pd}} {{import pyarrow as pa}} {{df = pd.DataFrame(\{'created': ['2018-05-10T10:24:01']})}} {{df['created'] = pd.to_datetime(df['created'])}} {{schema = pa.schema([pa.field('created', pa.date64())])}} {{pa.Table.from_pandas(df, schema=schema)}} Executing the above causes the python interpreter to exit with "Segmentation fault: 11". Attempting to convert into various other datatypes (by specifying different schemas) either succeeds, or raises an exception if the conversion is invalid. Summary: [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 (was: Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64) > [Python] Segmentation fault from PyArrow when mapping Pandas datetime column > to pyarrow.date64 > -- > > Key: ARROW-2391 > URL: https://issues.apache.org/jira/browse/ARROW-2391 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6 >Reporter: Dave Challis >Priority: Major > > When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and > a `pyarrow.Schema` provided, the function call results in a segmentation > fault if Pandas `datetime64[ns]` column tries to be converted to a > `pyarrow.date64` type. > A minimal example which shows this is: > {code:python} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'created': ['2018-05-10T10:24:01']}) > df['created'] = pd.to_datetime(df['created'])}} > schema = pa.schema([pa.field('created', pa.date64())]) > pa.Table.from_pandas(df, schema=schema) > {code} > Executing the above causes the python interpreter to exit with "Segmentation > fault: 11". > Attempting to convert into various other datatypes (by specifying different > schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
[ https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2406: Description: Minimal example to recreate: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} {code} # column 'a' is empty, but no type 'str' specified in Pandas df = pd.DataFrame({'a': []}) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} was: Minimal example to recreate: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} {code} # column 'a' is empty, but no type 'str' specified in Pandas df = pd.DataFrame({'a': []}) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} > [Python] Segfault when creating PyArrow table from Pandas for empty string > column when schema provided > -- > > Key: ARROW-2406 > URL: https://issues.apache.org/jira/browse/ARROW-2406 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6.3 >Reporter: Dave Challis >Priority: Major > > Minimal example to recreate: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema){code} > > This causes the python interpreter to exit with "Segmentation fault: 11". > The following examples all work without any issue: > {code} > # column 'a' is no longer empty > df = pd.DataFrame({'a': ['foo']}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > {code} > # column 'a' is empty, but no schema is specified > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > pa.Table.from_pandas(df) > {code} > {code} > # column 'a' is empty, but no type 'str' specified in Pandas > df = pd.DataFrame({'a': []}) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
[ https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dave Challis updated ARROW-2406: Description: Minimal example to recreate: {code} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} was: Minimal example to recreate: {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code:python} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code:python} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} > [Python] Segfault when creating PyArrow table from Pandas for empty string > column when schema provided > -- > > Key: ARROW-2406 > URL: https://issues.apache.org/jira/browse/ARROW-2406 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.9.0 > Environment: Mac OS High Sierra > Python 3.6.3 >Reporter: Dave Challis >Priority: Major > > Minimal example to recreate: > {code} > import pandas as pd > import pyarrow as pa > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema){code} > > This causes the python interpreter to exit with "Segmentation fault: 11". > The following examples all work without any issue: > {code} > # column 'a' is no longer empty > df = pd.DataFrame({'a': ['foo']}) > df['a'] = df['a'].astype(str) > schema = pa.schema([pa.field('a', pa.string())]) > pa.Table.from_pandas(df, schema=schema) > {code} > {code} > # column 'a' is empty, but no schema is specified > df = pd.DataFrame({'a': []}) > df['a'] = df['a'].astype(str) > pa.Table.from_pandas(df) > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided
Dave Challis created ARROW-2406: --- Summary: [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided Key: ARROW-2406 URL: https://issues.apache.org/jira/browse/ARROW-2406 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra Python 3.6.3 Reporter: Dave Challis Minimal example to recreate: {code:python} import pandas as pd import pyarrow as pa df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema){code} This causes the python interpreter to exit with "Segmentation fault: 11". The following examples all work without any issue: {code:python} # column 'a' is no longer empty df = pd.DataFrame({'a': ['foo']}) df['a'] = df['a'].astype(str) schema = pa.schema([pa.field('a', pa.string())]) pa.Table.from_pandas(df, schema=schema) {code} {code:python} # column 'a' is empty, but no schema is specified df = pd.DataFrame({'a': []}) df['a'] = df['a'].astype(str) pa.Table.from_pandas(df) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-2391) Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64
Dave Challis created ARROW-2391: --- Summary: Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64 Key: ARROW-2391 URL: https://issues.apache.org/jira/browse/ARROW-2391 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.9.0 Environment: Mac OS High Sierra Python 3.6 Reporter: Dave Challis When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a `pyarrow.Schema` provided, the function call results in a segmentation fault if Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type. A minimal example which shows this is: {{import pandas as pd}} {{import pyarrow as pa}} {{df = pd.DataFrame(\{'created': ['2018-05-10T10:24:01']})}} {{df['created'] = pd.to_datetime(df['created'])}} {{schema = pa.schema([pa.field('created', pa.date64())])}} {{pa.Table.from_pandas(df, schema=schema)}} Executing the above causes the python interpreter to exit with "Segmentation fault: 11". Attempting to convert into various other datatypes (by specifying different schemas) either succeeds, or raises an exception if the conversion is invalid. -- This message was sent by Atlassian JIRA (v7.6.3#76005)