[jira] [Commented] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-17 Thread Dave Challis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441058#comment-16441058
 ] 

Dave Challis commented on ARROW-2429:
-

Thanks [~xhochy] and [~joshuastorck], that makes sense now.

> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> 
>
> Key: ARROW-2429
> URL: https://issues.apache.org/jira/browse/ARROW-2429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>Reporter: Dave Challis
>Assignee: Uwe L. Korn
>Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
> Minimal example:
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-09 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2429:

Description: 
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

Minimal example:
 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}

  was:
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}


> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> 
>
> Key: ARROW-2429
> URL: https://issues.apache.org/jira/browse/ARROW-2429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>Reporter: Dave Challis
>Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
> Minimal example:
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-09 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2429:
---

 Summary: [Python] Timestamp unit in schema changes when writing to 
Parquet file then reading back
 Key: ARROW-2429
 URL: https://issues.apache.org/jira/browse/ARROW-2429
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
PyArrow 0.9.0 (py36_1)
Python
Reporter: Dave Challis


When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')



print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2429) [Python] Timestamp unit in schema changes when writing to Parquet file then reading back

2018-04-09 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2429:

Description: 
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}

  was:
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')



print(table.schema[0])  # pyarrow.Field (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field (microsecond 
units)
{code}


> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> 
>
> Key: ARROW-2429
> URL: https://issues.apache.org/jira/browse/ARROW-2429
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>Reporter: Dave Challis
>Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2423) [Python] PyArrow datatypes raise ValueError on equality checks against non-PyArrow objects

2018-04-09 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2423:
---

 Summary: [Python] PyArrow datatypes raise ValueError on equality 
checks against non-PyArrow objects
 Key: ARROW-2423
 URL: https://issues.apache.org/jira/browse/ARROW-2423
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
PyArrow 0.9.0 (py36_1)
Python 3.6.3
Reporter: Dave Challis


Checking a PyArrow datatype object for equality with non-PyArrow datatypes 
causes a `ValueError` to be raised, rather than either returning a True/False 
value, or returning 
[NotImplemented|https://docs.python.org/3/library/constants.html#NotImplemented]
 if the comparison isn't implemented.

E.g. attempting to call:
{code:java}
import pyarrow
pyarrow.int32() == 'foo'
{code}
results in:
{code:java}
Traceback (most recent call last):
  File "types.pxi", line 1221, in pyarrow.lib.type_for_alias
KeyError: 'foo'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "t.py", line 2, in 
pyarrow.int32() == 'foo'
  File "types.pxi", line 90, in pyarrow.lib.DataType.__richcmp__
  File "types.pxi", line 113, in pyarrow.lib.DataType.equals
  File "types.pxi", line 1223, in pyarrow.lib.type_for_alias
ValueError: No type alias for foo
{code}
The expected outcome for the above would be for the comparison to return 
`False`, as that's the general behaviour for comparisons between objects of 
different types (e.g. `1 == 'foo'` or `object() == 12.4` both return `False`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-09 Thread Dave Challis (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16430261#comment-16430261
 ] 

Dave Challis commented on ARROW-2406:
-

[~kszucs] My mistake, retested and noticed I was using an older env with 
pyarrow 0.8.0, looks like the issue was resolved in 0.9.0.

> [Python] Segfault when creating PyArrow table from Pandas for empty string 
> column when schema provided
> --
>
> Key: ARROW-2406
> URL: https://issues.apache.org/jira/browse/ARROW-2406
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Mac OS High Sierra
> Python 3.6.3
>Reporter: Dave Challis
>Priority: Major
> Fix For: 0.9.0
>
>
> Minimal example to recreate:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema){code}
>  
> This causes the python interpreter to exit with "Segmentation fault: 11".
> The following examples all work without any issue:
> {code}
> # column 'a' is no longer empty
> df = pd.DataFrame({'a': ['foo']})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> {code}
> # column 'a' is empty, but no schema is specified
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> pa.Table.from_pandas(df)
> {code}
> {code}
> # column 'a' is empty, but no type 'str' specified in Pandas
> df = pd.DataFrame({'a': []})
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-09 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis closed ARROW-2406.
---
   Resolution: Fixed
Fix Version/s: (was: 0.10.0)
   0.9.0

> [Python] Segfault when creating PyArrow table from Pandas for empty string 
> column when schema provided
> --
>
> Key: ARROW-2406
> URL: https://issues.apache.org/jira/browse/ARROW-2406
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Mac OS High Sierra
> Python 3.6.3
>Reporter: Dave Challis
>Priority: Major
> Fix For: 0.9.0
>
>
> Minimal example to recreate:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema){code}
>  
> This causes the python interpreter to exit with "Segmentation fault: 11".
> The following examples all work without any issue:
> {code}
> # column 'a' is no longer empty
> df = pd.DataFrame({'a': ['foo']})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> {code}
> # column 'a' is empty, but no schema is specified
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> pa.Table.from_pandas(df)
> {code}
> {code}
> # column 'a' is empty, but no type 'str' specified in Pandas
> df = pd.DataFrame({'a': []})
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-09 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2406:

Affects Version/s: (was: 0.9.0)
   0.8.0

> [Python] Segfault when creating PyArrow table from Pandas for empty string 
> column when schema provided
> --
>
> Key: ARROW-2406
> URL: https://issues.apache.org/jira/browse/ARROW-2406
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.8.0
> Environment: Mac OS High Sierra
> Python 3.6.3
>Reporter: Dave Challis
>Priority: Major
> Fix For: 0.9.0
>
>
> Minimal example to recreate:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema){code}
>  
> This causes the python interpreter to exit with "Segmentation fault: 11".
> The following examples all work without any issue:
> {code}
> # column 'a' is no longer empty
> df = pd.DataFrame({'a': ['foo']})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> {code}
> # column 'a' is empty, but no schema is specified
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> pa.Table.from_pandas(df)
> {code}
> {code}
> # column 'a' is empty, but no type 'str' specified in Pandas
> df = pd.DataFrame({'a': []})
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2391) [Python] Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64

2018-04-06 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2391:

Description: 
When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a 
`pyarrow.Schema` provided, the function call results in a segmentation fault if 
Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type.

A minimal example which shows this is:
{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'created': ['2018-05-10T10:24:01']})
df['created'] = pd.to_datetime(df['created'])}}
schema = pa.schema([pa.field('created', pa.date64())])
pa.Table.from_pandas(df, schema=schema)
{code}

Executing the above causes the python interpreter to exit with "Segmentation 
fault: 11".

Attempting to convert into various other datatypes (by specifying different 
schemas) either succeeds, or raises an exception if the conversion is invalid.

  was:
When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a 
`pyarrow.Schema` provided, the function call results in a segmentation fault if 
Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type.

 

A minimal example which shows this is:

{{import pandas as pd}}
{{import pyarrow as pa}}

{{df = pd.DataFrame(\{'created': ['2018-05-10T10:24:01']})}}
{{df['created'] = pd.to_datetime(df['created'])}}
{{schema = pa.schema([pa.field('created', pa.date64())])}}
{{pa.Table.from_pandas(df, schema=schema)}}

 

Executing the above causes the python interpreter to exit with "Segmentation 
fault: 11".

 

Attempting to convert into various other datatypes (by specifying different 
schemas) either succeeds, or raises an exception if the conversion is invalid.

Summary: [Python] Segmentation fault from PyArrow when mapping Pandas 
datetime column to pyarrow.date64  (was: Segmentation fault from PyArrow when 
mapping Pandas datetime column to pyarrow.date64)

> [Python] Segmentation fault from PyArrow when mapping Pandas datetime column 
> to pyarrow.date64
> --
>
> Key: ARROW-2391
> URL: https://issues.apache.org/jira/browse/ARROW-2391
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> Python 3.6
>Reporter: Dave Challis
>Priority: Major
>
> When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and 
> a `pyarrow.Schema` provided, the function call results in a segmentation 
> fault if Pandas `datetime64[ns]` column tries to be converted to a 
> `pyarrow.date64` type.
> A minimal example which shows this is:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'created': ['2018-05-10T10:24:01']})
> df['created'] = pd.to_datetime(df['created'])}}
> schema = pa.schema([pa.field('created', pa.date64())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> Executing the above causes the python interpreter to exit with "Segmentation 
> fault: 11".
> Attempting to convert into various other datatypes (by specifying different 
> schemas) either succeeds, or raises an exception if the conversion is invalid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-06 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2406:

Description: 
Minimal example to recreate:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema){code}
 
This causes the python interpreter to exit with "Segmentation fault: 11".

The following examples all work without any issue:
{code}
# column 'a' is no longer empty
df = pd.DataFrame({'a': ['foo']})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}
{code}
# column 'a' is empty, but no schema is specified
df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
pa.Table.from_pandas(df)
{code}
{code}
# column 'a' is empty, but no type 'str' specified in Pandas
df = pd.DataFrame({'a': []})
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}
 

  was:
Minimal example to recreate:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema){code}
 

This causes the python interpreter to exit with "Segmentation fault: 11".

The following examples all work without any issue:
{code}
# column 'a' is no longer empty
df = pd.DataFrame({'a': ['foo']})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}
{code}
# column 'a' is empty, but no schema is specified
df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
pa.Table.from_pandas(df)
{code}
{code}
# column 'a' is empty, but no type 'str' specified in Pandas
df = pd.DataFrame({'a': []})
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}
 


> [Python] Segfault when creating PyArrow table from Pandas for empty string 
> column when schema provided
> --
>
> Key: ARROW-2406
> URL: https://issues.apache.org/jira/browse/ARROW-2406
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> Python 3.6.3
>Reporter: Dave Challis
>Priority: Major
>
> Minimal example to recreate:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema){code}
>  
> This causes the python interpreter to exit with "Segmentation fault: 11".
> The following examples all work without any issue:
> {code}
> # column 'a' is no longer empty
> df = pd.DataFrame({'a': ['foo']})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> {code}
> # column 'a' is empty, but no schema is specified
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> pa.Table.from_pandas(df)
> {code}
> {code}
> # column 'a' is empty, but no type 'str' specified in Pandas
> df = pd.DataFrame({'a': []})
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-06 Thread Dave Challis (JIRA)

 [ 
https://issues.apache.org/jira/browse/ARROW-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2406:

Description: 
Minimal example to recreate:
{code}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema){code}
 

This causes the python interpreter to exit with "Segmentation fault: 11".

The following examples all work without any issue:
{code}
# column 'a' is no longer empty
df = pd.DataFrame({'a': ['foo']})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}
{code}
# column 'a' is empty, but no schema is specified
df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
pa.Table.from_pandas(df)
{code}
 

  was:
Minimal example to recreate:

 

 
{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema){code}
 

This causes the python interpreter to exit with "Segmentation fault: 11".

The following examples all work without any issue:
{code:python}
# column 'a' is no longer empty
df = pd.DataFrame({'a': ['foo']})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}

{code:python}
# column 'a' is empty, but no schema is specified
df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
pa.Table.from_pandas(df)
{code}


 


> [Python] Segfault when creating PyArrow table from Pandas for empty string 
> column when schema provided
> --
>
> Key: ARROW-2406
> URL: https://issues.apache.org/jira/browse/ARROW-2406
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.9.0
> Environment: Mac OS High Sierra
> Python 3.6.3
>Reporter: Dave Challis
>Priority: Major
>
> Minimal example to recreate:
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema){code}
>  
> This causes the python interpreter to exit with "Segmentation fault: 11".
> The following examples all work without any issue:
> {code}
> # column 'a' is no longer empty
> df = pd.DataFrame({'a': ['foo']})
> df['a'] = df['a'].astype(str)
> schema = pa.schema([pa.field('a', pa.string())])
> pa.Table.from_pandas(df, schema=schema)
> {code}
> {code}
> # column 'a' is empty, but no schema is specified
> df = pd.DataFrame({'a': []})
> df['a'] = df['a'].astype(str)
> pa.Table.from_pandas(df)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2406) [Python] Segfault when creating PyArrow table from Pandas for empty string column when schema provided

2018-04-06 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2406:
---

 Summary: [Python] Segfault when creating PyArrow table from Pandas 
for empty string column when schema provided
 Key: ARROW-2406
 URL: https://issues.apache.org/jira/browse/ARROW-2406
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
Python 3.6.3
Reporter: Dave Challis


Minimal example to recreate:

 

 
{code:python}
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema){code}
 

This causes the python interpreter to exit with "Segmentation fault: 11".

The following examples all work without any issue:
{code:python}
# column 'a' is no longer empty
df = pd.DataFrame({'a': ['foo']})
df['a'] = df['a'].astype(str)
schema = pa.schema([pa.field('a', pa.string())])
pa.Table.from_pandas(df, schema=schema)
{code}

{code:python}
# column 'a' is empty, but no schema is specified
df = pd.DataFrame({'a': []})
df['a'] = df['a'].astype(str)
pa.Table.from_pandas(df)
{code}


 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-2391) Segmentation fault from PyArrow when mapping Pandas datetime column to pyarrow.date64

2018-04-04 Thread Dave Challis (JIRA)
Dave Challis created ARROW-2391:
---

 Summary: Segmentation fault from PyArrow when mapping Pandas 
datetime column to pyarrow.date64
 Key: ARROW-2391
 URL: https://issues.apache.org/jira/browse/ARROW-2391
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Affects Versions: 0.9.0
 Environment: Mac OS High Sierra
Python 3.6
Reporter: Dave Challis


When trying to call `pyarrow.Table.from_pandas` with a `pandas.DataFrame` and a 
`pyarrow.Schema` provided, the function call results in a segmentation fault if 
Pandas `datetime64[ns]` column tries to be converted to a `pyarrow.date64` type.

 

A minimal example which shows this is:

{{import pandas as pd}}
{{import pyarrow as pa}}

{{df = pd.DataFrame(\{'created': ['2018-05-10T10:24:01']})}}
{{df['created'] = pd.to_datetime(df['created'])}}
{{schema = pa.schema([pa.field('created', pa.date64())])}}
{{pa.Table.from_pandas(df, schema=schema)}}

 

Executing the above causes the python interpreter to exit with "Segmentation 
fault: 11".

 

Attempting to convert into various other datatypes (by specifying different 
schemas) either succeeds, or raises an exception if the conversion is invalid.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)