[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-08-15 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908642#comment-16908642
 ] 

Wes McKinney commented on ARROW-6038:
-

I confirmed that the MWE is behaving properly now

{code}
$ python ~/Downloads/segfault_ex.py 
Creating table
Traceback (most recent call last):
  File "/home/wesm/Downloads/segfault_ex.py", line 11, in 
pa.RecordBatch.from_arrays([pa.array(["C", "C", "C"])], schema),
  File "pyarrow/table.pxi", line 1117, in pyarrow.lib.Table.from_batches
return pyarrow_wrap_table(c_table)
  File "pyarrow/public-api.pxi", line 316, in pyarrow.lib.pyarrow_wrap_table
check_status(ctable.get().Validate())
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: Column 0: In chunk 1 expected type string but saw null
{code}

This is still weird and dangerous though:

{code}
In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema)  

Out[4]: 

In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) 


In [6]: rb  

Out[6]: 

In [7]: rb.schema   

Out[7]: col: string

In [8]: rb[0]   

Out[8]: 

0 nulls
{code}

I opened ARROW-6263

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Assignee: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available, windows
> Fix For: 0.15.0
>
> Attachments: segfault_ex.py
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-07-31 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897136#comment-16897136
 ] 

Antoine Pitrou commented on ARROW-6038:
---

Note that it would be possible to have faster type equality comparisons, if we 
want to invest a bit of time.

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: windows
> Attachments: segfault_ex.py
>
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-07-31 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897133#comment-16897133
 ] 

Antoine Pitrou commented on ARROW-6038:
---

Ok, the issue here is that you are creating a Table column with different 
types. The second array is inferred to be an array of type "null". Arrow should 
prevent you from doing that instead of crashing.

However, comparing types can be a bit expensive (if e.g. they are nested 
types). [~wesmckinn] what do you think?

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0, 0.14.1
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: windows
> Attachments: segfault_ex.py
>
>
> When creating a Table from a list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip 
> (problem also occurs with 0.13.0 from conda-forge).



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-07-26 Thread Piotr Bajger (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893383#comment-16893383
 ] 

Piotr Bajger commented on ARROW-6038:
-

Yes, it does, I updated the version labels.

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0, 0.14.0
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: windows
> Attachments: segfault_ex.py
>
>
> When creating a Table from an list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.13.0 (py36h8c67754_1) from 
> conda-forge.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty

2019-07-25 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892963#comment-16892963
 ] 

Wes McKinney commented on ARROW-6038:
-

Does the problem occur with 0.14.0? 

> [Python] pyarrow.Table.from_batches produces corrupted table if any of the 
> batches were empty
> -
>
> Key: ARROW-6038
> URL: https://issues.apache.org/jira/browse/ARROW-6038
> Project: Apache Arrow
>  Issue Type: Bug
>Affects Versions: 0.13.0
>Reporter: Piotr Bajger
>Priority: Minor
>  Labels: windows
> Attachments: segfault_ex.py
>
>
> When creating a Table from an list/iterator of batches which contains an 
> "empty" RecordBatch a Table is produced but attempts to run any pyarrow 
> built-in functions (such as unique()) occasionally result in a Segfault.
> The MWE is attached: [^segfault_ex.py]
>  # The segfaults happen randomly, around 30% of the time.
>  # Commenting out line 10 in the MWE results in no segfaults.
>  # The segfault is triggered using the unique() function, but I doubt the 
> behaviour is specific to that function, from what I gather the problem lies 
> in Table creation.
> I'm on Windows 10, using Python 3.6 and pyarrow 0.13.0 (py36h8c67754_1) from 
> conda-forge.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)