[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908642#comment-16908642 ] Wes McKinney commented on ARROW-6038: - I confirmed that the MWE is behaving properly now {code} $ python ~/Downloads/segfault_ex.py Creating table Traceback (most recent call last): File "/home/wesm/Downloads/segfault_ex.py", line 11, in pa.RecordBatch.from_arrays([pa.array(["C", "C", "C"])], schema), File "pyarrow/table.pxi", line 1117, in pyarrow.lib.Table.from_batches return pyarrow_wrap_table(c_table) File "pyarrow/public-api.pxi", line 316, in pyarrow.lib.pyarrow_wrap_table check_status(ctable.get().Validate()) File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status raise ArrowInvalid(message) pyarrow.lib.ArrowInvalid: Column 0: In chunk 1 expected type string but saw null {code} This is still weird and dangerous though: {code} In [4]: pa.RecordBatch.from_arrays([pa.array([])], schema) Out[4]: In [5]: rb = pa.RecordBatch.from_arrays([pa.array([])], schema) In [6]: rb Out[6]: In [7]: rb.schema Out[7]: col: string In [8]: rb[0] Out[8]: 0 nulls {code} I opened ARROW-6263 > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Assignee: Antoine Pitrou >Priority: Minor > Labels: pull-request-available, windows > Fix For: 0.15.0 > > Attachments: segfault_ex.py > > Time Spent: 50m > Remaining Estimate: 0h > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897136#comment-16897136 ] Antoine Pitrou commented on ARROW-6038: --- Note that it would be possible to have faster type equality comparisons, if we want to invest a bit of time. > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Priority: Minor > Labels: windows > Attachments: segfault_ex.py > > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897133#comment-16897133 ] Antoine Pitrou commented on ARROW-6038: --- Ok, the issue here is that you are creating a Table column with different types. The second array is inferred to be an array of type "null". Arrow should prevent you from doing that instead of crashing. However, comparing types can be a bit expensive (if e.g. they are nested types). [~wesmckinn] what do you think? > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0, 0.14.1 >Reporter: Piotr Bajger >Priority: Minor > Labels: windows > Attachments: segfault_ex.py > > > When creating a Table from a list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.14.0 installed through pip > (problem also occurs with 0.13.0 from conda-forge). -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16893383#comment-16893383 ] Piotr Bajger commented on ARROW-6038: - Yes, it does, I updated the version labels. > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0, 0.14.0 >Reporter: Piotr Bajger >Priority: Minor > Labels: windows > Attachments: segfault_ex.py > > > When creating a Table from an list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.13.0 (py36h8c67754_1) from > conda-forge. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (ARROW-6038) [Python] pyarrow.Table.from_batches produces corrupted table if any of the batches were empty
[ https://issues.apache.org/jira/browse/ARROW-6038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892963#comment-16892963 ] Wes McKinney commented on ARROW-6038: - Does the problem occur with 0.14.0? > [Python] pyarrow.Table.from_batches produces corrupted table if any of the > batches were empty > - > > Key: ARROW-6038 > URL: https://issues.apache.org/jira/browse/ARROW-6038 > Project: Apache Arrow > Issue Type: Bug >Affects Versions: 0.13.0 >Reporter: Piotr Bajger >Priority: Minor > Labels: windows > Attachments: segfault_ex.py > > > When creating a Table from an list/iterator of batches which contains an > "empty" RecordBatch a Table is produced but attempts to run any pyarrow > built-in functions (such as unique()) occasionally result in a Segfault. > The MWE is attached: [^segfault_ex.py] > # The segfaults happen randomly, around 30% of the time. > # Commenting out line 10 in the MWE results in no segfaults. > # The segfault is triggered using the unique() function, but I doubt the > behaviour is specific to that function, from what I gather the problem lies > in Table creation. > I'm on Windows 10, using Python 3.6 and pyarrow 0.13.0 (py36h8c67754_1) from > conda-forge. -- This message was sent by Atlassian JIRA (v7.6.14#76016)