[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932715#comment-16932715 ] Antoine Pitrou commented on ARROW-6157: --- Hmm. Perhaps that validation can be moved to a separate method :-) Then we'll have to make sure that all tests call the thorough validation method, rather than the light one. > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932703#comment-16932703 ] Joris Van den Bossche commented on ARROW-6157: -- The ListArray validation actually does something like the latter (it checks if all offsets are valid), so there is at least _some_ precedent. > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 1.0.0 > > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932693#comment-16932693 ] Antoine Pitrou commented on ARROW-6157: --- Yes, we may need a {{ValidateData}} method that's more thorough. > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932632#comment-16932632 ] Wes McKinney commented on ARROW-6157: - Not obvious to me either. It seems like there is a dual need * Checking very basic validity preconditions * Actual data validation (boundschecking, checking monotonicity in the case of variable offsets, etc.) AFAICT we haven't really implemented much in the way of the latter. I think it'd be useful to have this in C++ but separate from the current {{Array::Validate}} I guess, and something that users can opt in to if they need to sanitize inputs > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults
[ https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932624#comment-16932624 ] Antoine Pitrou commented on ARROW-6157: --- It's not obvious whether we want validate() to a O(N) validation of the data. [~wesmckinn] > [Python][C++] UnionArray with invalid data passes validation / leads to > segfaults > - > > Key: ARROW-6157 > URL: https://issues.apache.org/jira/browse/ARROW-6157 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python >Reporter: Joris Van den Bossche >Priority: Major > > From the Python side, you can create an "invalid" UnionArray: > {code} > binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') > int64 = pa.array([1, 2, 3], type='int64') > types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8') # <- value of 2 is out > of bound for number of childs > value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32') > a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64]) > {code} > Eg on conversion to python this leads to a segfault: > {code} > In [7]: a.to_pylist() > Segmentation fault (core dumped) > {code} > On the other hand, doing an explicit validation does not give an error: > {code} > In [8]: a.validate() > {code} > Should the validation raise errors for this case? (the C++ > {{ValidateVisitor}} for UnionArray does nothing) > (so that this can be called from the Python API to avoid creating invalid > arrays / segfaults there) -- This message was sent by Atlassian Jira (v8.3.4#803005)