[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932715#comment-16932715
 ] 

Antoine Pitrou commented on ARROW-6157:
---

Hmm. Perhaps that validation can be moved to a separate method :-)
Then we'll have to make sure that all tests call the thorough validation 
method, rather than the light one.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Joris Van den Bossche (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932703#comment-16932703
 ] 

Joris Van den Bossche commented on ARROW-6157:
--

The ListArray validation actually does something like the latter (it checks if 
all offsets are valid), so there is at least _some_ precedent.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 1.0.0
>
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932693#comment-16932693
 ] 

Antoine Pitrou commented on ARROW-6157:
---

Yes, we may need a {{ValidateData}} method that's more thorough.

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932632#comment-16932632
 ] 

Wes McKinney commented on ARROW-6157:
-

Not obvious to me either. It seems like there is a dual need

* Checking very basic validity preconditions
* Actual data validation (boundschecking, checking monotonicity in the case of 
variable offsets, etc.)

AFAICT we haven't really implemented much in the way of the latter. I think 
it'd be useful to have this in C++ but separate from the current 
{{Array::Validate}} I guess, and something that users can opt in to if they 
need to sanitize inputs

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6157) [Python][C++] UnionArray with invalid data passes validation / leads to segfaults

2019-09-18 Thread Antoine Pitrou (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932624#comment-16932624
 ] 

Antoine Pitrou commented on ARROW-6157:
---

It's not obvious whether we want validate() to a O(N) validation of the data.

[~wesmckinn]

> [Python][C++] UnionArray with invalid data passes validation / leads to 
> segfaults
> -
>
> Key: ARROW-6157
> URL: https://issues.apache.org/jira/browse/ARROW-6157
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Python
>Reporter: Joris Van den Bossche
>Priority: Major
>
> From the Python side, you can create an "invalid" UnionArray:
> {code}
> binary = pa.array([b'a', b'b', b'c', b'd'], type='binary') 
> int64 = pa.array([1, 2, 3], type='int64') 
> types = pa.array([0, 1, 0, 0, 2, 1, 0], type='int8')   # <- value of 2 is out 
> of bound for number of childs
> value_offsets = pa.array([0, 0, 2, 1, 1, 2, 3], type='int32')
> a = pa.UnionArray.from_dense(types, value_offsets, [binary, int64])
> {code}
> Eg on conversion to python this leads to a segfault:
> {code}
> In [7]: a.to_pylist()
> Segmentation fault (core dumped)
> {code}
> On the other hand, doing an explicit validation does not give an error:
> {code}
> In [8]: a.validate()
> {code}
> Should the validation raise errors for this case? (the C++ 
> {{ValidateVisitor}} for UnionArray does nothing) 
> (so that this can be called from the Python API to avoid creating invalid 
> arrays / segfaults there)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)