Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
timsaucer merged PR #1347: URL: https://github.com/apache/datafusion-python/pull/1347 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
timsaucer commented on PR #1347: URL: https://github.com/apache/datafusion-python/pull/1347#issuecomment-3921628046 Thank you for all the work on this @kosiew ! I think the final solution is very nice. We couldn't have gotten here so quickly without your work. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
timsaucer commented on PR #1347: URL: https://github.com/apache/datafusion-python/pull/1347#issuecomment-3897116364 @kosiew If you're okay merging my PR https://github.com/kosiew/datafusion-python/pull/7 into this branch, I think this PR is the last thing we have left before we start the release candidate for 52. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
kosiew commented on PR #1347: URL: https://github.com/apache/datafusion-python/pull/1347#issuecomment-3853586582 @timsaucer, Thanks for your suggestions. I agree on both points and have refactored the implementation to align more closely with your approach, while also taking care to address the nanoarrow concern in a clear and safe way. ### Implementation Details 1. **Direct scalar handling** When the user returns a PyArrow scalar, we simply use it as-is via the existing `PyScalarValue` extraction—no extra work required. 2. **Fallback to `py_obj_to_scalar_value`** For anything that isn’t already a scalar (native Python values, arrays, etc.), we route through `py_obj_to_scalar_value`, which now cleanly handles the conversion. 3. **Extended `py_obj_to_scalar_value`** This function now: * Checks whether the object is already a `pyarrow.Scalar` using an explicit `isinstance()` check and extracts it directly. * Detects `pyarrow.Array` or `pyarrow.ChunkedArray` (also via `isinstance()`) and converts them into `ScalarValue::List` using the Arrow C data interface—**no `to_pylist()` calls involved**. * Falls back to the original behavior for native Python values (ints, floats, strings, etc.), converting them via `pyarrow.scalar()`. 4. **Applied consistently to `state()` and `evaluate()`** Both methods now share this unified conversion path, ensuring consistent and predictable behavior. ### Why `isinstance()` instead of `__arrow_c_array__`? I intentionally avoided checking the `__arrow_c_array__` protocol and opted for explicit `isinstance()` checks against `pyarrow.Scalar`, `pyarrow.Array`, and `pyarrow.ChunkedArray`. This keeps things clear and robust: * Scalar objects from libraries like nanoarrow that implement `__arrow_c_array__` are still correctly treated as scalars, rather than being misclassified as lists. * Only true PyArrow array types are converted into `ScalarValue::List`. * The type checks remain explicit, readable, and safe. ### Benefits * **No performance penalty**: Avoids `to_pylist()` entirely by relying on the Arrow C data interface. * **Flexible**: Users can return native Python values, PyArrow scalars, or PyArrow arrays—everything is handled gracefully. * **Consistent**: Both `state()` and `evaluate()` now follow the same conversion logic. * **Safe**: Clear type discrimination prevents nanoarrow scalars from being misclassified. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
timsaucer commented on PR #1347: URL: https://github.com/apache/datafusion-python/pull/1347#issuecomment-3850280625 One problem I see with my answer above ^ is that some libraries like nanoarrow DO implement `__arrow_c_array__` on a scalar value, and we wouldn't want to accidentally turn that into a `ScalarValue::List`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]
timsaucer commented on PR #1347:
URL:
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3849797441
Sorry it's taken me a while to get around to this PR. It feels like we are
doing two different things
1. telling users that they need to return pyarrow scalars as the return of
evaluate
2. detect when it is a list and then we convert it to a python value and
back into a pyarrow scalar
It feels like this isn't the best option. I think we want to avoid doing any
kind of `to_pylist()` calls.
I think a more general solution would be something like
1. Determine if they have passed in a pyarrow scalar value. If so, use it.
2. If they have not passed in a pyarrow scalar value, use
`py_obj_to_scalar_value` to convert to a scalar value
3. Update `py_obj_to_scalar_value` to detect pyarrow arrays and convert them
to `ScalarValue::List`
For the last part we could do something like
```
if let Ok(_) = obj.getattr(py, "__arrow_c_array__") {
let obj = obj.bind(py);
let array_data = ArrayData::from_pyarrow_bound(obj)?;
let array = Arc::new(GenericListArraytry_from(array_data)?);
return Ok(ScalarValue::List(array))
}
```
Additionally, if we're going to go down this route I think we would want to
treat both the `state()` and `evaluate()` since both of them should be
returning scalars.
An advantage of the point described above is that I think it adds more
flexibility to the users because their python functions can just return python
integers and such without having to convert them to pyarrow scalars.
What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
