Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-18 Thread via GitHub


timsaucer merged PR #1347:
URL: https://github.com/apache/datafusion-python/pull/1347


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-18 Thread via GitHub


timsaucer commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3921628046

   Thank you for all the work on this @kosiew ! I think the final solution is 
very nice. We couldn't have gotten here so quickly without your work.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-13 Thread via GitHub


timsaucer commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3897116364

   @kosiew If you're okay merging my PR 
https://github.com/kosiew/datafusion-python/pull/7 into this branch, I think 
this PR is the last thing we have left before we start the release candidate 
for 52.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-05 Thread via GitHub


kosiew commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3853586582

   @timsaucer,
   
   Thanks for your suggestions. 
   
   I agree on both points  and have refactored the implementation to align more 
closely with your approach, while also taking care to address the nanoarrow 
concern in a clear and safe way.
   
   ### Implementation Details
   
   1. **Direct scalar handling**
  When the user returns a PyArrow scalar, we simply use it as-is via the 
existing `PyScalarValue` extraction—no extra work required.
   
   2. **Fallback to `py_obj_to_scalar_value`**
  For anything that isn’t already a scalar (native Python values, arrays, 
etc.), we route through `py_obj_to_scalar_value`, which now cleanly handles the 
conversion.
   
   3. **Extended `py_obj_to_scalar_value`**
  This function now:
   
  * Checks whether the object is already a `pyarrow.Scalar` using an 
explicit `isinstance()` check and extracts it directly.
  * Detects `pyarrow.Array` or `pyarrow.ChunkedArray` (also via 
`isinstance()`) and converts them into `ScalarValue::List` using the Arrow C 
data interface—**no `to_pylist()` calls involved**.
  * Falls back to the original behavior for native Python values (ints, 
floats, strings, etc.), converting them via `pyarrow.scalar()`.
   
   4. **Applied consistently to `state()` and `evaluate()`**
  Both methods now share this unified conversion path, ensuring consistent 
and predictable behavior.
   
   ### Why `isinstance()` instead of `__arrow_c_array__`?
   
   I intentionally avoided checking the `__arrow_c_array__` protocol and opted 
for explicit `isinstance()` checks against `pyarrow.Scalar`, `pyarrow.Array`, 
and `pyarrow.ChunkedArray`. This keeps things clear and robust:
   
   * Scalar objects from libraries like nanoarrow that implement 
`__arrow_c_array__` are still correctly treated as scalars, rather than being 
misclassified as lists.
   * Only true PyArrow array types are converted into `ScalarValue::List`.
   * The type checks remain explicit, readable, and safe.
   
   ### Benefits 
   
   * **No performance penalty**: Avoids `to_pylist()` entirely by relying on 
the Arrow C data interface.
   * **Flexible**: Users can return native Python values, PyArrow scalars, or 
PyArrow arrays—everything is handled gracefully.
   * **Consistent**: Both `state()` and `evaluate()` now follow the same 
conversion logic.
   * **Safe**: Clear type discrimination prevents nanoarrow scalars from being 
misclassified.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-04 Thread via GitHub


timsaucer commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3850280625

   One problem I see with my answer above ^ is that some libraries like 
nanoarrow DO implement `__arrow_c_array__` on a scalar value, and we wouldn't 
want to accidentally turn that into a `ScalarValue::List`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] Fix Python UDAF list-of-timestamps return by enforcing list-valued scalars and caching PyArrow types [datafusion-python]

2026-02-04 Thread via GitHub


timsaucer commented on PR #1347:
URL: 
https://github.com/apache/datafusion-python/pull/1347#issuecomment-3849797441

   Sorry it's taken me a while to get around to this PR. It feels like we are 
doing two different things
   
   1. telling users that they need to return pyarrow scalars as the return of 
evaluate
   2. detect when it is a list and then we convert it to a python value and 
back into a pyarrow scalar
   
   It feels like this isn't the best option. I think we want to avoid doing any 
kind of `to_pylist()` calls.
   
   I think a more general solution would be something like
   
   1. Determine if they have passed in a pyarrow scalar value. If so, use it.
   2. If they have not passed in a pyarrow scalar value, use 
`py_obj_to_scalar_value` to convert to a scalar value
   3. Update `py_obj_to_scalar_value` to detect pyarrow arrays and convert them 
to `ScalarValue::List`
   
   For the last part we could do something like
   
   ```
   if let Ok(_) = obj.getattr(py, "__arrow_c_array__") {
   let obj = obj.bind(py);
   let array_data = ArrayData::from_pyarrow_bound(obj)?;
   let array = Arc::new(GenericListArraytry_from(array_data)?);
   return Ok(ScalarValue::List(array))
   }
   ```
   
   Additionally, if we're going to go down this route I think we would want to 
treat both the `state()` and `evaluate()` since both of them should be 
returning scalars.
   
   An advantage of the point described above is that I think it adds more 
flexibility to the users because their python functions can just return python 
integers and such without having to convert them to pyarrow scalars.
   
   What do you think?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]