Re: pyarrow: pa.compute.scalar vs pa.scalar

Joris Van den Bossche Wed, 05 Jun 2024 06:18:15 -0700

I opened https://github.com/apache/arrow/issues/41985 to capture that
we should update the `pc.scalar()` docstring.


On Thu, 30 May 2024 at 01:57, Aldrin <[email protected]> wrote:
>
> Just did a bit more digging.
>
> pyarrow.scalar is a function [1] returning a cython equivalent of 
> arrow::Scalar in C++ [2].
>
> From Felipe's reference [3], I would say you should not use 
> pyarrow.compute.Scalar unless you've tried to use pyarrow.Scalar and it's not 
> converting to expressions you're trying to build.
>
> My interpretation of the cython code is that pyarrow.compute.scalar returns 
> an Expression instance while pyarrow.scalar returns a pyarrow.Scalar 
> instance. Most of the cython code likely checks if it needs to convert to an 
> Expression, I am not sure it does the opposite. So if the code is not 
> converting a pyarrow.Scalar to an Expression, you can fallback on 
> constructing the Expression directly, but you should prefer using 
> pyarrow.Scalar and letting the library do the conversions as necessary.
>
> Additionally, if you are going to do any integration with C++ code, the 
> wrap/unwrap functions will return/expect pyarrow.Scalar instances [4].
>
> [1]: 
> https://github.com/apache/arrow/blob/main/python/pyarrow/scalar.pxi#L1145-L1220
> [2]: 
> https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L1163-L1172
> [3]: 
> https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718-L732
> [4]: 
> https://arrow.apache.org/docs/python/integration/extending.html#_CPPv4N5arrow5arrow2py11wrap_scalarERKNSt10shared_ptrI6ScalarEE
>
>
> # ------------------------------
> # Aldrin
>
> https://github.com/drin/
> https://gitlab.com/octalene
> https://keybase.io/octalene
>
> On Wednesday, May 29th, 2024 at 14:51, Adrian Garcia Badaracco 
> <[email protected]> wrote:
>
> Thank you. So it sounds like always use pyarrow.scalar. Do you know if 
> libraries (like something using or creating a pyarrow dataset) expected to 
> handle both?
>
> On Mon, May 27, 2024 at 6:26 PM Felipe Oliveira Carvalho 
> <[email protected]> wrote:
>>
>> I couldn't find the docs for compute.scalar, but by checking the
>> source code I can say this:
>>
>> pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from
>> a Python object.
>> pyarrow.compute.scalar [2] creates an Arrow compute Expression
>> wrapping a scalar object.
>>
>> You rarely need pyarrow.compute.scalar because when you pass an Arrow
>> Scalar or a Python object where an Expression is expected, it gets
>> automatically wrapped by Expression._expr_or_scalar() [3].
>>
>> [1] 
>> https://arrow.apache.org/docs/python/generated/pyarrow.scalar.html#pyarrow.scalar
>> [2] https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718
>> [3] 
>> https://github.com/apache/arrow/blob/main/python/pyarrow/_compute.pyx#L2494
>>
>> --
>> Felipe
>>
>> On Mon, May 27, 2024 at 11:43 AM Adrian Garcia Badaracco
>> <[email protected]> wrote:
>> >
>> > These seem to be two different things, but there’s nothing in the docs 
>> > explaining what the difference is. Some things like 
>> > pyarrow.dataset.dataset seem to work with either or even a mix (for 
>> > partitions / fragments).
>> >
>> > ```python
>> > from datetime import datetime, timezone
>> > import pyarrow as pa
>> > import pyarrow.compute as pc
>> >
>> > v = datetime(2000, 1, 1, tzinfo=timezone.utc)
>> > print(v) # 2000-01-01 00:00:00+00:00
>> >
>> > print(pa.scalar(v, pa.timestamp('ns', tz='UTC'))) # 2000-01-01 
>> > 00:00:00+00:00
>> >
>> > print(pc.scalar(v)) # 2000-01-01 00:00:00.000000Z
>> > # according to the docs this should be a bool, int float or str but at 
>> > runtime a datetime is accepted
>> > # seems to assume UTC but can't set ns precision
>> > ```
>> >
>> > Could someone clarify what the differences are, and if they’re on purpose 
>> > or accidental, etc.?
>
>

Re: pyarrow: pa.compute.scalar vs pa.scalar

Reply via email to