Re: pyarrow: pa.compute.scalar vs pa.scalar

Aldrin Wed, 29 May 2024 16:57:11 -0700

Just did a bit more digging.

pyarrow.scalar is a function [1] returning a cython equivalent of arrow::Scalar 
in C++ [2].


>From Felipe's reference [3], I would say you should not use 
>pyarrow.compute.Scalar unless you've tried to use pyarrow.Scalar and it's not 
>converting to expressions you're trying to build.

My interpretation of the cython code is that pyarrow.compute.scalar returns an 
Expression instance while pyarrow.scalar returns a pyarrow.Scalar instance. 
Most of the cython code likely checks if it needs to convert to an Expression, 
I am not sure it does the opposite. So if the code is not converting a 
pyarrow.Scalar to an Expression, you can fallback on constructing the 
Expression directly, but you should prefer using pyarrow.Scalar and letting the 
library do the conversions as necessary.

Additionally, if you are going to do any integration with C++ code, the 
wrap/unwrap functions will return/expect pyarrow.Scalar instances [4].

[1]: 
https://github.com/apache/arrow/blob/main/python/pyarrow/scalar.pxi#L1145-L1220
[2]: 
https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L1163-L1172
[3]: 
https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718-L732

[4]: 
https://arrow.apache.org/docs/python/integration/extending.html#_CPPv4N5arrow5arrow2py11wrap_scalarERKNSt10shared_ptrI6ScalarEE




# ------------------------------

# Aldrin


https://github.com/drin/

https://gitlab.com/octalene

https://keybase.io/octalene


On Wednesday, May 29th, 2024 at 14:51, Adrian Garcia Badaracco 
<[email protected]> wrote:

> Thank you. So it sounds like always use pyarrow.scalar. Do you know if 
> libraries (like something using or creating a pyarrow dataset) expected to 
> handle both?
> 

> On Mon, May 27, 2024 at 6:26 PM Felipe Oliveira Carvalho 
> <[email protected]> wrote:
> 

> > I couldn't find the docs for compute.scalar, but by checking the
> > source code I can say this:
> > 

> > pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from
> > a Python object.
> > pyarrow.compute.scalar [2] creates an Arrow compute Expression
> > wrapping a scalar object.
> > 

> > You rarely need pyarrow.compute.scalar because when you pass an Arrow
> > Scalar or a Python object where an Expression is expected, it gets
> > automatically wrapped by Expression._expr_or_scalar() [3].
> > 

> > [1] 
> > https://arrow.apache.org/docs/python/generated/pyarrow.scalar.html#pyarrow.scalar
> > [2] https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718
> > [3] 
> > https://github.com/apache/arrow/blob/main/python/pyarrow/_compute.pyx#L2494
> > 

> > --
> > Felipe
> > 

> > On Mon, May 27, 2024 at 11:43 AM Adrian Garcia Badaracco
> > <[email protected]> wrote:
> > >
> > > These seem to be two different things, but there’s nothing in the docs 
> > > explaining what the difference is. Some things like 
> > > pyarrow.dataset.dataset seem to work with either or even a mix (for 
> > > partitions / fragments).
> > >
> > > ```python
> > > from datetime import datetime, timezone
> > > import pyarrow as pa
> > > import pyarrow.compute as pc
> > >
> > > v = datetime(2000, 1, 1, tzinfo=timezone.utc)
> > > print(v) # 2000-01-01 00:00:00+00:00
> > >
> > > print(pa.scalar(v, pa.timestamp('ns', tz='UTC'))) # 2000-01-01 
> > > 00:00:00+00:00
> > >
> > > print(pc.scalar(v)) # 2000-01-01 00:00:00.000000Z
> > > # according to the docs this should be a bool, int float or str but at 
> > > runtime a datetime is accepted
> > > # seems to assume UTC but can't set ns precision
> > > ```
> > >
> > > Could someone clarify what the differences are, and if they’re on purpose 
> > > or accidental, etc.?

publickey - [email protected] - 0x21969656.asc
Description: application/pgp-keys

signature.asc
Description: OpenPGP digital signature

Re: pyarrow: pa.compute.scalar vs pa.scalar

Reply via email to