I opened https://github.com/apache/arrow/issues/41985 to capture that we should update the `pc.scalar()` docstring.
On Thu, 30 May 2024 at 01:57, Aldrin <[email protected]> wrote: > > Just did a bit more digging. > > pyarrow.scalar is a function [1] returning a cython equivalent of > arrow::Scalar in C++ [2]. > > From Felipe's reference [3], I would say you should not use > pyarrow.compute.Scalar unless you've tried to use pyarrow.Scalar and it's not > converting to expressions you're trying to build. > > My interpretation of the cython code is that pyarrow.compute.scalar returns > an Expression instance while pyarrow.scalar returns a pyarrow.Scalar > instance. Most of the cython code likely checks if it needs to convert to an > Expression, I am not sure it does the opposite. So if the code is not > converting a pyarrow.Scalar to an Expression, you can fallback on > constructing the Expression directly, but you should prefer using > pyarrow.Scalar and letting the library do the conversions as necessary. > > Additionally, if you are going to do any integration with C++ code, the > wrap/unwrap functions will return/expect pyarrow.Scalar instances [4]. > > [1]: > https://github.com/apache/arrow/blob/main/python/pyarrow/scalar.pxi#L1145-L1220 > [2]: > https://github.com/apache/arrow/blob/main/python/pyarrow/includes/libarrow.pxd#L1163-L1172 > [3]: > https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718-L732 > [4]: > https://arrow.apache.org/docs/python/integration/extending.html#_CPPv4N5arrow5arrow2py11wrap_scalarERKNSt10shared_ptrI6ScalarEE > > > # ------------------------------ > # Aldrin > > https://github.com/drin/ > https://gitlab.com/octalene > https://keybase.io/octalene > > On Wednesday, May 29th, 2024 at 14:51, Adrian Garcia Badaracco > <[email protected]> wrote: > > Thank you. So it sounds like always use pyarrow.scalar. Do you know if > libraries (like something using or creating a pyarrow dataset) expected to > handle both? > > On Mon, May 27, 2024 at 6:26 PM Felipe Oliveira Carvalho > <[email protected]> wrote: >> >> I couldn't find the docs for compute.scalar, but by checking the >> source code I can say this: >> >> pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from >> a Python object. >> pyarrow.compute.scalar [2] creates an Arrow compute Expression >> wrapping a scalar object. >> >> You rarely need pyarrow.compute.scalar because when you pass an Arrow >> Scalar or a Python object where an Expression is expected, it gets >> automatically wrapped by Expression._expr_or_scalar() [3]. >> >> [1] >> https://arrow.apache.org/docs/python/generated/pyarrow.scalar.html#pyarrow.scalar >> [2] https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718 >> [3] >> https://github.com/apache/arrow/blob/main/python/pyarrow/_compute.pyx#L2494 >> >> -- >> Felipe >> >> On Mon, May 27, 2024 at 11:43 AM Adrian Garcia Badaracco >> <[email protected]> wrote: >> > >> > These seem to be two different things, but there’s nothing in the docs >> > explaining what the difference is. Some things like >> > pyarrow.dataset.dataset seem to work with either or even a mix (for >> > partitions / fragments). >> > >> > ```python >> > from datetime import datetime, timezone >> > import pyarrow as pa >> > import pyarrow.compute as pc >> > >> > v = datetime(2000, 1, 1, tzinfo=timezone.utc) >> > print(v) # 2000-01-01 00:00:00+00:00 >> > >> > print(pa.scalar(v, pa.timestamp('ns', tz='UTC'))) # 2000-01-01 >> > 00:00:00+00:00 >> > >> > print(pc.scalar(v)) # 2000-01-01 00:00:00.000000Z >> > # according to the docs this should be a bool, int float or str but at >> > runtime a datetime is accepted >> > # seems to assume UTC but can't set ns precision >> > ``` >> > >> > Could someone clarify what the differences are, and if they’re on purpose >> > or accidental, etc.? > >
