I couldn't find the docs for compute.scalar, but by checking the source code I can say this:
pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from a Python object. pyarrow.compute.scalar [2] creates an Arrow compute Expression wrapping a scalar object. You rarely need pyarrow.compute.scalar because when you pass an Arrow Scalar or a Python object where an Expression is expected, it gets automatically wrapped by Expression._expr_or_scalar() [3]. [1] https://arrow.apache.org/docs/python/generated/pyarrow.scalar.html#pyarrow.scalar [2] https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718 [3] https://github.com/apache/arrow/blob/main/python/pyarrow/_compute.pyx#L2494 -- Felipe On Mon, May 27, 2024 at 11:43 AM Adrian Garcia Badaracco <[email protected]> wrote: > > These seem to be two different things, but there’s nothing in the docs > explaining what the difference is. Some things like pyarrow.dataset.dataset > seem to work with either or even a mix (for partitions / fragments). > > ```python > from datetime import datetime, timezone > import pyarrow as pa > import pyarrow.compute as pc > > v = datetime(2000, 1, 1, tzinfo=timezone.utc) > print(v) # 2000-01-01 00:00:00+00:00 > > print(pa.scalar(v, pa.timestamp('ns', tz='UTC'))) # 2000-01-01 00:00:00+00:00 > > print(pc.scalar(v)) # 2000-01-01 00:00:00.000000Z > # according to the docs this should be a bool, int float or str but at > runtime a datetime is accepted > # seems to assume UTC but can't set ns precision > ``` > > Could someone clarify what the differences are, and if they’re on purpose or > accidental, etc.?
