Thank you. So it sounds like always use pyarrow.scalar. Do you know if libraries (like something using or creating a pyarrow dataset) expected to handle both?
On Mon, May 27, 2024 at 6:26 PM Felipe Oliveira Carvalho < [email protected]> wrote: > I couldn't find the docs for compute.scalar, but by checking the > source code I can say this: > > pyarrow.scalar [1] creates an instance of a pyarrow.*Scalar class from > a Python object. > pyarrow.compute.scalar [2] creates an Arrow compute Expression > wrapping a scalar object. > > You rarely need pyarrow.compute.scalar because when you pass an Arrow > Scalar or a Python object where an Expression is expected, it gets > automatically wrapped by Expression._expr_or_scalar() [3]. > > [1] > https://arrow.apache.org/docs/python/generated/pyarrow.scalar.html#pyarrow.scalar > [2] > https://github.com/apache/arrow/blob/main/python/pyarrow/compute.py#L718 > [3] > https://github.com/apache/arrow/blob/main/python/pyarrow/_compute.pyx#L2494 > > -- > Felipe > > On Mon, May 27, 2024 at 11:43 AM Adrian Garcia Badaracco > <[email protected]> wrote: > > > > These seem to be two different things, but there’s nothing in the docs > explaining what the difference is. Some things like pyarrow.dataset.dataset > seem to work with either or even a mix (for partitions / fragments). > > > > ```python > > from datetime import datetime, timezone > > import pyarrow as pa > > import pyarrow.compute as pc > > > > v = datetime(2000, 1, 1, tzinfo=timezone.utc) > > print(v) # 2000-01-01 00:00:00+00:00 > > > > print(pa.scalar(v, pa.timestamp('ns', tz='UTC'))) # 2000-01-01 > 00:00:00+00:00 > > > > print(pc.scalar(v)) # 2000-01-01 00:00:00.000000Z > > # according to the docs this should be a bool, int float or str but at > runtime a datetime is accepted > > # seems to assume UTC but can't set ns precision > > ``` > > > > Could someone clarify what the differences are, and if they’re on > purpose or accidental, etc.? >
