Hi Jacek, The reason for this big difference between using the numpy vs python scalar in case of `pc.equal` is because in pyarrow we don't have a smart casting of python scalars, depending on the other types in the operation. The Python scalar gets converted to a pyarrow scalar, and defaults to int64. As a result, we are doing a comparison of uint8 and int64 types, and then the uint8 array gets cast to the common type, i.e. int64. While when passing a numpy scalar, pyarrow will preserve the type and convert that to a uint8 pyarrow scalar, and then the comparison is done with uint8 and uint8, not requiring any casting.
You can see this difference as well when explicitly creating a pyarrow scalar: import numpy as np import pyarrow as pa import pyarrow.compute as pc data_np = np.random.randint(0, 100, 10_000_000, dtype="uint8") data_pa = pa.array(data_np) In [12]: %timeit pc.equal(data_pa, pa.scalar(115, pa.uint8())) 3.38 ms ± 29.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [13]: %timeit pc.equal(data_pa, pa.scalar(115, pa.int64())) 35.6 ms ± 700 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) (my numbers are generally a bit bigger, but the relative difference of around x10 without vs with casting is similar to your timings) Best, Joris On Wed, 8 Nov 2023 at 22:04, Jacek Pliszka <[email protected]> wrote: > > Hi! > > I got surprising results when comparing numpy and pyarrow performance. > > val = np.uint8(115) > > numpy has similar speed if using 115 and np.uint8(115): > > %timeit np.count_nonzero(data_np == val) > 591 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > %timeit np.count_nonzero(data_np == 115) > 598 µs ± 3.73 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > > strangely it is fastest for b's" > > %timeit np.count_nonzero(data_np == b"s") > 403 µs ± 3.15 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > > pc.equal is 2.5 slower for np.uint8(115): > > %timeit pc.equal(data_pa, val).sum().as_py() > 1.64 ms ± 8.23 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) > > but much, much slower for 115: > > %timeit pc.equal(data_pa, 115).sum().as_py() > 15.6 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > > And fails for b"s": > > %timeit pc.equal(data_pa, b"s").sum().as_py() > ArrowNotImplementedError: Function 'equal' has no kernel matching > input types (uint8, binary) > > I wrote it down in https://github.com/apache/arrow/issues/38640 > > Any chance to get performance closer to numpy? > > BR, > > Jacek
