pitrou opened a new pull request #8777: URL: https://github.com/apache/arrow/pull/8777
Instead of applying the same boolean filter to all N columns, first convert to filter to take indices. Selecting indices of a linear array is much faster, thanks to avoiding bit-unpacking on the boolean filter. Using the Python benchmark setup from ARROW-10569: ```python import pandas as pd import pyarrow as pa import pyarrow.compute as pc import numpy as np num_rows = 10_000_000 data = np.random.randn(num_rows) df = pd.DataFrame({'data{}'.format(i): data for i in range(100)}) df['key'] = np.random.randint(0, 100, size=num_rows) rb = pa.record_batch(df) t = pa.table(df) ``` * before: ```python >>> %timeit rb.filter(pc.equal(rb[-1], 5)) 60 ms ± 509 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit t.filter(pc.equal(t[-1], 5)) 1.22 s ± 2.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) ``` * after: ```python >>> %timeit rb.filter(pc.equal(rb[-1], 5)) 59.2 ms ± 583 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) >>> %timeit t.filter(pc.equal(t[-1], 5)) 59.3 ms ± 339 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org