Adding to the performance scenario, I also implemented some operators on top of the Arrow compute API. I also observed similar performance when compared to Numpy and Pandas.
But underneath Pandas what I observed was the usage of numpy ops, ( https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195 , https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999 ) @Wes So this would mean that Pandas may have similar performance to Numpy in filtering cases. Is this a correct assumption? But the filter compute function itself was very fast. Most time is spent on creating the mask when there are multiple columns. For about 10M records I observed 1.5 ratio of execution time between Arrow-compute based filtering method vs Pandas. The performance gap is it due to vectorization or some other factor? With Regards, Vibhatha Abeykoon On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <[email protected]> wrote: > Ugh, let me reformat that since the PonyMail browser interface thinks > ">>>" is a triply quoted message. > > <<< t = pa.Table.from_pandas(df0) > <<< t > pyarrow.Table > timestamp: int64 > index: int32 > value: int64 > <<< import pyarrow.compute as pc > <<< def select_by_index(table, ival): > value_index = table.column('index') > index_type = value_index.type.to_pandas_dtype() > mask = pc.equal(value_index, index_type(ival)) > return table.filter(mask) > <<< %timeit t2 = select_by_index(t, 515) > 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > <<< %timeit t2 = select_by_index(t, 3) > 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > <<< %timeit df0[df0['index'] == 515] > 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) > <<< %timeit df0[df0['index'] == 3] > 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) > <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0), > np.count_nonzero(df0['index'] == 3), > np.count_nonzero(df0['index'] == 515))) > ALL:1225000, 3:200000, 515:195 > <<< df0.memory_usage() > Index 128 > timestamp 9800000 > index 4900000 > value 9800000 > dtype: int64 > >
