Re: Tabular ID query (subframe selection based on an integer ID)

Vibhatha Abeykoon Wed, 11 Nov 2020 14:28:23 -0800

Adding to the performance scenario, I also implemented some operators on
top of the Arrow compute API.
I also observed similar performance when compared to Numpy and Pandas.


But underneath Pandas what I observed was the usage of numpy ops,

(
https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195
,
https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999
)

@Wes

So this would mean that Pandas may have similar performance to Numpy in
filtering cases. Is this a correct assumption?

But the filter compute function itself was very fast. Most time is spent on
creating the mask when there are multiple columns.
For about 10M records I observed 1.5 ratio of execution time between
Arrow-compute based filtering method vs Pandas.

The performance gap is it due to vectorization or some other factor?


With Regards,
Vibhatha Abeykoon


On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <[email protected]> wrote:

> Ugh, let me reformat that since the PonyMail browser interface thinks
> ">>>" is a triply quoted message.
>
> <<< t = pa.Table.from_pandas(df0)
> <<< t
> pyarrow.Table
> timestamp: int64
> index: int32
> value: int64
> <<< import pyarrow.compute as pc
> <<< def select_by_index(table, ival):
>      value_index = table.column('index')
>      index_type = value_index.type.to_pandas_dtype()
>      mask = pc.equal(value_index, index_type(ival))
>      return table.filter(mask)
> <<< %timeit t2 = select_by_index(t, 515)
> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> <<< %timeit t2 = select_by_index(t, 3)
> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> <<< %timeit df0[df0['index'] == 515]
> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
> <<< %timeit df0[df0['index'] == 3]
> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
>                                  np.count_nonzero(df0['index'] == 3),
>                                  np.count_nonzero(df0['index'] == 515)))
> ALL:1225000, 3:200000, 515:195
> <<< df0.memory_usage()
> Index            128
> timestamp    9800000
> index        4900000
> value        9800000
> dtype: int64
>
>

Re: Tabular ID query (subframe selection based on an integer ID)

Reply via email to