In my setup here I did:
import pandas as pd
import pyarrow as pa
import pyarrow.compute as pc
import numpy as np
num_rows = 10_000_000
data = np.random.randn(num_rows)
df = pd.DataFrame({'data{}'.format(i): data
for i in range(100)})
df['key'] = np.random.randint(0, 100, size=num_rows)
rb = pa.record_batch(df)
t = pa.table(df)
I found that the performance of filtering a record batch is very similar:
In [22]: timeit df[df.key == 5]
71.3 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [24]: %timeit rb.filter(pc.equal(rb[-1], 5))
75.8 ms ± 2.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Whereas the performance of filtering a table is absolutely abysmal (no
idea what's going on here)
In [23]: %timeit t.filter(pc.equal(t[-1], 5))
961 ms ± 3.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A few obvious notes:
* Evidently, these code paths haven't been greatly optimized, so
someone ought to take a look at this
* Everything here is single-threaded in Arrow-land. The end-goal for
all of this is to parallelize everything (predicate evaluation,
filtering) on the CPU thread pool
On Wed, Nov 11, 2020 at 4:27 PM Vibhatha Abeykoon <[email protected]> wrote:
>
> Adding to the performance scenario, I also implemented some operators on top
> of the Arrow compute API.
> I also observed similar performance when compared to Numpy and Pandas.
>
> But underneath Pandas what I observed was the usage of numpy ops,
>
> (https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/ops/array_ops.py#L195,
> https://github.com/pandas-dev/pandas/blob/44406a65848a820a3708eda092044796e8c11cb5/pandas/core/series.py#L4999)
>
> @Wes
>
> So this would mean that Pandas may have similar performance to Numpy in
> filtering cases. Is this a correct assumption?
>
> But the filter compute function itself was very fast. Most time is spent on
> creating the mask when there are multiple columns.
> For about 10M records I observed 1.5 ratio of execution time between
> Arrow-compute based filtering method vs Pandas.
>
> The performance gap is it due to vectorization or some other factor?
>
>
> With Regards,
> Vibhatha Abeykoon
>
>
> On Wed, Nov 11, 2020 at 2:36 PM Jason Sachs <[email protected]> wrote:
>>
>> Ugh, let me reformat that since the PonyMail browser interface thinks ">>>"
>> is a triply quoted message.
>>
>> <<< t = pa.Table.from_pandas(df0)
>> <<< t
>> pyarrow.Table
>> timestamp: int64
>> index: int32
>> value: int64
>> <<< import pyarrow.compute as pc
>> <<< def select_by_index(table, ival):
>> value_index = table.column('index')
>> index_type = value_index.type.to_pandas_dtype()
>> mask = pc.equal(value_index, index_type(ival))
>> return table.filter(mask)
>> <<< %timeit t2 = select_by_index(t, 515)
>> 2.58 ms ± 31.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit t2 = select_by_index(t, 3)
>> 8.6 ms ± 91.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< %timeit df0[df0['index'] == 515]
>> 1.59 ms ± 5.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>> <<< %timeit df0[df0['index'] == 3]
>> 10 ms ± 28.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>> <<< print("ALL:%d, 3:%d, 515:%d" % (len(df0),
>> np.count_nonzero(df0['index'] == 3),
>> np.count_nonzero(df0['index'] == 515)))
>> ALL:1225000, 3:200000, 515:195
>> <<< df0.memory_usage()
>> Index 128
>> timestamp 9800000
>> index 4900000
>> value 9800000
>> dtype: int64
>>