Pyarrow is not scaling well when compared with duckdb. Is there something
that needs to be done differently?

Minimal example:

import pyarrow.parquet as pq
lineitem = pq.read_table('lineitemsf1.snappy.parquet')
con = duckdb.connect()

%timeit lineitem.group_by("l_returnflag").aggregate([("l_extendedprice",
"sum")])
ungrouped_aggregate = '''SELECT SUM(l_extendedprice) FROM lineitem GROUP BY
l_returnflag'''

%timeit con.execute(ungrouped_aggregate).fetch_arrow_table()

Results
%timeit lineitem.group_by("l_returnflag").aggregate([("l_extendedprice",
"sum")])
207 ms ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit con.execute(ungrouped_aggregate).fetch_arrow_table()
71.8 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Reply via email to