Pyarrow is not scaling well when compared with duckdb. Is there something
that needs to be done differently?
Minimal example:
import pyarrow.parquet as pq
lineitem = pq.read_table('lineitemsf1.snappy.parquet')
con = duckdb.connect()
%timeit lineitem.group_by("l_returnflag").aggregate([("l_extendedprice",
"sum")])
ungrouped_aggregate = '''SELECT SUM(l_extendedprice) FROM lineitem GROUP BY
l_returnflag'''
%timeit con.execute(ungrouped_aggregate).fetch_arrow_table()
Results
%timeit lineitem.group_by("l_returnflag").aggregate([("l_extendedprice",
"sum")])
207 ms ± 9.31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit con.execute(ungrouped_aggregate).fetch_arrow_table()
71.8 ms ± 2.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)