Wes McKinney created ARROW-9924: ----------------------------------- Summary: [Python] Performance regression reading individual Parquet files using Dataset interface Key: ARROW-9924 URL: https://issues.apache.org/jira/browse/ARROW-9924 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Wes McKinney Fix For: 2.0.0
I haven't investigated very deeply but this seems symptomatic of a problem: {code} In [27]: df = pd.DataFrame({'A': np.random.randn(10000000)}) In [28]: pq.write_table(pa.table(df), 'test.parquet') In [29]: timeit pq.read_table('test.parquet') 79.8 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) In [30]: timeit pq.read_table('test.parquet', use_legacy_dataset=True) 66.4 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)