That doesn't really solve it but just confirms that the problem is the newer datasets logic. I need more information to really know what is going on as this still seems like a problem.
How many row groups and how many columns does your file have? Or do you have a sample parquet file that shows this issue? On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote: > use_legacy_dataset=True fixes the problem. Could you explain a little > about the reason? Thanks! > > Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道: > >> What version of pyarrow are you using? What's your OS? Is the file on a >> local disk or S3? How many row groups are in your file? >> >> A difference of that much is not expected. However, they do use >> different infrastructure under the hood. Do you also get the faster >> performance with pq.read_table(use_legacy_dataset=True) as well. >> >> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote: >> >>> Hi all, I found that for the same parquet file, >>> using pq.ParquetFile(file_name).read() takes 6s while >>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought >>> they use the same internals but it seems not. The parquet file is 865MB, >>> snappy compression and enable dictionary. All other settings are default, >>> writing with pyarrow. >>> >>
