I am using a public benchmark. The origin file is https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file as a parquet file, with compression=snappy and use_dictionary=true. The data has ~20M rows and 43 columns. So there is only one row group with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on local disk.
Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道: > That doesn't really solve it but just confirms that the problem is the > newer datasets logic. I need more information to really know what is going > on as this still seems like a problem. > > How many row groups and how many columns does your file have? Or do you > have a sample parquet file that shows this issue? > > On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote: > >> use_legacy_dataset=True fixes the problem. Could you explain a little >> about the reason? Thanks! >> >> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道: >> >>> What version of pyarrow are you using? What's your OS? Is the file on >>> a local disk or S3? How many row groups are in your file? >>> >>> A difference of that much is not expected. However, they do use >>> different infrastructure under the hood. Do you also get the faster >>> performance with pq.read_table(use_legacy_dataset=True) as well. >>> >>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote: >>> >>>> Hi all, I found that for the same parquet file, >>>> using pq.ParquetFile(file_name).read() takes 6s while >>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought >>>> they use the same internals but it seems not. The parquet file is 865MB, >>>> snappy compression and enable dictionary. All other settings are default, >>>> writing with pyarrow. >>>> >>>
