I am using a public benchmark. The origin file is
https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
. I used pyarrow version 7.0.0 and pq.write_table api to write the csv file
as a parquet file, with compression=snappy and use_dictionary=true. The
data has ~20M rows and 43 columns. So there is only one row group with
row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on
local disk.

Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道:

> That doesn't really solve it but just confirms that the problem is the
> newer datasets logic.  I need more information to really know what is going
> on as this still seems like a problem.
>
> How many row groups and how many columns does your file have?  Or do you
> have a sample parquet file that shows this issue?
>
> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:
>
>> use_legacy_dataset=True fixes the problem. Could you explain a little
>> about the reason? Thanks!
>>
>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道:
>>
>>> What version of pyarrow are you using?  What's your OS?  Is the file on
>>> a local disk or S3?  How many row groups are in your file?
>>>
>>> A difference of that much is not expected.  However, they do use
>>> different infrastructure under the hood.  Do you also get the faster
>>> performance with pq.read_table(use_legacy_dataset=True) as well.
>>>
>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
>>>
>>>> Hi all, I found that for the same parquet file,
>>>> using pq.ParquetFile(file_name).read() takes 6s while
>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>>>> they use the same internals but it seems not. The parquet file is 865MB,
>>>> snappy compression and enable dictionary. All other settings are default,
>>>> writing with pyarrow.
>>>>
>>>

Reply via email to