Thanks for reporting this.  It seems a regression crept into 7.0.0
that accidentally disabled parallel column decoding when
pyarrow.parquet.read_table is called with a single file.  I have filed
[1] and should have a fix for it before the next release.  As a
workaround you can use the datasets API directly, this is already what
pyarrow.parquet.read_table is using under the hood when
use_legacy_dataset=False.  Or you can continue using
use_legacy_dataset=True.

import pyarrow.dataset as ds
table = ds.dataset('file.parquet', format='parquet').to_table()

[1] https://issues.apache.org/jira/browse/ARROW-15784

On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote:
>
> I am using a public benchmark. The origin file is 
> https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 
> . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file 
> as a parquet file, with compression=snappy and use_dictionary=true. The data 
> has ~20M rows and 43 columns. So there is only one row group with 
> row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on 
> local disk.
>
> Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道:
>>
>> That doesn't really solve it but just confirms that the problem is the newer 
>> datasets logic.  I need more information to really know what is going on as 
>> this still seems like a problem.
>>
>> How many row groups and how many columns does your file have?  Or do you 
>> have a sample parquet file that shows this issue?
>>
>> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:
>>>
>>> use_legacy_dataset=True fixes the problem. Could you explain a little about 
>>> the reason? Thanks!
>>>
>>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道:
>>>>
>>>> What version of pyarrow are you using?  What's your OS?  Is the file on a 
>>>> local disk or S3?  How many row groups are in your file?
>>>>
>>>> A difference of that much is not expected.  However, they do use different 
>>>> infrastructure under the hood.  Do you also get the faster performance 
>>>> with pq.read_table(use_legacy_dataset=True) as well.
>>>>
>>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
>>>>>
>>>>> Hi all, I found that for the same parquet file, using 
>>>>> pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name) 
>>>>> takes 17s. How do those two apis differ? I thought they use the same 
>>>>> internals but it seems not. The parquet file is 865MB, snappy compression 
>>>>> and enable dictionary. All other settings are default, writing with 
>>>>> pyarrow.

Reply via email to