Thanks for reporting this. It seems a regression crept into 7.0.0
that accidentally disabled parallel column decoding when
pyarrow.parquet.read_table is called with a single file. I have filed
[1] and should have a fix for it before the next release. As a
workaround you can use the datasets API directly, this is already what
pyarrow.parquet.read_table is using under the hood when
use_legacy_dataset=False. Or you can continue using
use_legacy_dataset=True.
import pyarrow.dataset as ds
table = ds.dataset('file.parquet', format='parquet').to_table()
[1] https://issues.apache.org/jira/browse/ARROW-15784
On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote:
>
> I am using a public benchmark. The origin file is
> https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
> . I used pyarrow version 7.0.0 and pq.write_table api to write the csv file
> as a parquet file, with compression=snappy and use_dictionary=true. The data
> has ~20M rows and 43 columns. So there is only one row group with
> row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on
> local disk.
>
> Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道:
>>
>> That doesn't really solve it but just confirms that the problem is the newer
>> datasets logic. I need more information to really know what is going on as
>> this still seems like a problem.
>>
>> How many row groups and how many columns does your file have? Or do you
>> have a sample parquet file that shows this issue?
>>
>> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:
>>>
>>> use_legacy_dataset=True fixes the problem. Could you explain a little about
>>> the reason? Thanks!
>>>
>>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道:
>>>>
>>>> What version of pyarrow are you using? What's your OS? Is the file on a
>>>> local disk or S3? How many row groups are in your file?
>>>>
>>>> A difference of that much is not expected. However, they do use different
>>>> infrastructure under the hood. Do you also get the faster performance
>>>> with pq.read_table(use_legacy_dataset=True) as well.
>>>>
>>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
>>>>>
>>>>> Hi all, I found that for the same parquet file, using
>>>>> pq.ParquetFile(file_name).read() takes 6s while pq.read_table(file_name)
>>>>> takes 17s. How do those two apis differ? I thought they use the same
>>>>> internals but it seems not. The parquet file is 865MB, snappy compression
>>>>> and enable dictionary. All other settings are default, writing with
>>>>> pyarrow.