Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Shawn Zeng Thu, 24 Feb 2022 00:34:06 -0800

use_legacy_dataset=True fixes the problem. Could you explain a little about
the reason? Thanks!


Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道：

> What version of pyarrow are you using?  What's your OS?  Is the file on a
> local disk or S3?  How many row groups are in your file?
>
> A difference of that much is not expected.  However, they do use different
> infrastructure under the hood.  Do you also get the faster performance with
> pq.read_table(use_legacy_dataset=True) as well.
>
> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
>
>> Hi all, I found that for the same parquet file,
>> using pq.ParquetFile(file_name).read() takes 6s while
>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>> they use the same internals but it seems not. The parquet file is 865MB,
>> snappy compression and enable dictionary. All other settings are default,
>> writing with pyarrow.
>>
>

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Reply via email to