That doesn't really solve it but just confirms that the problem is the
newer datasets logic.  I need more information to really know what is going
on as this still seems like a problem.

How many row groups and how many columns does your file have?  Or do you
have a sample parquet file that shows this issue?

On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:

> use_legacy_dataset=True fixes the problem. Could you explain a little
> about the reason? Thanks!
>
> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道:
>
>> What version of pyarrow are you using?  What's your OS?  Is the file on a
>> local disk or S3?  How many row groups are in your file?
>>
>> A difference of that much is not expected.  However, they do use
>> different infrastructure under the hood.  Do you also get the faster
>> performance with pq.read_table(use_legacy_dataset=True) as well.
>>
>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
>>
>>> Hi all, I found that for the same parquet file,
>>> using pq.ParquetFile(file_name).read() takes 6s while
>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I thought
>>> they use the same internals but it seems not. The parquet file is 865MB,
>>> snappy compression and enable dictionary. All other settings are default,
>>> writing with pyarrow.
>>>
>>

Reply via email to