Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Wes McKinney Tue, 08 Mar 2022 11:05:13 -0800

Since this isn't the first time this specific issue has happened in a
major release, is there a way that a test or benchmark regression
check could be introduced to prevent this category of problem in the
future?


On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <[email protected]> wrote:
>
> Thanks for reporting this.  It seems a regression crept into 7.0.0
> that accidentally disabled parallel column decoding when
> pyarrow.parquet.read_table is called with a single file.  I have filed
> [1] and should have a fix for it before the next release.  As a
> workaround you can use the datasets API directly, this is already what
> pyarrow.parquet.read_table is using under the hood when
> use_legacy_dataset=False.  Or you can continue using
> use_legacy_dataset=True.
>
> import pyarrow.dataset as ds
> table = ds.dataset('file.parquet', format='parquet').to_table()
>
> [1] https://issues.apache.org/jira/browse/ARROW-15784
>
> On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote:
> >
> > I am using a public benchmark. The origin file is 
> > https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
> >  . I used pyarrow version 7.0.0 and pq.write_table api to write the csv 
> > file as a parquet file, with compression=snappy and use_dictionary=true. 
> > The data has ~20M rows and 43 columns. So there is only one row group with 
> > row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on 
> > local disk.
> >
> > Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道：
> >>
> >> That doesn't really solve it but just confirms that the problem is the 
> >> newer datasets logic.  I need more information to really know what is 
> >> going on as this still seems like a problem.
> >>
> >> How many row groups and how many columns does your file have?  Or do you 
> >> have a sample parquet file that shows this issue?
> >>
> >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:
> >>>
> >>> use_legacy_dataset=True fixes the problem. Could you explain a little 
> >>> about the reason? Thanks!
> >>>
> >>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道：
> >>>>
> >>>> What version of pyarrow are you using?  What's your OS?  Is the file on 
> >>>> a local disk or S3?  How many row groups are in your file?
> >>>>
> >>>> A difference of that much is not expected.  However, they do use 
> >>>> different infrastructure under the hood.  Do you also get the faster 
> >>>> performance with pq.read_table(use_legacy_dataset=True) as well.
> >>>>
> >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
> >>>>>
> >>>>> Hi all, I found that for the same parquet file, using 
> >>>>> pq.ParquetFile(file_name).read() takes 6s while 
> >>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I 
> >>>>> thought they use the same internals but it seems not. The parquet file 
> >>>>> is 865MB, snappy compression and enable dictionary. All other settings 
> >>>>> are default, writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Reply via email to