Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Weston Pace Tue, 08 Mar 2022 17:35:17 -0800

The issue was a combination of python & C++ so it isn't something we'd
see in the micro benchmarks.  In the macro benchmarks this regression
actually did show up pretty clearly [1] but I didn't notice it in the
PR comment that conbench made.  Jonathan Keane raised [2] on the
conbench repo which would consider more salient reporting of
regressions.  We may also consider reviewing some of the largest
outstanding regressions as we approach a release or as a part of the
RC process.


[1] 
https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/
[2] https://github.com/conbench/conbench/issues/307

On Tue, Mar 8, 2022 at 9:05 AM Wes McKinney <[email protected]> wrote:
>
> Since this isn't the first time this specific issue has happened in a
> major release, is there a way that a test or benchmark regression
> check could be introduced to prevent this category of problem in the
> future?
>
> On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <[email protected]> wrote:
> >
> > Thanks for reporting this.  It seems a regression crept into 7.0.0
> > that accidentally disabled parallel column decoding when
> > pyarrow.parquet.read_table is called with a single file.  I have filed
> > [1] and should have a fix for it before the next release.  As a
> > workaround you can use the datasets API directly, this is already what
> > pyarrow.parquet.read_table is using under the hood when
> > use_legacy_dataset=False.  Or you can continue using
> > use_legacy_dataset=True.
> >
> > import pyarrow.dataset as ds
> > table = ds.dataset('file.parquet', format='parquet').to_table()
> >
> > [1] https://issues.apache.org/jira/browse/ARROW-15784
> >
> > On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote:
> > >
> > > I am using a public benchmark. The origin file is 
> > > https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2
> > >  . I used pyarrow version 7.0.0 and pq.write_table api to write the csv 
> > > file as a parquet file, with compression=snappy and use_dictionary=true. 
> > > The data has ~20M rows and 43 columns. So there is only one row group 
> > > with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file 
> > > is on local disk.
> > >
> > > Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道：
> > >>
> > >> That doesn't really solve it but just confirms that the problem is the 
> > >> newer datasets logic.  I need more information to really know what is 
> > >> going on as this still seems like a problem.
> > >>
> > >> How many row groups and how many columns does your file have?  Or do you 
> > >> have a sample parquet file that shows this issue?
> > >>
> > >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote:
> > >>>
> > >>> use_legacy_dataset=True fixes the problem. Could you explain a little 
> > >>> about the reason? Thanks!
> > >>>
> > >>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道：
> > >>>>
> > >>>> What version of pyarrow are you using?  What's your OS?  Is the file 
> > >>>> on a local disk or S3?  How many row groups are in your file?
> > >>>>
> > >>>> A difference of that much is not expected.  However, they do use 
> > >>>> different infrastructure under the hood.  Do you also get the faster 
> > >>>> performance with pq.read_table(use_legacy_dataset=True) as well.
> > >>>>
> > >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote:
> > >>>>>
> > >>>>> Hi all, I found that for the same parquet file, using 
> > >>>>> pq.ParquetFile(file_name).read() takes 6s while 
> > >>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I 
> > >>>>> thought they use the same internals but it seems not. The parquet 
> > >>>>> file is 865MB, snappy compression and enable dictionary. All other 
> > >>>>> settings are default, writing with pyarrow.

Re: [Python][Parquet]pq.ParquetFile.read faster than pq.read_table?

Reply via email to