The issue was a combination of python & C++ so it isn't something we'd see in the micro benchmarks. In the macro benchmarks this regression actually did show up pretty clearly [1] but I didn't notice it in the PR comment that conbench made. Jonathan Keane raised [2] on the conbench repo which would consider more salient reporting of regressions. We may also consider reviewing some of the largest outstanding regressions as we approach a release or as a part of the RC process.
[1] https://conbench.ursa.dev/compare/runs/c4d5e65d088243259e5198f4c0e219c9...5a1c693586c74471b7c8ba775005db54/ [2] https://github.com/conbench/conbench/issues/307 On Tue, Mar 8, 2022 at 9:05 AM Wes McKinney <[email protected]> wrote: > > Since this isn't the first time this specific issue has happened in a > major release, is there a way that a test or benchmark regression > check could be introduced to prevent this category of problem in the > future? > > On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <[email protected]> wrote: > > > > Thanks for reporting this. It seems a regression crept into 7.0.0 > > that accidentally disabled parallel column decoding when > > pyarrow.parquet.read_table is called with a single file. I have filed > > [1] and should have a fix for it before the next release. As a > > workaround you can use the datasets API directly, this is already what > > pyarrow.parquet.read_table is using under the hood when > > use_legacy_dataset=False. Or you can continue using > > use_legacy_dataset=True. > > > > import pyarrow.dataset as ds > > table = ds.dataset('file.parquet', format='parquet').to_table() > > > > [1] https://issues.apache.org/jira/browse/ARROW-15784 > > > > On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote: > > > > > > I am using a public benchmark. The origin file is > > > https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 > > > . I used pyarrow version 7.0.0 and pq.write_table api to write the csv > > > file as a parquet file, with compression=snappy and use_dictionary=true. > > > The data has ~20M rows and 43 columns. So there is only one row group > > > with row_group_size=64M as default. The OS is Ubuntu 20.04 and the file > > > is on local disk. > > > > > > Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道: > > >> > > >> That doesn't really solve it but just confirms that the problem is the > > >> newer datasets logic. I need more information to really know what is > > >> going on as this still seems like a problem. > > >> > > >> How many row groups and how many columns does your file have? Or do you > > >> have a sample parquet file that shows this issue? > > >> > > >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote: > > >>> > > >>> use_legacy_dataset=True fixes the problem. Could you explain a little > > >>> about the reason? Thanks! > > >>> > > >>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道: > > >>>> > > >>>> What version of pyarrow are you using? What's your OS? Is the file > > >>>> on a local disk or S3? How many row groups are in your file? > > >>>> > > >>>> A difference of that much is not expected. However, they do use > > >>>> different infrastructure under the hood. Do you also get the faster > > >>>> performance with pq.read_table(use_legacy_dataset=True) as well. > > >>>> > > >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote: > > >>>>> > > >>>>> Hi all, I found that for the same parquet file, using > > >>>>> pq.ParquetFile(file_name).read() takes 6s while > > >>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I > > >>>>> thought they use the same internals but it seems not. The parquet > > >>>>> file is 865MB, snappy compression and enable dictionary. All other > > >>>>> settings are default, writing with pyarrow.
