Since this isn't the first time this specific issue has happened in a major release, is there a way that a test or benchmark regression check could be introduced to prevent this category of problem in the future?
On Thu, Feb 24, 2022 at 9:48 PM Weston Pace <[email protected]> wrote: > > Thanks for reporting this. It seems a regression crept into 7.0.0 > that accidentally disabled parallel column decoding when > pyarrow.parquet.read_table is called with a single file. I have filed > [1] and should have a fix for it before the next release. As a > workaround you can use the datasets API directly, this is already what > pyarrow.parquet.read_table is using under the hood when > use_legacy_dataset=False. Or you can continue using > use_legacy_dataset=True. > > import pyarrow.dataset as ds > table = ds.dataset('file.parquet', format='parquet').to_table() > > [1] https://issues.apache.org/jira/browse/ARROW-15784 > > On Wed, Feb 23, 2022 at 10:59 PM Shawn Zeng <[email protected]> wrote: > > > > I am using a public benchmark. The origin file is > > https://homepages.cwi.nl/~boncz/PublicBIbenchmark/Generico/Generico_1.csv.bz2 > > . I used pyarrow version 7.0.0 and pq.write_table api to write the csv > > file as a parquet file, with compression=snappy and use_dictionary=true. > > The data has ~20M rows and 43 columns. So there is only one row group with > > row_group_size=64M as default. The OS is Ubuntu 20.04 and the file is on > > local disk. > > > > Weston Pace <[email protected]> 于2022年2月24日周四 16:45写道: > >> > >> That doesn't really solve it but just confirms that the problem is the > >> newer datasets logic. I need more information to really know what is > >> going on as this still seems like a problem. > >> > >> How many row groups and how many columns does your file have? Or do you > >> have a sample parquet file that shows this issue? > >> > >> On Wed, Feb 23, 2022, 10:34 PM Shawn Zeng <[email protected]> wrote: > >>> > >>> use_legacy_dataset=True fixes the problem. Could you explain a little > >>> about the reason? Thanks! > >>> > >>> Weston Pace <[email protected]> 于2022年2月24日周四 13:44写道: > >>>> > >>>> What version of pyarrow are you using? What's your OS? Is the file on > >>>> a local disk or S3? How many row groups are in your file? > >>>> > >>>> A difference of that much is not expected. However, they do use > >>>> different infrastructure under the hood. Do you also get the faster > >>>> performance with pq.read_table(use_legacy_dataset=True) as well. > >>>> > >>>> On Wed, Feb 23, 2022, 7:07 PM Shawn Zeng <[email protected]> wrote: > >>>>> > >>>>> Hi all, I found that for the same parquet file, using > >>>>> pq.ParquetFile(file_name).read() takes 6s while > >>>>> pq.read_table(file_name) takes 17s. How do those two apis differ? I > >>>>> thought they use the same internals but it seems not. The parquet file > >>>>> is 865MB, snappy compression and enable dictionary. All other settings > >>>>> are default, writing with pyarrow.
