Hi, thank you for your reply. The confusion comes from what Micah mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In that case it does not align will Hadoop block size unless you have only 1B per row. So in most case the row group will be very large than 64MB. I think this parameter uses num of rows instead of size already brings confusion when I read the doc and change the parameter.
I dont understand clearly why the current thinking is to have 1 row-group per file? Could you explain more? Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道: > What is the reason for this? Do you plan to change the default? > > > I think there is some confusion, I do believe this is the number of rows > but I'd guess it was set to 64M because it wasn't carefully adapted from > parquet-mr which I would guess uses byte size and therefore it aligns well > with the HDFS block size. > > I don't recall seeing any open issues to change it. It looks like for > datasets [1] the default is 1 million, so maybe we should try to align > these two. I don't have a strong opinion here, but my impression is that > the current thinking is to generally have 1 row-group per file, and > eliminate entire files. For sub-file pruning, I think column indexes are a > better solution but they have not been implemented in pyarrow yet. > > [1] > https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset > > On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek < > [email protected]> wrote: > >> hi Shawn, >> >> I expect this is the default because Parquet comes from the Hadoop >> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you >> need a different default? You can set it to the size that fits your use >> case best, right? >> >> Marnix >> >> >> >> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote: >> >>> For a clarification, I am referring to pyarrow.parquet.write_table >>> >>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道: >>> >>>> Hi, >>>> >>>> The default row_group_size is really large, which means a large table >>>> smaller than 64M rows will not get the benefits of row group level >>>> statistics. What is the reason for this? Do you plan to change the default? >>>> >>>> Thanks, >>>> Shawn >>>> >>>
