Based on this aligning the default for non-dataset write path to be 1 million rows seems to make sense in the short term
On Tuesday, February 22, 2022, Shawn Zeng <[email protected]> wrote: > Thanks for Weston's clear explanation. Point 2 is what I've experienced > without tuning parameter and point 3 is what I concerned about. Looking > forward for finer granularity of reading/indexing parquet than row group > level, which should fix the issue. > > Weston Pace <[email protected]> 于2022年2月23日周三 13:30写道: > >> These are all great points. A few notes from my own experiments >> (mostly confirming what others have said): >> >> 1) 1M rows is the minimum safe size for row groups on HDD (and >> perhaps a hair too low in some situations) if you are doing any kind >> of column selection (i.e. projection pushdown). As that number gets >> lower the ratio of skips to reads increases to the point where it >> starts to look too much like "random read" for the HDD and performance >> suffers. >> 2) 64M rows is too high for two (preventable) reasons in the C++ >> datasets API. The first reason is that the datasets API does not >> currently support sub-row-group streaming reads (e.g. we can only read >> one row group at a time from the disk). Large row groups leads to too >> much initial latency (have to read an entire row group before we start >> processing) and too much RAM usage. >> 3) 64M rows is also typically too high for the C++ datasets API >> because, as Micah pointed out, we don't yet have support in the >> datasets API for page-level column indices. This means that >> statistics-based filtering is done at the row-group level and very >> coarse grained. The bigger the block the less likely it is that a >> filter will eclipse the entire block. >> >> Points 2 & 3 above are (I'm fairly certain) entirely fixable. I've >> found reasonable performance with 1M rows per row group and so I >> haven't personally been as highly motivated to fix the latter two >> issues but they are somewhat high up on my personal priority list. If >> anyone has time to devote to working on these issues I would be happy >> to help someone get started. Ideally, if we can fix those two issues, >> then something like what Micah described (one row group per file) is >> fine and we can help shield users from frustrating parameter tuning. >> >> I have a draft of a partial fix for 2 at [1][2]. I expect I should be >> able to get back to it before the 8.0.0 release. I couldn't find an >> issue for the more complete fix (scanning at page-resolution instead >> of row-group-resolution) so I created [3]. >> >> A good read for the third point is at [4]. I couldn't find a JIRA >> issue for this from a quick search but I feel that we probably have >> some issues somewhere. >> >> [1] https://github.com/apache/arrow/pull/12228 >> [2] https://issues.apache.org/jira/browse/ARROW-14510 >> [3] https://issues.apache.org/jira/browse/ARROW-15759 >> [4] https://blog.cloudera.com/speeding-up-select-queries- >> with-parquet-page-indexes/ >> >> On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <[email protected]> wrote: >> > >> > Hi, thank you for your reply. The confusion comes from what Micah >> mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In >> that case it does not align will Hadoop block size unless you have only 1B >> per row. So in most case the row group will be very large than 64MB. I >> think this parameter uses num of rows instead of size already brings >> confusion when I read the doc and change the parameter. >> > >> > I dont understand clearly why the current thinking is to have 1 >> row-group per file? Could you explain more? >> > >> > Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道: >> >>> >> >>> What is the reason for this? Do you plan to change the default? >> >> >> >> >> >> I think there is some confusion, I do believe this is the number of >> rows but I'd guess it was set to 64M because it wasn't carefully adapted >> from parquet-mr which I would guess uses byte size and therefore it aligns >> well with the HDFS block size. >> >> >> >> I don't recall seeing any open issues to change it. It looks like for >> datasets [1] the default is 1 million, so maybe we should try to align >> these two. I don't have a strong opinion here, but my impression is that >> the current thinking is to generally have 1 row-group per file, and >> eliminate entire files. For sub-file pruning, I think column indexes are a >> better solution but they have not been implemented in pyarrow yet. >> >> >> >> [1] https://arrow.apache.org/docs/python/generated/pyarrow. >> dataset.write_dataset.html#pyarrow.dataset.write_dataset >> >> >> >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek < >> [email protected]> wrote: >> >>> >> >>> hi Shawn, >> >>> >> >>> I expect this is the default because Parquet comes from the Hadoop >> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you >> need a different default? You can set it to the size that fits your use >> case best, right? >> >>> >> >>> Marnix >> >>> >> >>> >> >>> >> >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote: >> >>>> >> >>>> For a clarification, I am referring to pyarrow.parquet.write_table >> >>>> >> >>>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道: >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>> The default row_group_size is really large, which means a large >> table smaller than 64M rows will not get the benefits of row group level >> statistics. What is the reason for this? Do you plan to change the default? >> >>>>> >> >>>>> Thanks, >> >>>>> Shawn >> >
