Based on this aligning the default for non-dataset write path to be 1
million rows seems to make sense in the short term

On Tuesday, February 22, 2022, Shawn Zeng <[email protected]> wrote:

> Thanks for Weston's clear explanation. Point 2 is what I've experienced
> without tuning parameter and point 3 is what I concerned about. Looking
> forward for finer granularity of reading/indexing parquet than row group
> level, which should fix the issue.
>
> Weston Pace <[email protected]> 于2022年2月23日周三 13:30写道:
>
>> These are all great points.  A few notes from my own experiments
>> (mostly confirming what others have said):
>>
>>  1) 1M rows is the minimum safe size for row groups on HDD (and
>> perhaps a hair too low in some situations) if you are doing any kind
>> of column selection (i.e. projection pushdown).  As that number gets
>> lower the ratio of skips to reads increases to the point where it
>> starts to look too much like "random read" for the HDD and performance
>> suffers.
>>  2) 64M rows is too high for two (preventable) reasons in the C++
>> datasets API.  The first reason is that the datasets API does not
>> currently support sub-row-group streaming reads (e.g. we can only read
>> one row group at a time from the disk).  Large row groups leads to too
>> much initial latency (have to read an entire row group before we start
>> processing) and too much RAM usage.
>>  3) 64M rows is also typically too high for the C++ datasets API
>> because, as Micah pointed out, we don't yet have support in the
>> datasets API for page-level column indices.  This means that
>> statistics-based filtering is done at the row-group level and very
>> coarse grained.  The bigger the block the less likely it is that a
>> filter will eclipse the entire block.
>>
>> Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
>> found reasonable performance with 1M rows per row group and so I
>> haven't personally been as highly motivated to fix the latter two
>> issues but they are somewhat high up on my personal priority list.  If
>> anyone has time to devote to working on these issues I would be happy
>> to help someone get started.  Ideally, if we can fix those two issues,
>> then something like what Micah described (one row group per file) is
>> fine and we can help shield users from frustrating parameter tuning.
>>
>> I have a draft of a partial fix for 2 at [1][2].  I expect I should be
>> able to get back to it before the 8.0.0 release.  I couldn't find an
>> issue for the more complete fix (scanning at page-resolution instead
>> of row-group-resolution) so I created [3].
>>
>> A good read for the third point is at [4].  I couldn't find a JIRA
>> issue for this from a quick search but I feel that we probably have
>> some issues somewhere.
>>
>> [1] https://github.com/apache/arrow/pull/12228
>> [2] https://issues.apache.org/jira/browse/ARROW-14510
>> [3] https://issues.apache.org/jira/browse/ARROW-15759
>> [4] https://blog.cloudera.com/speeding-up-select-queries-
>> with-parquet-page-indexes/
>>
>> On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <[email protected]> wrote:
>> >
>> > Hi, thank you for your reply. The confusion comes from what Micah
>> mentions: the row_group_size in pyarrow is 64M rows, instead of 64MB. In
>> that case it does not align will Hadoop block size unless you have only 1B
>> per row. So in most case the row group will be very large than 64MB. I
>> think this parameter uses num of rows instead of size already brings
>> confusion when I read the doc and change the parameter.
>> >
>> > I dont understand clearly why the current thinking is to have 1
>> row-group per file? Could you explain more?
>> >
>> > Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道:
>> >>>
>> >>> What is the reason for this? Do you plan to change the default?
>> >>
>> >>
>> >> I think there is some confusion, I do believe this is the number of
>> rows but I'd guess it was set to 64M because it wasn't carefully adapted
>> from parquet-mr which I would guess uses byte size and therefore it aligns
>> well with the HDFS block size.
>> >>
>> >> I don't recall seeing any open issues to change it.  It looks like for
>> datasets [1] the default is 1 million, so maybe we should try to align
>> these two.  I don't have a strong opinion here, but my impression is that
>> the current thinking is to generally have 1 row-group per file, and
>> eliminate entire files.  For sub-file pruning, I think column indexes are a
>> better solution but they have not been implemented in pyarrow yet.
>> >>
>> >> [1] https://arrow.apache.org/docs/python/generated/pyarrow.
>> dataset.write_dataset.html#pyarrow.dataset.write_dataset
>> >>
>> >> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek <
>> [email protected]> wrote:
>> >>>
>> >>> hi Shawn,
>> >>>
>> >>> I expect this is the default because Parquet comes from the Hadoop
>> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you
>> need a different default? You can set it to the size that fits your use
>> case best, right?
>> >>>
>> >>> Marnix
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote:
>> >>>>
>> >>>> For a clarification, I am referring to pyarrow.parquet.write_table
>> >>>>
>> >>>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> The default row_group_size is really large, which means a large
>> table smaller than 64M rows will not get the benefits of row group level
>> statistics. What is the reason for this? Do you plan to change the default?
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Shawn
>>
>

Reply via email to