These are all great points.  A few notes from my own experiments
(mostly confirming what others have said):

 1) 1M rows is the minimum safe size for row groups on HDD (and
perhaps a hair too low in some situations) if you are doing any kind
of column selection (i.e. projection pushdown).  As that number gets
lower the ratio of skips to reads increases to the point where it
starts to look too much like "random read" for the HDD and performance
suffers.
 2) 64M rows is too high for two (preventable) reasons in the C++
datasets API.  The first reason is that the datasets API does not
currently support sub-row-group streaming reads (e.g. we can only read
one row group at a time from the disk).  Large row groups leads to too
much initial latency (have to read an entire row group before we start
processing) and too much RAM usage.
 3) 64M rows is also typically too high for the C++ datasets API
because, as Micah pointed out, we don't yet have support in the
datasets API for page-level column indices.  This means that
statistics-based filtering is done at the row-group level and very
coarse grained.  The bigger the block the less likely it is that a
filter will eclipse the entire block.

Points 2 & 3 above are (I'm fairly certain) entirely fixable.  I've
found reasonable performance with 1M rows per row group and so I
haven't personally been as highly motivated to fix the latter two
issues but they are somewhat high up on my personal priority list.  If
anyone has time to devote to working on these issues I would be happy
to help someone get started.  Ideally, if we can fix those two issues,
then something like what Micah described (one row group per file) is
fine and we can help shield users from frustrating parameter tuning.

I have a draft of a partial fix for 2 at [1][2].  I expect I should be
able to get back to it before the 8.0.0 release.  I couldn't find an
issue for the more complete fix (scanning at page-resolution instead
of row-group-resolution) so I created [3].

A good read for the third point is at [4].  I couldn't find a JIRA
issue for this from a quick search but I feel that we probably have
some issues somewhere.

[1] https://github.com/apache/arrow/pull/12228
[2] https://issues.apache.org/jira/browse/ARROW-14510
[3] https://issues.apache.org/jira/browse/ARROW-15759
[4] 
https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/

On Tue, Feb 22, 2022 at 5:07 PM Shawn Zeng <[email protected]> wrote:
>
> Hi, thank you for your reply. The confusion comes from what Micah mentions: 
> the row_group_size in pyarrow is 64M rows, instead of 64MB. In that case it 
> does not align will Hadoop block size unless you have only 1B per row. So in 
> most case the row group will be very large than 64MB. I think this parameter 
> uses num of rows instead of size already brings confusion when I read the doc 
> and change the parameter.
>
> I dont understand clearly why the current thinking is to have 1 row-group per 
> file? Could you explain more?
>
> Micah Kornfield <[email protected]> 于2022年2月23日周三 03:52写道:
>>>
>>> What is the reason for this? Do you plan to change the default?
>>
>>
>> I think there is some confusion, I do believe this is the number of rows but 
>> I'd guess it was set to 64M because it wasn't carefully adapted from 
>> parquet-mr which I would guess uses byte size and therefore it aligns well 
>> with the HDFS block size.
>>
>> I don't recall seeing any open issues to change it.  It looks like for 
>> datasets [1] the default is 1 million, so maybe we should try to align these 
>> two.  I don't have a strong opinion here, but my impression is that the 
>> current thinking is to generally have 1 row-group per file, and eliminate 
>> entire files.  For sub-file pruning, I think column indexes are a better 
>> solution but they have not been implemented in pyarrow yet.
>>
>> [1] 
>> https://arrow.apache.org/docs/python/generated/pyarrow.dataset.write_dataset.html#pyarrow.dataset.write_dataset
>>
>> On Tue, Feb 22, 2022 at 4:50 AM Marnix van den Broek 
>> <[email protected]> wrote:
>>>
>>> hi Shawn,
>>>
>>> I expect this is the default because Parquet comes from the Hadoop 
>>> ecosystem, and the Hadoop block size is usually set to 64MB. Why would you 
>>> need a different default? You can set it to the size that fits your use 
>>> case best, right?
>>>
>>> Marnix
>>>
>>>
>>>
>>> On Tue, Feb 22, 2022 at 1:42 PM Shawn Zeng <[email protected]> wrote:
>>>>
>>>> For a clarification, I am referring to pyarrow.parquet.write_table
>>>>
>>>> Shawn Zeng <[email protected]> 于2022年2月22日周二 20:40写道:
>>>>>
>>>>> Hi,
>>>>>
>>>>> The default row_group_size is really large, which means a large table 
>>>>> smaller than 64M rows will not get the benefits of row group level 
>>>>> statistics. What is the reason for this? Do you plan to change the 
>>>>> default?
>>>>>
>>>>> Thanks,
>>>>> Shawn

Reply via email to