[ 
https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584533#comment-17584533
 ] 

ASF GitHub Bot commented on PARQUET-1115:
-----------------------------------------

NickCrews commented on PR #433:
URL: https://github.com/apache/parquet-mr/pull/433#issuecomment-1226667307

   It might be nice if we actually suggested an alternative instead of just 
saying "don't do this."
   
   You can see my solution at 
https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c.




> Warn users when misusing parquet-tools merge
> --------------------------------------------
>
>                 Key: PARQUET-1115
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1115
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Zoltan Ivanfi
>            Assignee: Nándor Kollár
>            Priority: Major
>             Fix For: 1.10.0
>
>
> To prevent users from using {{parquet-tools merge}} in scenarios where its 
> use is not practical, we should describe its limitations in the help text of 
> this command. Additionally, we should add a warning to the output of the 
> merge command if the size of the original row groups are below a threshold.
> Reasoning:
> Many users are tempted to use the new {{parquet-tools merge}} functionality, 
> because they want to achieve good performance and historically that has been 
> associated with large Parquet files. However, in practice Hive performance 
> won't change significantly after using {{parquet-tools merge}}, but Impala 
> performance will be much worse. The reason for that is that good performance 
> is not a result of large files but large rowgroups instead (up to the HDFS 
> block size).
> However, {{parquet-tools merge}} does not merge rowgroups, it just places 
> them one after the other. It was intended to be used for Parquet files that 
> are already arranged in row groups of the desired size. When used to merge 
> many small files, the resulting file will still contain small row groups and 
> one loses most of the advantages of larger files (the only one that remains 
> is that it takes a single HDFS operation to read them).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to