[ https://issues.apache.org/jira/browse/PARQUET-1115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584533#comment-17584533 ]
ASF GitHub Bot commented on PARQUET-1115: ----------------------------------------- NickCrews commented on PR #433: URL: https://github.com/apache/parquet-mr/pull/433#issuecomment-1226667307 It might be nice if we actually suggested an alternative instead of just saying "don't do this." You can see my solution at https://gist.github.com/NickCrews/7a47ef4083160011e8e533531d73428c. > Warn users when misusing parquet-tools merge > -------------------------------------------- > > Key: PARQUET-1115 > URL: https://issues.apache.org/jira/browse/PARQUET-1115 > Project: Parquet > Issue Type: Improvement > Reporter: Zoltan Ivanfi > Assignee: Nándor Kollár > Priority: Major > Fix For: 1.10.0 > > > To prevent users from using {{parquet-tools merge}} in scenarios where its > use is not practical, we should describe its limitations in the help text of > this command. Additionally, we should add a warning to the output of the > merge command if the size of the original row groups are below a threshold. > Reasoning: > Many users are tempted to use the new {{parquet-tools merge}} functionality, > because they want to achieve good performance and historically that has been > associated with large Parquet files. However, in practice Hive performance > won't change significantly after using {{parquet-tools merge}}, but Impala > performance will be much worse. The reason for that is that good performance > is not a result of large files but large rowgroups instead (up to the HDFS > block size). > However, {{parquet-tools merge}} does not merge rowgroups, it just places > them one after the other. It was intended to be used for Parquet files that > are already arranged in row groups of the desired size. When used to merge > many small files, the resulting file will still contain small row groups and > one loses most of the advantages of larger files (the only one that remains > is that it takes a single HDFS operation to read them). -- This message was sent by Atlassian Jira (v8.20.10#820010)