[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698051#comment-17698051 ] ASF GitHub Bot commented on PARQUET-2228: - vectorijk commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1129917664 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", +this.schema, inputFileSchema, inputFile); +throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: sounds good > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697669#comment-17697669 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128833713 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", +this.schema, inputFileSchema, inputFile); +throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: That sounds reasonable. Would you like to create a PR with your proposed change? @vectorijk > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697647#comment-17697647 ] ASF GitHub Bot commented on PARQUET-2228: - vectorijk commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", +this.schema, inputFileSchema, inputFile); +throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: shall we handle the close of connection opened by `TransParquetFileReader` when the exception is thrown here? in `initNextReader` shall we also consider to close previous reader explicitly if it is not null? try-with-resources statement to close reader might not be suitable in this function ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", +this.schema, inputFileSchema, inputFile); +throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: shall we handle the close of connection opened by `TransParquetFileReader` when the exception is thrown here? in `initNextReader`, shall we also consider to close previous reader explicitly if it is not null? try-with-resources statement to close reader might not be suitable in this function > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697646#comment-17697646 ] ASF GitHub Bot commented on PARQUET-2228: - vectorijk commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +LOG.error("Input files have different schemas, expect: {}, input: {}, current file: {}", +this.schema, inputFileSchema, inputFile); +throw new InvalidSchemaException("Input files have different schemas, current file: " + inputFile); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); Review Comment: shall we handle the close of connection opened by `TransParquetFileReader` when the exception is thrown here? in `initNextReader` shall we also consider to close previous reader explicitly if it is not null > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691509#comment-17691509 ] ASF GitHub Bot commented on PARQUET-2228: - gszadovszky merged PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026 > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691508#comment-17691508 ] ASF GitHub Bot commented on PARQUET-2228: - gszadovszky commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1112880384 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -101,37 +103,121 @@ public static class Builder { private List encryptColumns; private FileEncryptionProperties fileEncryptionProperties; +/** + * Create a builder to create a RewriterOptions. + * + * @param conf configuration for reading from input files and writing to output file + * @param inputFile input file path to read from + * @param outputFile output file path to rewrite to + */ public Builder(Configuration conf, Path inputFile, Path outputFile) { this.conf = conf; this.inputFiles = Arrays.asList(inputFile); this.outputFile = outputFile; } +/** + * Create a builder to create a RewriterOptions. + * + * Please note that if merging more than one file, the schema of all files must be the same. + * Otherwise, the rewrite will fail. + * + * The rewrite will keep original row groups from all input files. This may not be optimal Review Comment: Thank you, @wgtmac > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691259#comment-17691259 ] ASF GitHub Bot commented on PARQUET-2228: - shangxinli commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1437355639 Let me know if you have more feedbacks @ggershinsky @gszadovszky > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690085#comment-17690085 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1433984809 I saw a test failure below from the [GHA](https://github.com/apache/parquet-mr/actions/runs/4195487917/jobs/7275103509) which is unstable: ``` Error: Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.637 s <<< FAILURE! - in org.apache.parquet.hadoop.TestParquetWriter Error: testParquetFileWithBloomFilterWithFpp(org.apache.parquet.hadoop.TestParquetWriter) Time elapsed: 0.368 s <<< FAILURE! java.lang.AssertionError at org.apache.parquet.hadoop.TestParquetWriter.testParquetFileWithBloomFilterWithFpp(TestParquetWriter.java:342) ``` > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689805#comment-17689805 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1108635269 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -101,37 +103,121 @@ public static class Builder { private List encryptColumns; private FileEncryptionProperties fileEncryptionProperties; +/** + * Create a builder to create a RewriterOptions. + * + * @param conf configuration for reading from input files and writing to output file + * @param inputFile input file path to read from + * @param outputFile output file path to rewrite to + */ public Builder(Configuration conf, Path inputFile, Path outputFile) { this.conf = conf; this.inputFiles = Arrays.asList(inputFile); this.outputFile = outputFile; } +/** + * Create a builder to create a RewriterOptions. + * + * Please note that if merging more than one file, the schema of all files must be the same. + * Otherwise, the rewrite will fail. + * + * The rewrite will keep original row groups from all input files. This may not be optimal Review Comment: I have added some comment to elaborate the small row group problem. Please check @gszadovszky > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689138#comment-17689138 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431423625 > > You're right. We might add an option to force rewriting the input files record by record so row groups are regenerated by the writer. Does that sound good? @gszadovszky > > It sounds great, @wgtmac, but I am fine implementing it separately or in this PR as you prefer. However, we still need to highlight somehow to the user that in the other cases the user should not expect performance improvements in case of merging several files into one. (Moreover it'll increase the footer size which might also generate additional issues.) What do you think? I agree. Let me add some comments to explain the issue at the moment. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689118#comment-17689118 ] ASF GitHub Bot commented on PARQUET-2228: - gszadovszky commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431359196 > You're right. We might add an option to force rewriting the input files record by record so row groups are regenerated by the writer. Does that sound good? @gszadovszky It sounds great, @wgtmac, but I am fine implementing it separately or in this PR as you prefer. However, we still need to highlight somehow to the user that in the other cases the user should not expect performance improvements in case of merging several files into one. (Moreover it'll increase the footer size which might also generate additional issues.) What do you think? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689050#comment-17689050 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431185993 > > @wgtmac, by supporting multiple files to rewrite them into one we will end up with the same number of row-groups, right? Therefore, this tool is not ment to be used to solve the "small files problem". I am highlighting this because we've had issues with users misunderstanding the purpose of features like this. Maybe, we should add some notes about it to the help of parquet-cli. > > You're right. We might add an option to force rewriting the input files record by record so row groups are regenerated by the writer. Does that sound good? @gszadovszky Something like https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ConvertCommand.java > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689046#comment-17689046 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431182101 > @wgtmac, by supporting multiple files to rewrite them into one we will end up with the same number of row-groups, right? Therefore, this tool is not ment to be used to solve the "small files problem". I am highlighting this because we've had issues with users misunderstanding the purpose of features like this. Maybe, we should add some notes about it to the help of parquet-cli. You're right. We might add an option to force rewriting the input files record by record so row groups are regenerated by the writer. Does that sound good? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689011#comment-17689011 ] ASF GitHub Bot commented on PARQUET-2228: - gszadovszky commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431102177 @wgtmac, by supporting multiple files to rewrite them into one we will end up with the same number of row-groups, right? Therefore, this tool is not ment to be used to solve the "small files problem". I am highlighting this because we've had issues with users misunderstanding the purpose of features like this. Maybe, we should add some notes about it to the help of parquet-cli. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689007#comment-17689007 ] ASF GitHub Bot commented on PARQUET-2228: - gszadovszky commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1106931407 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: Sounds good to me. Thanks, @wgtmac. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688778#comment-17688778 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1106565210 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); Review Comment: Fixed ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); + } + + // Routines to get reader of next input file. + // Returns true if there is a next file to read, false otherwise. Review Comment: Fixed ## parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java: ## @@ -484,15 +673,22 @@ private List getOffsets(TransParquetFileReader reader, ColumnChunkMetaData } private void validateCreatedBy() throws Exception { -FileMetaData inFMD = getFileMetaData(inputFile.getFileName(), null).getFileMetaData(); -FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData(); +Set createdBySet = new HashSet<>(); +for (EncryptionTestFile inputFile : inputFiles) { + ParquetMetadata pmd = getFileMetaData(inputFile.getFileName(), null); + createdBySet.add(pmd.getFileMetaData().getCreatedBy()); + assertNull(pmd.getFileMetaData().getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY)); +} +Object[] inputCreatedBys = createdBySet.toArray(); +assertEquals(1, inputCreatedBys.length); +String inputCreatedBy = (String) inputCreatedBys[0]; -assertEquals(inFMD.getCreatedBy(), outFMD.getCreatedBy()); - assertNull(inFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY)); +FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData(); +assertEquals(inputCreatedBy, outFMD.getCreatedBy()); String originalCreatedBy = outFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY); assertNotNull(originalCreatedBy); -assertEquals(inFMD.getCreatedBy(), originalCreatedBy); +assertEquals(inputCreatedBy, originalCreatedBy); } Review Comment: Fixed > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688499#comment-17688499 ] ASF GitHub Bot commented on PARQUET-2228: - ggershinsky commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105785309 ## parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java: ## @@ -484,15 +673,22 @@ private List getOffsets(TransParquetFileReader reader, ColumnChunkMetaData } private void validateCreatedBy() throws Exception { -FileMetaData inFMD = getFileMetaData(inputFile.getFileName(), null).getFileMetaData(); -FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData(); +Set createdBySet = new HashSet<>(); +for (EncryptionTestFile inputFile : inputFiles) { + ParquetMetadata pmd = getFileMetaData(inputFile.getFileName(), null); + createdBySet.add(pmd.getFileMetaData().getCreatedBy()); + assertNull(pmd.getFileMetaData().getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY)); +} +Object[] inputCreatedBys = createdBySet.toArray(); +assertEquals(1, inputCreatedBys.length); +String inputCreatedBy = (String) inputCreatedBys[0]; -assertEquals(inFMD.getCreatedBy(), outFMD.getCreatedBy()); - assertNull(inFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY)); +FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData(); +assertEquals(inputCreatedBy, outFMD.getCreatedBy()); String originalCreatedBy = outFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY); assertNotNull(originalCreatedBy); -assertEquals(inFMD.getCreatedBy(), originalCreatedBy); +assertEquals(inputCreatedBy, originalCreatedBy); } Review Comment: nit: empty line ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); Review Comment: maybe some info (eg file path)? ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); + } + + // Routines to get reader of next input file. + // Returns true if there is a next file to read, false otherwise. Review Comment: this function returns void > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task >
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688227#comment-17688227 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: It is tricky to set a max size, whether in terms of number of files or total file sizes. Based on my knowledge, the performance varies when input files have different number of rows, number of columns, number of blocks, block sizes, etc. So it would be better for the caller to do the planning. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688224#comment-17688224 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: > Do we do dedup? Yes, `allOriginalCreatedBys` is a HashSet which does the job. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688030#comment-17688030 ] ASF GitHub Bot commented on PARQUET-2228: - shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { Review Comment: Do we want to set a max size? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688029#comment-17688029 ] ASF GitHub Bot commented on PARQUET-2228: - shangxinli commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: Do we do dedup? > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688001#comment-17688001 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723 @ggershinsky @shangxinli PTAL, thanks! > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687985#comment-17687985 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader, } } + // Open all input files to validate their schemas are compatible to merge + private void openInputFiles(List inputFiles, Configuration conf) { +Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), "No input files"); + +for (Path inputFile : inputFiles) { + try { +TransParquetFileReader reader = new TransParquetFileReader( +HadoopInputFile.fromPath(inputFile, conf), HadoopReadOptions.builder(conf).build()); +MessageType inputFileSchema = reader.getFooter().getFileMetaData().getSchema(); +if (this.schema == null) { + this.schema = inputFileSchema; +} else { + // Now we enforce equality of schemas from input files for simplicity. + if (!this.schema.equals(inputFileSchema)) { +throw new InvalidSchemaException("Input files have different schemas"); + } +} + this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy()); +this.inputFiles.add(reader); + } catch (IOException e) { +throw new IllegalArgumentException("Failed to open input file: " + inputFile, e); + } +} + +extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", allOriginalCreatedBys)); Review Comment: We have discussed the issue to consolidate original created_by when merging several input files. Now I use a HashSet to make sure there is no duplicate and they are concatenated by `\n`. Does this sound good? @gszadovszky > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687977#comment-17687977 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac opened a new pull request, #1026: URL: https://github.com/apache/parquet-mr/pull/1026 ### Jira https://issues.apache.org/jira/browse/PARQUET-2228 ### Tests - Refactor and add various cases to `ParquetRewriterTest` for merging files. ### Commits - RewriteOptions supports merging more than one input file. - Enforce schema equality of all input files. - ParquetRewriter supports various rewrite operations on several input files. > ParquetRewriter supports more than one input file > - > > Key: PARQUET-2228 > URL: https://issues.apache.org/jira/browse/PARQUET-2228 > Project: Parquet > Issue Type: Sub-task > Components: parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > ParquetRewriter currently supports only one input file. The scope of this > task is to support multiple input files and the rewriter merges them into a > single one w/o some rewrite options specified. -- This message was sent by Atlassian Jira (v8.20.10#820010)