[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17698051#comment-17698051
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1129917664


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   sounds good 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697669#comment-17697669
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128833713


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   That sounds reasonable. Would you like to create a PR with your proposed 
change? @vectorijk 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697647#comment-17697647
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader`, shall we also consider to close previous reader 
explicitly if it is not null? try-with-resources statement to close reader 
might not be suitable in this function





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-03-07 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17697646#comment-17697646
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

vectorijk commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1128712362


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +189,69 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+LOG.error("Input files have different schemas, expect: {}, input: 
{}, current file: {}",
+this.schema, inputFileSchema, inputFile);
+throw new InvalidSchemaException("Input files have different 
schemas, current file: " + inputFile);
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);

Review Comment:
   shall we handle the close of connection opened by `TransParquetFileReader` 
when the exception is thrown here?
   in `initNextReader` shall we also consider to close previous reader 
explicitly if it is not null





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691509#comment-17691509
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

gszadovszky merged PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-21 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691508#comment-17691508
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

gszadovszky commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1112880384


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -101,37 +103,121 @@ public static class Builder {
 private List encryptColumns;
 private FileEncryptionProperties fileEncryptionProperties;
 
+/**
+ * Create a builder to create a RewriterOptions.
+ *
+ * @param conf   configuration for reading from input files and 
writing to output file
+ * @param inputFile  input file path to read from
+ * @param outputFile output file path to rewrite to
+ */
 public Builder(Configuration conf, Path inputFile, Path outputFile) {
   this.conf = conf;
   this.inputFiles = Arrays.asList(inputFile);
   this.outputFile = outputFile;
 }
 
+/**
+ * Create a builder to create a RewriterOptions.
+ * 
+ * Please note that if merging more than one file, the schema of all files 
must be the same.
+ * Otherwise, the rewrite will fail.
+ * 
+ * The rewrite will keep original row groups from all input files. This 
may not be optimal

Review Comment:
   Thank you, @wgtmac 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-20 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17691259#comment-17691259
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1437355639

   Let me know if you have more feedbacks  @ggershinsky @gszadovszky 




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690085#comment-17690085
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1433984809

   I saw a test failure below from the 
[GHA](https://github.com/apache/parquet-mr/actions/runs/4195487917/jobs/7275103509)
 which is unstable:
   ```
   Error:  Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 
0.637 s <<< FAILURE! - in org.apache.parquet.hadoop.TestParquetWriter
   Error:  
testParquetFileWithBloomFilterWithFpp(org.apache.parquet.hadoop.TestParquetWriter)
  Time elapsed: 0.368 s  <<< FAILURE!
   java.lang.AssertionError
at 
org.apache.parquet.hadoop.TestParquetWriter.testParquetFileWithBloomFilterWithFpp(TestParquetWriter.java:342)
   ```




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-16 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689805#comment-17689805
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1108635269


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -101,37 +103,121 @@ public static class Builder {
 private List encryptColumns;
 private FileEncryptionProperties fileEncryptionProperties;
 
+/**
+ * Create a builder to create a RewriterOptions.
+ *
+ * @param conf   configuration for reading from input files and 
writing to output file
+ * @param inputFile  input file path to read from
+ * @param outputFile output file path to rewrite to
+ */
 public Builder(Configuration conf, Path inputFile, Path outputFile) {
   this.conf = conf;
   this.inputFiles = Arrays.asList(inputFile);
   this.outputFile = outputFile;
 }
 
+/**
+ * Create a builder to create a RewriterOptions.
+ * 
+ * Please note that if merging more than one file, the schema of all files 
must be the same.
+ * Otherwise, the rewrite will fail.
+ * 
+ * The rewrite will keep original row groups from all input files. This 
may not be optimal

Review Comment:
   I have added some comment to elaborate the small row group problem. Please 
check @gszadovszky 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689138#comment-17689138
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431423625

   > > You're right. We might add an option to force rewriting the input files 
record by record so row groups are regenerated by the writer. Does that sound 
good? @gszadovszky
   > 
   > It sounds great, @wgtmac, but I am fine implementing it separately or in 
this PR as you prefer. However, we still need to highlight somehow to the user 
that in the other cases the user should not expect performance improvements in 
case of merging several files into one. (Moreover it'll increase the footer 
size which might also generate additional issues.) What do you think?
   
   I agree. Let me add some comments to explain the issue at the moment.




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689118#comment-17689118
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

gszadovszky commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431359196

   > You're right. We might add an option to force rewriting the input files 
record by record so row groups are regenerated by the writer. Does that sound 
good? @gszadovszky
   
   It sounds great, @wgtmac, but I am fine implementing it separately or in 
this PR as you prefer. However, we still need to highlight somehow to the user 
that in the other cases the user should not expect performance improvements in 
case of merging several files into one. (Moreover it'll increase the footer 
size which might also generate additional issues.) What do you think?




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689050#comment-17689050
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431185993

   > > @wgtmac, by supporting multiple files to rewrite them into one we will 
end up with the same number of row-groups, right? Therefore, this tool is not 
ment to be used to solve the "small files problem". I am highlighting this 
because we've had issues with users misunderstanding the purpose of features 
like this. Maybe, we should add some notes about it to the help of parquet-cli.
   > 
   > You're right. We might add an option to force rewriting the input files 
record by record so row groups are regenerated by the writer. Does that sound 
good? @gszadovszky
   
   Something like 
https://github.com/apache/parquet-mr/blob/master/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ConvertCommand.java
 




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689046#comment-17689046
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431182101

   > @wgtmac, by supporting multiple files to rewrite them into one we will end 
up with the same number of row-groups, right? Therefore, this tool is not ment 
to be used to solve the "small files problem". I am highlighting this because 
we've had issues with users misunderstanding the purpose of features like this. 
Maybe, we should add some notes about it to the help of parquet-cli.
   
   You're right. We might add an option to force rewriting the input files 
record by record so row groups are regenerated by the writer. Does that sound 
good?




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689011#comment-17689011
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

gszadovszky commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1431102177

   @wgtmac, by supporting multiple files to rewrite them into one we will end 
up with the same number of row-groups, right? Therefore, this tool is not ment 
to be used to solve the "small files problem". I am highlighting this because 
we've had issues with users misunderstanding the purpose of features like this. 
Maybe, we should add some notes about it to the help of parquet-cli.




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689007#comment-17689007
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

gszadovszky commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1106931407


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   Sounds good to me. Thanks, @wgtmac.





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688778#comment-17688778
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1106565210


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");

Review Comment:
   Fixed



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));
+  }
+
+  // Routines to get reader of next input file.
+  // Returns true if there is a next file to read, false otherwise.

Review Comment:
   Fixed



##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java:
##
@@ -484,15 +673,22 @@ private List getOffsets(TransParquetFileReader 
reader, ColumnChunkMetaData
   }
 
   private void validateCreatedBy() throws Exception {
-FileMetaData inFMD = getFileMetaData(inputFile.getFileName(), 
null).getFileMetaData();
-FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData();
+Set createdBySet = new HashSet<>();
+for (EncryptionTestFile inputFile : inputFiles) {
+  ParquetMetadata pmd = getFileMetaData(inputFile.getFileName(), null);
+  createdBySet.add(pmd.getFileMetaData().getCreatedBy());
+  
assertNull(pmd.getFileMetaData().getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY));
+}
+Object[] inputCreatedBys = createdBySet.toArray();
+assertEquals(1, inputCreatedBys.length);
+String inputCreatedBy = (String) inputCreatedBys[0];
 
-assertEquals(inFMD.getCreatedBy(), outFMD.getCreatedBy());
-
assertNull(inFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY));
+FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData();
+assertEquals(inputCreatedBy, outFMD.getCreatedBy());
 
 String originalCreatedBy = 
outFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY);
 assertNotNull(originalCreatedBy);
-assertEquals(inFMD.getCreatedBy(), originalCreatedBy);
+assertEquals(inputCreatedBy, originalCreatedBy);
   }
 

Review Comment:
   Fixed





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688499#comment-17688499
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

ggershinsky commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105785309


##
parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java:
##
@@ -484,15 +673,22 @@ private List getOffsets(TransParquetFileReader 
reader, ColumnChunkMetaData
   }
 
   private void validateCreatedBy() throws Exception {
-FileMetaData inFMD = getFileMetaData(inputFile.getFileName(), 
null).getFileMetaData();
-FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData();
+Set createdBySet = new HashSet<>();
+for (EncryptionTestFile inputFile : inputFiles) {
+  ParquetMetadata pmd = getFileMetaData(inputFile.getFileName(), null);
+  createdBySet.add(pmd.getFileMetaData().getCreatedBy());
+  
assertNull(pmd.getFileMetaData().getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY));
+}
+Object[] inputCreatedBys = createdBySet.toArray();
+assertEquals(1, inputCreatedBys.length);
+String inputCreatedBy = (String) inputCreatedBys[0];
 
-assertEquals(inFMD.getCreatedBy(), outFMD.getCreatedBy());
-
assertNull(inFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY));
+FileMetaData outFMD = getFileMetaData(outputFile, null).getFileMetaData();
+assertEquals(inputCreatedBy, outFMD.getCreatedBy());
 
 String originalCreatedBy = 
outFMD.getKeyValueMetaData().get(ParquetRewriter.ORIGINAL_CREATED_BY_KEY);
 assertNotNull(originalCreatedBy);
-assertEquals(inFMD.getCreatedBy(), originalCreatedBy);
+assertEquals(inputCreatedBy, originalCreatedBy);
   }
 

Review Comment:
   nit: empty line



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");

Review Comment:
   maybe some info (eg file path)?



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));
+  }
+
+  // Routines to get reader of next input file.
+  // Returns true if there is a next file to read, false otherwise.

Review Comment:
   this function returns void





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>   

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688227#comment-17688227
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105193381


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   It is tricky to set a max size, whether in terms of number of files or total 
file sizes. Based on my knowledge, the performance varies when input files have 
different number of rows, number of columns, number of blocks, block sizes, 
etc. So it would be better for the caller to do the planning.





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688224#comment-17688224
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1105191438


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   > Do we do dedup?
   
   Yes, `allOriginalCreatedBys` is a HashSet which does the job. 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688030#comment-17688030
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104769279


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {

Review Comment:
   Do we want to set a max size?





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688029#comment-17688029
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

shangxinli commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104767997


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   Do we do dedup? 





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17688001#comment-17688001
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1428175723

   @ggershinsky @shangxinli PTAL, thanks!




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687985#comment-17687985
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac commented on code in PR #1026:
URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1104610792


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java:
##
@@ -183,12 +186,61 @@ public ParquetRewriter(TransParquetFileReader reader,
 }
   }
 
+  // Open all input files to validate their schemas are compatible to merge
+  private void openInputFiles(List inputFiles, Configuration conf) {
+Preconditions.checkArgument(inputFiles != null && !inputFiles.isEmpty(), 
"No input files");
+
+for (Path inputFile : inputFiles) {
+  try {
+TransParquetFileReader reader = new TransParquetFileReader(
+HadoopInputFile.fromPath(inputFile, conf), 
HadoopReadOptions.builder(conf).build());
+MessageType inputFileSchema = 
reader.getFooter().getFileMetaData().getSchema();
+if (this.schema == null) {
+  this.schema = inputFileSchema;
+} else {
+  // Now we enforce equality of schemas from input files for 
simplicity.
+  if (!this.schema.equals(inputFileSchema)) {
+throw new InvalidSchemaException("Input files have different 
schemas");
+  }
+}
+
this.allOriginalCreatedBys.add(reader.getFooter().getFileMetaData().getCreatedBy());
+this.inputFiles.add(reader);
+  } catch (IOException e) {
+throw new IllegalArgumentException("Failed to open input file: " + 
inputFile, e);
+  }
+}
+
+extraMetaData.put(ORIGINAL_CREATED_BY_KEY, String.join("\n", 
allOriginalCreatedBys));

Review Comment:
   We have discussed the issue to consolidate original created_by when merging 
several input files. Now I use a HashSet to make sure there is no duplicate and 
they are concatenated by `\n`. Does this sound good? @gszadovszky  





> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687977#comment-17687977
 ] 

ASF GitHub Bot commented on PARQUET-2228:
-

wgtmac opened a new pull request, #1026:
URL: https://github.com/apache/parquet-mr/pull/1026

   ### Jira
   
   https://issues.apache.org/jira/browse/PARQUET-2228
   
   ### Tests
   
   - Refactor and add various cases to `ParquetRewriterTest` for merging files.
   
   ### Commits
   
   - RewriteOptions supports merging more than one input file.
   - Enforce schema equality of all input files.
   - ParquetRewriter supports various rewrite operations on several input files.




> ParquetRewriter supports more than one input file
> -
>
> Key: PARQUET-2228
> URL: https://issues.apache.org/jira/browse/PARQUET-2228
> Project: Parquet
>  Issue Type: Sub-task
>  Components: parquet-mr
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> ParquetRewriter currently supports only one input file. The scope of this 
> task is to support multiple input files and the rewriter merges them into a 
> single one w/o some rewrite options specified.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)