[jira] [Updated] (PARQUET-2226) Support union Bloom Filter

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Description: 
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations

https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252

  was:
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations

[https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)|https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter]


> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2226) Support union Bloom Filter

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Description: 
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations

[https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)|https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter]

  was:
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations

https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)


> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> [https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)|https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2226) Support union Bloom Filter

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Description: 
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations

https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)

  was:
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations, [https://guava.dev/releases/31.0.1-jre] 
/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)


> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2226) Support union Bloom Filter

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Description: 
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for common use. 
Guava supports similar api operations, [https://guava.dev/releases/31.0.1-jre] 
/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)

  was:
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for our own use. 
Guava supports similar api operations, https://guava.dev/releases/31.0.1-jre 
/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)


> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations, 
> [https://guava.dev/releases/31.0.1-jre] 
> /api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2226) Support union Bloom Filter

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Summary: Support union Bloom Filter  (was: Support union Bloom Filter 
operation)

> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for our own use. 
> Guava supports similar api operations, https://guava.dev/releases/31.0.1-jre 
> /api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2210) [C++] Skip pages based on header metadata using a callback

2023-01-12 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield reassigned PARQUET-2210:


Assignee: fatemah

> [C++] Skip pages based on header metadata using a callback
> --
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Assignee: fatemah
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 20.5h
>  Remaining Estimate: 0h
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2210) [C++] Skip pages based on header metadata using a callback

2023-01-12 Thread Micah Kornfield (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Micah Kornfield resolved PARQUET-2210.
--
Fix Version/s: cpp-11.0.0
   Resolution: Fixed

Issue resolved by pull request 14603
https://github.com/apache/arrow/pull/14603

> [C++] Skip pages based on header metadata using a callback
> --
>
> Key: PARQUET-2210
> URL: https://issues.apache.org/jira/browse/PARQUET-2210
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp
>Reporter: fatemah
>Priority: Major
>  Labels: pull-request-available
> Fix For: cpp-11.0.0
>
>  Time Spent: 20.5h
>  Remaining Estimate: 0h
>
> Currently, we do not expose the page header metadata and they cannot be used 
> for skipping pages. I propose exposing the metadata through a callback that 
> would allow the caller to decide if they want to read or skip the page based 
> on the metadata. The signature of the callback would be the following: 
> std::function skip_page_callback)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676041#comment-17676041
 ] 

ASF GitHub Bot commented on PARQUET-2103:
-

wgtmac commented on code in PR #1019:
URL: https://github.com/apache/parquet-mr/pull/1019#discussion_r1068164918


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/FileMetaData.java:
##
@@ -71,7 +79,7 @@ public MessageType getSchema() {
 
   @Override
   public String toString() {
-return "FileMetaData{schema: "+schema+ ", metadata: " + keyValueMetaData + 
"}";
+return "FileMetaData{schema: "+schema+ ", metadata: " + keyValueMetaData + 
", encryption: " + encryptionType + "}";

Review Comment:
   ```suggestion
   return "FileMetaData{schema: " + schema + ", metadata: " + 
keyValueMetaData + ", encryption: " + encryptionType + "}";
   ```



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/FileMetaData.java:
##
@@ -50,16 +51,23 @@ public final class FileMetaData implements Serializable {
* @throws NullPointerException if schema or keyValueMetaData is {@code null}
*/
   public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy) {
-this(schema, keyValueMetaData, createdBy, null);
+this(schema, keyValueMetaData, createdBy, EncryptionType.UNENCRYPTED, 
null);
   }
-  
-  public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy, InternalFileDecryptor fileDecryptor) {
+
+  public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy,
+  InternalFileDecryptor fileDecryptor) {
+this(schema, keyValueMetaData, createdBy, null, fileDecryptor);

Review Comment:
   ```suggestion
   this(schema, keyValueMetaData, createdBy, EncryptionType.UNENCRYPTED, 
fileDecryptor);
   ```





> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> 

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1019: PARQUET-2103: Fix crypto exception in print toPrettyJSON

2023-01-12 Thread GitBox


wgtmac commented on code in PR #1019:
URL: https://github.com/apache/parquet-mr/pull/1019#discussion_r1068164918


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/FileMetaData.java:
##
@@ -71,7 +79,7 @@ public MessageType getSchema() {
 
   @Override
   public String toString() {
-return "FileMetaData{schema: "+schema+ ", metadata: " + keyValueMetaData + 
"}";
+return "FileMetaData{schema: "+schema+ ", metadata: " + keyValueMetaData + 
", encryption: " + encryptionType + "}";

Review Comment:
   ```suggestion
   return "FileMetaData{schema: " + schema + ", metadata: " + 
keyValueMetaData + ", encryption: " + encryptionType + "}";
   ```



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/FileMetaData.java:
##
@@ -50,16 +51,23 @@ public final class FileMetaData implements Serializable {
* @throws NullPointerException if schema or keyValueMetaData is {@code null}
*/
   public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy) {
-this(schema, keyValueMetaData, createdBy, null);
+this(schema, keyValueMetaData, createdBy, EncryptionType.UNENCRYPTED, 
null);
   }
-  
-  public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy, InternalFileDecryptor fileDecryptor) {
+
+  public FileMetaData(MessageType schema, Map 
keyValueMetaData, String createdBy,
+  InternalFileDecryptor fileDecryptor) {
+this(schema, keyValueMetaData, createdBy, null, fileDecryptor);

Review Comment:
   ```suggestion
   this(schema, keyValueMetaData, createdBy, EncryptionType.UNENCRYPTED, 
fileDecryptor);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool

2023-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676033#comment-17676033
 ] 

ASF GitHub Bot commented on PARQUET-2075:
-

wgtmac commented on code in PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1068159808


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.rewrite;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+import java.util.List;
+import java.util.Map;
+
+// A set of options to create a ParquetRewriter.
+public class RewriteOptions {
+
+  final Configuration conf;
+  final Path inputFile;
+  final Path outputFile;
+  final List pruneColumns;
+  final CompressionCodecName newCodecName;
+  final Map maskColumns;
+  final List encryptColumns;
+  final FileEncryptionProperties fileEncryptionProperties;
+
+  private RewriteOptions(Configuration conf,
+ Path inputFile,
+ Path outputFile,
+ List pruneColumns,
+ CompressionCodecName newCodecName,
+ Map maskColumns,
+ List encryptColumns,
+ FileEncryptionProperties fileEncryptionProperties) {
+this.conf = conf;
+this.inputFile = inputFile;
+this.outputFile = outputFile;
+this.pruneColumns = pruneColumns;
+this.newCodecName = newCodecName;
+this.maskColumns = maskColumns;
+this.encryptColumns = encryptColumns;
+this.fileEncryptionProperties = fileEncryptionProperties;
+  }
+
+  public Configuration getConf() {
+return conf;
+  }
+
+  public Path getInputFile() {
+return inputFile;
+  }
+
+  public Path getOutputFile() {
+return outputFile;
+  }
+
+  public List getPruneColumns() {
+return pruneColumns;
+  }
+
+  public CompressionCodecName getNewCodecName() {
+return newCodecName;
+  }
+
+  public Map getMaskColumns() {
+return maskColumns;
+  }
+
+  public List getEncryptColumns() {
+return encryptColumns;
+  }
+
+  public FileEncryptionProperties getFileEncryptionProperties() {
+return fileEncryptionProperties;
+  }
+
+  // Builder to create a RewriterOptions.
+  public static class Builder {
+private Configuration conf;
+private Path inputFile;
+private Path outputFile;
+private List pruneColumns;
+private CompressionCodecName newCodecName;
+private Map maskColumns;
+private List encryptColumns;
+private FileEncryptionProperties fileEncryptionProperties;
+
+public Builder(Configuration conf, Path inputFile, Path outputFile) {
+  this.conf = conf;
+  this.inputFile = inputFile;
+  this.outputFile = outputFile;
+}
+
+public Builder prune(List columns) {
+  this.pruneColumns = columns;
+  return this;
+}
+
+public Builder transform(CompressionCodecName newCodecName) {
+  this.newCodecName = newCodecName;
+  return this;
+}
+
+public Builder mask(Map maskColumns) {
+  this.maskColumns = maskColumns;
+  return this;
+}
+
+public Builder encrypt(List encryptColumns) {
+  this.encryptColumns = encryptColumns;
+  return this;
+}
+
+public Builder encryptionProperties(FileEncryptionProperties 
fileEncryptionProperties) {
+  this.fileEncryptionProperties = fileEncryptionProperties;
+  return this;
+}
+
+public RewriteOptions build() {
+  Preconditions.checkArgument(inputFile != null, "Input file is required");
+  Preconditions.checkArgument(outputFile != null, "Output file is 
required");
+
+  if (pruneColumns != null) {
+if (maskColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!maskColumns.containsKey(pruneColumn),
+"Cannot prune and 

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-12 Thread GitBox


wgtmac commented on code in PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1068159808


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.rewrite;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+import java.util.List;
+import java.util.Map;
+
+// A set of options to create a ParquetRewriter.
+public class RewriteOptions {
+
+  final Configuration conf;
+  final Path inputFile;
+  final Path outputFile;
+  final List pruneColumns;
+  final CompressionCodecName newCodecName;
+  final Map maskColumns;
+  final List encryptColumns;
+  final FileEncryptionProperties fileEncryptionProperties;
+
+  private RewriteOptions(Configuration conf,
+ Path inputFile,
+ Path outputFile,
+ List pruneColumns,
+ CompressionCodecName newCodecName,
+ Map maskColumns,
+ List encryptColumns,
+ FileEncryptionProperties fileEncryptionProperties) {
+this.conf = conf;
+this.inputFile = inputFile;
+this.outputFile = outputFile;
+this.pruneColumns = pruneColumns;
+this.newCodecName = newCodecName;
+this.maskColumns = maskColumns;
+this.encryptColumns = encryptColumns;
+this.fileEncryptionProperties = fileEncryptionProperties;
+  }
+
+  public Configuration getConf() {
+return conf;
+  }
+
+  public Path getInputFile() {
+return inputFile;
+  }
+
+  public Path getOutputFile() {
+return outputFile;
+  }
+
+  public List getPruneColumns() {
+return pruneColumns;
+  }
+
+  public CompressionCodecName getNewCodecName() {
+return newCodecName;
+  }
+
+  public Map getMaskColumns() {
+return maskColumns;
+  }
+
+  public List getEncryptColumns() {
+return encryptColumns;
+  }
+
+  public FileEncryptionProperties getFileEncryptionProperties() {
+return fileEncryptionProperties;
+  }
+
+  // Builder to create a RewriterOptions.
+  public static class Builder {
+private Configuration conf;
+private Path inputFile;
+private Path outputFile;
+private List pruneColumns;
+private CompressionCodecName newCodecName;
+private Map maskColumns;
+private List encryptColumns;
+private FileEncryptionProperties fileEncryptionProperties;
+
+public Builder(Configuration conf, Path inputFile, Path outputFile) {
+  this.conf = conf;
+  this.inputFile = inputFile;
+  this.outputFile = outputFile;
+}
+
+public Builder prune(List columns) {
+  this.pruneColumns = columns;
+  return this;
+}
+
+public Builder transform(CompressionCodecName newCodecName) {
+  this.newCodecName = newCodecName;
+  return this;
+}
+
+public Builder mask(Map maskColumns) {
+  this.maskColumns = maskColumns;
+  return this;
+}
+
+public Builder encrypt(List encryptColumns) {
+  this.encryptColumns = encryptColumns;
+  return this;
+}
+
+public Builder encryptionProperties(FileEncryptionProperties 
fileEncryptionProperties) {
+  this.fileEncryptionProperties = fileEncryptionProperties;
+  return this;
+}
+
+public RewriteOptions build() {
+  Preconditions.checkArgument(inputFile != null, "Input file is required");
+  Preconditions.checkArgument(outputFile != null, "Output file is 
required");
+
+  if (pruneColumns != null) {
+if (maskColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!maskColumns.containsKey(pruneColumn),
+"Cannot prune and mask same column");
+  }
+}
+
+if (encryptColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!encryptColumns.contains(pruneColumn),
+

[jira] [Updated] (PARQUET-2226) Support union Bloom Filter operation

2023-01-12 Thread Mars (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mars updated PARQUET-2226:
--
Description: 
We need to collect Parquet's bloom filter of multiple files, and then 
synthesize a more comprehensive bloom filter for our own use. 
Guava supports similar api operations, https://guava.dev/releases/31.0.1-jre 
/api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)

> Support union Bloom Filter operation
> 
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for our own use. 
> Guava supports similar api operations, https://guava.dev/releases/31.0.1-jre 
> /api/docs/com/google/common/hash/BloomFilter.html#putAll(com.google.common.hash.BloomFilter)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2226) Support union Bloom Filter operation

2023-01-12 Thread Mars (Jira)
Mars created PARQUET-2226:
-

 Summary: Support union Bloom Filter operation
 Key: PARQUET-2226
 URL: https://issues.apache.org/jira/browse/PARQUET-2226
 Project: Parquet
  Issue Type: Improvement
Reporter: Mars






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2075) Unified Rewriter Tool

2023-01-12 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675904#comment-17675904
 ] 

ASF GitHub Bot commented on PARQUET-2075:
-

ggershinsky commented on code in PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067893929


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.rewrite;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+import java.util.List;
+import java.util.Map;
+
+// A set of options to create a ParquetRewriter.
+public class RewriteOptions {
+
+  final Configuration conf;
+  final Path inputFile;
+  final Path outputFile;
+  final List pruneColumns;
+  final CompressionCodecName newCodecName;
+  final Map maskColumns;
+  final List encryptColumns;
+  final FileEncryptionProperties fileEncryptionProperties;
+
+  private RewriteOptions(Configuration conf,
+ Path inputFile,
+ Path outputFile,
+ List pruneColumns,
+ CompressionCodecName newCodecName,
+ Map maskColumns,
+ List encryptColumns,
+ FileEncryptionProperties fileEncryptionProperties) {
+this.conf = conf;
+this.inputFile = inputFile;
+this.outputFile = outputFile;
+this.pruneColumns = pruneColumns;
+this.newCodecName = newCodecName;
+this.maskColumns = maskColumns;
+this.encryptColumns = encryptColumns;
+this.fileEncryptionProperties = fileEncryptionProperties;
+  }
+
+  public Configuration getConf() {
+return conf;
+  }
+
+  public Path getInputFile() {
+return inputFile;
+  }
+
+  public Path getOutputFile() {
+return outputFile;
+  }
+
+  public List getPruneColumns() {
+return pruneColumns;
+  }
+
+  public CompressionCodecName getNewCodecName() {
+return newCodecName;
+  }
+
+  public Map getMaskColumns() {
+return maskColumns;
+  }
+
+  public List getEncryptColumns() {
+return encryptColumns;
+  }
+
+  public FileEncryptionProperties getFileEncryptionProperties() {
+return fileEncryptionProperties;
+  }
+
+  // Builder to create a RewriterOptions.
+  public static class Builder {
+private Configuration conf;
+private Path inputFile;
+private Path outputFile;
+private List pruneColumns;
+private CompressionCodecName newCodecName;
+private Map maskColumns;
+private List encryptColumns;
+private FileEncryptionProperties fileEncryptionProperties;
+
+public Builder(Configuration conf, Path inputFile, Path outputFile) {
+  this.conf = conf;
+  this.inputFile = inputFile;
+  this.outputFile = outputFile;
+}
+
+public Builder prune(List columns) {
+  this.pruneColumns = columns;
+  return this;
+}
+
+public Builder transform(CompressionCodecName newCodecName) {
+  this.newCodecName = newCodecName;
+  return this;
+}
+
+public Builder mask(Map maskColumns) {
+  this.maskColumns = maskColumns;
+  return this;
+}
+
+public Builder encrypt(List encryptColumns) {
+  this.encryptColumns = encryptColumns;
+  return this;
+}
+
+public Builder encryptionProperties(FileEncryptionProperties 
fileEncryptionProperties) {
+  this.fileEncryptionProperties = fileEncryptionProperties;
+  return this;
+}
+
+public RewriteOptions build() {
+  Preconditions.checkArgument(inputFile != null, "Input file is required");
+  Preconditions.checkArgument(outputFile != null, "Output file is 
required");
+
+  if (pruneColumns != null) {
+if (maskColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!maskColumns.containsKey(pruneColumn),
+"Cannot prune 

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-12 Thread GitBox


ggershinsky commented on code in PR #1014:
URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067893929


##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java:
##
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.parquet.hadoop.rewrite;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.Path;
+import org.apache.parquet.Preconditions;
+import org.apache.parquet.crypto.FileEncryptionProperties;
+import org.apache.parquet.hadoop.metadata.CompressionCodecName;
+
+import java.util.List;
+import java.util.Map;
+
+// A set of options to create a ParquetRewriter.
+public class RewriteOptions {
+
+  final Configuration conf;
+  final Path inputFile;
+  final Path outputFile;
+  final List pruneColumns;
+  final CompressionCodecName newCodecName;
+  final Map maskColumns;
+  final List encryptColumns;
+  final FileEncryptionProperties fileEncryptionProperties;
+
+  private RewriteOptions(Configuration conf,
+ Path inputFile,
+ Path outputFile,
+ List pruneColumns,
+ CompressionCodecName newCodecName,
+ Map maskColumns,
+ List encryptColumns,
+ FileEncryptionProperties fileEncryptionProperties) {
+this.conf = conf;
+this.inputFile = inputFile;
+this.outputFile = outputFile;
+this.pruneColumns = pruneColumns;
+this.newCodecName = newCodecName;
+this.maskColumns = maskColumns;
+this.encryptColumns = encryptColumns;
+this.fileEncryptionProperties = fileEncryptionProperties;
+  }
+
+  public Configuration getConf() {
+return conf;
+  }
+
+  public Path getInputFile() {
+return inputFile;
+  }
+
+  public Path getOutputFile() {
+return outputFile;
+  }
+
+  public List getPruneColumns() {
+return pruneColumns;
+  }
+
+  public CompressionCodecName getNewCodecName() {
+return newCodecName;
+  }
+
+  public Map getMaskColumns() {
+return maskColumns;
+  }
+
+  public List getEncryptColumns() {
+return encryptColumns;
+  }
+
+  public FileEncryptionProperties getFileEncryptionProperties() {
+return fileEncryptionProperties;
+  }
+
+  // Builder to create a RewriterOptions.
+  public static class Builder {
+private Configuration conf;
+private Path inputFile;
+private Path outputFile;
+private List pruneColumns;
+private CompressionCodecName newCodecName;
+private Map maskColumns;
+private List encryptColumns;
+private FileEncryptionProperties fileEncryptionProperties;
+
+public Builder(Configuration conf, Path inputFile, Path outputFile) {
+  this.conf = conf;
+  this.inputFile = inputFile;
+  this.outputFile = outputFile;
+}
+
+public Builder prune(List columns) {
+  this.pruneColumns = columns;
+  return this;
+}
+
+public Builder transform(CompressionCodecName newCodecName) {
+  this.newCodecName = newCodecName;
+  return this;
+}
+
+public Builder mask(Map maskColumns) {
+  this.maskColumns = maskColumns;
+  return this;
+}
+
+public Builder encrypt(List encryptColumns) {
+  this.encryptColumns = encryptColumns;
+  return this;
+}
+
+public Builder encryptionProperties(FileEncryptionProperties 
fileEncryptionProperties) {
+  this.fileEncryptionProperties = fileEncryptionProperties;
+  return this;
+}
+
+public RewriteOptions build() {
+  Preconditions.checkArgument(inputFile != null, "Input file is required");
+  Preconditions.checkArgument(outputFile != null, "Output file is 
required");
+
+  if (pruneColumns != null) {
+if (maskColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!maskColumns.containsKey(pruneColumn),
+"Cannot prune and mask same column");
+  }
+}
+
+if (encryptColumns != null) {
+  for (String pruneColumn : pruneColumns) {
+Preconditions.checkArgument(!encryptColumns.contains(pruneColumn),
+