[jira] [Commented] (PARQUET-2226) Support union Bloom Filter

2023-01-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676858#comment-17676858
 ] 

ASF GitHub Bot commented on PARQUET-2226:
-

yabola commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382679860

   @shangxinli @gszadovszky Can you help take a look if it is suitable~




> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge two Bloom Filters

2023-01-13 Thread GitBox


yabola commented on PR #1020:
URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382679860

   @shangxinli @gszadovszky Can you help take a look if it is suitable~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2226) Support union Bloom Filter

2023-01-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676850#comment-17676850
 ] 

ASF GitHub Bot commented on PARQUET-2226:
-

yabola opened a new pull request, #1020:
URL: https://github.com/apache/parquet-mr/pull/1020

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   




> Support union Bloom Filter
> --
>
> Key: PARQUET-2226
> URL: https://issues.apache.org/jira/browse/PARQUET-2226
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Mars
>Priority: Major
>
> We need to collect Parquet's bloom filter of multiple files, and then 
> synthesize a more comprehensive bloom filter for common use. 
> Guava supports similar api operations
> https://guava.dev/releases/31.0.1-jre/api/docs/src-html/com/google/common/hash/BloomFilter.html#line.252



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] yabola opened a new pull request, #1020: PARQUET-2226 Support merge two Bloom Filters

2023-01-13 Thread GitBox


yabola opened a new pull request, #1020:
URL: https://github.com/apache/parquet-mr/pull/1020

   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676616#comment-17676616
 ] 

ASF GitHub Bot commented on PARQUET-2212:
-

parthchandra commented on PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1381826030

   CI is failing at the pre-build step. Anyone know why?




> Add ByteBuffer api for decryptors to allow direct memory to be decrypted
> 
>
> Key: PARQUET-2212
> URL: https://issues.apache.org/jira/browse/PARQUET-2212
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Parth Chandra
>Priority: Major
> Fix For: 1.12.3
>
>
> The decrypt API in BlockCipher.Decryptor currently only provides an api that 
> takes in a byte array
> {code:java}
> byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);{code}
> A parquet reader that uses the DirectByteBufferAllocator has to incur the 
> cost of copying the data into a byte array (and sometimes back to a 
> DirectByteBuffer) to decrypt data.
> This proposes adding a new API that accepts ByteBuffer as input and avoids 
> the data copy.
> {code:java}
> ByteBuffer decrypt(ByteBuffer from, byte[] AAD);{code}
> The decryption in ColumnChunkPageReadStore can also be updated to use the 
> ByteBuffer based api if the buffer is a DirectByteBuffer. If the buffer is a 
> HeapByteBuffer, then we can continue to use the byte array API since that 
> does not incur a copy when the underlying byte array is accessed.
> Also, some investigation has shown that decryption with ByteBuffers is not 
> able to use hardware acceleration in JVM's before JDK17. In those cases, the 
> overall decryption speed is faster with byte arrays even after incurring the 
> overhead of making a copy. 
> The proposal, then, is to enable the use of the ByteBuffer api for 
> DirectByteBuffers only, and only if the JDK is JDK17 or higher or the user 
> explicitly configures it. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] parthchandra commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread GitBox


parthchandra commented on PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1381826030

   CI is failing at the pre-build step. Anyone know why?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2212) Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676589#comment-17676589
 ] 

ASF GitHub Bot commented on PARQUET-2212:
-

parthchandra commented on code in PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1069182893


##
parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java:
##
@@ -51,17 +52,26 @@
  * @param AAD - Additional Authenticated Data for the decryption (ignored 
in case of CTR cipher)
  * @return plaintext - starts at offset 0 of the output value, and fills 
up the entire byte array.
  */
-public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);
+byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);
 
 /**
+ * Convenience decryption method that reads the length and ciphertext from 
a ByteBuffer
+ *
+ * @param from ByteBuffer with length and ciphertext.
+ * @param AAD - Additional Authenticated Data for the decryption (ignored 
in case of CTR cipher)
+ * @return plaintext -  starts at offset 0 of the output, and fills up the 
entire byte buffer.

Review Comment:
   Done



##
parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java:
##
@@ -36,11 +37,11 @@
  * The ciphertext includes the nonce and (in case of GCM cipher) the tag, 
as detailed in the 
  * Parquet Modular Encryption specification.
  */
-public byte[] encrypt(byte[] plaintext, byte[] AAD);
+byte[] encrypt(byte[] plaintext, byte[] AAD);

Review Comment:
   Probably. This pr only focusses on the read path though. I can add the 
encryptor api when I look at the write path. 



##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCtrDecryptor.java:
##
@@ -86,6 +87,48 @@ public byte[] decrypt(byte[] ciphertext, int 
cipherTextOffset, int cipherTextLen
 
 return plainText;
   }
+  public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
+

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java:
##
@@ -144,6 +144,11 @@
*/
   public static final String BLOOM_FILTERING_ENABLED = 
"parquet.filter.bloom.enabled";
 
+  /**
+* Key to configure if off-heap buffer should be used for decryption
+*/
+  public static final String OFF_HEAP_DECRYPT_BUFFER_ENABLED = 
"parquet.decrypt.off-heap.buffer.enabled";

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {

Review Comment:
   Actually, it does. Added it for V2 and updated the unit tests as well. 



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java:
##
@@ -242,6 +245,12 @@ public Builder withFilter(Filter filter) {
   return this;
 }
 
+public Builder withAllocator(ByteBufferAllocator allocator) {

Review Comment:
   Well, yes. ParquetReader has an instance of `ParquetReadOptions` and this 
builder simply passes the allocator down to that instance. Adding this here is 
mirroring the other options being set in `ParquetReader.Builder`. Used in the 
unit tests.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {
   try {
 BytesInput bytes = dataPageV1.getBytes();
-if (null != blockDecryptor) {
-  bytes = 
BytesInput.from(blockDecryptor.decrypt(bytes.toByteArray(), dataPageAAD));
+BytesInput decompressed;
+
+if (options.getAllocator().isDirect() && 
options.useOffHeapDecryptBuffer()) {
+  ByteBuffer byteBuffer = bytes.toByteBuffer();
+  if (!byteBuffer.isDirect()) {

Review Comment:
   I think there was a previous discussion on this and we simplified the code 
to throw an error. Basically, we expect that a user the ParquetReader will be 
using either a direct buffer or a heap buffer all the way thru, so that 
encountering both a direct buffer and a heap buffer is an error.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java:
##
@@ -187,6 +189,7 @@ public static  Builder builder(ReadSupport 
readSupport, Path path) {
 private final InputFile file;
 private final Path path;
 private Filter filter = null;
+private ByteBufferAllocator  allocator = new HeapByteBufferAllocator();

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java:
##
@@ -78,6 +79,38 @@ public byte[] decrypt(byte[] ciphertext, int 
cipherTextOffset, int cipherTextLen
 return plainText;
   }
 
+  

[GitHub] [parquet-mr] parthchandra commented on a diff in pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread GitBox


parthchandra commented on code in PR #1008:
URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1069182893


##
parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java:
##
@@ -51,17 +52,26 @@
  * @param AAD - Additional Authenticated Data for the decryption (ignored 
in case of CTR cipher)
  * @return plaintext - starts at offset 0 of the output value, and fills 
up the entire byte array.
  */
-public byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);
+byte[] decrypt(byte[] lengthAndCiphertext, byte[] AAD);
 
 /**
+ * Convenience decryption method that reads the length and ciphertext from 
a ByteBuffer
+ *
+ * @param from ByteBuffer with length and ciphertext.
+ * @param AAD - Additional Authenticated Data for the decryption (ignored 
in case of CTR cipher)
+ * @return plaintext -  starts at offset 0 of the output, and fills up the 
entire byte buffer.

Review Comment:
   Done



##
parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java:
##
@@ -36,11 +37,11 @@
  * The ciphertext includes the nonce and (in case of GCM cipher) the tag, 
as detailed in the 
  * Parquet Modular Encryption specification.
  */
-public byte[] encrypt(byte[] plaintext, byte[] AAD);
+byte[] encrypt(byte[] plaintext, byte[] AAD);

Review Comment:
   Probably. This pr only focusses on the read path though. I can add the 
encryptor api when I look at the write path. 



##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCtrDecryptor.java:
##
@@ -86,6 +87,48 @@ public byte[] decrypt(byte[] ciphertext, int 
cipherTextOffset, int cipherTextLen
 
 return plainText;
   }
+  public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {
+

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java:
##
@@ -144,6 +144,11 @@
*/
   public static final String BLOOM_FILTERING_ENABLED = 
"parquet.filter.bloom.enabled";
 
+  /**
+* Key to configure if off-heap buffer should be used for decryption
+*/
+  public static final String OFF_HEAP_DECRYPT_BUFFER_ENABLED = 
"parquet.decrypt.off-heap.buffer.enabled";

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {

Review Comment:
   Actually, it does. Added it for V2 and updated the unit tests as well. 



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java:
##
@@ -242,6 +245,12 @@ public Builder withFilter(Filter filter) {
   return this;
 }
 
+public Builder withAllocator(ByteBufferAllocator allocator) {

Review Comment:
   Well, yes. ParquetReader has an instance of `ParquetReadOptions` and this 
builder simply passes the allocator down to that instance. Adding this here is 
mirroring the other options being set in `ParquetReader.Builder`. Used in the 
unit tests.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java:
##
@@ -133,11 +135,36 @@ public DataPage readPage() {
 public DataPage visit(DataPageV1 dataPageV1) {
   try {
 BytesInput bytes = dataPageV1.getBytes();
-if (null != blockDecryptor) {
-  bytes = 
BytesInput.from(blockDecryptor.decrypt(bytes.toByteArray(), dataPageAAD));
+BytesInput decompressed;
+
+if (options.getAllocator().isDirect() && 
options.useOffHeapDecryptBuffer()) {
+  ByteBuffer byteBuffer = bytes.toByteBuffer();
+  if (!byteBuffer.isDirect()) {

Review Comment:
   I think there was a previous discussion on this and we simplified the code 
to throw an error. Basically, we expect that a user the ParquetReader will be 
using either a direct buffer or a heap buffer all the way thru, so that 
encountering both a direct buffer and a heap buffer is an error.



##
parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java:
##
@@ -187,6 +189,7 @@ public static  Builder builder(ReadSupport 
readSupport, Path path) {
 private final InputFile file;
 private final Path path;
 private Filter filter = null;
+private ByteBufferAllocator  allocator = new HeapByteBufferAllocator();

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesGcmDecryptor.java:
##
@@ -78,6 +79,38 @@ public byte[] decrypt(byte[] ciphertext, int 
cipherTextOffset, int cipherTextLen
 return plainText;
   }
 
+  public ByteBuffer decrypt(ByteBuffer ciphertext, byte[] AAD) {

Review Comment:
   Done



##
parquet-hadoop/src/main/java/org/apache/parquet/crypto/AesCtrDecryptor.java:
##
@@ -86,6 +87,48 @@ public byte[] decrypt(byte[] ciphertext,