[GitHub] [parquet-format] pitrou merged pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-19 Thread GitBox
pitrou merged PR #189: URL: https://github.com/apache/parquet-format/pull/189 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-format] pitrou commented on a diff in pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-19 Thread GitBox
pitrou commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081911430 ## Encodings.md: ## @@ -299,9 +302,18 @@ For a longer description, see https://en.wikipedia.org/wiki/Incremental_encoding This is stored as a sequence of

[GitHub] [parquet-format] wjones127 commented on a diff in pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-19 Thread GitBox
wjones127 commented on code in PR #189: URL: https://github.com/apache/parquet-format/pull/189#discussion_r1081899568 ## Encodings.md: ## @@ -280,16 +280,19 @@ concatenated back to back. The expected savings is from the cost of encoding the and possibly better compression in

[GitHub] [parquet-mr] vectorijk commented on pull request #1015: add support re-encryption in ColumnEncryptor

2023-01-17 Thread GitBox
vectorijk commented on PR #1015: URL: https://github.com/apache/parquet-mr/pull/1015#issuecomment-1385727824 @wgtmac thanks for the review! I will coordinate with https://github.com/apache/parquet-mr/pull/1014 and address the comments -- This is an automated message from the Apache Git

[GitHub] [parquet-mr] zhangjiashen commented on pull request #1016: PARQUET-2223: Parquet Data Masking Enhancement for Column Encryption

2023-01-16 Thread GitBox
zhangjiashen commented on PR #1016: URL: https://github.com/apache/parquet-mr/pull/1016#issuecomment-1384639397 > I found the doc. Could you provide me with a "comment" access, so we'll discuss the goals and design there? Thanks. @ggershinsky thanks for looking at this, I have added

[GitHub] [parquet-mr] ggershinsky commented on pull request #1016: PARQUET-2223: Parquet Data Masking Enhancement for Column Encryption

2023-01-16 Thread GitBox
ggershinsky commented on PR #1016: URL: https://github.com/apache/parquet-mr/pull/1016#issuecomment-1384054816 I found the doc. Could you provide me with a "comment" access, so we'll discuss the goals and design there? Thanks. -- This is an automated message from the Apache Git Service.

[GitHub] [parquet-format] pitrou commented on pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-16 Thread GitBox
pitrou commented on PR #189: URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383840418 Also cc @rok -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[GitHub] [parquet-format] pitrou commented on pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-16 Thread GitBox
pitrou commented on PR #189: URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383831257 @emkornfield @gszadovszky @rdblue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-format] pitrou commented on pull request #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-16 Thread GitBox
pitrou commented on PR #189: URL: https://github.com/apache/parquet-format/pull/189#issuecomment-1383830870 @wjones127 Could you help review the wording? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [parquet-format] pitrou opened a new pull request, #189: PARQUET-2231: [Format] Allow DELTA_BYTE_ARRAY for FIXED_LEN_BYTE_ARRAY

2023-01-16 Thread GitBox
pitrou opened a new pull request, #189: URL: https://github.com/apache/parquet-format/pull/189 DELTA_BYTE_ARRAY has been supported for FIXED_LEN_BYTE_ARRAY by parquet-mr since 2015 (see PARQUET-152). Update the spec in consequence. Also improve wording, markup and add an example.

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-16 Thread GitBox
gszadovszky commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383783806 Sure. :) Please double-check the jira if I assigned it to the correct one. -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-16 Thread GitBox
yabola commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383715909 @wgtmac Thank you for your detailed review and @gszadovszky help. My jira id is miracle -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-16 Thread GitBox
gszadovszky commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383689874 @yabola, what is your jira account? I'd like to assign the jira to you before closing. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [parquet-mr] gszadovszky merged pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-16 Thread GitBox
gszadovszky merged PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070907144 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,24 @@ public long hash(float value) {

[GitHub] [parquet-mr] ggershinsky commented on pull request #1016: PARQUET-2223: Parquet Data Masking Enhancement for Column Encryption

2023-01-15 Thread GitBox
ggershinsky commented on PR #1016: URL: https://github.com/apache/parquet-mr/pull/1016#issuecomment-1383552737 As far as I understand, _data masking_ replaces content of sensitive columns; it does not remove the columns (schema and content). The latter is done by _column pruning_ - when

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
wgtmac commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070853151 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,24 @@ public long hash(float value) {

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070638035 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,83 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637510 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070637029 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,24 @@ public long hash(float value) {

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070636680 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -398,18 +398,21 @@ public long hash(Binary value) {

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-15 Thread GitBox
wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383160117 > I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR. > > I also agree to store the current writer

[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-15 Thread GitBox
gszadovszky commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383152341 I agree that merging the key-value metadata is not an easy question. Let's discuss it separately as it is not related to this PR. I also agree to store the current writer

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
wgtmac commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591970 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -398,18 +398,21 @@ public long hash(Binary value) {

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070591022 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590290 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,21 @@ public long hash(float value) {

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070590238 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
wgtmac commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070546933 ## parquet-column/src/test/java/org/apache/parquet/column/values/bloomfilter/TestBlockSplitBloomFilter.java: ## @@ -181,6 +182,60 @@ public void

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-15 Thread GitBox
yabola commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383102223 @wgtmac I had added unit test, please take a look~ thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-15 Thread GitBox
wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1383101451 > > I am afraid some implementations may drop characters after `'\n'` when displaying the string content. Let me do some investigation. > > I do not have a strong opinion for

[GitHub] [parquet-mr] shangxinli commented on pull request #1016: PARQUET-2223: Parquet Data Masking Enhancement for Column Encryption

2023-01-14 Thread GitBox
shangxinli commented on PR #1016: URL: https://github.com/apache/parquet-mr/pull/1016#issuecomment-1383006808 @ggershinsky Do you have time to have a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [parquet-mr] shangxinli merged pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-14 Thread GitBox
shangxinli merged PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-mr] shangxinli commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
shangxinli commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1383006013 @chenjunjiedada Do you still have time to review this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2227: Refactor several file rewriters to use a new unified ParquetRewriter implementation

2023-01-14 Thread GitBox
gszadovszky commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382840526 > I am afraid some implementations may drop characters after `'\n'` when displaying the string content. Let me do some investigation. I do not have a strong opinion for

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-14 Thread GitBox
wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382815489 > > * I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind

[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-14 Thread GitBox
gszadovszky commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382754916 > * I'd prefer creating a new JIRA for this refactor to be a prerequisite. Merging multiple files to a single one with customized pruning, encryption, and codec is also in my mind

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-14 Thread GitBox
wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1382752637 > I think it is a great refactor. Thanks a lot for working on it, @wgtmac! In the other hand I've thought about PARQUET-2075 as a request for a new feature in `parquet-cli`

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java: ## @@ -176,4 +176,10 @@ public String toString() { * @return

[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-14 Thread GitBox
gszadovszky commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1070274495 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/ParquetRewriter.java: ## @@ -0,0 +1,733 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
gszadovszky commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382737603 One more thing, @yabola. The compatibility tests fail because you have added a new method to a public interface. Even though this interface is not supposed to be implemented by

[GitHub] [parquet-mr] gszadovszky commented on pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
gszadovszky commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382736990 Thanks, @yabola for working on this and also to @wgtmac for reviewing. I do not have much experience with bloom filters so I will rely on your review. Ping me if you have a +1.

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267080 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,21 @@ public long hash(float value) {

[GitHub] [parquet-mr] yabola commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
yabola commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070267028 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BloomFilter.java: ## @@ -176,4 +176,10 @@ public String toString() { * @return

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1020: PARQUET-2226 Support merge bloom filters

2023-01-14 Thread GitBox
wgtmac commented on code in PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#discussion_r1070233210 ## parquet-column/src/main/java/org/apache/parquet/column/values/bloomfilter/BlockSplitBloomFilter.java: ## @@ -394,4 +395,21 @@ public long hash(float value) {

[GitHub] [parquet-mr] yabola commented on pull request #1020: PARQUET-2226 Support merge two Bloom Filters

2023-01-13 Thread GitBox
yabola commented on PR #1020: URL: https://github.com/apache/parquet-mr/pull/1020#issuecomment-1382679860 @shangxinli @gszadovszky Can you help take a look if it is suitable~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [parquet-mr] yabola opened a new pull request, #1020: PARQUET-2226 Support merge two Bloom Filters

2023-01-13 Thread GitBox
yabola opened a new pull request, #1020: URL: https://github.com/apache/parquet-mr/pull/1020 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in

[GitHub] [parquet-mr] parthchandra commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread GitBox
parthchandra commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1381826030 CI is failing at the pre-build step. Anyone know why? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [parquet-mr] parthchandra commented on a diff in pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-13 Thread GitBox
parthchandra commented on code in PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1069182893 ## parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java: ## @@ -51,17 +52,26 @@ * @param AAD - Additional Authenticated

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1019: PARQUET-2103: Fix crypto exception in print toPrettyJSON

2023-01-12 Thread GitBox
wgtmac commented on code in PR #1019: URL: https://github.com/apache/parquet-mr/pull/1019#discussion_r1068164918 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/FileMetaData.java: ## @@ -71,7 +79,7 @@ public MessageType getSchema() { @Override public

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-12 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1068159808 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-12 Thread GitBox
ggershinsky commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067893929 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-11 Thread GitBox
wgtmac commented on code in PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1067699349 ## parquet-format-structures/src/main/java/org/apache/parquet/format/BlockCipher.java: ## @@ -51,17 +52,26 @@ * @param AAD - Additional Authenticated Data for

[GitHub] [parquet-mr] Kimahriman commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-11 Thread GitBox
Kimahriman commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1379431375 same, we have certain jobs that can't function without a patched jar. Seems to get worse with the more columns you read. Our worst offender (table with thousands of columns), can

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-11 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067083773 ## parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java: ## @@ -0,0 +1,308 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-11 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067082884 ## parquet-hadoop/src/test/java/org/apache/parquet/hadoop/rewrite/ParquetRewriterTest.java: ## @@ -0,0 +1,308 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-11 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1067082250 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-10 Thread GitBox
wgtmac commented on code in PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066592694 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -927,7 +925,15 @@ public PageReadStore readRowGroup(int blockIndex) throws

[GitHub] [parquet-mr] camper42 commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-10 Thread GitBox
camper42 commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1378181577 same problem with @alexeykudinkin currently we replace paruqet jar with patched one in our image, waiting for release -- This is an automated message from the Apache Git

[GitHub] [parquet-mr] shangxinli commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-10 Thread GitBox
shangxinli commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1377840323 Thanks a lot @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-mr] gszadovszky commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-10 Thread GitBox
gszadovszky commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1377706622 > @gszadovszky I Just want to check if you have time to have a look. @wgtmac just be nice to take over the work that we discussed earlier to have an aggregated rewriter.

[GitHub] [parquet-mr] gszadovszky commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-10 Thread GitBox
gszadovszky commented on PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1377700950 > @gszadovszky Nice to see you are back! @shangxinli, I wouldn't say I'm back, unfortunately. I'm a bit closer to Parquet at Dremio but actually not working on it. We'll see

[GitHub] [parquet-mr] shangxinli commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-10 Thread GitBox
shangxinli commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1377598327 Thanks @alexeykudinkin for the explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [parquet-mr] alexeykudinkin commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-10 Thread GitBox
alexeykudinkin commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1377589036 Totally @shangxinli We have running Spark clusters in production _ingesting_ from 100s of Apache Hudi tables (using Parquet and Zstd) and writing into other ones. We

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-10 Thread GitBox
shangxinli commented on code in PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066042941 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1038,7 +1044,10 @@ public PageReadStore readNextFilteredRowGroup()

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-10 Thread GitBox
shangxinli commented on code in PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1066038932 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -927,7 +925,15 @@ public PageReadStore readRowGroup(int blockIndex)

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-10 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1065962705 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-10 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1065962705 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [parquet-mr] wgtmac commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-10 Thread GitBox
wgtmac commented on PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1377485663 > Thanks you for fixing this. I've added some comments. Also, could you add a similar test for the filtered row groups? Thanks for your review @gszadovszky ! I have

[GitHub] [parquet-mr] wgtmac commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-10 Thread GitBox
wgtmac commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1376928785 > @wgtmac Do you have time to have a look? @shangxinli Thanks for mentioning me. Sure, I will take a look this week. -- This is an automated message from the Apache Git

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376796050 Thank you all, @shangxinli , @ggershinsky , @sunchao , @wgtmac . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[GitHub] [parquet-mr] shangxinli commented on pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread GitBox
shangxinli commented on PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#issuecomment-1376755881 @wgtmac Do you have time to have a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [parquet-mr] shangxinli commented on a diff in pull request #1008: PARQUET-2212: Add ByteBuffer api for decryptors to allow direct memory to be decrypted

2023-01-09 Thread GitBox
shangxinli commented on code in PR #1008: URL: https://github.com/apache/parquet-mr/pull/1008#discussion_r1065346689 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageReadStore.java: ## @@ -133,11 +135,36 @@ public DataPage readPage() { public

[GitHub] [parquet-mr] shangxinli commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-09 Thread GitBox
shangxinli commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1376754942 @gszadovszky I Just want to check if you have time to have a look. @wgtmac Just be nice to take over the work that we discussed earlier to have an aggregated rewriter. -- This

[GitHub] [parquet-mr] shangxinli merged pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox
shangxinli merged PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [parquet-mr] shangxinli commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-09 Thread GitBox
shangxinli commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1376754236 Thank you @dongjoon-hyun for working on it! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [parquet-mr] shangxinli commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-09 Thread GitBox
shangxinli commented on PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1376751333 @gszadovszky Nice to see you are back! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [parquet-mr] shangxinli commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-09 Thread GitBox
shangxinli commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376750703 @alexeykudinkin We might release a new patch in the next 2 or 3 months. Can you elaborate why "this is a severe problem that does affect our ability to use Parquet w/ Zstd"?

[GitHub] [parquet-mr] alexeykudinkin commented on pull request #982: PARQUET-2160: Close ZstdInputStream to free off-heap memory in time.

2023-01-09 Thread GitBox
alexeykudinkin commented on PR #982: URL: https://github.com/apache/parquet-mr/pull/982#issuecomment-1376498280 @gszadovszky @ggershinsky @shangxinli Folks, do we have an approximate timeline for the next patch release that will be including this patch? This is a severe

[GitHub] [parquet-format] anjakefala commented on pull request #184: PARQUET-758: Add Float16/Half-float logical type

2023-01-09 Thread GitBox
anjakefala commented on PR #184: URL: https://github.com/apache/parquet-format/pull/184#issuecomment-1376199292 Hey @emkornfield! Is it reasonable for me to send a proposal to the mailing list for a vote? It seems @gszadovszky is not available for insight; is there anyone else that can

[GitHub] [parquet-mr] gszadovszky commented on a diff in pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-09 Thread GitBox
gszadovszky commented on code in PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#discussion_r1064374553 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1038,7 +1044,9 @@ public PageReadStore readNextFilteredRowGroup()

[GitHub] [parquet-mr] ggershinsky commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-08 Thread GitBox
ggershinsky commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1064348254 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,178 @@ +/* + * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-08 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1375203719 Thank you, @ggershinsky ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-08 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1374985673 Thank you, @wgtmac and @sunchao . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[GitHub] [parquet-mr] wgtmac commented on pull request #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-08 Thread GitBox
wgtmac commented on PR #1018: URL: https://github.com/apache/parquet-mr/pull/1018#issuecomment-1374806195 @gszadovszky @ggershinsky @shangxinli @sunchao Could you please take a look when you have time? cc @emkornfield -- This is an automated message from the Apache Git Service.

[GitHub] [parquet-mr] wgtmac opened a new pull request, #1018: PARQUET-2219: ParquetFileReader skips empty row group

2023-01-08 Thread GitBox
wgtmac opened a new pull request, #1018: URL: https://github.com/apache/parquet-mr/pull/1018 ### Jira My PR addresses the [PARQUET-2219](https://issues.apache.org/jira/browse/PARQUET/PARQUET-2219). ### Tests My PR adds the following unit test to read parquet file with

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-06 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1373908959 FYI, here is the ASF SBOM wikipage. - https://cwiki.apache.org/confluence/display/COMDEV/SBOM -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-05 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1372782088 Also, cc @shangxinli and @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

[GitHub] [parquet-mr] dongjoon-hyun commented on pull request #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-05 Thread GitBox
dongjoon-hyun commented on PR #1017: URL: https://github.com/apache/parquet-mr/pull/1017#issuecomment-1372539919 cc @ggershinsky and @sunchao -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [parquet-mr] dongjoon-hyun opened a new pull request, #1017: PARQUET-2224: Publish SBOM artifacts

2023-01-05 Thread GitBox
dongjoon-hyun opened a new pull request, #1017: URL: https://github.com/apache/parquet-mr/pull/1017 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references

[GitHub] [parquet-mr] wgtmac commented on pull request #1015: add support re-encryption in ColumnEncryptor

2023-01-05 Thread GitBox
wgtmac commented on PR #1015: URL: https://github.com/apache/parquet-mr/pull/1015#issuecomment-1372241704 BTW, I am working on https://github.com/apache/parquet-mr/pull/1014 to unify several rewriters and most logic of `class ColumnEncryptor` will be relocated to `class ParquetRewriter`.

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1015: add support re-encryption in ColumnEncryptor

2023-01-05 Thread GitBox
wgtmac commented on code in PR #1015: URL: https://github.com/apache/parquet-mr/pull/1015#discussion_r1062474157 ## parquet-hadoop/src/main/java/org/apache/parquet/crypto/ColumnDecryptionProperties.java: ## @@ -1,104 +1,109 @@ -/* - * Licensed to the Apache Software Foundation

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1016: PARQUET-2223: Parquet Data Masking Enhancement for Column Encryption

2023-01-05 Thread GitBox
wgtmac commented on code in PR #1016: URL: https://github.com/apache/parquet-mr/pull/1016#discussion_r1062349914 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/DataMaskingUtil.java: ## @@ -0,0 +1,95 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [parquet-mr] wgtmac commented on pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-04 Thread GitBox
wgtmac commented on PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#issuecomment-1371885483 > If you can add more unit tests, particularly the combinations of prune, mask, trans-compression etc, it would be better. I have added some test cases in the

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1014: PARQUET-2075: Implement unified file rewriter

2023-01-04 Thread GitBox
wgtmac commented on code in PR #1014: URL: https://github.com/apache/parquet-mr/pull/1014#discussion_r1062185479 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -0,0 +1,144 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

  1   2   3   4   5   6   7   8   9   10   >