[GitHub] [parquet-mr] hamza-tam opened a new pull request, #1041: Uniformizing booleans naming in ParquetWriter

2023-03-17 Thread via GitHub
hamza-tam opened a new pull request, #1041: URL: https://github.com/apache/parquet-mr/pull/1041 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701921#comment-17701921 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701922#comment-17701922 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474188220 Another possibility is a nice table: ``` +--++-+ | Page kind| RLE-encoded data kind | Prepend length? | +---

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474188853 Also, please someone with better knowledge of parquet-mr comment on [https://github.com/apache/parquet-format/pull/193#issuecomment-1474171946]. -- This is an automated message from

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701920#comment-17701920 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474186504 I think this should be more explicit, e.g.: ``` // The length-prepended version is used for: // - in v1 data pages: definition levels, repetition levels, and RLE-encoded boole

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701914#comment-17701914 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474171946 Ok, so v2 data pages for RLE-encoded boolean do encode the length: https://github.com/apache/parquet-mr/blob/1235003e742e6a76bf6cb8f7ed33e942fa12d0d5/parquet-column/src/main/java/or

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701907#comment-17701907 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474154058 I was alluding to this comment: > @wgtmac is correct that `length` is left out only in case of `v2` `DL` and `RL`. -- This is an automated message from the Apache Git Ser

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701904#comment-17701904 ] ASF GitHub Bot commented on PARQUET-: - mapleFU commented on PR #193: URL: h

[GitHub] [parquet-format] mapleFU commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
mapleFU commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474140277 @pitrou I guess it's already within DataPage. It's the first 4B in deserialized data -- This is an automated message from the Apache Git Service. To respond to the message, please

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701880#comment-17701880 ] ASF GitHub Bot commented on PARQUET-: - pitrou commented on PR #193: URL: ht

[GitHub] [parquet-format] pitrou commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
pitrou commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1474072383 Hmm, can you point me to the place where the length is written out for RLE-encoded boolean data in v2 data pages? -- This is an automated message from the Apache Git Service. To res

[GitHub] [parquet-site] gszadovszky commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
gszadovszky commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1474023977 @wgtmac, I've found it finally: https://parquet.staged.apache.org/ I don't think staging makes sense this way. The two branches are already diverged from each other. I think a bett

[GitHub] [parquet-site] wgtmac commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
wgtmac commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1473954586 There is a [document](https://github.com/apache/parquet-site/tree/production#staging) for staging and production but I still don't know where is the staging site. OK, I will close

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701813#comment-17701813 ] ASF GitHub Bot commented on PARQUET-2256: - wgtmac commented on PR #195: URL: ht

[GitHub] [parquet-format] wgtmac commented on pull request #195: PARQUET-2256: Add BloomFilter Compression

2023-03-17 Thread via GitHub
wgtmac commented on PR #195: URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1473943114 @gszadovszky Good point! I have a relevant proposal (https://github.com/apache/parquet-format/pull/194) to bloom filter, mind take a look as well? -- This is an automated m

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701808#comment-17701808 ] ASF GitHub Bot commented on PARQUET-: - wgtmac commented on PR #193: URL: ht

[GitHub] [parquet-format] wgtmac commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
wgtmac commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1473932634 Sounds good. Let me fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Xinli Shang (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701789#comment-17701789 ] Xinli Shang commented on PARQUET-1690: -- It is a quite long time ago. I don't remem

[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701775#comment-17701775 ] ASF GitHub Bot commented on PARQUET-2198: - shangxinli commented on PR #1005: UR

[GitHub] [parquet-mr] shangxinli commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs

2023-03-17 Thread via GitHub
shangxinli commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1473894681 We already started working on the release. Please wait... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[jira] [Commented] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701682#comment-17701682 ] ASF GitHub Bot commented on PARQUET-2198: - mdadil-dk commented on PR #1005: URL

[GitHub] [parquet-mr] mdadil-dk commented on pull request #1005: PARQUET-2198 : Updating jackson data bind version to fix CVEs

2023-03-17 Thread via GitHub
mdadil-dk commented on PR #1005: URL: https://github.com/apache/parquet-mr/pull/1005#issuecomment-1473738111 Any new release plan for this ?? Or have SNAPSHOT/RC build to test ?? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701660#comment-17701660 ] ASF GitHub Bot commented on PARQUET-: - gszadovszky commented on PR #193: UR

[GitHub] [parquet-format] gszadovszky commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
gszadovszky commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1473665965 @wgtmac is correct that `length` is left out only in case of `v2` `DL` and `RL`. Meanwhile I agree with @pitrou that the note is better to be mentioned at the grammar spec: ``

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701637#comment-17701637 ] ASF GitHub Bot commented on PARQUET-2256: - gszadovszky commented on PR #195: UR

[GitHub] [parquet-format] gszadovszky commented on pull request #195: PARQUET-2256: Add BloomFilter Compression

2023-03-17 Thread via GitHub
gszadovszky commented on PR #195: URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1473613246 @mapleFU, I have discovered two unfortunate issues with the format definition of bloom filters that would be nice to be corrected before adding this change. (I am also fine solvi

[GitHub] [parquet-site] gszadovszky commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
gszadovszky commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1473548025 Even though the last release is old without the release no one should implement the new features. It should work similarly to the releases of implementations like `parquet-mr`. So I

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Xuwei Fu (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701577#comment-17701577 ] Xuwei Fu commented on PARQUET-2256: --- [~gszadovszky] Yes, I'd like to. I think having

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701575#comment-17701575 ] Gabor Szadovszky commented on PARQUET-2256: --- [~mwish], would you mind to do s

[jira] [Commented] (PARQUET-2222) [Format] RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701573#comment-17701573 ] ASF GitHub Bot commented on PARQUET-: - wgtmac commented on PR #193: URL: ht

[GitHub] [parquet-format] wgtmac commented on pull request #193: PARQUET-2222: RLE encoding spec incorrect for v2 data pages

2023-03-17 Thread via GitHub
wgtmac commented on PR #193: URL: https://github.com/apache/parquet-format/pull/193#issuecomment-1473382293 Gentle ping @gszadovszky -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[jira] [Assigned] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky reassigned PARQUET-2256: - Assignee: Xuwei Fu > Adding Compression for BloomFilter >

[GitHub] [parquet-site] wgtmac commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
wgtmac commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1473373099 > Have you copied from master or from the latest release? (I think, the latest release would be preferred.) I copied from `master` because the latest v2.9.0 was released almost two yea

[GitHub] [parquet-site] gszadovszky commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
gszadovszky commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1473359437 Thanks for taking care of this, @wgtmac! Have you copied from `master` or from the latest release? (I think, the latest release would be preferred.) Also, what do you think abou

[jira] [Commented] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure

2023-03-17 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701568#comment-17701568 ] Gabor Szadovszky commented on PARQUET-2258: --- Thanks for fixing this, [~abstra

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701567#comment-17701567 ] ASF GitHub Bot commented on PARQUET-2256: - mapleFU commented on PR #195: URL: h

[GitHub] [parquet-format] mapleFU commented on pull request #195: PARQUET-2256: Add BloomFilter Compression

2023-03-17 Thread via GitHub
mapleFU commented on PR #195: URL: https://github.com/apache/parquet-format/pull/195#issuecomment-1473351026 @gszadovszky Mind take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

[jira] [Commented] (PARQUET-2256) Adding Compression for BloomFilter

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701566#comment-17701566 ] ASF GitHub Bot commented on PARQUET-2256: - mapleFU opened a new pull request, #

[GitHub] [parquet-format] mapleFU opened a new pull request, #195: PARQUET-2256: Add BloomFilter Compression

2023-03-17 Thread via GitHub
mapleFU opened a new pull request, #195: URL: https://github.com/apache/parquet-format/pull/195 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them

[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Gabor Szadovszky (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701561#comment-17701561 ] Gabor Szadovszky commented on PARQUET-1690: --- [~humanoid], I don't know/rememb

[jira] [Updated] (PARQUET-2259) [Site] Update parquet site

2023-03-17 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2259: Labels: pull-request-available (was: ) > [Site] Update parquet site > --

[GitHub] [parquet-site] wgtmac commented on pull request #31: PARQUET-2259: Update site to sync with latest parquet-format

2023-03-17 Thread via GitHub
wgtmac commented on PR #31: URL: https://github.com/apache/parquet-site/pull/31#issuecomment-1473291804 I have copied corresponding text from parquet-format to make it easy to update in the future. Please take a look, thanks! @gszadovszky @shangxinli -- This is an automated message from

[jira] [Commented] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure

2023-03-17 Thread Jira
[ https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701543#comment-17701543 ] László Bodor commented on PARQUET-2258: --- thanks [~gszadovszky] and [~wgtmac] for

[jira] [Resolved] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure

2023-03-17 Thread Jira
[ https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] László Bodor resolved PARQUET-2258. --- Resolution: Fixed > Storing toString fields in FilterPredicate instances can lead to memory

[jira] [Updated] (PARQUET-2258) Storing toString fields in FilterPredicate instances can lead to memory pressure

2023-03-17 Thread Jira
[ https://issues.apache.org/jira/browse/PARQUET-2258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] László Bodor updated PARQUET-2258: -- Fix Version/s: 1.12.3 > Storing toString fields in FilterPredicate instances can lead to memo

[jira] [Comment Edited] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Alexey Diomin (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701541#comment-17701541 ] Alexey Diomin edited comment on PARQUET-1690 at 3/17/23 7:17 AM:

[jira] [Commented] (PARQUET-1690) Integer Overflow of BinaryStatistics#isSmallerThan()

2023-03-17 Thread Alexey Diomin (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17701541#comment-17701541 ] Alexey Diomin commented on PARQUET-1690: [~gszadovszky]  could you review the l