[GitHub] [parquet-mr] zhongyujiang commented on pull request #1028: PARQUET-2244: Fix notIn for columns with null values

2023-02-16 Thread via GitHub
zhongyujiang commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1433005114 > I don't know if there is a downstream that relies on Parquet judge value <> null as TRUE instead of UNKNOW, I guess that might be in some non-ansi standard engines. I

[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689729#comment-17689729 ] ASF GitHub Bot commented on PARQUET-2244: - zhongyujiang commented on PR #1028: URL:

[jira] [Created] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread dzcxzl (Jira)
dzcxzl created PARQUET-2247: --- Summary: Fail-fast if CapacityByteArrayOutputStream write overflow Key: PARQUET-2247 URL: https://issues.apache.org/jira/browse/PARQUET-2247 Project: Parquet Issue

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689770#comment-17689770 ] ASF GitHub Bot commented on PARQUET-2247: - cxzl25 opened a new pull request, #1031: URL:

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689782#comment-17689782 ] ASF GitHub Bot commented on PARQUET-2247: - cxzl25 commented on code in PR #1031: URL:

[GitHub] [parquet-mr] cxzl25 opened a new pull request, #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
cxzl25 opened a new pull request, #1031: URL: https://github.com/apache/parquet-mr/pull/1031 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [PARQUET-2247](https://issues.apache.org/jira/browse/PARQUET-2247) ### Tests -

[GitHub] [parquet-mr] cxzl25 commented on a diff in pull request #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
cxzl25 commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1108530943 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -160,7 +160,7 @@ public void writePage(BytesInput bytes,

[jira] [Commented] (PARQUET-2243) Support zstd-jni in DirectCodecFactory

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689600#comment-17689600 ] ASF GitHub Bot commented on PARQUET-2243: - gszadovszky commented on PR #1027: URL:

[jira] [Commented] (PARQUET-2241) ByteStreamSplitDecoder broken in presence of nulls

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689605#comment-17689605 ] ASF GitHub Bot commented on PARQUET-2241: - wgtmac commented on PR #1025: URL:

[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689602#comment-17689602 ] ASF GitHub Bot commented on PARQUET-2244: - gszadovszky commented on PR #1028: URL:

[GitHub] [parquet-mr] gszadovszky commented on pull request #1028: PARQUET-2244: Fix notIn for columns with null values

2023-02-16 Thread via GitHub
gszadovszky commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1432685172 It seems I pushed it too quickly. Sorry for not giving the time to give feedback, @huaxingao and @wgtmac. @zhongyujiang, feel free to put up another PR with the revert. --

[GitHub] [parquet-mr] gszadovszky commented on pull request #1027: PARQUET-2243: Support zstd-jni in DirectCodecFactory

2023-02-16 Thread via GitHub
gszadovszky commented on PR #1027: URL: https://github.com/apache/parquet-mr/pull/1027#issuecomment-1432679571 Thank you @wgtmac for the review! I'll push it tomorrow if are no objections. -- This is an automated message from the Apache Git Service. To respond to the message, please

[GitHub] [parquet-mr] wgtmac commented on pull request #1025: PARQUET-2241: Fix ByteStreamSplitValuesReader with nulls

2023-02-16 Thread via GitHub
wgtmac commented on PR #1025: URL: https://github.com/apache/parquet-mr/pull/1025#issuecomment-1432693254 @gszadovszky Have time to take a look? The fix is pretty trivial. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-16 Thread via GitHub
wgtmac commented on code in PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#discussion_r1108635269 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/rewrite/RewriteOptions.java: ## @@ -101,37 +103,121 @@ public static class Builder { private List

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689805#comment-17689805 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on code in PR #1026: URL:

[jira] [Created] (PARQUET-2248) ParquetRewriter supports merging files by record

2023-02-16 Thread Gang Wu (Jira)
Gang Wu created PARQUET-2248: Summary: ParquetRewriter supports merging files by record Key: PARQUET-2248 URL: https://issues.apache.org/jira/browse/PARQUET-2248 Project: Parquet Issue Type:

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690092#comment-17690092 ] ASF GitHub Bot commented on PARQUET-2247: - wgtmac commented on code in PR #1031: URL:

[GitHub] [parquet-mr] wgtmac commented on pull request #1026: PARQUET-2228: ParquetRewriter supports more than one input file

2023-02-16 Thread via GitHub
wgtmac commented on PR #1026: URL: https://github.com/apache/parquet-mr/pull/1026#issuecomment-1433984809 I saw a test failure below from the [GHA](https://github.com/apache/parquet-mr/actions/runs/4195487917/jobs/7275103509) which is unstable: ``` Error: Tests run: 6, Failures: 1,

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
wgtmac commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1109219562 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -220,6 +221,11 @@ public void write(byte b[], int off, int len) {

[jira] [Commented] (PARQUET-2228) ParquetRewriter supports more than one input file

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690085#comment-17690085 ] ASF GitHub Bot commented on PARQUET-2228: - wgtmac commented on PR #1026: URL:

[GitHub] [parquet-mr] cxzl25 commented on a diff in pull request #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
cxzl25 commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1109244080 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -220,6 +221,11 @@ public void write(byte b[], int off, int len) {

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690107#comment-17690107 ] ASF GitHub Bot commented on PARQUET-2247: - cxzl25 commented on code in PR #1031: URL:

[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690099#comment-17690099 ] ASF GitHub Bot commented on PARQUET-2244: - wgtmac commented on PR #1028: URL:

[GitHub] [parquet-mr] wgtmac commented on pull request #1028: PARQUET-2244: Fix notIn for columns with null values

2023-02-16 Thread via GitHub
wgtmac commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434018036 > I don't have a strong opinion on whether to keep or revert the fix. The fix won't cause any correctness issue on the engine side because engine will filter again. Same here.

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690110#comment-17690110 ] ASF GitHub Bot commented on PARQUET-2247: - wgtmac commented on code in PR #1031: URL:

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
wgtmac commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1109248942 ## parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java: ## @@ -205,6 +206,12 @@ public void write(byte b[], int off, int len) {

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690164#comment-17690164 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL:

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-16 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1109339286 ## parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690168#comment-17690168 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL:

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-16 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1109339286 ## parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-16 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1109345316 ## parquet-column/src/main/java/org/apache/parquet/column/values/bitpacking/ParquetReadRouter.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690170#comment-17690170 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL:

[GitHub] [parquet-mr] huaxingao commented on pull request #1028: PARQUET-2244: Fix notIn for columns with null values

2023-02-16 Thread via GitHub
huaxingao commented on PR #1028: URL: https://github.com/apache/parquet-mr/pull/1028#issuecomment-1434016401 I don't have a strong opinion on whether to keep or revert the fix. The fix won't cause any correctness issue on the engine side because engine will filter again. -- This is an

[jira] [Commented] (PARQUET-2244) Dictionary filter may skip row-groups incorrectly when evaluating notIn

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690097#comment-17690097 ] ASF GitHub Bot commented on PARQUET-2244: - huaxingao commented on PR #1028: URL:

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1031: PARQUET-2247: Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread via GitHub
wgtmac commented on code in PR #1031: URL: https://github.com/apache/parquet-mr/pull/1031#discussion_r1109249438 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java: ## @@ -160,7 +160,7 @@ public void writePage(BytesInput bytes,

[jira] [Commented] (PARQUET-2247) Fail-fast if CapacityByteArrayOutputStream write overflow

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690111#comment-17690111 ] ASF GitHub Bot commented on PARQUET-2247: - wgtmac commented on code in PR #1031: URL:

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-16 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1109367368 ## parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690176#comment-17690176 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL:

[GitHub] [parquet-mr] jiangjiguang commented on a diff in pull request #1011: PARQUET-2159: java17 vector parquet bit-packing decode optimization

2023-02-16 Thread via GitHub
jiangjiguang commented on code in PR #1011: URL: https://github.com/apache/parquet-mr/pull/1011#discussion_r1109362203 ## parquet-generator/src/main/java/org/apache/parquet/encoding/vectorbitpacking/BitPackingGenerator512Vector.java: ## @@ -0,0 +1,67 @@ +/* + * Licensed to the

[jira] [Commented] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-02-16 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17690182#comment-17690182 ] ASF GitHub Bot commented on PARQUET-2159: - jiangjiguang commented on code in PR #1011: URL: