[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759204#comment-17759204 ] ASF GitHub Bot commented on PARQUET-2261: - etseidl commented on code in PR #197

[GitHub] [parquet-format] etseidl commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-25 Thread via GitHub
etseidl commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1306335556 ## src/main/thrift/parquet.thrift: ## @@ -974,6 +1050,13 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: optio

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759111#comment-17759111 ] ASF GitHub Bot commented on PARQUET-2261: - etseidl commented on PR #197: URL: h

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-25 Thread via GitHub
etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1693675685 Question about the two implementation rule...is there a set of preferred implementations? Since I'll likely be the one implementing this change in libcudf, I'd be happy to prototype

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759100#comment-17759100 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on PR #197: UR

[GitHub] [parquet-format] emkornfield commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-25 Thread via GitHub
emkornfield commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1693610362 Just wanted to tag that https://github.com/apache/parquet-format/pull/197#discussion_r1301338683 (tagging here because it is a comment on an outdated version) seems to be the la

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759099#comment-17759099 ] ASF GitHub Bot commented on PARQUET-2261: - emkornfield commented on code in PR

[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-25 Thread via GitHub
emkornfield commented on code in PR #197: URL: https://github.com/apache/parquet-format/pull/197#discussion_r1305871232 ## src/main/thrift/parquet.thrift: ## @@ -974,6 +1050,13 @@ struct ColumnIndex { /** A list containing the number of null values for each page **/ 5: o

[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759079#comment-17759079 ] ASF GitHub Bot commented on PARQUET-2261: - m29498 commented on PR #197: URL: ht

[GitHub] [parquet-format] m29498 commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

2023-08-25 Thread via GitHub
m29498 commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1693533716 Thanks @GregoryKimball and @etseidl We would also find this change very useful. As @GregoryKimball mentioned, we can use the extra size statistics in the page footer to be able to mor

[GitHub] [parquet-mr] wgtmac commented on pull request #1111: PARQUET-1822: Avoid requiring Hadoop installation for reading/writing

2023-08-25 Thread via GitHub
wgtmac commented on PR #: URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1693451139 @eyeyar03 We haven't released the next major version 1.14.0 yet, so that's why you cannot see it from there. Usually we don't backport any new feature to a minor release, so the next mi

[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759069#comment-17759069 ] ASF GitHub Bot commented on PARQUET-1822: - wgtmac commented on PR #: URL: h

[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17759048#comment-17759048 ] ASF GitHub Bot commented on PARQUET-1822: - eyeyar03 commented on PR #: URL:

[GitHub] [parquet-mr] eyeyar03 commented on pull request #1111: PARQUET-1822: Avoid requiring Hadoop installation for reading/writing

2023-08-25 Thread via GitHub
eyeyar03 commented on PR #: URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1693352011 Hi @wgtmac , just new to exploring and parsing parquet in Java.. I've been trying the sample code here but I can't make it work because of the LocalInputFile not yet in the current ve

[jira] [Commented] (PARQUET-2342) Parquet writer produced a corrupted file due to page value count overflow

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758874#comment-17758874 ] ASF GitHub Bot commented on PARQUET-2342: - majdyz commented on PR #1135: URL: h

[GitHub] [parquet-mr] majdyz commented on pull request #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-25 Thread via GitHub
majdyz commented on PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#issuecomment-1692954460 @wgtmac I renamed the config to align with its actual intention -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[jira] [Commented] (PARQUET-2342) Parquet writer produced a corrupted file due to page value count overflow

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758865#comment-17758865 ] ASF GitHub Bot commented on PARQUET-2342: - majdyz commented on PR #1135: URL: h

[GitHub] [parquet-mr] majdyz commented on pull request #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-25 Thread via GitHub
majdyz commented on PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#issuecomment-1692922180 @boneanxs It's `org.apache.parquet.io.ParquetDecodingException: totalValueCount '-2094967296' <= 0` You can run the added test with the fix excluded to repro the issue. --

[jira] [Commented] (PARQUET-2342) Parquet writer produced a corrupted file due to page value count overflow

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758861#comment-17758861 ] ASF GitHub Bot commented on PARQUET-2342: - wgtmac commented on code in PR #1135

[GitHub] [parquet-mr] wgtmac commented on a diff in pull request #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-25 Thread via GitHub
wgtmac commented on code in PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#discussion_r1305294522 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java: ## @@ -146,6 +146,7 @@ public static enum JobSummaryLevel { public static final

[jira] [Commented] (PARQUET-2342) Parquet writer produced a corrupted file due to page value count overflow

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758859#comment-17758859 ] ASF GitHub Bot commented on PARQUET-2342: - boneanxs commented on PR #1135: URL:

[GitHub] [parquet-mr] boneanxs commented on pull request #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-25 Thread via GitHub
boneanxs commented on PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#issuecomment-1692901137 Hey @majdyz Can share what the exact error you've got when reading this corrupt file? -- This is an automated message from the Apache Git Service. To respond to the message, please

[jira] [Commented] (PARQUET-2342) Parquet writer produced a corrupted file due to page value count overflow

2023-08-25 Thread ASF GitHub Bot (Jira)
[ https://issues.apache.org/jira/browse/PARQUET-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758858#comment-17758858 ] ASF GitHub Bot commented on PARQUET-2342: - majdyz commented on code in PR #1135

[GitHub] [parquet-mr] majdyz commented on a diff in pull request #1135: PARQUET-2342: Fix writing corrupted parquet file by avoiding overflow on page value count

2023-08-25 Thread via GitHub
majdyz commented on code in PR #1135: URL: https://github.com/apache/parquet-mr/pull/1135#discussion_r1305292076 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java: ## @@ -231,7 +231,9 @@ private void sizeCheck() { long usedMem = writ