[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492418#comment-17492418 ] Junjie Chen commented on PARQUET-2122: -- That's the default size of the bloom filter. Please configure parquet.bloom.filter.max.bytes to fit. > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2125) ParquetFileReader has a currentBlock information in a private field
Tanuja Dubey created PARQUET-2125: - Summary: ParquetFileReader has a currentBlock information in a private field Key: PARQUET-2125 URL: https://issues.apache.org/jira/browse/PARQUET-2125 Project: Parquet Issue Type: Wish Components: parquet-avro Affects Versions: 1.8.1 Reporter: Tanuja Dubey The currentBlock variable is a metric information which can be useful to know which block the current record is being read from. If this variable has a getter, it would be possible to skip over a certain blocks. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492347#comment-17492347 ] ASF GitHub Bot commented on PARQUET-2121: - sekikn commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Thank you for the comment, @shangxinli! I updated the PR. Instead of removing the lines, I just added '(deprecated)' to parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] sekikn commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules
sekikn commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806426714 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Thank you for the comment, @shangxinli! I updated the PR. Instead of removing the lines, I just added '(deprecated)' to parquet-cascading* and parquet-scrooge in README.md, just like parquet-hive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding
[ https://issues.apache.org/jira/browse/PARQUET-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2124: Labels: pull-request-available (was: ) > Bad DCHECK For Intermixed Dictionary Encoding > - > > Key: PARQUET-2124 > URL: https://issues.apache.org/jira/browse/PARQUET-2124 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: William Butler >Assignee: William Butler >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Parquet CPP has a DCHECK for a dictionary encoded page coming after a > non-dictionary encoded page. This is bad because the DCHECK can be triggered > by Parquet files that have a column that has a dictionary page, then a > non-dictionary encoded page, then a page of dictionary encoded > values(indices). Fuzzing found such a file. While this could be turned into > an exception, I don't see anything in the Parquet specification that > prohibits such an occurrence of pages. > This situation has brought up on the mailing list > before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] > and it seems like this is valid but nobody is doing it. > In the PR that added this > check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the > check is probably not needed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2124) Bad DCHECK For Intermixed Dictionary Encoding
William Butler created PARQUET-2124: --- Summary: Bad DCHECK For Intermixed Dictionary Encoding Key: PARQUET-2124 URL: https://issues.apache.org/jira/browse/PARQUET-2124 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: William Butler Assignee: William Butler Parquet CPP has a DCHECK for a dictionary encoded page coming after a non-dictionary encoded page. This is bad because the DCHECK can be triggered by Parquet files that have a column that has a dictionary page, then a non-dictionary encoded page, then a page of dictionary encoded values(indices). Fuzzing found such a file. While this could be turned into an exception, I don't see anything in the Parquet specification that prohibits such an occurrence of pages. This situation has brought up on the mailing list before([https://lists.apache.org/thread/3bzymmbxvmzj12km7cjz1150ndvy9bos)] and it seems like this is valid but nobody is doing it. In the PR that added this check([https://github.com/apache/parquet-cpp/pull/73)] it was noted that the check is probably not needed. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (PARQUET-2123) Invalid memory access in ScanFileContents
[ https://issues.apache.org/jira/browse/PARQUET-2123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2123: Labels: pull-request-available (was: ) > Invalid memory access in ScanFileContents > - > > Key: PARQUET-2123 > URL: https://issues.apache.org/jira/browse/PARQUET-2123 > Project: Parquet > Issue Type: Bug > Components: parquet-cpp >Reporter: William Butler >Assignee: William Butler >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When a Parquet file has 0 columns, ScanFileContents will try to access the > 0th element of a size 0 vector. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2123) Invalid memory access in ScanFileContents
William Butler created PARQUET-2123: --- Summary: Invalid memory access in ScanFileContents Key: PARQUET-2123 URL: https://issues.apache.org/jira/browse/PARQUET-2123 Project: Parquet Issue Type: Bug Components: parquet-cpp Reporter: William Butler Assignee: William Butler When a Parquet file has 0 columns, ScanFileContents will try to access the 0th element of a size 0 vector. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2120) parquet-cli dictionary command fails on pages without dictionary encoding
[ https://issues.apache.org/jira/browse/PARQUET-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492126#comment-17492126 ] ASF GitHub Bot commented on PARQUET-2120: - shangxinli commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003 Thanks for working on it! Can you squash the commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > parquet-cli dictionary command fails on pages without dictionary encoding > - > > Key: PARQUET-2120 > URL: https://issues.apache.org/jira/browse/PARQUET-2120 > Project: Parquet > Issue Type: Bug > Components: parquet-cli >Affects Versions: 1.12.2 >Reporter: Willi Raschkowski >Priority: Minor > > parquet-cli's {{dictionary}} command fails with an NPE if a page does not > have dictionary encoding: > {code} > $ parquet dictionary --column col a-b-c.snappy.parquet > Unknown error > java.lang.NullPointerException: Cannot invoke > "org.apache.parquet.column.page.DictionaryPage.getEncoding()" because "page" > is null > at > org.apache.parquet.cli.commands.ShowDictionaryCommand.run(ShowDictionaryCommand.java:78) > at org.apache.parquet.cli.Main.run(Main.java:155) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at org.apache.parquet.cli.Main.main(Main.java:185) > $ parquet meta a-b-c.snappy.parquet > ... > Row group 0: count: 1 46.00 B records start: 4 total: 46 B > > type encodings count avg size nulls min / max > col BINARYS _ 1 46.00 B0 "a" / "a" > Row group 1: count: 200 0.34 B records start: 50 total: 69 B > > type encodings count avg size nulls min / max > col BINARYS _ R 200 0.34 B 0 "b" / "c" > {code} > (Note the missing {{R}} / dictionary encoding on that first page.) > Someone familiar with Parquet might guess from the NPE that there's no > dictionary encoding. But for files that mix pages with and without dictionary > encoding (like above), the command will fail before getting to pages that > actually have dictionaries. > The problem is that [this > line|https://github.com/apache/parquet-mr/blob/300200eb72b9f16df36d9a68cf762683234aeb08/parquet-cli/src/main/java/org/apache/parquet/cli/commands/ShowDictionaryCommand.java#L76] > assumes {{readDictionaryPage}} always returns a page and doesn't handle when > it does not, i.e. when it returns {{null}}. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on pull request #946: PARQUET-2120: Dictionary command should handle missing dictionary pages
shangxinli commented on pull request #946: URL: https://github.com/apache/parquet-mr/pull/946#issuecomment-1039360003 Thanks for working on it! Can you squash the commits? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492117#comment-17492117 ] ASF GitHub Bot commented on PARQUET-2121: - shangxinli commented on pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753 @sekikn Thanks for working on it! Just leave some minor comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on pull request #947: PARQUET-2121: Remove descriptions for the removed modules
shangxinli commented on pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#issuecomment-1039353753 @sekikn Thanks for working on it! Just leave some minor comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-2121) Remove descriptions for the removed modules
[ https://issues.apache.org/jira/browse/PARQUET-2121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492108#comment-17492108 ] ASF GitHub Bot commented on PARQUET-2121: - shangxinli commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here/ ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Remove descriptions for the removed modules > --- > > Key: PARQUET-2121 > URL: https://issues.apache.org/jira/browse/PARQUET-2121 > Project: Parquet > Issue Type: Improvement >Reporter: Kengo Seki >Assignee: Kengo Seki >Priority: Minor > > PARQUET-2020 removed some deprecated modules, but the related descriptions > still remain in some documents. They should be removed since their existence > is misleading. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #947: PARQUET-2121: Remove descriptions for the removed modules
shangxinli commented on a change in pull request #947: URL: https://github.com/apache/parquet-mr/pull/947#discussion_r806064746 ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here/ ## File path: README.md ## @@ -66,10 +66,8 @@ Parquet is a very active project, and new features are being added quickly. Here * Type-specific encoding * Hive integration (deprecated) * Pig integration -* Cascading integration Review comment: Since the code is still there, do you think we can just add '(deprecated)' here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Comment Edited] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099 ] Xinli Shang edited comment on PARQUET-2122 at 2/14/22, 4:56 PM: [~junjie] Do you know why? was (Author: sha...@uber.com): [~junjie]Do you know why? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
[ https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099 ] Xinli Shang commented on PARQUET-2122: -- [~junjie]Do you know why? > Adding Bloom filter to small Parquet file bloats in size X1700 > -- > > Key: PARQUET-2122 > URL: https://issues.apache.org/jira/browse/PARQUET-2122 > Project: Parquet > Issue Type: Bug > Components: parquet-cli, parquet-mr >Affects Versions: 1.13.0 >Reporter: Ze'ev Maor >Priority: Critical > Attachments: data.csv, data_index_bloom.parquet > > > Converting a small, 14 rows/1 string column csv file to Parquet without bloom > filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to > ParquetWriter then yields a 1049197B file. > It isn't clear what the extra space is used by. > Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700
Z M created PARQUET-2122: Summary: Adding Bloom filter to small Parquet file bloats in size X1700 Key: PARQUET-2122 URL: https://issues.apache.org/jira/browse/PARQUET-2122 Project: Parquet Issue Type: Bug Components: parquet-cli, parquet-mr Affects Versions: 1.13.0 Reporter: Z M Attachments: data.csv, data_index_bloom.parquet Converting a small, 14 rows/1 string column csv file to Parquet without bloom filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to ParquetWriter then yields a 1049197B file. It isn't clear what the extra space is used by. Attached csv and bloated Parquet files. -- This message was sent by Atlassian Jira (v8.20.1#820001)