[jira] [Updated] (PARQUET-2398) Make static variables final
[ https://issues.apache.org/jira/browse/PARQUET-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2398: Labels: pull-request-available (was: ) > Make static variables final > --- > > Key: PARQUET-2398 > URL: https://issues.apache.org/jira/browse/PARQUET-2398 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Labels: pull-request-available > Fix For: 1.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications
[ https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2407: Labels: pull-request-available (was: ) > Add custom .asf.yaml for finer-grained control of email notifications > - > > Key: PARQUET-2407 > URL: https://issues.apache.org/jira/browse/PARQUET-2407 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > As per the discussion on ML: > https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add > a customized .asf.yaml config file for better control of email notifications. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications
[ https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793078#comment-17793078 ] ASF GitHub Bot commented on PARQUET-2407: - wgtmac merged PR #1232: URL: https://github.com/apache/parquet-mr/pull/1232 > Add custom .asf.yaml for finer-grained control of email notifications > - > > Key: PARQUET-2407 > URL: https://issues.apache.org/jira/browse/PARQUET-2407 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > As per the discussion on ML: > https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add > a customized .asf.yaml config file for better control of email notifications. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications
[ https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17793077#comment-17793077 ] ASF GitHub Bot commented on PARQUET-2407: - wgtmac commented on PR #1232: URL: https://github.com/apache/parquet-mr/pull/1232#issuecomment-1839865078 Thanks @gszadovszky @pitrou! I'll merge it to see what happens. If it works as expected, I'll create a PR for parquet-format as well. > Add custom .asf.yaml for finer-grained control of email notifications > - > > Key: PARQUET-2407 > URL: https://issues.apache.org/jira/browse/PARQUET-2407 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > As per the discussion on ML: > https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add > a customized .asf.yaml config file for better control of email notifications. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes
[ https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792995#comment-17792995 ] ASF GitHub Bot commented on PARQUET-2408: - Fokko merged PR #1231: URL: https://github.com/apache/parquet-mr/pull/1231 > Fix license header in .gitattributes > > > Key: PARQUET-2408 > URL: https://issues.apache.org/jira/browse/PARQUET-2408 > Project: Parquet > Issue Type: Bug >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > Fix For: 1.14.0 > > > {code:java} > ➜ parquet-mr git:(master) ✗ git status > (ASF) is not a valid attribute name: .gitattributes:2 > License, is not a valid attribute name: .gitattributes:6 > "License"); is not a valid attribute name: .gitattributes:7 > http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: > .gitattributes:10 > writing, is not a valid attribute name: .gitattributes:12 > "AS is not a valid attribute name: .gitattributes:14 > KIND, is not a valid attribute name: .gitattributes:15 > On branch master > Your branch is up to date with 'origin/master'. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792965#comment-17792965 ] ASF GitHub Bot commented on PARQUET-2261: - shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414220467 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -409,4 +428,14 @@ abstract void writePage( ValuesWriter definitionLevels, ValuesWriter values) throws IOException; + + abstract void writePage( + int rowCount, + int valueCount, + Statistics statistics, + SizeStatistics sizeStatistics, Review Comment: There could be some confusion of the two names that sieStatistics is one type of statistics but we are separating them > [Format] Add statistics that reflect decoded size to metadata > - > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792963#comment-17792963 ] ASF GitHub Bot commented on PARQUET-2261: - shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -389,7 +400,14 @@ void writePage() { this.rowsWrittenSoFar += pageRowCount; if (DEBUG) LOG.debug("write page"); try { - writePage(pageRowCount, valueCount, statistics, repetitionLevelColumn, definitionLevelColumn, dataColumn); + writePage( + pageRowCount, + valueCount, + statistics, + sizeStatisticsBuilder.build(), Review Comment: Can we have parity of line 406 and 407? You can use a varaiible in line 407 > [Format] Add statistics that reflect decoded size to metadata > - > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata
[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792964#comment-17792964 ] ASF GitHub Bot commented on PARQUET-2261: - shangxinli commented on code in PR #1177: URL: https://github.com/apache/parquet-mr/pull/1177#discussion_r1414218649 ## parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java: ## @@ -389,7 +400,14 @@ void writePage() { this.rowsWrittenSoFar += pageRowCount; if (DEBUG) LOG.debug("write page"); try { - writePage(pageRowCount, valueCount, statistics, repetitionLevelColumn, definitionLevelColumn, dataColumn); + writePage( + pageRowCount, + valueCount, + statistics, + sizeStatisticsBuilder.build(), Review Comment: Can we have the parity of lines 406 and 407? You can use a variable in line 407 > [Format] Add statistics that reflect decoded size to metadata > - > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: New Feature > Components: parquet-format >Reporter: Micah Kornfield >Assignee: Micah Kornfield >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications
[ https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792930#comment-17792930 ] ASF GitHub Bot commented on PARQUET-2407: - wgtmac commented on PR #1232: URL: https://github.com/apache/parquet-mr/pull/1232#issuecomment-1838950498 Per the [discussion on ML](https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz), I proposed to create an .asf.yaml file for customizing email notification. Please take a look, thanks! @gszadovszky @shangxinli @pitrou @emkornfield > Add custom .asf.yaml for finer-grained control of email notifications > - > > Key: PARQUET-2407 > URL: https://issues.apache.org/jira/browse/PARQUET-2407 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > As per the discussion on ML: > https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add > a customized .asf.yaml config file for better control of email notifications. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2407) Add custom .asf.yaml for finer-grained control of email notifications
[ https://issues.apache.org/jira/browse/PARQUET-2407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792927#comment-17792927 ] ASF GitHub Bot commented on PARQUET-2407: - wgtmac opened a new pull request, #1232: URL: https://github.com/apache/parquet-mr/pull/1232 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2407 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run `mvn spotless:apply -Pvector-plugins` ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Add custom .asf.yaml for finer-grained control of email notifications > - > > Key: PARQUET-2407 > URL: https://issues.apache.org/jira/browse/PARQUET-2407 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-mr >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > As per the discussion on ML: > https://lists.apache.org/thread/4x2ob2ojkznfft3czz0gypmtoz7vo9fz, we can add > a customized .asf.yaml config file for better control of email notifications. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes
[ https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792926#comment-17792926 ] ASF GitHub Bot commented on PARQUET-2408: - wgtmac commented on PR #1231: URL: https://github.com/apache/parquet-mr/pull/1231#issuecomment-1838934525 PTAL @amousavigourabi @Fokko > Fix license header in .gitattributes > > > Key: PARQUET-2408 > URL: https://issues.apache.org/jira/browse/PARQUET-2408 > Project: Parquet > Issue Type: Bug >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > {code:java} > ➜ parquet-mr git:(master) ✗ git status > (ASF) is not a valid attribute name: .gitattributes:2 > License, is not a valid attribute name: .gitattributes:6 > "License"); is not a valid attribute name: .gitattributes:7 > http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: > .gitattributes:10 > writing, is not a valid attribute name: .gitattributes:12 > "AS is not a valid attribute name: .gitattributes:14 > KIND, is not a valid attribute name: .gitattributes:15 > On branch master > Your branch is up to date with 'origin/master'. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2408) Fix license header in .gitattributes
[ https://issues.apache.org/jira/browse/PARQUET-2408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792925#comment-17792925 ] ASF GitHub Bot commented on PARQUET-2408: - wgtmac opened a new pull request, #1231: URL: https://github.com/apache/parquet-mr/pull/1231 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2408 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run `mvn spotless:apply -Pvector-plugins` ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Fix license header in .gitattributes > > > Key: PARQUET-2408 > URL: https://issues.apache.org/jira/browse/PARQUET-2408 > Project: Parquet > Issue Type: Bug >Reporter: Gang Wu >Assignee: Gang Wu >Priority: Major > > {code:java} > ➜ parquet-mr git:(master) ✗ git status > (ASF) is not a valid attribute name: .gitattributes:2 > License, is not a valid attribute name: .gitattributes:6 > "License"); is not a valid attribute name: .gitattributes:7 > http://www.apache.org/licenses/LICENSE-2.0 is not a valid attribute name: > .gitattributes:10 > writing, is not a valid attribute name: .gitattributes:12 > "AS is not a valid attribute name: .gitattributes:14 > KIND, is not a valid attribute name: .gitattributes:15 > On branch master > Your branch is up to date with 'origin/master'. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792829#comment-17792829 ] ASF GitHub Bot commented on PARQUET-1822: - drealeed commented on PR #: URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1838584976 @amousavigourabi , that's actually what I did and it's working for us now. Thanks > Parquet without Hadoop dependencies > --- > > Key: PARQUET-1822 > URL: https://issues.apache.org/jira/browse/PARQUET-1822 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.11.0 > Environment: Amazon Fargate (linux), Windows development box. > We are writing Parquet to be read by the Snowflake and Athena databases. >Reporter: mark juchems >Priority: Minor > Labels: documentation, newbie > Fix For: 1.14.0 > > > I have been trying for weeks to create a parquet file from avro and write to > S3 in Java. This has been incredibly frustrating and odd as Spark can do it > easily (I'm told). > I have assembled the correct jars through luck and diligence, but now I find > out that I have to have hadoop installed on my machine. I am currently > developing in Windows and it seems a dll and exe can fix that up but am > wondering about Linus as the code will eventually run in Fargate on AWS. > *Why do I need external dependencies and not pure java?* > The thing really is how utterly complex all this is. I would like to create > an avro file and convert it to Parquet and write it to S3, but I am trapped > in "ParquetWriter" hell! > *Why can't I get a normal OutputStream and write it wherever I want?* > I have scoured the web for examples and there are a few but we really need > some documentation on this stuff. I understand that there may be reasons for > all this but I can't find them on the web anywhere. Any help? Can't we get > the "SimpleParquet" jar that does this: > > ParquetWriter writer = > AvroParquetWriter.builder(outputStream) > .withSchema(avroSchema) > .withConf(conf) > .withCompressionCodec(CompressionCodecName.SNAPPY) > .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites > files). > .build(); > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2405) Include logic for a fallback CodecFactory implementation
[ https://issues.apache.org/jira/browse/PARQUET-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792814#comment-17792814 ] ASF GitHub Bot commented on PARQUET-2405: - amousavigourabi opened a new pull request, #1230: URL: https://github.com/apache/parquet-mr/pull/1230 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [x] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run `mvn spotless:apply -Pvector-plugins` ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Include logic for a fallback CodecFactory implementation > > > Key: PARQUET-2405 > URL: https://issues.apache.org/jira/browse/PARQUET-2405 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > Now that CodecFactories are pluggable in the read/write API, a fallback > CodecFactory implementation should be included. This would allow users to > more easily implement their own partially delegating CodecFactory for, for > example, hardware accelerating a codec they expect to encounter a lot, > without failing on codecs they did not implement themselves. For these cases, > the fallback implementation could delegate to the default > (Direct)CodecFactory, or any other implementation of their liking. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2406) Remove redundant valueOf calls
[ https://issues.apache.org/jira/browse/PARQUET-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792792#comment-17792792 ] ASF GitHub Bot commented on PARQUET-2406: - amousavigourabi opened a new pull request, #1229: URL: https://github.com/apache/parquet-mr/pull/1229 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run `mvn spotless:apply -Pvector-plugins` ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Remove redundant valueOf calls > -- > > Key: PARQUET-2406 > URL: https://issues.apache.org/jira/browse/PARQUET-2406 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Minor > > Remove redundant valueOf calls, or replace them with parseXXX where possible. > This could avoid some redundant calls and/or (un)boxing. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2401) Synchronize on final fields
[ https://issues.apache.org/jira/browse/PARQUET-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792774#comment-17792774 ] ASF GitHub Bot commented on PARQUET-2401: - Fokko merged PR #1224: URL: https://github.com/apache/parquet-mr/pull/1224 > Synchronize on final fields > --- > > Key: PARQUET-2401 > URL: https://issues.apache.org/jira/browse/PARQUET-2401 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.1 >Reporter: Atour Mousavi Gourabi >Priority: Minor > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static
[ https://issues.apache.org/jira/browse/PARQUET-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792773#comment-17792773 ] ASF GitHub Bot commented on PARQUET-2393: - Fokko merged PR #1216: URL: https://github.com/apache/parquet-mr/pull/1216 > Make `ColumnIOCreatorVisitor` static > > > Key: PARQUET-2393 > URL: https://issues.apache.org/jira/browse/PARQUET-2393 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`
[ https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792772#comment-17792772 ] ASF GitHub Bot commented on PARQUET-2392: - Fokko merged PR #1215: URL: https://github.com/apache/parquet-mr/pull/1215 > Remove StringBuilder in `LogicalTypeAnnotation` > --- > > Key: PARQUET-2392 > URL: https://issues.apache.org/jira/browse/PARQUET-2392 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`
[ https://issues.apache.org/jira/browse/PARQUET-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792769#comment-17792769 ] ASF GitHub Bot commented on PARQUET-2394: - Fokko merged PR #1217: URL: https://github.com/apache/parquet-mr/pull/1217 > Use `computeIfAbsent` in `MessageColumnIO` > -- > > Key: PARQUET-2394 > URL: https://issues.apache.org/jira/browse/PARQUET-2394 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792698#comment-17792698 ] ASF GitHub Bot commented on PARQUET-1647: - wgtmac commented on PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1838002383 BTW, it would be good to add an interoperability test to read parquet files from here: https://github.com/apache/parquet-testing/commit/da467dac2f095b979af37bcf40fa0d1dee5ff652. You may want to take a look at this example: https://github.com/apache/parquet-mr/blob/44b56225be6fe7b74667f4f2430326ef1f076cc5/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/codec/TestInteropReadLz4RawCodec.java#L40 > [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`
[ https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792695#comment-17792695 ] ASF GitHub Bot commented on PARQUET-2395: - wgtmac commented on PR #1218: URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837995639 Thanks for the explanation! > Prefer `singletonList` over `asList` > > > Key: PARQUET-2395 > URL: https://issues.apache.org/jira/browse/PARQUET-2395 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`
[ https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792693#comment-17792693 ] ASF GitHub Bot commented on PARQUET-2395: - Fokko merged PR #1218: URL: https://github.com/apache/parquet-mr/pull/1218 > Prefer `singletonList` over `asList` > > > Key: PARQUET-2395 > URL: https://issues.apache.org/jira/browse/PARQUET-2395 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`
[ https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792694#comment-17792694 ] ASF GitHub Bot commented on PARQUET-2395: - Fokko commented on PR #1218: URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837993837 Thanks for the review @wgtmac, @zhangjiashen and @amousavigourabi 🙌 > Prefer `singletonList` over `asList` > > > Key: PARQUET-2395 > URL: https://issues.apache.org/jira/browse/PARQUET-2395 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`
[ https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792692#comment-17792692 ] ASF GitHub Bot commented on PARQUET-2395: - Fokko commented on PR #1218: URL: https://github.com/apache/parquet-mr/pull/1218#issuecomment-1837993326 @wgtmac Two things: - `singletonList` is completely immutable, while with `asList` you can still mutate the reference. - `singletonList` is not backed by an array, reducing the memory footprint. > Prefer `singletonList` over `asList` > > > Key: PARQUET-2395 > URL: https://issues.apache.org/jira/browse/PARQUET-2395 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2344) Bump to Thirft 0.19.0
[ https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792691#comment-17792691 ] ASF GitHub Bot commented on PARQUET-2344: - wgtmac commented on code in PR #1192: URL: https://github.com/apache/parquet-mr/pull/1192#discussion_r141345 ## pom.xml: ## @@ -619,6 +622,9 @@ true true + + javax.annotation:javax.annotation-api:jar:1.3.2 Review Comment: Why do we need to ignore this? ## parquet-thrift/src/main/java/org/apache/parquet/thrift/ThriftSchemaConverter.java: ## @@ -225,14 +225,18 @@ private static ThriftField toThriftField(String name, Field field, ThriftField.R final Field listElemField = field.getListElemField(); type = new ThriftType.ListType(toThriftField(listElemField.getName(), listElemField, requirement)); break; + case UUID: case ENUM: -Collection enumValues = field.getEnumValues(); -List values = new ArrayList(); -for (TEnum tEnum : enumValues) { - values.add(new EnumValue(tEnum.getValue(), tEnum.toString())); +if (field.isEnum()) { Review Comment: Why mixing UUID and ENUM in this case? ## parquet-format-structures/pom.xml: ## @@ -156,6 +156,11 @@ libthrift ${format.thrift.version} + + javax.annotation + javax.annotation-api Review Comment: Where do we need this? ## parquet-thrift/src/main/java/org/apache/parquet/thrift/struct/ThriftTypeID.java: ## @@ -51,10 +51,15 @@ public enum ThriftTypeID { LIST (TType.LIST, true, ListType.class), ENUM (TType.ENUM, TType.I32, EnumType.class); - private static ThriftTypeID[] types = new ThriftTypeID[17]; + private static final ThriftTypeID[] types; static { +types = new ThriftTypeID[18]; for (ThriftTypeID t : ThriftTypeID.values()) { - types[t.thriftType] = t; + if (t.thriftType == -1) { Review Comment: It would be good to add the link to the comment as well. Or at least we need to explain why -1 is used here. > Bump to Thirft 0.19.0 > - > > Key: PARQUET-2344 > URL: https://issues.apache.org/jira/browse/PARQUET-2344 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`
[ https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792690#comment-17792690 ] ASF GitHub Bot commented on PARQUET-2396: - Fokko merged PR #1219: URL: https://github.com/apache/parquet-mr/pull/1219 > Refactor `ColumnIndexBuilder` > - > > Key: PARQUET-2396 > URL: https://issues.apache.org/jira/browse/PARQUET-2396 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing
[ https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792668#comment-17792668 ] ASF GitHub Bot commented on PARQUET-2391: - Fokko commented on PR #1214: URL: https://github.com/apache/parquet-mr/pull/1214#issuecomment-1837969587 Thanks for the review @wgtmac & @amousavigourabi 🙌 > Remove unnecessary unboxing > --- > > Key: PARQUET-2391 > URL: https://issues.apache.org/jira/browse/PARQUET-2391 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792667#comment-17792667 ] ASF GitHub Bot commented on PARQUET-1647: - zhangjiashen commented on code in PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#discussion_r1413455235 ## pom.xml: ## @@ -596,6 +597,9 @@ [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing
[ https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792666#comment-17792666 ] ASF GitHub Bot commented on PARQUET-2391: - Fokko merged PR #1214: URL: https://github.com/apache/parquet-mr/pull/1214 > Remove unnecessary unboxing > --- > > Key: PARQUET-2391 > URL: https://issues.apache.org/jira/browse/PARQUET-2391 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792645#comment-17792645 ] ASF GitHub Bot commented on PARQUET-2385: - wgtmac merged PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203 > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's
[ https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792641#comment-17792641 ] ASF GitHub Bot commented on PARQUET-2390: - Fokko merged PR #1213: URL: https://github.com/apache/parquet-mr/pull/1213 > Replace anonymouse functions with lambda's > -- > > Key: PARQUET-2390 > URL: https://issues.apache.org/jira/browse/PARQUET-2390 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's
[ https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792642#comment-17792642 ] ASF GitHub Bot commented on PARQUET-2390: - Fokko commented on PR #1213: URL: https://github.com/apache/parquet-mr/pull/1213#issuecomment-1837879063 Thanks for the review @wgtmac > Replace anonymouse functions with lambda's > -- > > Key: PARQUET-2390 > URL: https://issues.apache.org/jira/browse/PARQUET-2390 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2400) Update Spotless command in PR prompt to include vector plugins
[ https://issues.apache.org/jira/browse/PARQUET-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792640#comment-17792640 ] ASF GitHub Bot commented on PARQUET-2400: - wgtmac merged PR #1223: URL: https://github.com/apache/parquet-mr/pull/1223 > Update Spotless command in PR prompt to include vector plugins > -- > > Key: PARQUET-2400 > URL: https://issues.apache.org/jira/browse/PARQUET-2400 > Project: Parquet > Issue Type: Improvement >Reporter: Atour Mousavi Gourabi >Priority: Minor > > The Maven command to apply Spotless referenced in the PR prompt does not > include applying it to the parquet-plugins. This needs to be updated in those > docs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792628#comment-17792628 ] ASF GitHub Bot commented on PARQUET-1647: - wgtmac commented on code in PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#discussion_r1413353347 ## pom.xml: ## @@ -596,6 +597,9 @@ [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792621#comment-17792621 ] ASF GitHub Bot commented on PARQUET-1647: - zhangjiashen commented on PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1837800275 > Could you please rebase it? Rebased, can you help merge this PR? > [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`
[ https://issues.apache.org/jira/browse/PARQUET-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792618#comment-17792618 ] ASF GitHub Bot commented on PARQUET-2388: - Fokko merged PR #1211: URL: https://github.com/apache/parquet-mr/pull/1211 > Deprecate `CHARSETS` on `PlainValuesWriter` > --- > > Key: PARQUET-2388 > URL: https://issues.apache.org/jira/browse/PARQUET-2388 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2389) Remove redundant initializers
[ https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792617#comment-17792617 ] ASF GitHub Bot commented on PARQUET-2389: - Fokko commented on PR #1212: URL: https://github.com/apache/parquet-mr/pull/1212#issuecomment-1837791938 Thanks for the review @wgtmac and @amousavigourabi 🙌 > Remove redundant initializers > - > > Key: PARQUET-2389 > URL: https://issues.apache.org/jira/browse/PARQUET-2389 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2389) Remove redundant initializers
[ https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792616#comment-17792616 ] ASF GitHub Bot commented on PARQUET-2389: - Fokko merged PR #1212: URL: https://github.com/apache/parquet-mr/pull/1212 > Remove redundant initializers > - > > Key: PARQUET-2389 > URL: https://issues.apache.org/jira/browse/PARQUET-2389 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression
[ https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792615#comment-17792615 ] ASF GitHub Bot commented on PARQUET-2387: - Fokko commented on PR #1210: URL: https://github.com/apache/parquet-mr/pull/1210#issuecomment-1837786670 Thanks @wgtmac 🙌 > Simplify `hasFieldsIgnored` expression > -- > > Key: PARQUET-2387 > URL: https://issues.apache.org/jira/browse/PARQUET-2387 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression
[ https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792614#comment-17792614 ] ASF GitHub Bot commented on PARQUET-2387: - Fokko merged PR #1210: URL: https://github.com/apache/parquet-mr/pull/1210 > Simplify `hasFieldsIgnored` expression > -- > > Key: PARQUET-2387 > URL: https://issues.apache.org/jira/browse/PARQUET-2387 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2399) Use deprecated tag in Javadoc
[ https://issues.apache.org/jira/browse/PARQUET-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792463#comment-17792463 ] ASF GitHub Bot commented on PARQUET-2399: - Fokko merged PR #1222: URL: https://github.com/apache/parquet-mr/pull/1222 > Use deprecated tag in Javadoc > - > > Key: PARQUET-2399 > URL: https://issues.apache.org/jira/browse/PARQUET-2399 > Project: Parquet > Issue Type: Improvement >Reporter: Atour Mousavi Gourabi >Priority: Minor > > In some Javadoc, we use Deprecated: instead of the deprecated tag. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2401) Synchronize on final fields
[ https://issues.apache.org/jira/browse/PARQUET-2401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792271#comment-17792271 ] ASF GitHub Bot commented on PARQUET-2401: - amousavigourabi opened a new pull request, #1224: URL: https://github.com/apache/parquet-mr/pull/1224 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run the spotless:apply goal in Maven ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Synchronize on final fields > --- > > Key: PARQUET-2401 > URL: https://issues.apache.org/jira/browse/PARQUET-2401 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792267#comment-17792267 ] ASF GitHub Bot commented on PARQUET-1822: - amousavigourabi commented on PR #: URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1836931968 > Our project needs this feature as well, is there a date for the next major release? @drealeed if you just need to be able to drop the Hadoop Path dependency, you might want to consider copying the InputFile, OutputFile implementations from this pull request before the next release is out. If you need to fully drop Hadoop, this is still being worked on. > Parquet without Hadoop dependencies > --- > > Key: PARQUET-1822 > URL: https://issues.apache.org/jira/browse/PARQUET-1822 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.11.0 > Environment: Amazon Fargate (linux), Windows development box. > We are writing Parquet to be read by the Snowflake and Athena databases. >Reporter: mark juchems >Priority: Minor > Labels: documentation, newbie > Fix For: 1.14.0 > > > I have been trying for weeks to create a parquet file from avro and write to > S3 in Java. This has been incredibly frustrating and odd as Spark can do it > easily (I'm told). > I have assembled the correct jars through luck and diligence, but now I find > out that I have to have hadoop installed on my machine. I am currently > developing in Windows and it seems a dll and exe can fix that up but am > wondering about Linus as the code will eventually run in Fargate on AWS. > *Why do I need external dependencies and not pure java?* > The thing really is how utterly complex all this is. I would like to create > an avro file and convert it to Parquet and write it to S3, but I am trapped > in "ParquetWriter" hell! > *Why can't I get a normal OutputStream and write it wherever I want?* > I have scoured the web for examples and there are a few but we really need > some documentation on this stuff. I understand that there may be reasons for > all this but I can't find them on the web anywhere. Any help? Can't we get > the "SimpleParquet" jar that does this: > > ParquetWriter writer = > AvroParquetWriter.builder(outputStream) > .withSchema(avroSchema) > .withConf(conf) > .withCompressionCodec(CompressionCodecName.SNAPPY) > .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites > files). > .build(); > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1822) Parquet without Hadoop dependencies
[ https://issues.apache.org/jira/browse/PARQUET-1822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17792120#comment-17792120 ] ASF GitHub Bot commented on PARQUET-1822: - drealeed commented on PR #: URL: https://github.com/apache/parquet-mr/pull/#issuecomment-1836333591 Our project needs this feature as well, is there a date for the next major release? > Parquet without Hadoop dependencies > --- > > Key: PARQUET-1822 > URL: https://issues.apache.org/jira/browse/PARQUET-1822 > Project: Parquet > Issue Type: Improvement > Components: parquet-avro >Affects Versions: 1.11.0 > Environment: Amazon Fargate (linux), Windows development box. > We are writing Parquet to be read by the Snowflake and Athena databases. >Reporter: mark juchems >Priority: Minor > Labels: documentation, newbie > Fix For: 1.14.0 > > > I have been trying for weeks to create a parquet file from avro and write to > S3 in Java. This has been incredibly frustrating and odd as Spark can do it > easily (I'm told). > I have assembled the correct jars through luck and diligence, but now I find > out that I have to have hadoop installed on my machine. I am currently > developing in Windows and it seems a dll and exe can fix that up but am > wondering about Linus as the code will eventually run in Fargate on AWS. > *Why do I need external dependencies and not pure java?* > The thing really is how utterly complex all this is. I would like to create > an avro file and convert it to Parquet and write it to S3, but I am trapped > in "ParquetWriter" hell! > *Why can't I get a normal OutputStream and write it wherever I want?* > I have scoured the web for examples and there are a few but we really need > some documentation on this stuff. I understand that there may be reasons for > all this but I can't find them on the web anywhere. Any help? Can't we get > the "SimpleParquet" jar that does this: > > ParquetWriter writer = > AvroParquetWriter.builder(outputStream) > .withSchema(avroSchema) > .withConf(conf) > .withCompressionCodec(CompressionCodecName.SNAPPY) > .withWriteMode(Mode.OVERWRITE)//probably not good for prod. (overwrites > files). > .build(); > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression
[ https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791649#comment-17791649 ] ASF GitHub Bot commented on PARQUET-2387: - Fokko commented on code in PR #1210: URL: https://github.com/apache/parquet-mr/pull/1210#discussion_r1410732198 ## parquet-thrift/src/main/java/org/apache/parquet/thrift/BufferedProtocolReadToWrite.java: ## @@ -369,7 +369,7 @@ public String toDebugString() { ThriftField expectedField; if ((expectedField = type.getChildById(field.id)) == null) { handleUnrecognizedField(field, type, in); -hasFieldsIgnored |= true; Review Comment: `hasFieldsIgnored |= true` equals `hasFieldsIgnored = hasFieldsIgnored || true` which will always return true. > Simplify `hasFieldsIgnored` expression > -- > > Key: PARQUET-2387 > URL: https://issues.apache.org/jira/browse/PARQUET-2387 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`
[ https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791643#comment-17791643 ] ASF GitHub Bot commented on PARQUET-2392: - Fokko commented on PR #1215: URL: https://github.com/apache/parquet-mr/pull/1215#issuecomment-1833844311 @zhangjiashen Thanks for the review, appreciate it! I don't think there are any negative performance implications since the compile will just optimize it to a single concatenation. `StringBuilders` should only be used when you concatenate in a loop (fields in a schema for example). > Remove StringBuilder in `LogicalTypeAnnotation` > --- > > Key: PARQUET-2392 > URL: https://issues.apache.org/jira/browse/PARQUET-2392 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791631#comment-17791631 ] ASF GitHub Bot commented on PARQUET-2171: - steveloughran commented on code in PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#discussion_r1410699415 ## parquet-hadoop/README.md: ## @@ -501,3 +501,11 @@ If `false`, key material is stored in separate new files, created in the same fo **Description:** Length of key encryption keys (KEKs), randomly generated by parquet key management tools. Can be 128, 192 or 256 bits. **Default value:** `128` +--- + +**Property:** `parquet.hadoop.vectored.io.enabled` +**Description:** Flag to enable use of the FileSystem Vector IO API on Hadoop releases which support the feature. +If `true` then an attempt will be made to dynamically load the relevant classes; Review Comment: no, hdfs doesn't support it. Native IO does, so if you use file:// URLS you get direct NIO vectored IO into buffers (yay! hadoop APIs move to the 2010s!). S3A supports it with multiple parallel GET with some range coalescing in between. Would love ABFS connector to support it too... > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791630#comment-17791630 ] ASF GitHub Bot commented on PARQUET-2171: - steveloughran commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1833813575 @Fokko that's japicmp getting its logic wrong because it's a new file; thought I'd edited the build rules so it would ignore that. anyway, need to fix the merge as something (#1209?) has just broken it > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791571#comment-17791571 ] ASF GitHub Bot commented on PARQUET-2385: - amousavigourabi commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-183334 > @amousavigourabi It seems a rebase is required. @wgtmac Done, thanks for the heads up😄 > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2400) Update Spotless command in PR prompt to include vector plugins
[ https://issues.apache.org/jira/browse/PARQUET-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791563#comment-17791563 ] ASF GitHub Bot commented on PARQUET-2400: - amousavigourabi opened a new pull request, #1223: URL: https://github.com/apache/parquet-mr/pull/1223 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run the spotless:apply goal in Maven ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Update Spotless command in PR prompt to include vector plugins > -- > > Key: PARQUET-2400 > URL: https://issues.apache.org/jira/browse/PARQUET-2400 > Project: Parquet > Issue Type: Improvement >Reporter: Atour Mousavi Gourabi >Priority: Minor > > The Maven command to apply Spotless referenced in the PR prompt does not > include applying it to the parquet-plugins. This needs to be updated in those > docs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2399) Use deprecated tag in Javadoc
[ https://issues.apache.org/jira/browse/PARQUET-2399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791560#comment-17791560 ] ASF GitHub Bot commented on PARQUET-2399: - amousavigourabi opened a new pull request, #1222: URL: https://github.com/apache/parquet-mr/pull/1222 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Style - [ ] My contribution adheres to the code style guidelines and Spotless passes. - To apply the necessary changes, run the spotless:apply goal in Maven ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Use deprecated tag in Javadoc > - > > Key: PARQUET-2399 > URL: https://issues.apache.org/jira/browse/PARQUET-2399 > Project: Parquet > Issue Type: Improvement >Reporter: Atour Mousavi Gourabi >Priority: Minor > > In some Javadoc, we use Deprecated: instead of the deprecated tag. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791520#comment-17791520 ] ASF GitHub Bot commented on PARQUET-1647: - wgtmac commented on PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1833370284 Could you please rebase it? > [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791519#comment-17791519 ] ASF GitHub Bot commented on PARQUET-2385: - wgtmac commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1833368029 @amousavigourabi It seems a rebase is required. > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791517#comment-17791517 ] ASF GitHub Bot commented on PARQUET-2386: - wgtmac commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1833367177 > I rebased on the latest master. I'd rather get this merged somewhat quickly (maybe by the end of the week?) as to avoid blocking other merges and/or endless rebasing. I just merged this PR to unblock others. Thanks @amousavigourabi! > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > Fix For: 1.14.0 > > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791516#comment-17791516 ] ASF GitHub Bot commented on PARQUET-2386: - wgtmac merged PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209 > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791324#comment-17791324 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832709341 I rebased on the latest master. I'd rather get this merged somewhat quickly (maybe by the end of the week?) as to avoid blocking other merges and/or endless rebasing. > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791320#comment-17791320 ] ASF GitHub Bot commented on PARQUET-2171: - Fokko commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1832684057 @steveloughran Can you fix the compatibility issue? ``` Error: Failed to execute goal com.github.siom79.japicmp:japicmp-maven-plugin:0.18.2:cmp (default) on project parquet-hadoop: There is at least one incompatibility: org.apache.parquet.hadoop.util.vectorio.BindingUtils.raiseInnerCause(java.util.concurrent.ExecutionException):CLASS_GENERIC_TEMPLATE_CHANGED -> [Help 1] ``` > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791300#comment-17791300 ] ASF GitHub Bot commented on PARQUET-2386: - Fokko commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832643602 @amousavigourabi I can see that it is not nice having to rebase all the time. I can hold off on mine, they are probably easier to resolve than yours. > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791277#comment-17791277 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1832537608 @Fokko , how do we want to coordinate this pull request and your refactoring PRs? I do not feel a lot for applying the editorconfig and Spotless a dozen or so times to resolve conflicts in this PR after each of those get merged. > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791272#comment-17791272 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on code in PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1409749808 ## pom.xml: ## @@ -512,6 +515,43 @@ + More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791271#comment-17791271 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on code in PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1409748314 ## pom.xml: ## @@ -512,6 +515,43 @@ + More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2398) Make static variables final
[ https://issues.apache.org/jira/browse/PARQUET-2398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791101#comment-17791101 ] ASF GitHub Bot commented on PARQUET-2398: - Fokko opened a new pull request, #1221: URL: https://github.com/apache/parquet-mr/pull/1221 Make sure you have checked _all_ steps below. These variables should not change, therefore they should be final. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Make static variables final > --- > > Key: PARQUET-2398 > URL: https://issues.apache.org/jira/browse/PARQUET-2398 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`
[ https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791079#comment-17791079 ] ASF GitHub Bot commented on PARQUET-2396: - Fokko commented on code in PR #1219: URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1409257114 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ## @@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt visit(NotEq notEq) { public > PrimitiveIterator.OfInt visit(In in) { Set values = in.getValues(); IntSet matchingIndexesForNull = new IntOpenHashSet(); // for null - Iterator it = values.iterator(); - while(it.hasNext()) { -T value = it.next(); -if (value == null) { - if (nullCounts == null) { -// Searching for nulls so if we don't have null related statistics we have to return all pages -return IndexIterator.all(getPageCount()); - } else { -for (int i = 0; i < nullCounts.length; i++) { - if (nullCounts[i] > 0) { -matchingIndexesForNull.add(i); + for (T value : values) { + if (value == null) { Review Comment: Ah, good point. I've updated the PR. Thanks for catching this 👍 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ## @@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt visit(NotEq notEq) { public > PrimitiveIterator.OfInt visit(In in) { Set values = in.getValues(); IntSet matchingIndexesForNull = new IntOpenHashSet(); // for null - Iterator it = values.iterator(); - while(it.hasNext()) { -T value = it.next(); -if (value == null) { - if (nullCounts == null) { -// Searching for nulls so if we don't have null related statistics we have to return all pages -return IndexIterator.all(getPageCount()); - } else { -for (int i = 0; i < nullCounts.length; i++) { - if (nullCounts[i] > 0) { -matchingIndexesForNull.add(i); + for (T value : values) { + if (value == null) { Review Comment: Ah, good point. I've updated the PR. Thanks for catching this 👍 > Refactor `ColumnIndexBuilder` > - > > Key: PARQUET-2396 > URL: https://issues.apache.org/jira/browse/PARQUET-2396 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791077#comment-17791077 ] ASF GitHub Bot commented on PARQUET-2386: - Fokko commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831861024 This is great @amousavigourabi. I was thinking of adding linting as well, so great work here! > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2397) Make use of `isEmpty`
[ https://issues.apache.org/jira/browse/PARQUET-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791074#comment-17791074 ] ASF GitHub Bot commented on PARQUET-2397: - Fokko merged PR #1220: URL: https://github.com/apache/parquet-mr/pull/1220 > Make use of `isEmpty` > - > > Key: PARQUET-2397 > URL: https://issues.apache.org/jira/browse/PARQUET-2397 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2171) Implement vectored IO in parquet file format
[ https://issues.apache.org/jira/browse/PARQUET-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17791061#comment-17791061 ] ASF GitHub Bot commented on PARQUET-2171: - steveloughran commented on PR #1139: URL: https://github.com/apache/parquet-mr/pull/1139#issuecomment-1831828739 Code wise, no, other than reviews from others about what is the best place for things, such as that awaitFuture stuff or any other suggestions which people who know the parquet codebase think is best. Code works and we have been testing this through Amazon S3 Express storage for extra speed up. To be ruthless: there's no point paying the premium for that until you've embraced the extra speed ups you get from this first > Implement vectored IO in parquet file format > > > Key: PARQUET-2171 > URL: https://issues.apache.org/jira/browse/PARQUET-2171 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr >Reporter: Mukund Thakur >Priority: Major > > We recently added a new feature called vectored IO in Hadoop for improving > read performance for seek heavy readers. Spark Jobs and others which uses > parquet will greatly benefit from this api. Details can be found here > [https://github.com/apache/hadoop/commit/e1842b2a749d79cbdc15c524515b9eda64c339d5] > https://issues.apache.org/jira/browse/HADOOP-18103 > https://issues.apache.org/jira/browse/HADOOP-11867 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790960#comment-17790960 ] ASF GitHub Bot commented on PARQUET-2386: - gszadovszky commented on code in PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#discussion_r1408884415 ## pom.xml: ## @@ -512,6 +515,43 @@ + More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`
[ https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790918#comment-17790918 ] ASF GitHub Bot commented on PARQUET-2396: - zhangjiashen commented on code in PR #1219: URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ## @@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt visit(NotEq notEq) { public > PrimitiveIterator.OfInt visit(In in) { Set values = in.getValues(); IntSet matchingIndexesForNull = new IntOpenHashSet(); // for null - Iterator it = values.iterator(); - while(it.hasNext()) { -T value = it.next(); -if (value == null) { - if (nullCounts == null) { -// Searching for nulls so if we don't have null related statistics we have to return all pages -return IndexIterator.all(getPageCount()); - } else { -for (int i = 0; i < nullCounts.length; i++) { - if (nullCounts[i] > 0) { -matchingIndexesForNull.add(i); + for (T value : values) { + if (value == null) { Review Comment: Nit: Let's modify the indent spaces to 2 and ensure consistency and similar to changes below? > Refactor `ColumnIndexBuilder` > - > > Key: PARQUET-2396 > URL: https://issues.apache.org/jira/browse/PARQUET-2396 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`
[ https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790917#comment-17790917 ] ASF GitHub Bot commented on PARQUET-2396: - zhangjiashen commented on code in PR #1219: URL: https://github.com/apache/parquet-mr/pull/1219#discussion_r1408788644 ## parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java: ## @@ -298,24 +295,22 @@ public > PrimitiveIterator.OfInt visit(NotEq notEq) { public > PrimitiveIterator.OfInt visit(In in) { Set values = in.getValues(); IntSet matchingIndexesForNull = new IntOpenHashSet(); // for null - Iterator it = values.iterator(); - while(it.hasNext()) { -T value = it.next(); -if (value == null) { - if (nullCounts == null) { -// Searching for nulls so if we don't have null related statistics we have to return all pages -return IndexIterator.all(getPageCount()); - } else { -for (int i = 0; i < nullCounts.length; i++) { - if (nullCounts[i] > 0) { -matchingIndexesForNull.add(i); + for (T value : values) { + if (value == null) { Review Comment: Nit: Let's modify the indent spaces to 2 and ensure consistency? > Refactor `ColumnIndexBuilder` > - > > Key: PARQUET-2396 > URL: https://issues.apache.org/jira/browse/PARQUET-2396 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790840#comment-17790840 ] ASF GitHub Bot commented on PARQUET-2386: - wgtmac commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1831045581 Thanks for the improvement! Could you please take a look at this? @shangxinli @gszadovszky @Fokko > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2397) Make use of `isEmpty`
[ https://issues.apache.org/jira/browse/PARQUET-2397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790822#comment-17790822 ] ASF GitHub Bot commented on PARQUET-2397: - Fokko opened a new pull request, #1220: URL: https://github.com/apache/parquet-mr/pull/1220 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2397 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Make use of `isEmpty` > - > > Key: PARQUET-2397 > URL: https://issues.apache.org/jira/browse/PARQUET-2397 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2396) Refactor `ColumnIndexBuilder`
[ https://issues.apache.org/jira/browse/PARQUET-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790819#comment-17790819 ] ASF GitHub Bot commented on PARQUET-2396: - Fokko opened a new pull request, #1219: URL: https://github.com/apache/parquet-mr/pull/1219 Small refactor to improve readability Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Refactor `ColumnIndexBuilder` > - > > Key: PARQUET-2396 > URL: https://issues.apache.org/jira/browse/PARQUET-2396 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2395) Prefer `singletonList` over `asList`
[ https://issues.apache.org/jira/browse/PARQUET-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790815#comment-17790815 ] ASF GitHub Bot commented on PARQUET-2395: - Fokko opened a new pull request, #1218: URL: https://github.com/apache/parquet-mr/pull/1218 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Prefer `singletonList` over `asList` > > > Key: PARQUET-2395 > URL: https://issues.apache.org/jira/browse/PARQUET-2395 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2394) Use `computeIfAbsent` in `MessageColumnIO`
[ https://issues.apache.org/jira/browse/PARQUET-2394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790812#comment-17790812 ] ASF GitHub Bot commented on PARQUET-2394: - Fokko opened a new pull request, #1217: URL: https://github.com/apache/parquet-mr/pull/1217 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2394 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Use `computeIfAbsent` in `MessageColumnIO` > -- > > Key: PARQUET-2394 > URL: https://issues.apache.org/jira/browse/PARQUET-2394 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2393) Make `ColumnIOCreatorVisitor` static
[ https://issues.apache.org/jira/browse/PARQUET-2393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790810#comment-17790810 ] ASF GitHub Bot commented on PARQUET-2393: - Fokko opened a new pull request, #1216: URL: https://github.com/apache/parquet-mr/pull/1216 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Make `ColumnIOCreatorVisitor` static > > > Key: PARQUET-2393 > URL: https://issues.apache.org/jira/browse/PARQUET-2393 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2392) Remove StringBuilder in `LogicalTypeAnnotation`
[ https://issues.apache.org/jira/browse/PARQUET-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790807#comment-17790807 ] ASF GitHub Bot commented on PARQUET-2392: - Fokko opened a new pull request, #1215: URL: https://github.com/apache/parquet-mr/pull/1215 Make sure you have checked _all_ steps below. StringBuilder only makes sense when you concatenate in a loop. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2392 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Remove StringBuilder in `LogicalTypeAnnotation` > --- > > Key: PARQUET-2392 > URL: https://issues.apache.org/jira/browse/PARQUET-2392 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2391) Remove unnecessary unboxing
[ https://issues.apache.org/jira/browse/PARQUET-2391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790805#comment-17790805 ] ASF GitHub Bot commented on PARQUET-2391: - Fokko opened a new pull request, #1214: URL: https://github.com/apache/parquet-mr/pull/1214 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2391 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Remove unnecessary unboxing > --- > > Key: PARQUET-2391 > URL: https://issues.apache.org/jira/browse/PARQUET-2391 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2390) Replace anonymouse functions with lambda's
[ https://issues.apache.org/jira/browse/PARQUET-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790800#comment-17790800 ] ASF GitHub Bot commented on PARQUET-2390: - Fokko opened a new pull request, #1213: URL: https://github.com/apache/parquet-mr/pull/1213 They are easier to read Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2390 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Replace anonymouse functions with lambda's > -- > > Key: PARQUET-2390 > URL: https://issues.apache.org/jira/browse/PARQUET-2390 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2389) Remove redundant initializers
[ https://issues.apache.org/jira/browse/PARQUET-2389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790797#comment-17790797 ] ASF GitHub Bot commented on PARQUET-2389: - Fokko opened a new pull request, #1212: URL: https://github.com/apache/parquet-mr/pull/1212 Just some cleanup Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2389 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Remove redundant initializers > - > > Key: PARQUET-2389 > URL: https://issues.apache.org/jira/browse/PARQUET-2389 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2388) Deprecate `CHARSETS` on `PlainValuesWriter`
[ https://issues.apache.org/jira/browse/PARQUET-2388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790795#comment-17790795 ] ASF GitHub Bot commented on PARQUET-2388: - Fokko opened a new pull request, #1211: URL: https://github.com/apache/parquet-mr/pull/1211 Not being used Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2388 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Deprecate `CHARSETS` on `PlainValuesWriter` > --- > > Key: PARQUET-2388 > URL: https://issues.apache.org/jira/browse/PARQUET-2388 > Project: Parquet > Issue Type: Improvement >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2387) Simplify `hasFieldsIgnored` expression
[ https://issues.apache.org/jira/browse/PARQUET-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790793#comment-17790793 ] ASF GitHub Bot commented on PARQUET-2387: - Fokko opened a new pull request, #1210: URL: https://github.com/apache/parquet-mr/pull/1210 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-2387 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Simplify `hasFieldsIgnored` expression > -- > > Key: PARQUET-2387 > URL: https://issues.apache.org/jira/browse/PARQUET-2387 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2344) Bump to Thirft 0.19.0
[ https://issues.apache.org/jira/browse/PARQUET-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790791#comment-17790791 ] ASF GitHub Bot commented on PARQUET-2344: - Fokko commented on PR #1192: URL: https://github.com/apache/parquet-mr/pull/1192#issuecomment-1830859619 @wgtmac Thanks for splitting out the format upgrade. Always a good idea to make PRs smaller. I finally fixed all the tests, and this looks good to go to me 👍 > Bump to Thirft 0.19.0 > - > > Key: PARQUET-2344 > URL: https://issues.apache.org/jira/browse/PARQUET-2344 > Project: Parquet > Issue Type: Bug > Components: parquet-format, parquet-mr >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: format-2.10.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790790#comment-17790790 ] ASF GitHub Bot commented on PARQUET-2384: - Fokko merged PR #1202: URL: https://github.com/apache/parquet-mr/pull/1202 > Mark toOriginalType as deprecated > - > > Key: PARQUET-2384 > URL: https://issues.apache.org/jira/browse/PARQUET-2384 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2384) Mark toOriginalType as deprecated
[ https://issues.apache.org/jira/browse/PARQUET-2384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790789#comment-17790789 ] ASF GitHub Bot commented on PARQUET-2384: - Fokko commented on PR #1202: URL: https://github.com/apache/parquet-mr/pull/1202#issuecomment-1830855336 Thanks for the review @wgtmac > Mark toOriginalType as deprecated > - > > Key: PARQUET-2384 > URL: https://issues.apache.org/jira/browse/PARQUET-2384 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.13.1 >Reporter: Fokko Driesprong >Assignee: Fokko Driesprong >Priority: Major > Fix For: 1.14.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790682#comment-17790682 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830328306 The `.editorconfig` has been expanded for IntelliJ and is mostly compliant with the Spotless configuration. IntelliJ refactoring and Spotless have some minor disagreements on continuation indents sometimes, which cannot really be resolved at the moment. As it is included in the Maven lifecycle, the Spotless configuration would of course be leading. > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790681#comment-17790681 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi commented on PR #1209: URL: https://github.com/apache/parquet-mr/pull/1209#issuecomment-1830320974 @wgtmac > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2386) More consistent code style in parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-2386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790680#comment-17790680 ] ASF GitHub Bot commented on PARQUET-2386: - amousavigourabi opened a new pull request, #1209: URL: https://github.com/apache/parquet-mr/pull/1209 Make sure you have checked _all_ steps below. ### Jira - [x] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [x] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: - This PR only refactors style, no logic is added or removed in any way shape or form ### Commits - [x] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [x] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does - Adds note in the PR template on the style checks --- This PR contains two commits, the first adds the style checks and configurations, the second applies these changes to the repository. > More consistent code style in parquet-mr > > > Key: PARQUET-2386 > URL: https://issues.apache.org/jira/browse/PARQUET-2386 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Assignee: Atour Mousavi Gourabi >Priority: Major > > The code style conventions used in parquet-mr are generally inconsistent and > unenforced. We might want to consider using linters such as Spotless and a > more extensive .editorconfig configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790129#comment-17790129 ] ASF GitHub Bot commented on PARQUET-2385: - amousavigourabi commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828048073 > > @wgtmac , after the comments on the code style I had a quick look through the repository and found that indentations and the such differ quite drastically between (and even within) files. ParquetWriter has indents of four spaces for the constructor arguments, where ParquetReader has them at 14 spaces. Would you feel anything for a more extensive `.editorconfig` and easy-to-use linter such as spotless? > > Yes, I think that would be good to make the style consistent across all files automatically. I'll get started on expanding the `.editorconfig` for at least IntelliJ and setting up a compatible Spotless configuration then. > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790122#comment-17790122 ] ASF GitHub Bot commented on PARQUET-2385: - wgtmac commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828031389 cc @Fokko > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790121#comment-17790121 ] ASF GitHub Bot commented on PARQUET-2385: - wgtmac commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1828030327 > @wgtmac , after the comments on the code style I had a quick look through the repository and found that indentations and the such differ quite drastically between (and even within) files. ParquetWriter has indents of four spaces for the constructor arguments, where ParquetReader has them at 14 spaces. Would you feel anything for a more extensive `.editorconfig` and easy-to-use linter such as spotless? Yes, I think that would be good to make the style consistent across all files automatically. > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790097#comment-17790097 ] ASF GitHub Bot commented on PARQUET-2385: - amousavigourabi commented on PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#issuecomment-1827942202 @wgtmac , after the comments on the code style I had a quick look through the repository and found that indentations and the such differ quite drastically between (and even within) files. ParquetWriter has indents of four spaces for the constructor arguments, where ParquetReader has them at 14 spaces. Would you feel anything for a more extensive `.editorconfig` and easy-to-use linter such as spotless? > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length
[ https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789930#comment-17789930 ] ASF GitHub Bot commented on PARQUET-2373: - mapleFU commented on PR #1184: URL: https://github.com/apache/parquet-mr/pull/1184#issuecomment-1827228087 FYI, I've update a BloomFilter with length for testing: https://github.com/apache/parquet-testing/pull/43 > Improve I/O performance with bloom_filter_length > > > Key: PARQUET-2373 > URL: https://issues.apache.org/jira/browse/PARQUET-2373 > Project: Parquet > Issue Type: Improvement >Reporter: Jiashen Zhang >Priority: Minor > > The spec PARQUET-2257 has added bloom_filter_length for reader to load the > bloom filter in a single shot. This implementation alters the code to make > use of the 'bloom_filter_length' field for loading the bloom filter > (consisting of the header and bitset) in order to enhance I/O scheduling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length
[ https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789926#comment-17789926 ] ASF GitHub Bot commented on PARQUET-2373: - zhangjiashen commented on code in PR #1184: URL: https://github.com/apache/parquet-mr/pull/1184#discussion_r1396702452 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java: ## @@ -1347,11 +1348,24 @@ public BloomFilter readBloomFilter(ColumnChunkMetaData meta) throws IOException } } -// Read Bloom filter data header. +// Seek to Bloom filter offset. f.seek(bloomFilterOffset); + +// Read Bloom filter length. +int bloomFilterLength = meta.getBloomFilterLength(); + +// If it is set, read Bloom filter header and bitset together. +// Otherwise, read Bloom filter header first and then bitset. +InputStream in = null; +if (bloomFilterLength > 0) { + byte[] headerAndBitSet = new byte[bloomFilterLength]; + f.readFully(headerAndBitSet); + in = new ByteArrayInputStream(headerAndBitSet); +} + BloomFilterHeader bloomFilterHeader; try { - bloomFilterHeader = Util.readBloomFilterHeader(f, bloomFilterDecryptor, bloomFilterHeaderAAD); + bloomFilterHeader = Util.readBloomFilterHeader(in != null ? in : f, bloomFilterDecryptor, bloomFilterHeaderAAD); Review Comment: It would make code more complex to read if we separate these into two methods. Changed code little bit to avoid sereral checks, please check if it makes sense? > Improve I/O performance with bloom_filter_length > > > Key: PARQUET-2373 > URL: https://issues.apache.org/jira/browse/PARQUET-2373 > Project: Parquet > Issue Type: Improvement >Reporter: Jiashen Zhang >Priority: Minor > > The spec PARQUET-2257 has added bloom_filter_length for reader to load the > bloom filter in a single shot. This implementation alters the code to make > use of the 'bloom_filter_length' field for loading the bloom filter > (consisting of the header and bitset) in order to enhance I/O scheduling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1647) [Java] support for Arrow's float16
[ https://issues.apache.org/jira/browse/PARQUET-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789925#comment-17789925 ] ASF GitHub Bot commented on PARQUET-1647: - zhangjiashen commented on PR #1142: URL: https://github.com/apache/parquet-mr/pull/1142#issuecomment-1827206658 > @zhangjiashen This can be rebased to adopt parquet-format 2.10.0 @wgtmac I just rebased with master branch and please help take a look when you get a chance? > [Java] support for Arrow's float16 > -- > > Key: PARQUET-1647 > URL: https://issues.apache.org/jira/browse/PARQUET-1647 > Project: Parquet > Issue Type: Improvement > Components: parquet-format, parquet-thrift >Reporter: The Alchemist >Priority: Minor > > h2. DESCRIPTION > > I'm wondering if there's any interest in supporting Arrow's {{float16}} type > in Parquet. > There seem to be one or two {{float16}} / {{halffloat}} tickets here (e.g., > PARQUET-1403) but nothing that speaks to adding half-float support to Parquet > in-general. > > h2. PLANS > I'm able to spend some time on this, if someone points me in the right > direction. > > # Add the {{HALFFLOAT}} or {{FLOAT16}} enum (any preferred naming > convention?) to > [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L32] > # Add {{HALFFLOAT}} to {{org.apache.parquet.schema.PrimitiveType}} > # Add {{HALFFLOAT}} support to > {{org.apache.parquet.arrow.schema.SchemaConverter}} > # Add encoding for new type at {{org.apache.parquet.column.Encoding}} > # ?? > If anyone has any interest in this, pointers, or comments, they would be > greatly appreciated! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2373) Improve I/O performance with bloom_filter_length
[ https://issues.apache.org/jira/browse/PARQUET-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789924#comment-17789924 ] ASF GitHub Bot commented on PARQUET-2373: - zhangjiashen commented on PR #1184: URL: https://github.com/apache/parquet-mr/pull/1184#issuecomment-1827205946 > @zhangjiashen This can be rebased to adopt parquet-format 2.10.0 @wgtmac I just rebased with master branch and please help take a look when you get a chance? > Improve I/O performance with bloom_filter_length > > > Key: PARQUET-2373 > URL: https://issues.apache.org/jira/browse/PARQUET-2373 > Project: Parquet > Issue Type: Improvement >Reporter: Jiashen Zhang >Priority: Minor > > The spec PARQUET-2257 has added bloom_filter_length for reader to load the > bloom filter in a single shot. This implementation alters the code to make > use of the 'bloom_filter_length' field for loading the bloom filter > (consisting of the header and bitset) in order to enhance I/O scheduling. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
[ https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789880#comment-17789880 ] ASF GitHub Bot commented on PARQUET-2042: - shangxinli merged PR #900: URL: https://github.com/apache/parquet-mr/pull/900 > Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay > --- > > Key: PARQUET-2042 > URL: https://issues.apache.org/jira/browse/PARQUET-2042 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Reporter: Michael Wong >Priority: Major > > Related to https://issues.apache.org/jira/browse/PARQUET-1595 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
[ https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789881#comment-17789881 ] ASF GitHub Bot commented on PARQUET-2042: - shangxinli commented on PR #900: URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1827000412 Merged > Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay > --- > > Key: PARQUET-2042 > URL: https://issues.apache.org/jira/browse/PARQUET-2042 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Reporter: Michael Wong >Priority: Major > > Related to https://issues.apache.org/jira/browse/PARQUET-1595 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789879#comment-17789879 ] ASF GitHub Bot commented on PARQUET-2385: - wgtmac commented on code in PR #1203: URL: https://github.com/apache/parquet-mr/pull/1203#discussion_r1405536183 ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java: ## @@ -303,6 +303,32 @@ public ParquetWriter(Path file, Configuration conf, WriteSupport writeSupport int maxPaddingSize, ParquetProperties encodingProps, FileEncryptionProperties encryptionProperties) throws IOException { +this( + file, + mode, + writeSupport, + compressionCodecName, + new CodecFactory(conf, encodingProps.getPageSizeThreshold()), + rowGroupSize, + validating, + conf, + maxPaddingSize, + encodingProps, + encryptionProperties); + } + + ParquetWriter( +OutputFile file, Review Comment: It seems the indentation of constructor is 4 spaces elsewhere. ## parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java: ## @@ -321,17 +347,17 @@ public ParquetWriter(Path file, Configuration conf, WriteSupport writeSupport encodingProps.getPageWriteChecksumEnabled(), encryptionProperties); fileWriter.start(); -this.codecFactory = new CodecFactory(conf, encodingProps.getPageSizeThreshold()); +this.codecFactory = codecFactory; CompressionCodecFactory.BytesInputCompressor compressor = codecFactory.getCompressor(compressionCodecName); this.writer = new InternalParquetRecordWriter( -fileWriter, -writeSupport, -schema, -writeContext.getExtraMetaData(), -rowGroupSize, -compressor, -validating, -encodingProps); + fileWriter, Review Comment: Could you please revert the irrelevant style change? > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2042) Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay
[ https://issues.apache.org/jira/browse/PARQUET-2042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789870#comment-17789870 ] ASF GitHub Bot commented on PARQUET-2042: - mwong38 commented on PR #900: URL: https://github.com/apache/parquet-mr/pull/900#issuecomment-1826944950 > I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks! Done. I really hope it's the last time. > Unwrap common Protobuf wrappers and logical Timestamps, Date, TimeOfDay > --- > > Key: PARQUET-2042 > URL: https://issues.apache.org/jira/browse/PARQUET-2042 > Project: Parquet > Issue Type: Improvement > Components: parquet-protobuf >Reporter: Michael Wong >Priority: Major > > Related to https://issues.apache.org/jira/browse/PARQUET-1595 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-2385) Don't initialize CodecFactory in ParquetWriter
[ https://issues.apache.org/jira/browse/PARQUET-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789508#comment-17789508 ] ASF GitHub Bot commented on PARQUET-2385: - amousavigourabi opened a new pull request, #1203: URL: https://github.com/apache/parquet-mr/pull/1203 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does > Don't initialize CodecFactory in ParquetWriter > -- > > Key: PARQUET-2385 > URL: https://issues.apache.org/jira/browse/PARQUET-2385 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Reporter: Atour Mousavi Gourabi >Priority: Major > > In ParquetWriter we initialize a CodecFactory, instead we should allow users > to set their own via the builder as to provide a little more flexibility > (analogous to PARQUET-2282). -- This message was sent by Atlassian Jira (v8.20.10#820010)