[GitHub] [parquet-mr] gszadovszky merged pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky merged pull request #778: URL: https://github.com/apache/parquet-mr/pull/778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Resolved] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-1827. --- Resolution: Fixed > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125599#comment-17125599 ] ASF GitHub Bot commented on PARQUET-1827: - gszadovszky merged pull request #778: URL: https://github.com/apache/parquet-mr/pull/778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr
Hi Xin, Thanks for the proposal. Could you please make the google doc public? Cheers, Walid On Thu, Jun 4, 2020, 6:46 AM Dong, Xin wrote: > Hi, All, > > The existing Parquet compress codec framework only supports codec name > based compression implementation lookup. And it's one-2-one mapping which > means only one implementation is supported given a codec name. > However, there are various implementations for the same codec name. And > different implementations may not be compatible with others due to > different purposes. Given Gzip as an example, for some accelerators, it's > limited in memory capacity and the history buffer size is relatively > smaller than CPU based. And currently codec framework doesn't provide a > mechanism to allow users to customize standard compression codec for their > own purposes (e.g. performance acceleration, workload offloading). > To address the problem, we propose a provider-aware compression codec > lookup for parquet-mr. We've put the proposal here: > > https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm > > Any comment is welcome and please let us know your feedback. > > Thanks, > Xin Dong >
Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr
Hi, All, The existing Parquet compress codec framework only supports codec name based compression implementation lookup. And it's one-2-one mapping which means only one implementation is supported given a codec name. However, there are various implementations for the same codec name. And different implementations may not be compatible with others due to different purposes. Given Gzip as an example, for some accelerators, it's limited in memory capacity and the history buffer size is relatively smaller than CPU based. And currently codec framework doesn't provide a mechanism to allow users to customize standard compression codec for their own purposes (e.g. performance acceleration, workload offloading). To address the problem, we propose a provider-aware compression codec lookup for parquet-mr. We've put the proposal here: https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm Any comment is welcome and please let us know your feedback. Thanks, Xin Dong
[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125026#comment-17125026 ] ASF GitHub Bot commented on PARQUET-1827: - shangxinli commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434631781 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: Sounds good This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] shangxinli commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
shangxinli commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434631781 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: Sounds good This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr
[ https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124935#comment-17124935 ] ASF GitHub Bot commented on PARQUET-1827: - gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: The [`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144) I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For `testUUIDType`, though it is a good idea to check the compatibility of the avro schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > UUID type currently not supported by parquet-mr > --- > > Key: PARQUET-1827 > URL: https://issues.apache.org/jira/browse/PARQUET-1827 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Brad Smith >Assignee: Gabor Szadovszky >Priority: Major > Labels: pull-request-available > > The parquet-format project introduced a new UUID logical type in version 2.4: > [https://github.com/apache/parquet-format/blob/master/CHANGES.md] > This would be a useful type to have available in some circumstances, but it > currently isn't supported in the parquet-mr library. Hopefully this feature > can be implemented at some point. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr
gszadovszky commented on a change in pull request #778: URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078 ## File path: parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java ## @@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() throws Exception { testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema); } + @Test + public void testUUIDType() throws Exception { +Schema fromAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null))); +String parquet = "message myrecord {\n" + +" required binary uuid (STRING);\n" + +"}\n"; +Schema toAvro = Schema.createRecord("myrecord", null, null, false, +Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, null))); + +testAvroToParquetConversion(fromAvro, parquet); +testParquetToAvroConversion(toAvro, parquet); + } Review comment: The [`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144) I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one you suggested: it checks for equality (in two phases) of the initial and the result avro schemas (and not only for compatibility). For `testUUIDType`, though it is a good idea to check the compatibility of the avro schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE
[ https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124919#comment-17124919 ] Patrick OFriel commented on PARQUET-1842: - Really appreciate the update [~gszadovszky] , thanks for the info! > Update Jackson Databind version to address CVE > -- > > Key: PARQUET-1842 > URL: https://issues.apache.org/jira/browse/PARQUET-1842 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.11.0 > Environment: Any >Reporter: Patrick OFriel >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > The current version of jackson-databind in parquet-mr has several CVEs > associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], > [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], > [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], > [https://nvd.nist.gov/vuln/detail/CVE-2020-1], > [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We > should update to jackson-databind 2.9.10.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD
[ https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Szadovszky resolved PARQUET-1866. --- Resolution: Fixed > Replace Hadoop ZSTD with JNI-ZSTD > - > > Key: PARQUET-1866 > URL: https://issues.apache.org/jira/browse/PARQUET-1866 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > The parquet-mr repo has been using > [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd] > for the parquet-cli project. It is a cleaner approach to use this JNI than > using Hadoop ZSTD compression, because 1) on the developing box, installing > Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. > Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with > ZSTD-JNI for parquet-hadoop project. > According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI > for ZSTD. > Another approach is to use https://github.com/airlift/aircompressor which is > a pure Java implementation. But it seems the compression level is not > adjustable in aircompressor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD
[ https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124682#comment-17124682 ] ASF GitHub Bot commented on PARQUET-1866: - gszadovszky merged pull request #793: URL: https://github.com/apache/parquet-mr/pull/793 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Replace Hadoop ZSTD with JNI-ZSTD > - > > Key: PARQUET-1866 > URL: https://issues.apache.org/jira/browse/PARQUET-1866 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > The parquet-mr repo has been using > [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd] > for the parquet-cli project. It is a cleaner approach to use this JNI than > using Hadoop ZSTD compression, because 1) on the developing box, installing > Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. > Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with > ZSTD-JNI for parquet-hadoop project. > According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI > for ZSTD. > Another approach is to use https://github.com/airlift/aircompressor which is > a pure Java implementation. But it seems the compression level is not > adjustable in aircompressor. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] gszadovszky merged pull request #793: PARQUET-1866: Replace Hadoop ZSTD with JNI-ZSTD
gszadovszky merged pull request #793: URL: https://github.com/apache/parquet-mr/pull/793 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE
[ https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124675#comment-17124675 ] Gabor Szadovszky commented on PARQUET-1842: --- [~pofriel], we are intensively working on integrating the feature encryption. I hope, we can initiate a release in 30 days. It's another question how much time will it take to have a successful vote on it. > Update Jackson Databind version to address CVE > -- > > Key: PARQUET-1842 > URL: https://issues.apache.org/jira/browse/PARQUET-1842 > Project: Parquet > Issue Type: Task > Components: parquet-mr >Affects Versions: 1.11.0 > Environment: Any >Reporter: Patrick OFriel >Priority: Major > Labels: pull-request-available > Fix For: 1.12.0 > > > The current version of jackson-databind in parquet-mr has several CVEs > associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], > [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], > [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], > [https://nvd.nist.gov/vuln/detail/CVE-2020-1], > [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We > should update to jackson-databind 2.9.10.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)