[GitHub] [parquet-mr] gszadovszky merged pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-03 Thread GitBox


gszadovszky merged pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (PARQUET-1827) UUID type currently not supported by parquet-mr

2020-06-03 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1827.
---
Resolution: Fixed

> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr

2020-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125599#comment-17125599
 ] 

ASF GitHub Bot commented on PARQUET-1827:
-

gszadovszky merged pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

2020-06-03 Thread Gara Walid
Hi Xin,

Thanks for the proposal. Could you please make the google doc public?

Cheers,
Walid

On Thu, Jun 4, 2020, 6:46 AM Dong, Xin  wrote:

> Hi, All,
>
> The existing Parquet compress codec framework only supports codec name
> based compression implementation lookup. And it's one-2-one mapping which
> means only one implementation is supported given a codec name.
> However, there are various implementations for the same codec name. And
> different implementations may not be compatible with others due to
> different purposes. Given Gzip as an example, for some accelerators, it's
> limited in memory capacity and the history buffer size is relatively
> smaller than CPU based.  And currently codec framework doesn't provide a
> mechanism to allow users to customize standard compression codec for their
> own purposes (e.g. performance acceleration, workload offloading).
> To address the problem, we propose a provider-aware compression codec
> lookup for parquet-mr. We've put the proposal here:
>
> https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm
>
> Any comment is welcome and please let us know your feedback.
>
> Thanks,
> Xin Dong
>


Proposal for CompressionCodec Provider-aware Compression Codec Lookup for parquet-mr

2020-06-03 Thread Dong, Xin
Hi, All,

The existing Parquet compress codec framework only supports codec name based 
compression implementation lookup. And it's one-2-one mapping which means only 
one implementation is supported given a codec name.
However, there are various implementations for the same codec name. And 
different implementations may not be compatible with others due to different 
purposes. Given Gzip as an example, for some accelerators, it's limited in 
memory capacity and the history buffer size is relatively smaller than CPU 
based.  And currently codec framework doesn't provide a mechanism to allow 
users to customize standard compression codec for their own purposes (e.g. 
performance acceleration, workload offloading).
To address the problem, we propose a provider-aware compression codec lookup 
for parquet-mr. We've put the proposal here:
https://docs.google.com/document/d/1sbCjDxEjM5UkbMPNmGqEfF-LYPDWhM-B474dZZeOFD4/edit?ts=5ecb2462#heading=h.5b2qz2ba32wm

Any comment is welcome and please let us know your feedback.

Thanks,
Xin Dong


[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr

2020-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17125026#comment-17125026
 ] 

ASF GitHub Bot commented on PARQUET-1827:
-

shangxinli commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434631781



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   Sounds good





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] shangxinli commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-03 Thread GitBox


shangxinli commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434631781



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   Sounds good





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr

2020-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124935#comment-17124935
 ] 

ASF GitHub Bot commented on PARQUET-1827:
-

gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   The 
[`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144)
 I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one 
you suggested: it checks for equality (in two phases) of the initial and the 
result avro schemas (and not only for compatibility). For `testUUIDType`, 
though it is a good idea to check the compatibility of the avro schemas.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky commented on a change in pull request #778: PARQUET-1827: UUID type currently not supported by parquet-mr

2020-06-03 Thread GitBox


gszadovszky commented on a change in pull request #778:
URL: https://github.com/apache/parquet-mr/pull/778#discussion_r434550078



##
File path: 
parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java
##
@@ -766,6 +768,33 @@ public void testReuseNameInNestedStructureAtSameLevel() 
throws Exception {
 testParquetToAvroConversion(NEW_BEHAVIOR, schema, parquetSchema);
   }
 
+  @Test
+  public void testUUIDType() throws Exception {
+Schema fromAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", 
LogicalTypes.uuid().addToSchema(Schema.create(STRING)), null, null)));
+String parquet = "message myrecord {\n" +
+"  required binary uuid (STRING);\n" +
+"}\n";
+Schema toAvro = Schema.createRecord("myrecord", null, null, false,
+Arrays.asList(new Schema.Field("uuid", Schema.create(STRING), null, 
null)));
+
+testAvroToParquetConversion(fromAvro, parquet);
+testParquetToAvroConversion(toAvro, parquet);
+  }

Review comment:
   The 
[`testRoundTripConversion`](https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/test/java/org/apache/parquet/avro/TestAvroSchemaConverter.java#L144)
 I'm using in `testUUIDTypeWithParquetUUID` is actually stronger than the one 
you suggested: it checks for equality (in two phases) of the initial and the 
result avro schemas (and not only for compatibility). For `testUUIDType`, 
though it is a good idea to check the compatibility of the avro schemas.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE

2020-06-03 Thread Patrick OFriel (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124919#comment-17124919
 ] 

Patrick OFriel commented on PARQUET-1842:
-

Really appreciate the update [~gszadovszky] , thanks for the info!

> Update Jackson Databind version to address CVE
> --
>
> Key: PARQUET-1842
> URL: https://issues.apache.org/jira/browse/PARQUET-1842
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Any
>Reporter: Patrick OFriel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The current version of jackson-databind in parquet-mr has several CVEs 
> associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-1], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We 
> should update to jackson-databind 2.9.10.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD

2020-06-03 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1866.
---
Resolution: Fixed

> Replace Hadoop ZSTD with JNI-ZSTD
> -
>
> Key: PARQUET-1866
> URL: https://issues.apache.org/jira/browse/PARQUET-1866
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> The parquet-mr repo has been using 
> [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd]
>  for the parquet-cli project. It is a cleaner approach to use this JNI than 
> using Hadoop ZSTD compression, because 1) on the developing box, installing 
> Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. 
> Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with 
> ZSTD-JNI for parquet-hadoop project. 
> According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI 
> for ZSTD.
> Another approach is to use https://github.com/airlift/aircompressor which is 
> a pure Java implementation. But it seems the compression level is not 
> adjustable in aircompressor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1866) Replace Hadoop ZSTD with JNI-ZSTD

2020-06-03 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124682#comment-17124682
 ] 

ASF GitHub Bot commented on PARQUET-1866:
-

gszadovszky merged pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Replace Hadoop ZSTD with JNI-ZSTD
> -
>
> Key: PARQUET-1866
> URL: https://issues.apache.org/jira/browse/PARQUET-1866
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> The parquet-mr repo has been using 
> [ZSTD-JNI|https://github.com/luben/zstd-jni/tree/master/src/main/java/com/github/luben/zstd]
>  for the parquet-cli project. It is a cleaner approach to use this JNI than 
> using Hadoop ZSTD compression, because 1) on the developing box, installing 
> Hadoop is cumbersome, 2) Older version of Hadoop doesn't support ZSTD. 
> Upgrading Hadoop is another pain. This Jira is to replace Hadoop ZSTD with 
> ZSTD-JNI for parquet-hadoop project. 
> According to the author of ZSTD-JNI, Flink, Spark, Cassandra all use ZSTD-JNI 
> for ZSTD.
> Another approach is to use https://github.com/airlift/aircompressor which is 
> a pure Java implementation. But it seems the compression level is not 
> adjustable in aircompressor. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] gszadovszky merged pull request #793: PARQUET-1866: Replace Hadoop ZSTD with JNI-ZSTD

2020-06-03 Thread GitBox


gszadovszky merged pull request #793:
URL: https://github.com/apache/parquet-mr/pull/793


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1842) Update Jackson Databind version to address CVE

2020-06-03 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124675#comment-17124675
 ] 

Gabor Szadovszky commented on PARQUET-1842:
---

[~pofriel], we are intensively working on integrating the feature encryption. I 
hope, we can initiate a release in 30 days. It's another question how much time 
will it take to have a successful vote on it.

> Update Jackson Databind version to address CVE
> --
>
> Key: PARQUET-1842
> URL: https://issues.apache.org/jira/browse/PARQUET-1842
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.11.0
> Environment: Any
>Reporter: Patrick OFriel
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The current version of jackson-databind in parquet-mr has several CVEs 
> associated with it: [https://nvd.nist.gov/vuln/detail/CVE-2020-10673], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10672], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-10969], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-1], 
> [https://nvd.nist.gov/vuln/detail/CVE-2020-3], (and a few more). We 
> should update to jackson-databind 2.9.10.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)