[GitHub] [parquet-mr] shangxinli opened a new pull request #799: Parquet-1872: Add TransCompression command to parquet-tools - Add the…

2020-06-27 Thread GitBox


shangxinli opened a new pull request #799:
URL: https://github.com/apache/parquet-mr/pull/799


   … command to registry to complete
   
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references 
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-XXX
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

2020-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146873#comment-17146873
 ] 

ASF GitHub Bot commented on PARQUET-1879:
-

maccamlc edited a comment on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues, such as 
[PARQUET-335](https://issues.apache.org/jira/browse/PARQUET-335) and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/9ca7652b5d0f9946791089e60193dd10a4a97604)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   From what I see, it all should depend on whether the MAP_KEY_VALUE type is 
required as an Original Type or is ok being null for older readers?
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -
>
> Key: PARQUET-1879
> URL: https://issues.apache.org/jira/browse/PARQUET-1879
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-format
>Affects Versions: 1.11.0
>Reporter: Matthew McMahon
>Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> 
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
> {
>   "name": "FeatureAmounts",
>   "type": {
> "type": "map",
> "values": "records.MoneyDecimal"
>   }
> }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required binary key (STRING);
>   required fixed_len_byte_array(12) value (DECIMAL(28,15));
> }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that 
> [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632]
>  only looks for the new logical type of

[GitHub] [parquet-mr] maccamlc edited a comment on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-27 Thread GitBox


maccamlc edited a comment on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues, such as 
[PARQUET-335](https://issues.apache.org/jira/browse/PARQUET-335) and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/9ca7652b5d0f9946791089e60193dd10a4a97604)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   From what I see, it all should depend on whether the MAP_KEY_VALUE type is 
required as an Original Type or is ok being null for older readers?
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

2020-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146869#comment-17146869
 ] 

ASF GitHub Bot commented on PARQUET-1879:
-

maccamlc edited a comment on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/81738854062ea36f59a993cb4206c8874881d491)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -
>
> Key: PARQUET-1879
> URL: https://issues.apache.org/jira/browse/PARQUET-1879
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-format
>Affects Versions: 1.11.0
>Reporter: Matthew McMahon
>Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> 
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
> {
>   "name": "FeatureAmounts",
>   "type": {
> "type": "map",
> "values": "records.MoneyDecimal"
>   }
> }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required binary key (STRING);
>   required fixed_len_byte_array(12) value (DECIMAL(28,15));
> }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that 
> [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632]
>  only looks for the new logical type of Map or List, therefore this causes an 
> error.
> I have seen in Parquet Formats that 
> [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  should be something like
> {code:java}
> // Map
>

[GitHub] [parquet-mr] maccamlc edited a comment on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-27 Thread GitBox


maccamlc edited a comment on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/81738854062ea36f59a993cb4206c8874881d491)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] maccamlc commented on pull request #798: PARQUET-1879 MapKeyValue is not a valid Logical Type

2020-06-27 Thread GitBox


maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/3f774d123997a4c63631185ca409550ca03b960d)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field

2020-06-27 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146850#comment-17146850
 ] 

ASF GitHub Bot commented on PARQUET-1879:
-

maccamlc commented on pull request #798:
URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774


   @gszadovszky before this gets merged, I just wanted to clarify something 
myself after looking more into the format spec, that might tidy this issue up 
further.
   
   * Is MAP_KEY_VALUE required to still be written as the Converted Type when 
creating new files?
   
   From what I could see from some older issues and the 
[backwards-compatibility 
rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1)
 it seems to have always been an optional type, and also used incorrectly in 
the past.
   
   It appears that older versions of Parquet would be able to read the Map type 
in the schema without MAP_KEY_VALUE.
   
   If that is true, I would probably suggest pushing this [additional 
commit](https://github.com/maccamlc/parquet-mr/commit/3f774d123997a4c63631185ca409550ca03b960d)
 that I tested, onto this PR.
   
   It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would 
result in UNKNOWN being written to the file. But it is removed from the 
ConversionPatterns path, meaning that my case of this occuring when converting 
an Avro schema is still fixed, and tested.
   
   Let me know if believe this might be the preferred fix, or if what have 
already done is better.
   
   Thanks
   Matt



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with 
> a Map field
> -
>
> Key: PARQUET-1879
> URL: https://issues.apache.org/jira/browse/PARQUET-1879
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-format
>Affects Versions: 1.11.0
>Reporter: Matthew McMahon
>Priority: Critical
>
> From my 
> [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with]
>  in relation to an issue I'm having with getting Snowflake (Cloud DB) to load 
> Parquet files written with version 1.11.0
> 
> The problem only appears when using a map schema field in the Avro schema. 
> For example:
> {code:java}
> {
>   "name": "FeatureAmounts",
>   "type": {
> "type": "map",
> "values": "records.MoneyDecimal"
>   }
> }
> {code}
> When using Parquet-Avro to write the file, a bad Parquet schema ends up with, 
> for example
> {code:java}
> message record.ResponseRecord {
>   required binary GroupId (STRING);
>   required int64 EntryTime (TIMESTAMP(MILLIS,true));
>   required int64 HandlingDuration;
>   required binary Id (STRING);
>   optional binary ResponseId (STRING);
>   required binary RequestId (STRING);
>   optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15));
>   required group FeatureAmounts (MAP) {
> repeated group map (MAP_KEY_VALUE) {
>   required binary key (STRING);
>   required fixed_len_byte_array(12) value (DECIMAL(28,15));
> }
>   }
> }
> {code}
> From the great answer to my StackOverflow, it seems the issue is that the 
> 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, 
> that has no logical type equivalent. From the comment on 
> [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904]
> {code:java}
> // This logical type annotation is implemented to support backward 
> compatibility with ConvertedType.
>   // The new logical type representation in parquet-format doesn't have any 
> key-value type,
>   // thus this annotation is mapped to UNKNOWN. This type shouldn't be used.
> {code}
> However, it seems this is being written with the latest 1.11.0, which then 
> causes Apache Arrow to fail with
> {code:java}
> Logical type Null can not be applied to group node
> {code}
> As it appears that 
> [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632]
>  only looks for the new logical type of Map or List, therefore this causes an 
> error.
> I have seen in Parquet Formats that 
> [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md]
>  should be something like
> {code:java}
> // Map
> requir