[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598759#comment-17598759 ] Daniel Dai commented on PARQUET-1879: - This seems to be a backward-incompatible change. We cannot read parquet file created pre-1.11.1 using the new version. Here is a same error message: {code:java} org.apache.parquet.io.InvalidRecordException: key_value not found in optional group canonicals (MAP) { repeated group map (MAP_KEY_VALUE) { required binary key (ENUM); optional group value { optional int32 index; optional int64 pinId; optional group indexableTextIndexes (LIST) { repeated int32 indexableTextIndexes_tuple; } optional int32 indexExpLq; optional int32 indexExp; optional boolean imageOnly; optional boolean link404; optional boolean unsafe; optional boolean imageNotOnPage; optional boolean linkStatusError; } } } at org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:176) at org.apache.parquet.schema.GroupType.getType(GroupType.java:208) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:348) at org.apache.parquet.schema.GroupType.checkContains(GroupType.java:339) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:349) at org.apache.parquet.schema.MessageType.checkContains(MessageType.java:124) at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:56) at org.apache.parquet.hadoop.thrift.ThriftReadSupport.init(ThriftReadSupport.java:187) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:200) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:216) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:213) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:168) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I am not sure what's the best way to fix it. I am thinking about adding a walker in the construct of FileMetaData to fix the schema, is it a good idea? > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Assignee: Matthew McMahon >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary ke
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151837#comment-17151837 ] ASF GitHub Bot commented on PARQUET-1879: - gszadovszky merged pull request #798: URL: https://github.com/apache/parquet-mr/pull/798 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. > I have seen in Parquet Formats that > [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > should be something like > {code:java} > // Map > required group my_map (MAP) { > repeated group key_value { > required binary key (UTF8); > optional int32 value; > } > } > {code} > Is this on the correct path? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147611#comment-17147611 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-651009716 > @maccamlc, > > The main problem I think is that the spec does not say anything about how the thrift objects shall be used. The specification is about the semantics of the schema and it is described using the parquet schema _language_. But, in the file there is no such _language_, we only have [thrift objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift). > When the specification says something about the _logical types_ (e.g. `MAP`) it does not say anything about which thrift structure should be used (the converted type [`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53) or the logical type [`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)). > We added the new logical type structures in the thrift to support enhanced ways to specify _logical types_ (e.g. [`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)). The idea for backward compatibility was to write the old converted types wherever it make sense (the semantics of the actual _logical type_ is the same as was before) along with the new logical type structures. So, related to `MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was written before (prior to `1.11.0`) and it helps for other readers but do not expect it to be there. > > Cheers, > Gabor Sounds good @gszadovszky . Thanks for some clarification. Therefore, depending on any other comments from other reviewers, it seems this PR is still ready to merge as-is :) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} >
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17147599#comment-17147599 ] ASF GitHub Bot commented on PARQUET-1879: - gszadovszky commented on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650992678 @maccamlc, The main problem I think is that the spec does not say anything about how the thrift objects shall be used. The specification is about the semantics of the schema and it is described using the parquet schema _language_. But, in the file there is no such _language_, we only have [thrift objects](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift). When the specification says something about the _logical types_ (e.g. `MAP`) it does not say anything about which thrift structure should be used (the converted type [`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L53) or the logical type [`MAP`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L324)). We added the new logical type structures in the thrift to support enhanced ways to specify _logical types_ (e.g. [`TimeStampType`](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L272)). The idea for backward compatibility was to write the old converted types wherever it make sense (the semantics of the actual _logical type_ is the same as was before) along with the new logical type structures. So, related to `MAP_KEY_VALUE`, I think, we shall write it at the correct place if it was written before (prior to `1.11.0`) and it helps for other readers but do not expect it to be there. Cheers, Gabor This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. >
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146873#comment-17146873 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc edited a comment on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774 @gszadovszky before this gets merged, I just wanted to clarify something myself after looking more into the format spec, that might tidy this issue up further. * Is MAP_KEY_VALUE required to still be written as the Converted Type when creating new files? From what I could see from some older issues, such as [PARQUET-335](https://issues.apache.org/jira/browse/PARQUET-335) and the [backwards-compatibility rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) it seems to have always been an optional type, and also used incorrectly in the past. It appears that older versions of Parquet would be able to read the Map type in the schema without MAP_KEY_VALUE. If that is true, I would probably suggest pushing this [additional commit](https://github.com/maccamlc/parquet-mr/commit/9ca7652b5d0f9946791089e60193dd10a4a97604) that I tested, onto this PR. It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would result in UNKNOWN being written to the file. But it is removed from the ConversionPatterns path, meaning that my case of this occuring when converting an Avro schema is still fixed, and tested. Let me know if believe this might be the preferred fix, or if what have already done is better. From what I see, it all should depend on whether the MAP_KEY_VALUE type is required as an Original Type or is ok being null for older readers? Thanks Matt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146869#comment-17146869 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc edited a comment on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774 @gszadovszky before this gets merged, I just wanted to clarify something myself after looking more into the format spec, that might tidy this issue up further. * Is MAP_KEY_VALUE required to still be written as the Converted Type when creating new files? From what I could see from some older issues and the [backwards-compatibility rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) it seems to have always been an optional type, and also used incorrectly in the past. It appears that older versions of Parquet would be able to read the Map type in the schema without MAP_KEY_VALUE. If that is true, I would probably suggest pushing this [additional commit](https://github.com/maccamlc/parquet-mr/commit/81738854062ea36f59a993cb4206c8874881d491) that I tested, onto this PR. It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would result in UNKNOWN being written to the file. But it is removed from the ConversionPatterns path, meaning that my case of this occuring when converting an Avro schema is still fixed, and tested. Let me know if believe this might be the preferred fix, or if what have already done is better. Thanks Matt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. > I have seen in Parquet Formats that > [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > should be something like > {code:java} > // Map >
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17146850#comment-17146850 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-650527774 @gszadovszky before this gets merged, I just wanted to clarify something myself after looking more into the format spec, that might tidy this issue up further. * Is MAP_KEY_VALUE required to still be written as the Converted Type when creating new files? From what I could see from some older issues and the [backwards-compatibility rules](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) it seems to have always been an optional type, and also used incorrectly in the past. It appears that older versions of Parquet would be able to read the Map type in the schema without MAP_KEY_VALUE. If that is true, I would probably suggest pushing this [additional commit](https://github.com/maccamlc/parquet-mr/commit/3f774d123997a4c63631185ca409550ca03b960d) that I tested, onto this PR. It would mean that any unexpected uses of LogicalType.MAP_KEY_VALUE would result in UNKNOWN being written to the file. But it is removed from the ConversionPatterns path, meaning that my case of this occuring when converting an Avro schema is still fixed, and tested. Let me know if believe this might be the preferred fix, or if what have already done is better. Thanks Matt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. > I have seen in Parquet Formats that > [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > should be something like > {code:java} > // Map > requir
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144855#comment-17144855 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-649482253 > Thank you for creating the backward compatibility test for Map. It should have been existed already. > Unfortunately, this way you do not properly test backward compatibility. The problem is you cannot generate an "old" file with the "new" library. To be more precise the message parser is more for convenience and not used while reading/writing a parquet file. When you say you are testing converted type it is not really true because the parser tries to read logical types at the first place. Also the parquet writer writes both logical types and converted types so you cannot validate old files that have only converted types. > I would suggest adding tests that covers the examples in the [spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) by creating the thrift generated format objects and convert them by `ParquetMetadataConverter` just like you did it in `TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit better in that class as well. > I should have been described this before you've implemented this test. I am sorry about that. > > Please don't force push your changes because it makes harder to track the review. The committer will squash the PR before merging it anyway. > Thank you for creating the backward compatibility test for Map. It should have been existed already. > Unfortunately, this way you do not properly test backward compatibility. The problem is you cannot generate an "old" file with the "new" library. To be more precise the message parser is more for convenience and not used while reading/writing a parquet file. When you say you are testing converted type it is not really true because the parser tries to read logical types at the first place. Also the parquet writer writes both logical types and converted types so you cannot validate old files that have only converted types. > I would suggest adding tests that covers the examples in the [spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) by creating the thrift generated format objects and convert them by `ParquetMetadataConverter` just like you did it in `TestParquetMetadataConverter.testMapLogicalType`. Maybe, these tests would fit better in that class as well. > I should have been described this before you've implemented this test. I am sorry about that. > > Please don't force push your changes because it makes harder to track the review. The committer will squash the PR before merging it anyway. Apologies for the force push. Good to know that squashed on commit. And thanks for the detailed reply. I think I got it this time :) Tests were moved into TestParquetMetadataConverter and for the old format test, building the metadata through Thrift SchemaElements. Regards, Matt This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144630#comment-17144630 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#issuecomment-649202469 > Thanks for working on this. > > You have changed every naming from `"map"` to `"key_value"` in the tests. This is good for the expected data but we should keep testing `"map"` at the read path as well. Based on the [spec](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1) it is still acceptable. > > I am not an expert in this topic so I would be happy if someone else also could review this. @gszadovszky no problem. I have tried to add a test to verify the backwards-compatibile reading. Added TestReadWriteMapKeyValue to the commit. Not sure if this is the correct way, but parsing a schema to go the logical type path with key_value and no MAP_KEY_VALUE type, then another with map and with the MAP_KEY_VALUE type. From what I can tell the name is not actually verified anywhere (I tried with random name value too :) ), but both test paths are successful. Hopefully it's ok, but let me know if might need to go a bit deeper somewhere else This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. > I have seen in Parquet Formats that > [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > should be something like > {code:java} > // Map > required group my_map (MAP) { > repeated group key_value { > required binary key (UTF8); > optional int32 value; > } > } > {code} > Is this on the correct path? -- This message was sent by Atlassian Jira (v8.3.4#803
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144631#comment-17144631 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on a change in pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#discussion_r445294784 ## File path: parquet-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java ## @@ -18,57 +18,9 @@ */ package org.apache.parquet.format.converter; -import static java.util.Collections.emptyList; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.filterFileMetaDataByStart; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.MICROS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.MILLIS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.NANOS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.bsonType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.dateType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.decimalType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.enumType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.intType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.jsonType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.listType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.mapType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.stringType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.timeType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.timestampType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.uuidType; -import static org.apache.parquet.schema.MessageTypeParser.parseMessageType; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertNull; -import static org.junit.Assert.assertSame; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; -import static org.apache.parquet.format.CompressionCodec.UNCOMPRESSED; -import static org.apache.parquet.format.Type.INT32; -import static org.apache.parquet.format.Util.readPageHeader; -import static org.apache.parquet.format.Util.writePageHeader; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.filterFileMetaDataByMidpoint; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.getOffset; - -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.math.BigInteger; -import java.nio.ByteBuffer; -import java.nio.charset.Charset; -import java.security.SecureRandom; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Random; -import java.util.Set; -import java.util.TreeSet; - +import com.google.common.collect.Lists; Review comment: Didn't notice. Sorry, should be reverted now This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary Reque
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144629#comment-17144629 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc commented on a change in pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#discussion_r445293759 ## File path: parquet-column/src/main/java/org/apache/parquet/schema/Types.java ## @@ -1179,18 +1181,18 @@ protected Type build(String name) { keyType = STRING_KEY; } - GroupBuilder builder = buildGroup(repetition).as(OriginalType.MAP); + GroupBuilder builder = buildGroup(repetition).as(mapType()); if (id != null) { builder.id(id.intValue()); } if (valueType != null) { return builder -.repeatedGroup().addFields(keyType, valueType).named("map") +.repeatedGroup().addFields(keyType, valueType).named("key_value") Review comment: :+1: updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As it appears that > [Arrow|https://github.com/apache/arrow/blob/master/cpp/src/parquet/types.cc#L629-L632] > only looks for the new logical type of Map or List, therefore this causes an > error. > I have seen in Parquet Formats that > [LogicalTypes|https://github.com/apache/parquet-format/blob/master/LogicalTypes.md] > should be something like > {code:java} > // Map > required group my_map (MAP) { > repeated group key_value { > required binary key (UTF8); > optional int32 value; > } > } > {code} > Is this on the correct path? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17143815#comment-17143815 ] ASF GitHub Bot commented on PARQUET-1879: - gszadovszky commented on a change in pull request #798: URL: https://github.com/apache/parquet-mr/pull/798#discussion_r444841963 ## File path: parquet-column/src/main/java/org/apache/parquet/schema/Types.java ## @@ -1179,18 +1181,18 @@ protected Type build(String name) { keyType = STRING_KEY; } - GroupBuilder builder = buildGroup(repetition).as(OriginalType.MAP); + GroupBuilder builder = buildGroup(repetition).as(mapType()); if (id != null) { builder.id(id.intValue()); } if (valueType != null) { return builder -.repeatedGroup().addFields(keyType, valueType).named("map") +.repeatedGroup().addFields(keyType, valueType).named("key_value") Review comment: I would suggest using `ConverstionPatterns.MAP_REPEATED_NAME` here as well. ## File path: parquet-hadoop/src/test/java/org/apache/parquet/format/converter/TestParquetMetadataConverter.java ## @@ -18,57 +18,9 @@ */ package org.apache.parquet.format.converter; -import static java.util.Collections.emptyList; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.filterFileMetaDataByStart; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.MICROS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.MILLIS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.TimeUnit.NANOS; -import static org.apache.parquet.schema.LogicalTypeAnnotation.bsonType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.dateType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.decimalType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.enumType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.intType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.jsonType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.listType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.mapType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.stringType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.timeType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.timestampType; -import static org.apache.parquet.schema.LogicalTypeAnnotation.uuidType; -import static org.apache.parquet.schema.MessageTypeParser.parseMessageType; -import static org.junit.Assert.assertEquals; -import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertNull; -import static org.junit.Assert.assertSame; -import static org.junit.Assert.assertTrue; -import static org.junit.Assert.fail; -import static org.apache.parquet.format.CompressionCodec.UNCOMPRESSED; -import static org.apache.parquet.format.Type.INT32; -import static org.apache.parquet.format.Util.readPageHeader; -import static org.apache.parquet.format.Util.writePageHeader; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.filterFileMetaDataByMidpoint; -import static org.apache.parquet.format.converter.ParquetMetadataConverter.getOffset; - -import java.io.ByteArrayInputStream; -import java.io.ByteArrayOutputStream; -import java.io.IOException; -import java.math.BigInteger; -import java.nio.ByteBuffer; -import java.nio.charset.Charset; -import java.security.SecureRandom; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.HashMap; -import java.util.HashSet; -import java.util.List; -import java.util.Random; -import java.util.Set; -import java.util.TreeSet; - +import com.google.common.collect.Lists; Review comment: Please, do not organize imports. It makes merge conflicts hard to resolve. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142869#comment-17142869 ] ASF GitHub Bot commented on PARQUET-1879: - maccamlc opened a new pull request #798: URL: https://github.com/apache/parquet-mr/pull/798 * Writing UNKNOWN logical type into the schema, causes a breakage when parsing the file with Apache Arrow * Instead use the default, of falling back to null when that backwards-compatibility only logical type is present, but still write the original type Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-XXX - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Priority: Critical > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary key (STRING); > required fixed_len_byte_array(12) value (DECIMAL(28,15)); > } > } > } > {code} > From the great answer to my StackOverflow, it seems the issue is that the > 1.11.0 Parquet-Avro is still using the legacy MAP_KEY_VALUE converted type, > that has no logical type equivalent. From the comment on > [LogicalTypeAnnotation|https://github.com/apache/parquet-mr/blob/84c954d8a4feef2d9bdad7a236a7268ef71a1c25/parquet-column/src/main/java/org/apache/parquet/schema/LogicalTypeAnnotation.java#L904] > {code:java} > // This logical type annotation is implemented to support backward > compatibility with ConvertedType. > // The new logical type representation in parquet-format doesn't have any > key-value type, > // thus this annotation is mapped to UNKNOWN. This type shouldn't be used. > {code} > However, it seems this is being written with the latest 1.11.0, which then > causes Apache Arrow to fail with > {code:java} > Logical type Null can not be applied to group node > {code} > As