[jira] [Comment Edited] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598759#comment-17598759 ] Daniel Dai edited comment on PARQUET-1879 at 9/1/22 7:46 AM: - This seems to be a backward-incompatible change. We cannot read parquet file created pre-1.11.1 using the new version. Here is a sample error message: {code:java} org.apache.parquet.io.InvalidRecordException: key_value not found in optional group canonicals (MAP) { repeated group map (MAP_KEY_VALUE) { required binary key (ENUM); optional group value { optional int32 index; optional int64 pinId; optional group indexableTextIndexes (LIST) { repeated int32 indexableTextIndexes_tuple; } optional int32 indexExpLq; optional int32 indexExp; optional boolean imageOnly; optional boolean link404; optional boolean unsafe; optional boolean imageNotOnPage; optional boolean linkStatusError; } } } at org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:176) at org.apache.parquet.schema.GroupType.getType(GroupType.java:208) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:348) at org.apache.parquet.schema.GroupType.checkContains(GroupType.java:339) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:349) at org.apache.parquet.schema.MessageType.checkContains(MessageType.java:124) at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:56) at org.apache.parquet.hadoop.thrift.ThriftReadSupport.init(ThriftReadSupport.java:187) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:200) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:216) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:213) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:168) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I am not sure what's the best way to fix it. I am thinking about adding a walker in the construct of FileMetaData to fix the schema, is it a good idea? was (Author: daijy): This seems to be a backward-incompatible change. We cannot read parquet file created pre-1.11.1 using the new version. Here is a same error message: {code:java} org.apache.parquet.io.InvalidRecordException: key_value not found in optional group canonicals (MAP) { repeated group map (MAP_KEY_VALUE) { required binary key (ENUM); optional group value { optional int32 index; optional int64 pinId; optional group indexableTextIndexes (LIST) { repeated int32 indexableTextIndexes_tuple; } optional int32 indexExpLq; optional int32 indexExp; optional boolean imageOnly; optional boolean link404; optional boolean unsafe; optional boolean imageNotOnPage; optional boolean linkStatusError; } } } at org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:176) at org.apache.parquet.schema.GroupType.getType(GroupType.java:208) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:348) at org.apache.parquet.schema.GroupType.checkContains(GroupType.java:339) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:349) at org.apache.parquet.schema.MessageType.checkContains(MessageType.java:124) at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:56) at org.apache.parquet.hadoop.thrift.ThriftReadSupport.init(ThriftReadSupport.java:187) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:200) at org.apache.parquet.hadoop.ParquetRecor
[jira] [Commented] (PARQUET-1879) Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with a Map field
[ https://issues.apache.org/jira/browse/PARQUET-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598759#comment-17598759 ] Daniel Dai commented on PARQUET-1879: - This seems to be a backward-incompatible change. We cannot read parquet file created pre-1.11.1 using the new version. Here is a same error message: {code:java} org.apache.parquet.io.InvalidRecordException: key_value not found in optional group canonicals (MAP) { repeated group map (MAP_KEY_VALUE) { required binary key (ENUM); optional group value { optional int32 index; optional int64 pinId; optional group indexableTextIndexes (LIST) { repeated int32 indexableTextIndexes_tuple; } optional int32 indexExpLq; optional int32 indexExp; optional boolean imageOnly; optional boolean link404; optional boolean unsafe; optional boolean imageNotOnPage; optional boolean linkStatusError; } } } at org.apache.parquet.schema.GroupType.getFieldIndex(GroupType.java:176) at org.apache.parquet.schema.GroupType.getType(GroupType.java:208) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:348) at org.apache.parquet.schema.GroupType.checkContains(GroupType.java:339) at org.apache.parquet.schema.GroupType.checkGroupContains(GroupType.java:349) at org.apache.parquet.schema.MessageType.checkContains(MessageType.java:124) at org.apache.parquet.hadoop.api.ReadSupport.getSchemaForRead(ReadSupport.java:56) at org.apache.parquet.hadoop.thrift.ThriftReadSupport.init(ThriftReadSupport.java:187) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:200) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.liftedTree1$1(NewHadoopRDD.scala:216) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:213) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:168) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:71) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} I am not sure what's the best way to fix it. I am thinking about adding a walker in the construct of FileMetaData to fix the schema, is it a good idea? > Apache Arrow can not read a Parquet File written with Parqet-Avro 1.11.0 with > a Map field > - > > Key: PARQUET-1879 > URL: https://issues.apache.org/jira/browse/PARQUET-1879 > Project: Parquet > Issue Type: Bug > Components: parquet-avro, parquet-format >Affects Versions: 1.11.0 >Reporter: Matthew McMahon >Assignee: Matthew McMahon >Priority: Critical > Fix For: 1.12.0, 1.11.1 > > > From my > [StackOverflow|https://stackoverflow.com/questions/62504757/issue-with-loading-parquet-data-into-snowflake-cloud-database-when-written-with] > in relation to an issue I'm having with getting Snowflake (Cloud DB) to load > Parquet files written with version 1.11.0 > > The problem only appears when using a map schema field in the Avro schema. > For example: > {code:java} > { > "name": "FeatureAmounts", > "type": { > "type": "map", > "values": "records.MoneyDecimal" > } > } > {code} > When using Parquet-Avro to write the file, a bad Parquet schema ends up with, > for example > {code:java} > message record.ResponseRecord { > required binary GroupId (STRING); > required int64 EntryTime (TIMESTAMP(MILLIS,true)); > required int64 HandlingDuration; > required binary Id (STRING); > optional binary ResponseId (STRING); > required binary RequestId (STRING); > optional fixed_len_byte_array(12) CostInUSD (DECIMAL(28,15)); > required group FeatureAmounts (MAP) { > repeated group map (MAP_KEY_VALUE) { > required binary ke
[jira] [Commented] (PARQUET-1963) DeprecatedParquetInputFormat in CombineFileInputFormat throw NPE when the first sub-split is empty
[ https://issues.apache.org/jira/browse/PARQUET-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17268907#comment-17268907 ] Daniel Dai commented on PARQUET-1963: - Thanks [~gszadovszky]! > DeprecatedParquetInputFormat in CombineFileInputFormat throw NPE when the > first sub-split is empty > -- > > Key: PARQUET-1963 > URL: https://issues.apache.org/jira/browse/PARQUET-1963 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Reporter: Daniel Dai >Assignee: Daniel Dai >Priority: Major > > A followup of PARQUET-1947, after the fix, when the first sub-split is empty > in CombineFileInputFormat, there's a NPE: > {code} > Caused by: java.lang.NullPointerException > at > org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.next(DeprecatedParquetInputFormat.java:154) > at > org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.next(DeprecatedParquetInputFormat.java:73) > at > cascading.tap.hadoop.io.CombineFileRecordReaderWrapper.next(CombineFileRecordReaderWrapper.java:70) > at > org.apache.hadoop.mapred.lib.CombineFileRecordReader.next(CombineFileRecordReader.java:58) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) > at > org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) > at > cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:61) > at > org.apache.parquet.cascading.ParquetTupleScheme.source(ParquetTupleScheme.java:160) > at > cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:163) > at > cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:136) > ... 10 more > {code} > The reason is CombineFileInputFormat will use the result of createValue of > the first sub-split as the value container. Since the first sub-split is > empty, the value container is null. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1963) DeprecatedParquetInputFormat in CombineFileInputFormat throw NPE when the first sub-split is empty
Daniel Dai created PARQUET-1963: --- Summary: DeprecatedParquetInputFormat in CombineFileInputFormat throw NPE when the first sub-split is empty Key: PARQUET-1963 URL: https://issues.apache.org/jira/browse/PARQUET-1963 Project: Parquet Issue Type: Bug Components: parquet-mr Reporter: Daniel Dai Assignee: Daniel Dai A followup of PARQUET-1947, after the fix, when the first sub-split is empty in CombineFileInputFormat, there's a NPE: {code} Caused by: java.lang.NullPointerException at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.next(DeprecatedParquetInputFormat.java:154) at org.apache.parquet.hadoop.mapred.DeprecatedParquetInputFormat$RecordReaderWrapper.next(DeprecatedParquetInputFormat.java:73) at cascading.tap.hadoop.io.CombineFileRecordReaderWrapper.next(CombineFileRecordReaderWrapper.java:70) at org.apache.hadoop.mapred.lib.CombineFileRecordReader.next(CombineFileRecordReader.java:58) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.moveToNext(MapTask.java:199) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:185) at cascading.tap.hadoop.util.MeasuredRecordReader.next(MeasuredRecordReader.java:61) at org.apache.parquet.cascading.ParquetTupleScheme.source(ParquetTupleScheme.java:160) at cascading.tuple.TupleEntrySchemeIterator.getNext(TupleEntrySchemeIterator.java:163) at cascading.tuple.TupleEntrySchemeIterator.hasNext(TupleEntrySchemeIterator.java:136) ... 10 more {code} The reason is CombineFileInputFormat will use the result of createValue of the first sub-split as the value container. Since the first sub-split is empty, the value container is null. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1666) Remove Unused Modules
[ https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244209#comment-17244209 ] Daniel Dai commented on PARQUET-1666: - This sounds good to me. I can also put into old branches since we are not using 1.12 anyway. And for consuming the patch, we have internal branch and should not be a big deal for us. > Remove Unused Modules > -- > > Key: PARQUET-1666 > URL: https://issues.apache.org/jira/browse/PARQUET-1666 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > In the last two meetings, Ryan Blue proposed to remove some unused Parquet > modules. This is to open a task to track it. > Here are the related meeting notes for the discussion on this. > Remove old Parquet modules > Hive modules - sounds good > Scooge - Julien will reach out to twitter > Tools - undecided - Cloudera may still use the parquet-tools according to > Gabor. > Cascading - undecided > We can change the module as deprecated as description. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1666) Remove Unused Modules
[ https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17243531#comment-17243531 ] Daniel Dai commented on PARQUET-1666: - [~gszadovszky] I am fine to remove Cascading from 1.12.0. We only use it for legacy application and don't think will upgrade Cascading to use 1.12. > Remove Unused Modules > -- > > Key: PARQUET-1666 > URL: https://issues.apache.org/jira/browse/PARQUET-1666 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > In the last two meetings, Ryan Blue proposed to remove some unused Parquet > modules. This is to open a task to track it. > Here are the related meeting notes for the discussion on this. > Remove old Parquet modules > Hive modules - sounds good > Scooge - Julien will reach out to twitter > Tools - undecided - Cloudera may still use the parquet-tools according to > Gabor. > Cascading - undecided > We can change the module as deprecated as description. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data
[ https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PARQUET-1947: Attachment: Part1.java > DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong > data > --- > > Key: PARQUET-1947 > URL: https://issues.apache.org/jira/browse/PARQUET-1947 > Project: Parquet > Issue Type: Bug > Components: parquet-cascading >Reporter: Daniel Dai >Priority: Major > Attachments: Part1.java > > > When we read parquet file using cascading 2, we observe wrong data in the > file boundary when we turn on input combine in cascading (setUseCombinedInput > to true). > This can be reproduced easily with two parquet input files, each containing > one record. A simple cascading application (attached) read the two input with > setUseCombinedInput(true). What we get is the duplicated record in the first > input file and the missing record in the second input file. > Here is the call sequence to understand what happen after the last record of > first input: > 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the > last record of first input again > 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of > first input > 3. CombineFileRecordReader creates a new > DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new > "value" variable containing the first record of second input > 4. CombineFileRecordReader invokes RecordReader.next on the new > RecordReaderWrapper, but since firstRecord flag is on, next does not do > anything > 5. Thus the "value" variable containing the first record of second input is > lost, and cascading is reusing the last record of first input -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data
Daniel Dai created PARQUET-1947: --- Summary: DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data Key: PARQUET-1947 URL: https://issues.apache.org/jira/browse/PARQUET-1947 Project: Parquet Issue Type: Bug Components: parquet-cascading Reporter: Daniel Dai When we read parquet file using cascading 2, we observe wrong data in the file boundary when we turn on input combine in cascading (setUseCombinedInput to true). This can be reproduced easily with two parquet input files, each containing one record. A simple cascading application (attached) read the two input with setUseCombinedInput(true). What we get is the duplicated record in the first input file and the missing record in the second input file. Here is the call sequence to understand what happen after the last record of first input: 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the last record of first input again 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of first input 3. CombineFileRecordReader creates a new DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new "value" variable containing the first record of second input 4. CombineFileRecordReader invokes RecordReader.next on the new RecordReaderWrapper, but since firstRecord flag is on, next does not do anything 5. Thus the "value" variable containing the first record of second input is lost, and cascading is reusing the last record of first input -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (PARQUET-334) UT TestSummary failed with "java.lang.RuntimeException: Usage: B = FOREACH (GROUP A ALL) GENERATE Summary(A); Can not get schema from null" when Pig >=0.15
[ https://issues.apache.org/jira/browse/PARQUET-334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PARQUET-334: --- Attachment: PARQUET-334-1.patch Input Schema is maintained by Pig inside EvalFunc. No need to maintain this in Parquet side. Attach patch. > UT TestSummary failed with "java.lang.RuntimeException: Usage: B = FOREACH > (GROUP A ALL) GENERATE Summary(A); Can not get schema from null" when Pig > >=0.15 > --- > > Key: PARQUET-334 > URL: https://issues.apache.org/jira/browse/PARQUET-334 > Project: Parquet > Issue Type: Bug > Components: parquet-mr >Affects Versions: 1.6.0 >Reporter: li xiang >Priority: Critical > Attachments: PARQUET-334-1.patch > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to > store alias B > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1694) > at org.apache.pig.PigServer.registerQuery(PigServer.java:623) > at org.apache.pig.PigServer.registerQuery(PigServer.java:636) > at parquet.pig.summary.TestSummary.testMaxIsZero(TestSummary.java:154) > ... > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 0: > java.lang.RuntimeException: Usage: B = FOREACH (GROUP A ALL) GENERATE > Summary(A); Can not get schema from null > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:307) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.access$500(PigServer.java:113) > at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:1689) > ... 32 more > Caused by: java.lang.RuntimeException: Usage: B = FOREACH (GROUP A ALL) > GENERATE Summary(A); Can not get schema from null > at parquet.pig.summary.Summary.setInputSchema(Summary.java:266) > at > org.apache.pig.newplan.logical.expression.ExpToPhyTranslationVisitor.visit(ExpToPhyTranslationVisitor.java:530) > at > org.apache.pig.newplan.logical.expression.UserFuncExpression.accept(UserFuncExpression.java:132) > at > org.apache.pig.newplan.ReverseDependencyOrderWalkerWOSeenChk.walk(ReverseDependencyOrderWalkerWOSeenChk.java:69) > at > org.apache.pig.newplan.logical.relational.LogToPhyTranslationVisitor.visit(LogToPhyTranslationVisitor.java:808) > at > org.apache.pig.newplan.logical.relational.LOForEach.accept(LOForEach.java:87) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:258) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:293) > ... 37 more > Caused by: java.lang.NullPointerException > at parquet.pig.summary.Summary.setInputSchema(Summary.java:261) > ... 46 more > It relates to a change on pig side: > pig/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java > introduced by PIG-3294 -- This message was sent by Atlassian JIRA (v6.3.4#6332)