subject:"\[jira\] \[Comment Edited\] \(SPARK\-11844\) can not read class org.apache.parquet.format.PageHeader\: don't know what type\: 13"

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:58 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java

]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:58 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java

]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:57 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java
]
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344095#comment-17344095
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:42 PM:
---

[~hryhoriev.nick] where do you see this option: *fast.data.upload* ?

I don't see it in hadoop core-default.xml or on hadoop-aws module: 
[https://github.com/apache/hadoop/blob/027c8fb257eb5144a4cee42341bf6b774c0fd8d1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L318]


was (Author: sungwy92):
[~hryhoriev.nick] where do you see this option?

I don't see it in hadoop core-default.xml or on hadoop-aws module: 
https://github.com/apache/hadoop/blob/027c8fb257eb5144a4cee42341bf6b774c0fd8d1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L318

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
>

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-14 Thread Xu Guang Lv (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320926#comment-17320926
 ] 

Xu Guang Lv edited comment on SPARK-11844 at 4/14/21, 11:31 AM:


I got the same error too, but I can definitely reproduce the error each time I 
ran my job. Is there any patch can solve this issue?

Here is my scenario:
https://github.com/apache/hudi/issues/2812


was (Author: xu_guang_lv):
I got the same error too, but I can definitely reproduce the error each time I 
ran my job. Is there any patch can solve this issue?

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>   at 
> parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> {code}
> The next retry was good. Right now, seems not critical. But, let's still 
> track

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-12 Thread Nick Hryhoriev (Jira)

[
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
]

Nick Hryhoriev edited comment on SPARK-11844 at 4/12/21, 3:05 PM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark
Structure Stream FileSink and DeltaSink. writing to s3 using S3a API.
This specific was a writer with the s3 multipart upload, which hidden for
spark developer in S3a API(Hadoop-AWS).
Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
.set("spark.hadoop.fs.s3a.connection.maximum", "500")
.set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
.set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
.set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
I was not lucky to reproduce it.
Because we write near 1 files every hour. files add every 10 minutes,
And only two damaged.

One file has PageHeader: Null exception.
One file has PageHeader: unknown type -15
And this issue always reproduced on these files.
The Footer of the file is OK, which means I can read schema and do an
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in exception handling
in ParquetWriter + S3aOutputStream.
Why?
1. S3 SDK can confirm or abort upload. because it's atomic
2. While Hadoop Files system API on FsDataOutputStream level, can only close
the stream.
So in case of high-level exception in spark or parquet, it just does not close
stream.
But because it's Stream, which is part of the java.io decorator pattern.
Multiple streams wrap each other. So maybe somewhere there is a `try finally`
block which calls `close`.
Which will commit the wrong file and hide the underlying exception.
It's only a guess, I was not able to confirm it with code or reproduce it.

FIY [~hyukjin.kwon]

was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark
Structure Stream FileSink and DeltaSink. writing to s3 using S3a API.
This specific was a writer with the s3 multipart upload, which hidden for
spark developer in S3a API(Hadoop-AWS).
Configuration for s3a

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
I was not lucky to reproduce it.
Because we write near 1 files every hour. files add every 10 minutes,
And only two damaged.

Data itself do not affect this. We have a backup of data and successfully re
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling
in ParquetWriter + S3aOutputStream.
Why?
1. S3 SDK can confirm or abort upload. because it's atomic
2. While Hadoop Files system API on FsDataOutputStream level, can only close
the stream.
So in case of high-level exception in spark or parquet, it just does not close
stream.
But because it's Stream, which is part of the java.io decorator pattern.
Multiple streams wrap each other. So maybe somewhere there is a `try finally`
block which calls `close`.
Which will commit the wrong file and hide the underlying exception.
It's only a guess, I was not able to confirm it with code or reproduce it.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

{code:java}
// Write all rows in RDD partition.
try {
val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
while (iterator.hasNext) {
val pair = iterator.next()
config.write(pair)

// Update bytes written metric every few records
maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
recordsWritten += 1
}

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)

[
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
]

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 2:22 PM:
-

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
I was not lucky to reproduce it.
Because we write near 1 files every hour. files add every 10 minutes,
And only two damaged.

Data itself do not affect this. We have a backup of data and successfully re
ETL this data. New files are ok.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

{code:java}
// Write all rows in RDD partition.
try {
val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
while (iterator.hasNext) {
val pair = iterator.next()
config.write(pair)

// Update bytes written metric every few records
maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
recordsWritten += 1
}

config.closeWriter(taskContext)
committer.commitTask(taskContext)
}{code}
And here is the java-doc for tryWithSafeFinallyAndFailureCallbacks.
{code:java}
/**
* Execute a block of code and call the failure callbacks in the catch block.
If exceptions occur
* in either the catch or the finally block, they are appended to the list of
suppressed
* exceptions in original exception which is then rethrown.
*
* This is primarily an issue with `catch { abort() }` or `finally {
out.close() }` blocks,
* where the abort/close needs to be called to clean up `out`, but if an
exception happened
* in `out.write`, it's likely `out` may be corrupted and `abort` or
`out.close` will
* fail as well. This would then suppress the original/likely more meaningful
* exception from the original `out.write` call.
*/{code}
FIY [~hyukjin.kwon]

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
I was not lucky to reproduce it.
Because we write near 1 files every hour. files add every 10 minutes,
And only two damaged.

Data itself do not affect this. We have a backup of data and successfully re
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling
in ParquetWriter + S3aOutputStream.
Why?
1. S3 SDK can confirm or abort upload. because it's atomic
2. While Hadoop Files

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)

[
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
]

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 11:50 AM:
--