[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:58 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java

]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>  

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:58 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java

]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>  

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344105#comment-17344105
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:57 PM:
---

I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java]
 
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]


was (Author: sungwy92):
I'm in agreement with your intuition that S3aOutputStream may be closing 
without raising an underlying exception.

 

I've primarily seen this issue only when writing large parquet files (tends to 
be bigger than 1 or 2Gb), and have confirmed that these files are corrupt upon 
read through spark / parquet reader.

And interestingly enough, Hadoop 2.9 has a similar issue with their 
S3aInputStream, just that it actually throws an exception and closes the 
stream. The root cause in the inputstream is that the S3Client object gets 
GC'ed prematurely even when the file read is still happening. I'm curious if 
something similar is happening with S3aOutputStream, but instead of detecting 
the exception of closing the stream, it is actually continuing the write and 
closing it without an exception.

[https://stackoverflow.com/questions/22673698/local-variable-in-method-being-garbage-collected-before-method-exits-java
]
[https://github.com/apache/hadoop/blob/rel/release-2.9.0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java]

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-05-13 Thread Sungwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344095#comment-17344095
 ] 

Sungwon edited comment on SPARK-11844 at 5/13/21, 7:42 PM:
---

[~hryhoriev.nick] where do you see this option: *fast.data.upload* ?

I don't see it in hadoop core-default.xml or on hadoop-aws module: 
[https://github.com/apache/hadoop/blob/027c8fb257eb5144a4cee42341bf6b774c0fd8d1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L318]


was (Author: sungwy92):
[~hryhoriev.nick] where do you see this option?

I don't see it in hadoop core-default.xml or on hadoop-aws module: 
https://github.com/apache/hadoop/blob/027c8fb257eb5144a4cee42341bf6b774c0fd8d1/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/Constants.java#L318

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
> 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-14 Thread Xu Guang Lv (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17320926#comment-17320926
 ] 

Xu Guang Lv edited comment on SPARK-11844 at 4/14/21, 11:31 AM:


I got the same error too, but I can definitely reproduce the error each time I 
ran my job. Is there any patch can solve this issue?

Here is my scenario:
https://github.com/apache/hudi/issues/2812


was (Author: xu_guang_lv):
I got the same error too, but I can definitely reproduce the error each time I 
ran my job. Is there any patch can solve this issue?

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>   at 
> parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> {code}
> The next retry was good. Right now, seems not critical. But, let's still 
> track 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-12 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/12/21, 3:05 PM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in exception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

FIY [~hyukjin.kwon]


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

 
{code:java}
// Write all rows in RDD partition.
try {
 val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
 while (iterator.hasNext) {
 val pair = iterator.next()
 config.write(pair)

 // Update bytes written metric every few records
 maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
 recordsWritten += 1
 }

 config.closeWriter(taskContext)
 committer.commitTask(taskContext)
 }{code}
And here is the java-doc for tryWithSafeFinallyAndFailureCallbacks.
{code:java}
/**
 * Execute a block of code and call the failure callbacks in the catch block. 
If exceptions occur
 * in either the catch or the finally block, they are appended to the list of 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 2:22 PM:
-

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

 
{code:java}
// Write all rows in RDD partition.
try {
 val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
 while (iterator.hasNext) {
 val pair = iterator.next()
 config.write(pair)

 // Update bytes written metric every few records
 maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
 recordsWritten += 1
 }

 config.closeWriter(taskContext)
 committer.commitTask(taskContext)
 }{code}
And here is the java-doc for tryWithSafeFinallyAndFailureCallbacks.
{code:java}
/**
 * Execute a block of code and call the failure callbacks in the catch block. 
If exceptions occur
 * in either the catch or the finally block, they are appended to the list of 
suppressed
 * exceptions in original exception which is then rethrown.
 *
 * This is primarily an issue with `catch { abort() }` or `finally { 
out.close() }` blocks,
 * where the abort/close needs to be called to clean up `out`, but if an 
exception happened
 * in `out.write`, it's likely `out` may be corrupted and `abort` or 
`out.close` will
 * fail as well. This would then suppress the original/likely more meaningful
 * exception from the original `out.write` call.
 */{code}
FIY [~hyukjin.kwon]


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 11:50 AM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
 Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
 .set("spark.hadoop.fs.s3a.connection.maximum", "500")
 .set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
 .set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
 .set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
 "spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
 "spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"

It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.

the most Strange place which I find is `SparkHadoopWriter` -> line 129.

 
{code:java}
// Write all rows in RDD partition.
try {
 val ret = Utils.tryWithSafeFinallyAndFailureCallbacks {
 while (iterator.hasNext) {
 val pair = iterator.next()
 config.write(pair)

 // Update bytes written metric every few records
 maybeUpdateOutputMetrics(outputMetrics, callback, recordsWritten)
 recordsWritten += 1
 }

 config.closeWriter(taskContext)
 committer.commitTask(taskContext)
 }{code}
And here is the java-doc for tryWithSafeFinallyAndFailureCallbacks.
{code:java}
/**
 * Execute a block of code and call the failure callbacks in the catch block. 
If exceptions occur
 * in either the catch or the finally block, they are appended to the list of 
suppressed
 * exceptions in original exception which is then rethrown.
 *
 * This is primarily an issue with `catch { abort() }` or `finally { 
out.close() }` blocks,
 * where the abort/close needs to be called to clean up `out`, but if an 
exception happened
 * in `out.write`, it's likely `out` may be corrupted and `abort` or 
`out.close` will
 * fail as well. This would then suppress the original/likely more meaningful
 * exception from the original `out.write` call.
 */{code}


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
.set("spark.hadoop.fs.s3a.connection.maximum", "500")
.set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
.set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
.set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"


 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 10:51 AM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
 This specific was a writer with the s3 multipart upload, which hidden for 
spark developer in S3a API(Hadoop-AWS).
Configuration for s3a

.set("spark.hadoop.fs.s3a.threads.max", "128")
.set("spark.hadoop.fs.s3a.connection.maximum", "500")
.set("spark.hadoop.fs.s3a.max.total.tasks", "2500")
.set("spark.hadoop.fs.s3a.multipart.threshold", "104857600")
.set("spark.hadoop.fs.s3a.multipart.size", "104857600")

"spark.hadoop.fs.s3a.fast.upload": "true",
"spark.hadoop.fs.s3a.fast.upload.buffer": "bytebuffer",
"spark.hadoop.fs.s3a.fast.upload.active.blocks": "4"


 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
 Why? 
 1. S3 SDK can confirm or abort upload. because it's atomic
 2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
 So in case of high-level exception in spark or parquet, it just does not close 
stream.
 But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
 Which will commit the wrong file and hide the underlying exception.
 It's only a guess, I was not able to confirm it with code or reproduce it.


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
This specific was a writer with the s3 multipart upload, which hidden for spark 
developer in S3a API(Hadoop-aws).
 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
Why? 
1. S3 SDK can confirm or abort upload. because it's atomic
2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
So in case of high-level exception in spark or parquet, it just does not close 
stream.
But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
Which will commit the wrong file and hide the underlying exception.
It's only a guess, I was not able to confirm it with code or reproduce it.

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 10:49 AM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a API.
This specific was a writer with the s3 multipart upload, which hidden for spark 
developer in S3a API(Hadoop-aws).
 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
 One file has PageHeader: unknown type -15
 And this issue always reproduced on these files.
 The Footer of the file is OK, which means I can read schema and do an 
operation that requires only statistics like count.

Data itself do not affect this. We have a backup of data and successfully re 
ETL this data. New files are ok.

My personal *intuition* suggests that problem somewhere in reception handling 
in ParquetWriter + S3aOutputStream.
Why? 
1. S3 SDK can confirm or abort upload. because it's atomic
2. While Hadoop Files system API on FsDataOutputStream level, can only close 
the stream.
So in case of high-level exception in spark or parquet, it just does not close 
stream.
But because it's Stream, which is part of the java.io decorator pattern. 
Multiple streams wrap each other. So maybe somewhere there is a `try finally` 
block which calls `close`.
Which will commit the wrong file and hide the underlying exception.
It's only a guess, I was not able to confirm it with code or reproduce it.


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a api.
 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
One file has PageHeader: unknown type -15
And this issue always reproduced on these files.
The Footer of the file is OK, which means I can read schema and do an operation 
that requires only statistics like count.

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2021-04-08 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17317066#comment-17317066
 ] 

Nick Hryhoriev edited comment on SPARK-11844 at 4/8/21, 10:35 AM:
--

I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink.  writing to s3 using S3a api.
 It happens twice in production for 5 months.
 I was not lucky to reproduce it. 
 Because we write near 1 files every hour. files add every 10 minutes,
 And only two damaged.

One file has PageHeader: Null exception.
One file has PageHeader: unknown type -15
And this issue always reproduced on these files.
The Footer of the file is OK, which means I can read schema and do an operation 
that requires only statistics like count.


was (Author: hryhoriev.nick):
I have experienced the same issue with spark 2.4.7 and 3.1.1. with Spark 
Structure Stream FileSink and DeltaSink. 
It happens twice in production for 5 months.
I was not lucky to reproduce it. 
Because we write near 1 files every hour. files add ever 10 minutes,
And only two damaged.

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>  Labels: bulk-closed
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2017-10-20 Thread Cosmin Lehene (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16213216#comment-16213216
 ] 

Cosmin Lehene edited comment on SPARK-11844 at 10/20/17 8:59 PM:
-

I'm seeing these with spark-2.2.1-SNAPSHOT
{noformat}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
don't know what type: 14
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:166)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: don't 
know what type: 14
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
at 
shaded.parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
at 
org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
at org.apache.parquet.format.PageHeader.read(PageHeader.java:828)
at org.apache.parquet.format.Util.read(Util.java:213)
... 24 more
{noformat}

Also they seem accompanied by 

{noformat}
java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
Required field 'uncompressed_page_size' was not found in serialized data! 
Struct: PageHeader(type:null, uncompressed_page_size:0, compressed_page_size:0)
at org.apache.parquet.format.Util.read(Util.java:216)
at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:835)
at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:849)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:700)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at 

[jira] [Comment Edited] (SPARK-11844) can not read class org.apache.parquet.format.PageHeader: don't know what type: 13

2015-11-20 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15015497#comment-15015497
 ] 

Hyukjin Kwon edited comment on SPARK-11844 at 11/20/15 9:45 AM:


Just a question.. Does this just happen randomly? if the file was the same and 
you just tried to read it, would this be a Parquet's issue (I mean does it look 
possibly failed to read a page header in thrift format)?


was (Author: hyukjin.kwon):
Just a question.. Does this just happen randomly? if the file was the same and 
you just tried to read it, would this be a Parquet's issue (I mean does it look 
possibly failed to read a page in thrift format)?

> can not read class org.apache.parquet.format.PageHeader: don't know what 
> type: 13
> -
>
> Key: SPARK-11844
> URL: https://issues.apache.org/jira/browse/SPARK-11844
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> I got the following error once when I was running a query
> {code}
> java.io.IOException: can not read class org.apache.parquet.format.PageHeader: 
> don't know what type: 13
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:534)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:546)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:496)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:127)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:208)
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:201)
>   at 
> org.apache.spark.rdd.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:168)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.joins.HashJoin$$anon$1.hasNext(HashJoin.scala:77)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:88)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:704)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: parquet.org.apache.thrift.protocol.TProtocolException: don't know 
> what type: 13
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.getTType(TCompactProtocol.java:806)
>   at 
> parquet.org.apache.thrift.protocol.TCompactProtocol.readFieldBegin(TCompactProtocol.java:500)
>   at 
> org.apache.parquet.format.InterningProtocol.readFieldBegin(InterningProtocol.java:158)
>   at 
> parquet.org.apache.thrift.protocol.TProtocolUtil.skip(TProtocolUtil.java:108)
> {code}
> The next retry was