dilip-k-m commented on pull request #4039: URL: https://github.com/apache/spark/pull/4039#issuecomment-635613051
I've got the same issue in production. I was able to replicate in our performance test environment. So, concluded that, with the same cluster configuration, if a spark job is fed growing rate of input traffic, while writing after processing the feed, it generates parquet files with corrupted footer. Again, this probability of footer corruption increases when more unique values[the input file has more distinct values, i.e. lesser redundant field values] are fed. If, number of partitions to write is increased, then also this probability is reduced. I have found that Spark 2.x does not have such issue. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org