Hi, I'm facing a very strange error that occurs halfway of long execution Spark SQL jobs:
18/01/12 22:14:30 ERROR Utils: Aborting task java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes expected at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) (...) Since I get this in several jobs, I wonder if it might be a problem at the comm layer. Did anyone face a similar problem? It always happens in a job which does a shuffle of 200GB reading then in partitions of ~64MB for a groupBy. And it is weird that it only fails when it processed over 1000 partitions (16 cores on one node) I even tried changing the spark.shuffle.file.buffer config but it just seems to change the point when it occurs. Really would appreciate some hints - what it could be, what to try, test, how to debug - as I feel pretty much blocked here. Thanks in advance Fernando