Hi Jinxin, spark web ui shows that all tasks are completed successfully, this error appears in the shell: java.io.EOFException: Premature EOF: no length prefix available at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244) at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733) More information can be seen here: https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e I speculate that there is a problem with deserialization, because after the web ui shows that the tasks of collect() are completed, the memory occupied by the “spark submit” process is still increasing. After a few minutes, the memory usage will no longer increase, and after a few minutes, the shell will report this error. Best regards, maqy
发件人: Tang Jinxin 发送时间: 2020年4月22日 23:16 收件人: maqy 抄送: user@spark.apache.org 主题: 回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available Maybe datanode stop data transfer due to timeout.Could you please provide exception stack? xiaoxingstack 邮箱:xiaoxingst...@gmail.com 签名由 网易邮箱大师 定制 在2020年04月22日 19:53,maqy 写道: Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches about 100GB. I guess there may be something wrong with deserialization. Has anyone else encountered this problem? Best regards, maqy 发件人: maqy1...@outlook.com 发送时间: 2020年4月20日 10:33 收件人: user@spark.apache.org 主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available Hi all, I get a Dataset[Row] through the following code: val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata") After that I want to collect it to the driver: val df_rows: Array[Row] = df.collect() The Spark web ui shows that all tasks have run successfully, but the application did not stop. After more than ten minutes, an error will be generated in the shell: java.io.EOFException: Premature EOF: no length prefix available Environment: Spark 2.4.3 Hadoop 2.7.7 Total rows of data about 800,000,000, 12GB More detailed information can be seen here: https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e Does anyone know the reason? Best regards, maqy