回复: 回复：[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

maqy Wed, 22 Apr 2020 08:41:27 -0700

    Hi Jinxin,
　spark web ui shows that all tasks are completed successfully, this error 
appears in the shell:
java.io.EOFException: Premature EOF: no length prefix available
    at 
org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
    at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
    at 
org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)
　More information can be seen here:
　
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
　
　I speculate that there is a problem with deserialization, because after the 
web ui shows that the tasks of collect() are completed, the memory occupied by 
the “spark submit” process is still increasing. After a few minutes, the memory 
usage will no longer increase, and after a few minutes, the shell will report 
this error.
　
Best regards,
maqy


发件人: Tang Jinxin
发送时间: 2020年4月22日 23:16
收件人: maqy
抄送: user@spark.apache.org
主题: 回复：[Spark SQL] [Beginner] Dataset[Row] collect to driver 
throwjava.io.EOFException: Premature EOF: no length prefix available

Maybe datanode stop data transfer due    to timeout.Could you please provide 
exception stack?

xiaoxingstack
邮箱：xiaoxingst...@gmail.com
签名由 网易邮箱大师 定制
在2020年04月22日 19:53，maqy 写道：
    Today I meet the same problem using rdd.collect (), the format of rdd is 
Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches 
about 100GB.
    I guess there may be something wrong with deserialization. Has anyone else 
encountered this problem?

Best regards,
maqy

发件人: maqy1...@outlook.com
发送时间: 2020年4月20日 10:33
收件人: user@spark.apache.org
主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver 
throwjava.io.EOFException: Premature EOF: no length prefix available

Hi all,
　I get a Dataset[Row] through the following code:

val df: Dataset[Row] = 
spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")

　After that I want to collect it to the driver:

val df_rows: Array[Row] = df.collect()

　The Spark web ui shows that all tasks have run successfully, but the 
application did not stop. After more than ten minutes, an error will be 
generated in the shell:

java.io.EOFException: Premature EOF: no length prefix available
　
Environment:
    Spark 2.4.3
    Hadoop 2.7.7
    Total rows of data about 800,000,000, 12GB
    
    More detailed information can be seen here:
https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e
    Does anyone know the reason?

Best regards,
maqy

回复: 回复：[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

Reply via email to