Hello,

I'm running Spark job on AWS EMR that reads many lzo files from a S3 bucket
partitioned by date.
Sometimes I see errors in logs similar to

18/04/13 11:53:52 WARN TaskSetManager: Lost task 151177.0 in stage
43.0 (TID 1516123, ip-10-10-2-6.ec2.internal, executor 57):
java.io.IOException: Corrupted uncompressed block
        at 
com.hadoop.compression.lzo.LzopInputStream.verifyChecksums(LzopInputStream.java:219)
        at 
com.hadoop.compression.lzo.LzopInputStream.getCompressedData(LzopInputStream.java:284)
        at 
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:261)


I don't see the jobs fail. I assume this task succeeded when it is retried.
If the input file is actually corrupted even task retries should fail and
eventually job will fail based on "spark.task.maxFailures" config rt?

Is there way to make Spark/Hadoop lzo library to print the full file name
when such failures happen? So that I can then manually check if the file is
indeed corrupted.

Thanks,
Srikanth

Reply via email to