RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

John Lilley Mon, 27 Jan 2014 07:42:14 -0800

None of the datanode logs have error messages.

From: Harsh J [mailto:[email protected]]
Sent: Monday, January 27, 2014 8:15 AM
To: <[email protected]>
Subject: Re: BlockMissingException reading HDFS file, but the block exists and 
fsck shows OK



Can you check the log of the DN that is holding the specific block for any 
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" 
<[email protected]<mailto:[email protected]>> wrote:
I am getting this perplexing error.  Our YARN application launches tasks that 
attempt to simultaneously open a large number of files for merge.  There seems 
to be a load threshold in terms of number of simultaneous tasks attempting to 
open a set of HDFS files on a four-node cluster.  The threshold is hit at 32 
tasks, each opening 450 files.  The threshold is not hit at 16 tasks, each 
opening 250 files.

The files are stored in HDFS with replication=1.  I know that low replication 
leaves me open to node-failure issues, but bear with me, nothing is actually 
failing.

I get this exception when attempting to open a file:
org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
Could not obtain block: 
BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
    
org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
    
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
    org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
    org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)

However, the block is definitely *not* missing.  I can be running the following 
command continuously while all of this is going on:
hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
Well before the tasks start it is showing good files all around, including:
/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 bytes, 2 
block(s):  OK
0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 
len=134217728 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]
1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 
len=10620886 repl=1 [192.168.57.110:50010<http://192.168.57.110:50010>]

My application logs also show that *some* tasks are able to open the files for 
which a missing block is reported.
In case you suspect, the files are not being deleted.  The fsck continues to 
show good status for these files well after the error report.
I've also checked to ensure that the files are not being held open by the 
creators of the files.

This leads me to believe that I've hit a an HDFS open-file limit of some kind.  
We can compensate pretty easily, by doing a two-phase merge that opens far 
fewer files simultaneously, keeping a limited pool of open files, etc.  
However, I would still like to know what limit is being hit, and how to best 
predict that limit on various cluster configurations.

Thanks,
john

RE: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Reply via email to