Can you check the log of the DN that is holding the specific block for any errors? On Jan 27, 2014 8:37 PM, "John Lilley" <[email protected]> wrote:
> I am getting this perplexing error. Our YARN application launches tasks > that attempt to simultaneously open a large number of files for merge. > There seems to be a load threshold in terms of number of simultaneous tasks > attempting to open a set of HDFS files on a four-node cluster. The > threshold is hit at 32 tasks, each opening 450 files. The threshold is not > hit at 16 tasks, each opening 250 files. > > > > The files are stored in HDFS with replication=1. I know that low > replication leaves me open to node-failure issues, but bear with me, > nothing is actually failing. > > > > I get this exception when attempting to open a file: > > org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException: > > > Could not obtain block: > BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 > > file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld > > > org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838) > > > org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889) > > org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154) > > org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77) > > > > However, the block is definitely **not** missing. I can be running the > following command continuously while all of this is going on: > > hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations > > Well before the tasks start it is showing good files all around, including: > > /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614 > bytes, 2 block(s): OK > > 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385 > len=134217728 repl=1 [192.168.57.110:50010] > > 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411 > len=10620886 repl=1 [192.168.57.110:50010] > > > > My application logs also show that **some** tasks are able to open the > files for which a missing block is reported. > > In case you suspect, the files are not being deleted. The fsck continues > to show good status for these files well after the error report. > > I've also checked to ensure that the files are not being held open by the > creators of the files. > > > > This leads me to believe that I've hit a an HDFS open-file limit of some > kind. We can compensate pretty easily, by doing a two-phase merge that > opens far fewer files simultaneously, keeping a limited pool of open files, > etc. However, I would still like to know what limit is being hit, and how > to best predict that limit on various cluster configurations. > > > > Thanks, > > john >
