Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Harsh J Mon, 27 Jan 2014 07:27:16 -0800

Can you check the log of the DN that is holding the specific block for any
errors?
On Jan 27, 2014 8:37 PM, "John Lilley" <[email protected]> wrote:


>  I am getting this perplexing error.  Our YARN application launches tasks
> that attempt to simultaneously open a large number of files for merge.
> There seems to be a load threshold in terms of number of simultaneous tasks
> attempting to open a set of HDFS files on a four-node cluster.  The
> threshold is hit at 32 tasks, each opening 450 files.  The threshold is not
> hit at 16 tasks, each opening 250 files.
>
>
>
> The files are stored in HDFS with replication=1.  I know that low
> replication leaves me open to node-failure issues, but bear with me,
> nothing is actually failing.
>
>
>
> I get this exception when attempting to open a file:
>
> org/apache/hadoop/fs/FSDataInputStream.read:org.apache.hadoop.hdfs.BlockMissingException:
>
>
> Could not obtain block:
> BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
>
> file=/rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld
>
>
> org.apache.hadoop.hdfs.DFSInputStream.chooseDataNode(DFSInputStream.java:838)
>
>
> org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:889)
>
>     org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1154)
>
>     org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:77)
>
>
>
> However, the block is definitely **not** missing.  I can be running the
> following command continuously while all of this is going on:
>
> hdfs fsck /rpdm/tmp/ProjectTemp_34_1/TempFolder_6 -files -blocks -locations
>
> Well before the tasks start it is showing good files all around, including:
>
> /rpdm/tmp/ProjectTemp_34_1/TempFolder_6/data00001_000003.dld 144838614
> bytes, 2 block(s):  OK
>
> 0. BP-1827033441-192.168.57.112-1384284857542:blk_1073964208_223385
> len=134217728 repl=1 [192.168.57.110:50010]
>
> 1. BP-1827033441-192.168.57.112-1384284857542:blk_1073964234_223411
> len=10620886 repl=1 [192.168.57.110:50010]
>
>
>
> My application logs also show that **some** tasks are able to open the
> files for which a missing block is reported.
>
> In case you suspect, the files are not being deleted.  The fsck continues
> to show good status for these files well after the error report.
>
> I've also checked to ensure that the files are not being held open by the
> creators of the files.
>
>
>
> This leads me to believe that I've hit a an HDFS open-file limit of some
> kind.  We can compensate pretty easily, by doing a two-phase merge that
> opens far fewer files simultaneously, keeping a limited pool of open files,
> etc.  However, I would still like to know what limit is being hit, and how
> to best predict that limit on various cluster configurations.
>
>
>
> Thanks,
>
> john
>

Re: BlockMissingException reading HDFS file, but the block exists and fsck shows OK

Reply via email to