Kuhu Shukla created HDFS-9958:
---------------------------------

             Summary: BlockManager#createLocatedBlocks can throw NPE for 
corruptBlocks on failed storages.
                 Key: HDFS-9958
                 URL: https://issues.apache.org/jira/browse/HDFS-9958
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.7.2
            Reporter: Kuhu Shukla
            Assignee: Kuhu Shukla


In a scenario where the corrupt replica is on a failed storage, before it is 
taken out of blocksMap, there is a race which causes the creation of 
LocatedBlock on a {{machines}} array element that is not populated. 

Following is the root cause,
{code}
final int numCorruptNodes = countNodes(blk).corruptReplicas();
{code}
countNodes only looks at nodes with storage state as NORMAL, which in the case 
where corrupt replica is on failed storage will amount to numCorruptNodes being 
zero. 
{code}
final int numNodes = blocksMap.numNodes(blk);
{code}
However, numNodes will count all nodes/storages irrespective of the state of 
the storage. Therefore numMachines will include such (failed) nodes. The assert 
would fail only if the system is enabled to catch Assertion errors, otherwise 
it goes ahead and tries to create LocatedBlock object for that is not put in 
the {{machines}} array.

Here is the stack trace:
{code}
java.lang.NullPointerException
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:45)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeStorageInfo.toDatanodeInfos(DatanodeStorageInfo.java:40)
        at 
org.apache.hadoop.hdfs.protocol.LocatedBlock.<init>(LocatedBlock.java:84)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:878)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:826)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:799)
        at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:899)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1849)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
        at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
        at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
        at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
        at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to