Moving to [email protected]
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as
it may be a CDH4 specific problem.

Could you share your whole DN log (from startup until heartbeat
errors) please? I suspect its a problem with DN registration, that the
log will help confirm.

On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <[email protected]> wrote:
> Hello,
>
> I have trouble in quorum-based HDFS HA of CDH 4.1.1.
>
> NameNode Web UI of Cloudera Manager reports NameNode status.
> Its has "Cluster Summary" section and my cluster is summarized
> there like below.
>
> --- Cluster Summary ---
> Configured Capacity   : 0 KB
> DFS Used              : 0 KB
> Non DFS Used          : 0 KB
> DFS Remaining         : 0 KB
> DFS Used%             : 100 %
> DFS Remaining%        : 0 %
> Block Pool Used       : 0 KB
> Block Pool Used%      : 100 %
> DataNodes usages      : Min %  Median %  Max %  stdev %
>                           0 %       0 %    0 %      0 %
> Live Nodes            : 0 (Decommissioned: 0)
> Dead Nodes            : 5 (Decommissioned: 0)
> Decommissioning Nodes : 0
> --------------------
>
> As you can see, all the DataNodes are regarded as dead.
>
> I found DataNodes continued to emit logs about failure to
> send heartbeat to NameNode.
>
> ---- DataNode Log (host names were manually edited) ---
> 2012-10-30 19:28:16,817 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode
> node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of
> 300000 msec  BLOCKREPORT_INTERVAL of 21600000msec Initial delay:
> 0msec; heartBeatInterval=3000
> 2012-10-30 19:28:16,817 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> BPOfferService for Block pool
> BP-2063217961-192.168.62.231-1351263110470 (storage id
> DS-2090122187-192.168.62.233-50010-1338981658216) service to
> node02.example.com/192.168.62.232:8020
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674)
>         at java.lang.Thread.run(Thread.java:662)
> --------------------
>
> So, I guess that DataNodes are failing to locate the name service
> for some reasons, but I don't have any clue to solve the problem.
>
> I confirmed that
> /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml
> of a DataNode contains
>
> --- core-site.xml ---
>   <property>
>     <name>fs.defaultFS</name>
>     <value>hdfs://nameservice1</value>
>   </property>
> --------------------
>
> and hdfs-site.xml contains
>
> --- hdfs-site.xml ---
>   <property>
>     <name>dfs.nameservices</name>
>     <value>nameservice1</value>
>   </property>
>   <property>
>     <name>dfs.client.failover.proxy.provider.nameservice1</name>
>     
> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
>   </property>
>   <property>
>     <name>dfs.ha.namenodes.nameservice1</name>
>     <value>namenode38,namenode90</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode38</name>
>     <value>node01.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode38</name>
>     <value>node01.example.com:50470</value>
>   </property>
>   <property>
>     <name>dfs.namenode.rpc-address.nameservice1.namenode90</name>
>     <value>node02.example.com:8020</value>
>   </property>
>   <property>
>     <name>dfs.namenode.http-address.nameservice1.namenode90</name>
>     <value>node02.example.com:50070</value>
>   </property>
>   <property>
>     <name>dfs.namenode.https-address.nameservice1.namenode90</name>
>     <value>jbmnode02.jibemobile.jp:50470</value>
>   </property>
>   <property>
>     <name>dfs.permissions.superusergroup</name>
>     <value>supergroup</value>
>   </property>
>   <property>
>     <name>dfs.replication</name>
>     <value>3</value>
>   </property>
>   <property>
>     <name>dfs.namenode.replication.min</name>
>     <value>1</value>
>   </property>
>   <property>
>     <name>dfs.replication.max</name>
>     <value>512</value>
>   </property>
> --------------------
>
> The following was my trial to create a file in HDFS but in vain.
>
> --------------------
> # vi /tmp/test.txt
> # sudo -u hdfs hadoop fs -mkdir /takahiko
> # sudo -u hdfs hadoop fs -ls /
> Found 3 items
> drwxr-xr-x   - hbase hbase               0 2012-10-30 15:12 /hbase
> drwxr-xr-x   - hdfs  supergroup          0 2012-10-30 18:55 /takahiko
> drwxrwxrwt   - hdfs  hdfs                0 2012-10-26 23:58 /tmp
> # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/
> 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> copyFromLocal: File /takahiko/test.txt._COPYING_ could only be
> replicated to 0 nodes instead of minReplication (=1).  There are 0
> datanode(s) running and no node(s) are excluded in this operation.
> 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file
> /takahiko/test.txt._COPYING_
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
> /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes
> instead of minReplication (=1).  There are 0 datanode(s) running and
> no node(s) are excluded in this operation.
>         at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322)
>         at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170)
>         at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693)
>         at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687)
>
>         at org.apache.hadoop.ipc.Client.call(Client.java:1160)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202)
>         at $Proxy9.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164)
>         at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83)
>         at $Proxy10.addBlock(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003)
>         at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463)
> --------------------
>
>
> Could anyone give me any hint to solve the problem?
>
> Best Regards,
> Takahiko Kawasaki



-- 
Harsh J

Reply via email to