Moving to [email protected] (https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as it may be a CDH4 specific problem.
Could you share your whole DN log (from startup until heartbeat errors) please? I suspect its a problem with DN registration, that the log will help confirm. On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <[email protected]> wrote: > Hello, > > I have trouble in quorum-based HDFS HA of CDH 4.1.1. > > NameNode Web UI of Cloudera Manager reports NameNode status. > Its has "Cluster Summary" section and my cluster is summarized > there like below. > > --- Cluster Summary --- > Configured Capacity : 0 KB > DFS Used : 0 KB > Non DFS Used : 0 KB > DFS Remaining : 0 KB > DFS Used% : 100 % > DFS Remaining% : 0 % > Block Pool Used : 0 KB > Block Pool Used% : 100 % > DataNodes usages : Min % Median % Max % stdev % > 0 % 0 % 0 % 0 % > Live Nodes : 0 (Decommissioned: 0) > Dead Nodes : 5 (Decommissioned: 0) > Decommissioning Nodes : 0 > -------------------- > > As you can see, all the DataNodes are regarded as dead. > > I found DataNodes continued to emit logs about failure to > send heartbeat to NameNode. > > ---- DataNode Log (host names were manually edited) --- > 2012-10-30 19:28:16,817 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode > node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of > 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: > 0msec; heartBeatInterval=3000 > 2012-10-30 19:28:16,817 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > BPOfferService for Block pool > BP-2063217961-192.168.62.231-1351263110470 (storage id > DS-2090122187-192.168.62.233-50010-1338981658216) service to > node02.example.com/192.168.62.232:8020 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) > at java.lang.Thread.run(Thread.java:662) > -------------------- > > So, I guess that DataNodes are failing to locate the name service > for some reasons, but I don't have any clue to solve the problem. > > I confirmed that > /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml > of a DataNode contains > > --- core-site.xml --- > <property> > <name>fs.defaultFS</name> > <value>hdfs://nameservice1</value> > </property> > -------------------- > > and hdfs-site.xml contains > > --- hdfs-site.xml --- > <property> > <name>dfs.nameservices</name> > <value>nameservice1</value> > </property> > <property> > <name>dfs.client.failover.proxy.provider.nameservice1</name> > > <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> > </property> > <property> > <name>dfs.ha.namenodes.nameservice1</name> > <value>namenode38,namenode90</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode38</name> > <value>node01.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode38</name> > <value>node01.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode38</name> > <value>node01.example.com:50470</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode90</name> > <value>node02.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode90</name> > <value>node02.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode90</name> > <value>jbmnode02.jibemobile.jp:50470</value> > </property> > <property> > <name>dfs.permissions.superusergroup</name> > <value>supergroup</value> > </property> > <property> > <name>dfs.replication</name> > <value>3</value> > </property> > <property> > <name>dfs.namenode.replication.min</name> > <value>1</value> > </property> > <property> > <name>dfs.replication.max</name> > <value>512</value> > </property> > -------------------- > > The following was my trial to create a file in HDFS but in vain. > > -------------------- > # vi /tmp/test.txt > # sudo -u hdfs hadoop fs -mkdir /takahiko > # sudo -u hdfs hadoop fs -ls / > Found 3 items > drwxr-xr-x - hbase hbase 0 2012-10-30 15:12 /hbase > drwxr-xr-x - hdfs supergroup 0 2012-10-30 18:55 /takahiko > drwxrwxrwt - hdfs hdfs 0 2012-10-26 23:58 /tmp > # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/ > 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes > instead of minReplication (=1). There are 0 datanode(s) running and > no node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) > > at org.apache.hadoop.ipc.Client.call(Client.java:1160) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > at $Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at $Proxy10.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463) > copyFromLocal: File /takahiko/test.txt._COPYING_ could only be > replicated to 0 nodes instead of minReplication (=1). There are 0 > datanode(s) running and no node(s) are excluded in this operation. > 12/10/30 20:07:05 ERROR hdfs.DFSClient: Failed to close file > /takahiko/test.txt._COPYING_ > org.apache.hadoop.ipc.RemoteException(java.io.IOException): File > /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes > instead of minReplication (=1). There are 0 datanode(s) running and > no node(s) are excluded in this operation. > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) > > at org.apache.hadoop.ipc.Client.call(Client.java:1160) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:202) > at $Proxy9.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:290) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:164) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:83) > at $Proxy10.addBlock(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1150) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1003) > at > org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:463) > -------------------- > > > Could anyone give me any hint to solve the problem? > > Best Regards, > Takahiko Kawasaki -- Harsh J
