Can you check server log on node 106 around 19:19:20 to see if there was more clue ?
bq. being informed by the events which happens during their absent somehow? Did you mean after nodeA came back online ? Cheers On Thu, Mar 31, 2016 at 9:57 AM, Zheng Shen <[email protected]> wrote: > Hi Ted, > > Thank you very much for your reply! > > We do have mutliple HMaster nodes, one of them is on the offline node > (let's call it nodeA). Another is on node which is alwasy online (nodeB). > > I scanned the audit log, and found that during nodeA offline, the nodeB > HDFS auditlog shows: > > hdfs-audit.log:2016-03-31 19:19:24,158 INFO FSNamesystem.audit: > allowed=true ugi=hbase (auth:SIMPLE) ip=/192.168.1.106 cmd=delete > src=/hbase/archive/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d > dst=null perm=null proto=rpc > > where (192.168.1.106) is the IP of nodeB. > > So it looks like nodeB deleted this file during nodeA's offline. However, > should'nt services on nodeA (like HMaster and namenode) being informed by > the events which happens during their absent somehow? > > Although we have only 5 nodes in this cluster, we do perform HA on every > levels of HBase service stack. So yes, there are multiple instances of > every services as long as it's possible or necessay (e.g. we have 3 > HMaster, 2 name node, 3 journal node) > > Thanks, > Zheng > > ________________________________ > [email protected] > > From: Ted Yu<mailto:[email protected]> > Date: 2016-04-01 00:00 > To: [email protected]<mailto:[email protected]> > Subject: Re: Could not initialize all stores for the region > bq. File does not exist: /hbase/data/default/vocabulary/ > 2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d > > Can you search in namenode audit log to see which node initiated the delete > request of the above file ? > Then you can search in that node's region server log to get more clue. > > bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as > HBase master and region server > > Can you separate some daemons off this node (e.g. HBase master) ? > I assume you have second HBase master running somewhere else. Otherwise > this node becomes the weak point of the cluster. > > On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <[email protected]> > wrote: > > > Hi, > > > > Our Hbase cannot performance any write operation while the read operation > > are fine. I found the following error from regision server log > > > > > > Could not initialize all stores for the > > > region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788. > > > > Failed open of > > > region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788., > > starting to roll back the global memstore size. > > java.io.IOException: java.io.IOException: java.io.FileNotFoundException: > > File does not exist: > > > /hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d > > at > > > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66) > > at > > > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853) > > at > > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825) > > at > > > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559) > > at > > > > > > Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME => > > > 'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.', > > STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed, > > transitioning from OPENING to FAILED_OPEN in ZK, expecting version 22 > > > > > > We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7, > > with HDFS HA enabled (one of the namenode is running on the server being > > shutdown). Our HBase cluster expereienced an expected node shutdown today > > for about 4 hours. The node which is shutdown hosts the HDFS namenode and > > datanode, Cloudera Manager, as well as HBase master and region server (5 > > nodes in totally in our small clusder). During the node shuting down, > > beside the services running that that node, the other HDFS namenode, > > failover server, and 2 of 3 journal node are also down. After the node is > > recovered, we restarted the whole CDH cluster, and then it ends like this > > one... > > > > The HDFS checking "hdfs fsck" does not report any corrupted blocks. > > > > Any suggesion about where we should look into for this problem? > > > > Thanks! > > Zheng > > > > ________________________________ > > [email protected] > > >
