Re: Re: Could not initialize all stores for the region

Ted Yu Thu, 31 Mar 2016 10:07:11 -0700

Can you check server log on node 106 around 19:19:20 to see if there was
more clue ?


bq. being informed by the events which happens during their absent somehow?

Did you mean after nodeA came back online ?

Cheers

On Thu, Mar 31, 2016 at 9:57 AM, Zheng Shen <[email protected]> wrote:

> Hi Ted,
>
> Thank you very much for your reply!
>
> We do have mutliple HMaster nodes, one of them is on the offline node
> (let's call it nodeA). Another is on node which is alwasy online (nodeB).
>
> I scanned the audit log, and found that during nodeA offline, the nodeB
> HDFS auditlog shows:
>
> hdfs-audit.log:2016-03-31 19:19:24,158 INFO FSNamesystem.audit:
> allowed=true ugi=hbase (auth:SIMPLE) ip=/192.168.1.106 cmd=delete
> src=/hbase/archive/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> dst=null perm=null proto=rpc
>
> where (192.168.1.106) is the IP of nodeB.
>
> So it looks like nodeB deleted this file during nodeA's offline. However,
> should'nt services on nodeA (like HMaster and namenode) being informed by
> the events which happens during their absent somehow?
>
> Although we have only 5 nodes in this cluster, we do perform HA on every
> levels of HBase service stack. So yes, there are multiple instances of
> every services as long as it's possible or necessay (e.g. we have 3
> HMaster, 2 name node, 3 journal node)
>
> Thanks,
> Zheng
>
> ________________________________
> [email protected]
>
> From: Ted Yu<mailto:[email protected]>
> Date: 2016-04-01 00:00
> To: [email protected]<mailto:[email protected]>
> Subject: Re: Could not initialize all stores for the region
> bq. File does not exist: /hbase/data/default/vocabulary/
> 2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
>
> Can you search in namenode audit log to see which node initiated the delete
> request of the above file ?
> Then you can search in that node's region server log to get more clue.
>
> bq. hosts the HDFS namenode and datanode, Cloudera Manager, as well as
> HBase master and region server
>
> Can you separate some daemons off this node (e.g. HBase master) ?
> I assume you have second HBase master running somewhere else. Otherwise
> this node becomes the weak point of the cluster.
>
> On Thu, Mar 31, 2016 at 7:58 AM, Zheng Shen <[email protected]>
> wrote:
>
> > Hi,
> >
> > Our Hbase cannot performance any write operation while the read operation
> > are fine. I found the following error from regision server log
> >
> >
> > Could not initialize all stores for the
> >
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.
> >
> > Failed open of
> >
> region=vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.,
> > starting to roll back the global memstore size.
> > java.io.IOException: java.io.IOException: java.io.FileNotFoundException:
> > File does not exist:
> >
> /hbase/data/default/vocabulary/2639c4d082646bb4a4fa2d8119f9aaef/cnt/2dc367d0e1c24a3b848c68d3b171b06d
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1932)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1873)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1853)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1825)
> >         at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:559)
> >         at
> >
> >
> > Opening of region {ENCODED => 19faeb6e4da0b1873f68da271b0f5788, NAME =>
> >
> 'vocabulary,576206_6513944,1459420417369.19faeb6e4da0b1873f68da271b0f5788.',
> > STARTKEY => '576206_6513944', ENDKEY => '599122_6739914'} failed,
> > transitioning from OPENING to FAILED_OPEN in ZK, expecting version 22
> >
> >
> > We are using Cloudera CDH 5.4.7, the HBase version is 1.0.0-cdh_5.4.7,
> > with HDFS HA enabled (one of the namenode is running on the server being
> > shutdown). Our HBase cluster expereienced an expected node shutdown today
> > for about 4 hours. The node which is shutdown hosts the HDFS namenode and
> > datanode, Cloudera Manager, as well as HBase master and region server (5
> > nodes in totally in our small clusder).  During the node shuting down,
> > beside the services running that that node, the other HDFS namenode,
> > failover server, and 2 of 3 journal node are also down. After the node is
> > recovered, we restarted the whole CDH cluster, and then it ends like this
> > one...
> >
> > The HDFS checking "hdfs fsck" does not report any corrupted blocks.
> >
> > Any suggesion about where we should look into for this problem?
> >
> > Thanks!
> > Zheng
> >
> > ________________________________
> > [email protected]
> >
>

Re: Re: Could not initialize all stores for the region

Reply via email to