Can you check lv-295 server log to see what happened to
0015506030f086780f6154b4cace7c6a, resulting in FAILED_CLOSE ?

888 inconsistencies were detected, how many regions were there in total
(10) ?

Can you go thru master log between the time of the first restart and 2017-07-25
05:50:32 to see if there was some clue ?
You can pastebin the log after redaction.

It seems log level was at INFO, some details would be visible if DEBUG log
was on.

Cheers

On Tue, Jul 25, 2017 at 10:16 PM, Bo Zhang <[email protected]> wrote:

> Hello hbaseers,
> We currently use hbase-1.0.0-cdh5.5.2 to create our hbase-cluster.
> However, we met some problems yesterday.
>
> We only stopped/started our hbase-cluster once, and then, there is an
> initial error:
> >Number of regions: 10
> >Deployed on: prod-lex-datanode-lv-238.prod.marinsw.net,60020,
> 1500950475647
> prod-lex-datanode-lv-245.prod.marinsw.net,60020,1500950476164
> prod-lex-datanode-lv-247.prod.marinsw.net,60020,1500950476370
> prod-lex-datanode-lv-292.prod.marinsw.net,60020,1500950475711
> prod-lex-datanode-lv-294.prod.marinsw.net,60020,1500950475833
> prod-lex-datanode-lv-297.prod.marinsw.net,60020,1500950475948
> prod-lex-datanode-lv-302.prod.marinsw.net,60020,1500950475835
> prod-lex-datanode-lv-303.prod.marinsw.net,60020,1500950477303
> >888 inconsistencies detected.
> >Status: INCONSISTENT
>
> Then, we restarted hbase-cluster again and tried to use "hbck -fix" to fix
> the inconsistency problem. But we received an error:
> >2017-07-25 03:52:06,341 WARN org.apache.hadoop.hbase.master.RegionStates:
> Failed to open/close 0015506030f086780f6154b4cace7c6a on
> prod-lex-datanode-lv-295.prod.marinsw.net,60020,1500953992312, set to
> FAILED_CLOSE
>
> Meanwhile, region servers are in “transition”.
>
> In this case, we had to stop the cluster and want to use offlineMetaRepair
> to solve them.
> At the same time, we can't stop region servers, and have to manually kill
> the processes.
> Here is the logs:
>
> >2017-07-25 05:50:32,233 INFO
> org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer: loading
> config
> 2017-07-25 05:50:32,278 INFO org.apache.hadoop.hbase.master.RegionStates:
> Transition
> {1588230740 state=OFFLINE, ts=1500961832240, server=null}
>
> to
> {1588230740 state=OPEN, ts=1500961832278, server=
> prod-lex-datanode-lv-235.prod.marinsw.net,60020,1500960907808}
>
> >2017-07-25 05:50:32,279 INFO org.apache.hadoop.hbase.
> master.ServerManager:
> AssignmentManager hasn't finished failover cleanup; waiting
> >2017-07-25 05:50:32,280 INFO org.apache.hadoop.hbase.master.HMaster:
> hbase:meta assigned=0, rit=false, location=
> prod-lex-datanode-lv-235.prod.marinsw.net,60020,1500960907808
> >2017-07-25 05:50:32,433 INFO
> org.apache.hadoop.hbase.MetaMigrationConvertingToPB: META already up-to
> date with PB serialization
> >2017-07-25 05:50:32,782 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Found regions out on
> cluster or in RIT; presuming failover
> >2017-07-25 05:50:32,834 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Joined the cluster in
> 400ms, failover=true
> >2017-07-25 05:50:32,905 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> logs for prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931
> before assignment; region count=0
> >2017-07-25 05:50:32,908 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: dead splitlog workers [
> prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931]
> >2017-07-25 05:50:32,910 INFO
> org.apache.hadoop.hbase.master.SplitLogManager:
> hdfs://prod-lex/hbase/WALs/
> prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931-splitting is
> empty dir, no logs to split
> >2017-07-25 05:50:32,911 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting 0 logs
> in
> [hdfs://prod-lex/hbase/WALs/prod-lex-datanode-lv-255.prod.marinsw.net
> ,60020,1449178169931-splitting] for [
> prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931]
> >2017-07-25 05:50:32,917 WARN
> org.apache.hadoop.hbase.master.SplitLogManager: returning success without
> actually splitting and deleting all the log files in path
> hdfs://prod-lex/hbase/WALs/prod-lex-datanode-lv-255.prod.marinsw.net
> ,60020,1449178169931-splitting
> >2017-07-25 05:50:32,917 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: finished splitting (more
> than or equal to) 0 bytes in 0 log files in [hdfs://prod-lex/hbase/WALs/
> prod-lex-datanode-lv-255.prod.marinsw.net,60020,1449178169931-splitting]
> in
> 6ms
> >2017-07-25 05:50:32,918 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> 0
> region(s) that prod-lex-datanode-lv-255.prod.marinsw.net,60020,
> 1449178169931
> was carrying (and 0 regions(s) that were opening on this server)
> >2017-07-25 05:50:32,918 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Finished
> processing of shutdown of prod-lex-datanode-lv-255.prod.marinsw.net
> ,60020,1449178169931
>
>
> At last, we had to use snapshot,  to delete hbase znodes for zookeeper, and
> to full restart the whole cluster (hdfs, zookeeper, hbase along with other
> required services Hive and Oozie).
>
> Although we got hbase-cluster back up, but we still don't know what cause
> the problems, and need some suggestions and explains to avoid the problems
> happen again.
>
> Do you have any idea why the restart of hbase-cluster will cause
> inconsistency and "transition" problems?
>
> And is there a better (or smarter) way to solve them?
>
> Any suggestion and idea is welcome.
>
> Thank you so much in advance.
>
> ++
>
> Bo ZHANG
>

Reply via email to