Re: Some problems in one accident on my production cluster

2016-02-24 Thread Heng Chen
Thanks stack and ted for your help. After check the code, i think the reason is RS send split request with parent region, two daughter regions, then RS crash. Master update two daughter regions to be SPLIT_NEW state and put them in regionsInTransition which is stored in memory of master. And

Re: Some problems in one accident on my production cluster

2016-02-24 Thread Stack
On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen wrote: > The story is I run one MR job on my production cluster (0.98.6), it needs > to scan one table during map procedure. > > Because of the heavy load from the job, all my RS crashed due to OOM. > > Really big rows? If

Re: Some problems in one accident on my production cluster

2016-02-24 Thread Ted Yu
bq. RegionStates: THIS SHOULD NOT HAPPEN: unexpected { ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW Looks like the above wouldn't have happened if you are using 0.98.11+ See HBASE-12958 On Wed, Feb 24, 2016 at 6:39 PM, Heng Chen wrote: > I pick up some logs

Re: Some problems in one accident on my production cluster

2016-02-24 Thread Heng Chen
Thanks @ted, your suggestions about 2 and 3 are what i need ! 2016-02-25 10:39 GMT+08:00 Heng Chen : > I pick up some logs in master.log about one region > "ad283942aff2bba6c0b94ff98a904d1a" > > > 2016-02-24 16:24:35,610 INFO [AM.ZK.Worker-pool2-t3491] >

Re: Some problems in one accident on my production cluster

2016-02-24 Thread Heng Chen
I pick up some logs in master.log about one region "ad283942aff2bba6c0b94ff98a904d1a" 2016-02-24 16:24:35,610 INFO [AM.ZK.Worker-pool2-t3491] master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610,

Re: Some problems in one accident on my production cluster

2016-02-24 Thread Ted Yu
bq. two regions were in transition Can you pastebin related server logs w.r.t. these two regions so that we can have more clue ? For #2, please see http://hbase.apache.org/book.html#big.cluster.config For #3, please see

Some problems in one accident on my production cluster

2016-02-24 Thread Heng Chen
The story is I run one MR job on my production cluster (0.98.6), it needs to scan one table during map procedure. Because of the heavy load from the job, all my RS crashed due to OOM. After i restart all RS, i found one problem. All regions were reopened on one RS, and balancer could not