Re: Some regions never get online after a region server crashes

2016-06-06 Thread Shuai Lin
Would do that, thanks!

On Sat, Jun 4, 2016 at 6:55 PM, Ted Yu  wrote:

> I think this sounds like a bug.
>
> Search in HBase JIRA first. If there is no JIRA with the same symptom,
> consider filing one.
>
> Cheers
>
> On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin  wrote:
>
>> Hi Ted,
>>
>> I'm kind of confused, so is this normal behaviour or a bug of hbase? For
>> me it looks like a bug, should I fire a JIRA?
>>
>> Thanks,
>>
>> Shuai
>>
>> On Fri, May 27, 2016 at 8:02 PM, Ted Yu  wrote:
>>
>>> There were 7 regions Master tried to close which were opening but not
>>> yet served.
>>>
>>> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>>>
>>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin 
>>> wrote:
>>>
 Here is the complete log on node6 between 13:10:47 and 13:11:47:
 http://paste.openstack.org/raw/505826/

 The master asked node6 to open several regions. Node6 opened the first
 4 very fast (within 1 seconsd) and got stuck at the 5th one. But there is
 no errors at that time.

 On Wed, May 25, 2016 at 10:12 PM, Ted Yu  wrote:

> In AssignmentManager#assign(), you should find:
>
>   // Send OPEN RPC. If it fails on a IOE or RemoteException,
>   // regions will be assigned individually.
>   long maxWaitTime = System.currentTimeMillis() +
> this.server.getConfiguration().
>   getLong("hbase.regionserver.rpc.startup.waittime",
> 6);
>
> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>
> Cheers
>
> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin 
> wrote:
>
>> Because of the "opening regions" rpc call sent by master to the
>> region server node6 got timed out after 1 minutes?
>>
>> *RPC call was sent:*
>>
>> 2016-04-30 13:10:47,702 INFO 
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>
>> *After 1 minute:*
>>
>> 2016-04-30 13:11:47,780 INFO
>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> with rs-node6.example.com,60020,1458723856883 in order to assign
>> regions, java.io.IOException: Call to
>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>> waitTime=60001, operationTimeout=6 expired.
>>
>> 2016-04-30 13:11:47,783 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>
>>
>> I have checked hbase source code, but don't find any specific timeout
>> settings for "open region" rpc call I can use. So I guess the it's using
>> the default "hbase.rpc.timeout", which defaults to 60secs. And since 
>> there
>> are 20+ regions being assigned to node6 almost at the same moment, node6
>> gets overloaded and can't finish opening all of them within one minute.
>>
>> So this looks like a hbase bug to me (regions never get online when
>> the region server failed to handle the OpenRegionRequest before the rpc
>> timeout), am I right?
>>
>>
>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>>
>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>>> request
>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>
>>> Was there any clue in master log why the close request was sent ?
>>>
>>> Cheers
>>>
>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin 
>>> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>>> 5.4.8. The
>>> > RS logs on node6 can be found here <
>>> http://paste.openstack.org/raw/496174/
>>> > >
>>> >  .
>>> >
>>> > Thanks!
>>> >
>>> > Shuai
>>> >
>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu 
>>> wrote:
>>> >
>>> > > Can you pastebin related server log w.r.t.
>>> > d1c7f3f455f2529da82a2f713b5ee067
>>> > > from rs-node6 ?
>>> > >
>>> > > Which release of hbase are you using ?
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin <
>>> linshuai2...@gmail.com>
>>> > wrote:
>>> > >
>>> > > > Hi list,
>>> > > >
>>> > > > Last weekend I got a region server crashed, but some regions
>>> never got
>>> > > > online again on other RSes. I've gone through the logs, and
>>> here is the
>>> > > > timeline about some of the events:
>>> > > >
>>> > > > * 13:03:50 on of 

Re: Some regions never get online after a region server crashes

2016-06-04 Thread Ted Yu
I think this sounds like a bug.

Search in HBase JIRA first. If there is no JIRA with the same symptom,
consider filing one.

Cheers

On Fri, Jun 3, 2016 at 1:10 AM, Shuai Lin  wrote:

> Hi Ted,
>
> I'm kind of confused, so is this normal behaviour or a bug of hbase? For
> me it looks like a bug, should I fire a JIRA?
>
> Thanks,
>
> Shuai
>
> On Fri, May 27, 2016 at 8:02 PM, Ted Yu  wrote:
>
>> There were 7 regions Master tried to close which were opening but not
>> yet served.
>>
>> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>>
>> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin 
>> wrote:
>>
>>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>>> http://paste.openstack.org/raw/505826/
>>>
>>> The master asked node6 to open several regions. Node6 opened the first 4
>>> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
>>> errors at that time.
>>>
>>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu  wrote:
>>>
 In AssignmentManager#assign(), you should find:

   // Send OPEN RPC. If it fails on a IOE or RemoteException,
   // regions will be assigned individually.
   long maxWaitTime = System.currentTimeMillis() +
 this.server.getConfiguration().
   getLong("hbase.regionserver.rpc.startup.waittime", 6);

 BTW can you see what caused rs-node6 to not respond around 13:11:47 ?

 Cheers

 On Fri, May 20, 2016 at 6:20 AM, Shuai Lin 
 wrote:

> Because of the "opening regions" rpc call sent by master to the region
> server node6 got timed out after 1 minutes?
>
> *RPC call was sent:*
>
> 2016-04-30 13:10:47,702 INFO 
> org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>
> *After 1 minute:*
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node6.example.com,60020,1458723856883 in order to assign
> regions, java.io.IOException: Call to
> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
> waitTime=60001, operationTimeout=6 expired.
>
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>
>
> I have checked hbase source code, but don't find any specific timeout
> settings for "open region" rpc call I can use. So I guess the it's using
> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
> are 20+ regions being assigned to node6 almost at the same moment, node6
> gets overloaded and can't finish opening all of them within one minute.
>
> So this looks like a hbase bug to me (regions never get online when
> the region server failed to handle the OpenRegionRequest before the rpc
> timeout), am I right?
>
>
> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>
>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
>> request
>> when it was opening, leading to RegionAlreadyInTransitionException.
>>
>> Was there any clue in master log why the close request was sent ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin 
>> wrote:
>>
>> > Hi Ted,
>> >
>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
>> 5.4.8. The
>> > RS logs on node6 can be found here <
>> http://paste.openstack.org/raw/496174/
>> > >
>> >  .
>> >
>> > Thanks!
>> >
>> > Shuai
>> >
>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>> >
>> > > Can you pastebin related server log w.r.t.
>> > d1c7f3f455f2529da82a2f713b5ee067
>> > > from rs-node6 ?
>> > >
>> > > Which release of hbase are you using ?
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin > >
>> > wrote:
>> > >
>> > > > Hi list,
>> > > >
>> > > > Last weekend I got a region server crashed, but some regions
>> never got
>> > > > online again on other RSes. I've gone through the logs, and
>> here is the
>> > > > timeline about some of the events:
>> > > >
>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>> disk
>> > > > failure. Master started to split rs-node7's WALs
>> > > >
>> > > >
>> > > > 2016-04-30 13:03:50,953 INFO
>> > > > 

Re: Some regions never get online after a region server crashes

2016-06-03 Thread Shuai Lin
Hi Ted,

I'm kind of confused, so is this normal behaviour or a bug of hbase? For me
it looks like a bug, should I fire a JIRA?

Thanks,

Shuai

On Fri, May 27, 2016 at 8:02 PM, Ted Yu  wrote:

> There were 7 regions Master tried to close which were opening but not yet
> served.
>
> d1c7f3f455f2529da82a2f713b5ee067 was one of them.
>
> On Fri, May 27, 2016 at 12:47 AM, Shuai Lin 
> wrote:
>
>> Here is the complete log on node6 between 13:10:47 and 13:11:47:
>> http://paste.openstack.org/raw/505826/
>>
>> The master asked node6 to open several regions. Node6 opened the first 4
>> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
>> errors at that time.
>>
>> On Wed, May 25, 2016 at 10:12 PM, Ted Yu  wrote:
>>
>>> In AssignmentManager#assign(), you should find:
>>>
>>>   // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>>   // regions will be assigned individually.
>>>   long maxWaitTime = System.currentTimeMillis() +
>>> this.server.getConfiguration().
>>>   getLong("hbase.regionserver.rpc.startup.waittime", 6);
>>>
>>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>>
>>> Cheers
>>>
>>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin 
>>> wrote:
>>>
 Because of the "opening regions" rpc call sent by master to the region
 server node6 got timed out after 1 minutes?

 *RPC call was sent:*

 2016-04-30 13:10:47,702 INFO 
 org.apache.hadoop.hbase.master.AssignmentManager:
 Assigning 22 region(s) tors-node6.example.com,60020,1458723856883

 *After 1 minute:*

 2016-04-30 13:11:47,780 INFO
 org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
 with rs-node6.example.com,60020,1458723856883 in order to assign
 regions, java.io.IOException: Call to
 rs-node6.example.com/172.16.6.6:60020 failed on local exception:
 org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
 waitTime=60001, operationTimeout=6 expired.

 2016-04-30 13:11:47,783 DEBUG
 org.apache.hadoop.hbase.master.AssignmentManager: Force region state
 offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
 ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}


 I have checked hbase source code, but don't find any specific timeout
 settings for "open region" rpc call I can use. So I guess the it's using
 the default "hbase.rpc.timeout", which defaults to 60secs. And since there
 are 20+ regions being assigned to node6 almost at the same moment, node6
 gets overloaded and can't finish opening all of them within one minute.

 So this looks like a hbase bug to me (regions never get online when the
 region server failed to handle the OpenRegionRequest before the rpc
 timeout), am I right?


 On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:

> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
> request
> when it was opening, leading to RegionAlreadyInTransitionException.
>
> Was there any clue in master log why the close request was sent ?
>
> Cheers
>
> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin 
> wrote:
>
> > Hi Ted,
> >
> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH
> 5.4.8. The
> > RS logs on node6 can be found here <
> http://paste.openstack.org/raw/496174/
> > >
> >  .
> >
> > Thanks!
> >
> > Shuai
> >
> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
> >
> > > Can you pastebin related server log w.r.t.
> > d1c7f3f455f2529da82a2f713b5ee067
> > > from rs-node6 ?
> > >
> > > Which release of hbase are you using ?
> > >
> > > Cheers
> > >
> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
> > wrote:
> > >
> > > > Hi list,
> > > >
> > > > Last weekend I got a region server crashed, but some regions
> never got
> > > > online again on other RSes. I've gone through the logs, and here
> is the
> > > > timeline about some of the events:
> > > >
> > > > * 13:03:50 on of the region server, rs-node7, died because of a
> disk
> > > > failure. Master started to split rs-node7's WALs
> > > >
> > > >
> > > > 2016-04-30 13:03:50,953 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> > > > logs for rs-node7.example.com,60020,1458724695128 before
> assignment;
> > > > region
> > > > count=133
> > > > 2016-04-30 13:03:50,966 DEBUG
> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch
> of
> > logs
> > > to
> > > > split
> > > > 2016-04-30 

Re: Some regions never get online after a region server crashes

2016-05-27 Thread Ted Yu
There were 7 regions Master tried to close which were opening but not yet
served.

d1c7f3f455f2529da82a2f713b5ee067 was one of them.

On Fri, May 27, 2016 at 12:47 AM, Shuai Lin  wrote:

> Here is the complete log on node6 between 13:10:47 and 13:11:47:
> http://paste.openstack.org/raw/505826/
>
> The master asked node6 to open several regions. Node6 opened the first 4
> very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
> errors at that time.
>
> On Wed, May 25, 2016 at 10:12 PM, Ted Yu  wrote:
>
>> In AssignmentManager#assign(), you should find:
>>
>>   // Send OPEN RPC. If it fails on a IOE or RemoteException,
>>   // regions will be assigned individually.
>>   long maxWaitTime = System.currentTimeMillis() +
>> this.server.getConfiguration().
>>   getLong("hbase.regionserver.rpc.startup.waittime", 6);
>>
>> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>>
>> Cheers
>>
>> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin 
>> wrote:
>>
>>> Because of the "opening regions" rpc call sent by master to the region
>>> server node6 got timed out after 1 minutes?
>>>
>>> *RPC call was sent:*
>>>
>>> 2016-04-30 13:10:47,702 INFO 
>>> org.apache.hadoop.hbase.master.AssignmentManager:
>>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>>
>>> *After 1 minute:*
>>>
>>> 2016-04-30 13:11:47,780 INFO
>>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>>> with rs-node6.example.com,60020,1458723856883 in order to assign
>>> regions, java.io.IOException: Call to
>>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>>> waitTime=60001, operationTimeout=6 expired.
>>>
>>> 2016-04-30 13:11:47,783 DEBUG
>>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>>
>>>
>>> I have checked hbase source code, but don't find any specific timeout
>>> settings for "open region" rpc call I can use. So I guess the it's using
>>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>>> are 20+ regions being assigned to node6 almost at the same moment, node6
>>> gets overloaded and can't finish opening all of them within one minute.
>>>
>>> So this looks like a hbase bug to me (regions never get online when the
>>> region server failed to handle the OpenRegionRequest before the rpc
>>> timeout), am I right?
>>>
>>>
>>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>>>
 Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE
 request
 when it was opening, leading to RegionAlreadyInTransitionException.

 Was there any clue in master log why the close request was sent ?

 Cheers

 On Wed, May 4, 2016 at 8:02 PM, Shuai Lin 
 wrote:

 > Hi Ted,
 >
 > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
 The
 > RS logs on node6 can be found here <
 http://paste.openstack.org/raw/496174/
 > >
 >  .
 >
 > Thanks!
 >
 > Shuai
 >
 > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
 >
 > > Can you pastebin related server log w.r.t.
 > d1c7f3f455f2529da82a2f713b5ee067
 > > from rs-node6 ?
 > >
 > > Which release of hbase are you using ?
 > >
 > > Cheers
 > >
 > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
 > wrote:
 > >
 > > > Hi list,
 > > >
 > > > Last weekend I got a region server crashed, but some regions
 never got
 > > > online again on other RSes. I've gone through the logs, and here
 is the
 > > > timeline about some of the events:
 > > >
 > > > * 13:03:50 on of the region server, rs-node7, died because of a
 disk
 > > > failure. Master started to split rs-node7's WALs
 > > >
 > > >
 > > > 2016-04-30 13:03:50,953 INFO
 > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
 Splitting
 > > > logs for rs-node7.example.com,60020,1458724695128 before
 assignment;
 > > > region
 > > > count=133
 > > > 2016-04-30 13:03:50,966 DEBUG
 > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch
 of
 > logs
 > > to
 > > > split
 > > > 2016-04-30 13:03:50,966 INFO
 > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
 33
 > logs
 > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
 > > > ,60020,1458724695128-splitting]
 > > > for [rs-node7.example.com,60020,1458724695128]
 > > >
 > > > * 13:10:47 WAL splits done, master began to re-assign 

Re: Some regions never get online after a region server crashes

2016-05-27 Thread Shuai Lin
Here is the complete log on node6 between 13:10:47 and 13:11:47:
http://paste.openstack.org/raw/505826/

The master asked node6 to open several regions. Node6 opened the first 4
very fast (within 1 seconsd) and got stuck at the 5th one. But there is no
errors at that time.

On Wed, May 25, 2016 at 10:12 PM, Ted Yu  wrote:

> In AssignmentManager#assign(), you should find:
>
>   // Send OPEN RPC. If it fails on a IOE or RemoteException,
>   // regions will be assigned individually.
>   long maxWaitTime = System.currentTimeMillis() +
> this.server.getConfiguration().
>   getLong("hbase.regionserver.rpc.startup.waittime", 6);
>
> BTW can you see what caused rs-node6 to not respond around 13:11:47 ?
>
> Cheers
>
> On Fri, May 20, 2016 at 6:20 AM, Shuai Lin  wrote:
>
>> Because of the "opening regions" rpc call sent by master to the region
>> server node6 got timed out after 1 minutes?
>>
>> *RPC call was sent:*
>>
>> 2016-04-30 13:10:47,702 INFO 
>> org.apache.hadoop.hbase.master.AssignmentManager:
>> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>>
>> *After 1 minute:*
>>
>> 2016-04-30 13:11:47,780 INFO
>> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> with rs-node6.example.com,60020,1458723856883 in order to assign
>> regions, java.io.IOException: Call to
>> rs-node6.example.com/172.16.6.6:60020 failed on local exception:
>> org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=4,
>> waitTime=60001, operationTimeout=6 expired.
>>
>> 2016-04-30 13:11:47,783 DEBUG
>> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
>> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>>
>>
>> I have checked hbase source code, but don't find any specific timeout
>> settings for "open region" rpc call I can use. So I guess the it's using
>> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
>> are 20+ regions being assigned to node6 almost at the same moment, node6
>> gets overloaded and can't finish opening all of them within one minute.
>>
>> So this looks like a hbase bug to me (regions never get online when the
>> region server failed to handle the OpenRegionRequest before the rpc
>> timeout), am I right?
>>
>>
>> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>>
>>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>>> when it was opening, leading to RegionAlreadyInTransitionException.
>>>
>>> Was there any clue in master log why the close request was sent ?
>>>
>>> Cheers
>>>
>>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin 
>>> wrote:
>>>
>>> > Hi Ted,
>>> >
>>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>>> The
>>> > RS logs on node6 can be found here <
>>> http://paste.openstack.org/raw/496174/
>>> > >
>>> >  .
>>> >
>>> > Thanks!
>>> >
>>> > Shuai
>>> >
>>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>>> >
>>> > > Can you pastebin related server log w.r.t.
>>> > d1c7f3f455f2529da82a2f713b5ee067
>>> > > from rs-node6 ?
>>> > >
>>> > > Which release of hbase are you using ?
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
>>> > wrote:
>>> > >
>>> > > > Hi list,
>>> > > >
>>> > > > Last weekend I got a region server crashed, but some regions never
>>> got
>>> > > > online again on other RSes. I've gone through the logs, and here
>>> is the
>>> > > > timeline about some of the events:
>>> > > >
>>> > > > * 13:03:50 on of the region server, rs-node7, died because of a
>>> disk
>>> > > > failure. Master started to split rs-node7's WALs
>>> > > >
>>> > > >
>>> > > > 2016-04-30 13:03:50,953 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> Splitting
>>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>>> assignment;
>>> > > > region
>>> > > > count=133
>>> > > > 2016-04-30 13:03:50,966 DEBUG
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>>> > logs
>>> > > to
>>> > > > split
>>> > > > 2016-04-30 13:03:50,966 INFO
>>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting
>>> 33
>>> > logs
>>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>>> > > > ,60020,1458724695128-splitting]
>>> > > > for [rs-node7.example.com,60020,1458724695128]
>>> > > >
>>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>>> > > >
>>> > > > 2016-04-30 13:10:47,655 INFO
>>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>>> > Reassigning
>>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>>> > carrying
>>> > > > (and 0 regions(s) that were opening on this server)
>>> > > > 2016-04-30 13:10:47,665 INFO
>>> > > > 

Re: Some regions never get online after a region server crashes

2016-05-25 Thread Ted Yu
In AssignmentManager#assign(), you should find:

  // Send OPEN RPC. If it fails on a IOE or RemoteException,
  // regions will be assigned individually.
  long maxWaitTime = System.currentTimeMillis() +
this.server.getConfiguration().
  getLong("hbase.regionserver.rpc.startup.waittime", 6);

BTW can you see what caused rs-node6 to not respond around 13:11:47 ?

Cheers

On Fri, May 20, 2016 at 6:20 AM, Shuai Lin  wrote:

> Because of the "opening regions" rpc call sent by master to the region
> server node6 got timed out after 1 minutes?
>
> *RPC call was sent:*
>
> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>
> *After 1 minute:*
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node6.example.com,60020,1458723856883 in order to assign regions,
> java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=6 expired.
>
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>
>
> I have checked hbase source code, but don't find any specific timeout
> settings for "open region" rpc call I can use. So I guess the it's using
> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
> are 20+ regions being assigned to node6 almost at the same moment, node6
> gets overloaded and can't finish opening all of them within one minute.
>
> So this looks like a hbase bug to me (regions never get online when the
> region server failed to handle the OpenRegionRequest before the rpc
> timeout), am I right?
>
>
> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>
>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>> when it was opening, leading to RegionAlreadyInTransitionException.
>>
>> Was there any clue in master log why the close request was sent ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin  wrote:
>>
>> > Hi Ted,
>> >
>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>> The
>> > RS logs on node6 can be found here <
>> http://paste.openstack.org/raw/496174/
>> > >
>> >  .
>> >
>> > Thanks!
>> >
>> > Shuai
>> >
>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>> >
>> > > Can you pastebin related server log w.r.t.
>> > d1c7f3f455f2529da82a2f713b5ee067
>> > > from rs-node6 ?
>> > >
>> > > Which release of hbase are you using ?
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
>> > wrote:
>> > >
>> > > > Hi list,
>> > > >
>> > > > Last weekend I got a region server crashed, but some regions never
>> got
>> > > > online again on other RSes. I've gone through the logs, and here is
>> the
>> > > > timeline about some of the events:
>> > > >
>> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > > > failure. Master started to split rs-node7's WALs
>> > > >
>> > > >
>> > > > 2016-04-30 13:03:50,953 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Splitting
>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>> assignment;
>> > > > region
>> > > > count=133
>> > > > 2016-04-30 13:03:50,966 DEBUG
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> > logs
>> > > to
>> > > > split
>> > > > 2016-04-30 13:03:50,966 INFO
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> > logs
>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > > > ,60020,1458724695128-splitting]
>> > > > for [rs-node7.example.com,60020,1458724695128]
>> > > >
>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>> > > >
>> > > > 2016-04-30 13:10:47,655 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> > Reassigning
>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> > carrying
>> > > > (and 0 regions(s) that were opening on this server)
>> > > > 2016-04-30 13:10:47,665 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > > > region(s) across 6 server(s), round-robin=true
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node1.example.com,60020,1458720625688
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node2.example.com,60020,1458721110988
>> > > > 2016-04-30 

Re: Some regions never get online after a region server crashes

2016-05-25 Thread Shuai Lin
Hi Ted, does my conclusion above make sense? Should I open an issue in
hbase JIRA?

On Fri, May 20, 2016 at 9:20 PM, Shuai Lin  wrote:

> Because of the "opening regions" rpc call sent by master to the region
> server node6 got timed out after 1 minutes?
>
> *RPC call was sent:*
>
> 2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
> Assigning 22 region(s) tors-node6.example.com,60020,1458723856883
>
> *After 1 minute:*
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node6.example.com,60020,1458723856883 in order to assign regions,
> java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=6 expired.
>
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
> ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}
>
>
> I have checked hbase source code, but don't find any specific timeout
> settings for "open region" rpc call I can use. So I guess the it's using
> the default "hbase.rpc.timeout", which defaults to 60secs. And since there
> are 20+ regions being assigned to node6 almost at the same moment, node6
> gets overloaded and can't finish opening all of them within one minute.
>
> So this looks like a hbase bug to me (regions never get online when the
> region server failed to handle the OpenRegionRequest before the rpc
> timeout), am I right?
>
>
> On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:
>
>> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
>> when it was opening, leading to RegionAlreadyInTransitionException.
>>
>> Was there any clue in master log why the close request was sent ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin  wrote:
>>
>> > Hi Ted,
>> >
>> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8.
>> The
>> > RS logs on node6 can be found here <
>> http://paste.openstack.org/raw/496174/
>> > >
>> >  .
>> >
>> > Thanks!
>> >
>> > Shuai
>> >
>> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>> >
>> > > Can you pastebin related server log w.r.t.
>> > d1c7f3f455f2529da82a2f713b5ee067
>> > > from rs-node6 ?
>> > >
>> > > Which release of hbase are you using ?
>> > >
>> > > Cheers
>> > >
>> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
>> > wrote:
>> > >
>> > > > Hi list,
>> > > >
>> > > > Last weekend I got a region server crashed, but some regions never
>> got
>> > > > online again on other RSes. I've gone through the logs, and here is
>> the
>> > > > timeline about some of the events:
>> > > >
>> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > > > failure. Master started to split rs-node7's WALs
>> > > >
>> > > >
>> > > > 2016-04-30 13:03:50,953 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Splitting
>> > > > logs for rs-node7.example.com,60020,1458724695128 before
>> assignment;
>> > > > region
>> > > > count=133
>> > > > 2016-04-30 13:03:50,966 DEBUG
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> > logs
>> > > to
>> > > > split
>> > > > 2016-04-30 13:03:50,966 INFO
>> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> > logs
>> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > > > ,60020,1458724695128-splitting]
>> > > > for [rs-node7.example.com,60020,1458724695128]
>> > > >
>> > > > * 13:10:47 WAL splits done, master began to re-assign regions
>> > > >
>> > > > 2016-04-30 13:10:47,655 INFO
>> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> > Reassigning
>> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> > carrying
>> > > > (and 0 regions(s) that were opening on this server)
>> > > > 2016-04-30 13:10:47,665 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > > > region(s) across 6 server(s), round-robin=true
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node1.example.com,60020,1458720625688
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node2.example.com,60020,1458721110988
>> > > > 2016-04-30 13:10:47,667 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> > region(s)
>> > > to
>> > > > rs-node3.example.com,60020,1458721713906
>> > > > 2016-04-30 13:10:47,679 INFO
>> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> > region(s)
>> > > to
>> > > > 

Re: Some regions never get online after a region server crashes

2016-05-20 Thread Shuai Lin
Because of the "opening regions" rpc call sent by master to the region
server node6 got timed out after 1 minutes?

*RPC call was sent:*

2016-04-30 13:10:47,702 INFO org.apache.hadoop.hbase.master.AssignmentManager:
Assigning 22 region(s) tors-node6.example.com,60020,1458723856883

*After 1 minute:*

2016-04-30 13:11:47,780 INFO
org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
with rs-node6.example.com,60020,1458723856883 in order to assign regions,
java.io.IOException: Call to rs-node6.example.com/172.16.6.6:60020 failed
on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=4, waitTime=60001, operationTimeout=6 expired.

2016-04-30 13:11:47,783 DEBUG
org.apache.hadoop.hbase.master.AssignmentManager: Force region state
offline {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN,
ts=1462021847743, server=rs-node6.example.com,60020,1458723856883}


I have checked hbase source code, but don't find any specific timeout
settings for "open region" rpc call I can use. So I guess the it's using
the default "hbase.rpc.timeout", which defaults to 60secs. And since there
are 20+ regions being assigned to node6 almost at the same moment, node6
gets overloaded and can't finish opening all of them within one minute.

So this looks like a hbase bug to me (regions never get online when the
region server failed to handle the OpenRegionRequest before the rpc
timeout), am I right?


On Fri, May 20, 2016 at 12:42 PM, Ted Yu  wrote:

> Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
> when it was opening, leading to RegionAlreadyInTransitionException.
>
> Was there any clue in master log why the close request was sent ?
>
> Cheers
>
> On Wed, May 4, 2016 at 8:02 PM, Shuai Lin  wrote:
>
> > Hi Ted,
> >
> > The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> > RS logs on node6 can be found here <
> http://paste.openstack.org/raw/496174/
> > >
> >  .
> >
> > Thanks!
> >
> > Shuai
> >
> > On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
> >
> > > Can you pastebin related server log w.r.t.
> > d1c7f3f455f2529da82a2f713b5ee067
> > > from rs-node6 ?
> > >
> > > Which release of hbase are you using ?
> > >
> > > Cheers
> > >
> > > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
> > wrote:
> > >
> > > > Hi list,
> > > >
> > > > Last weekend I got a region server crashed, but some regions never
> got
> > > > online again on other RSes. I've gone through the logs, and here is
> the
> > > > timeline about some of the events:
> > > >
> > > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > > failure. Master started to split rs-node7's WALs
> > > >
> > > >
> > > > 2016-04-30 13:03:50,953 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Splitting
> > > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > > region
> > > > count=133
> > > > 2016-04-30 13:03:50,966 DEBUG
> > > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> > logs
> > > to
> > > > split
> > > > 2016-04-30 13:03:50,966 INFO
> > > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> > logs
> > > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > > ,60020,1458724695128-splitting]
> > > > for [rs-node7.example.com,60020,1458724695128]
> > > >
> > > > * 13:10:47 WAL splits done, master began to re-assign regions
> > > >
> > > > 2016-04-30 13:10:47,655 INFO
> > > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> > Reassigning
> > > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> > carrying
> > > > (and 0 regions(s) that were opening on this server)
> > > > 2016-04-30 13:10:47,665 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > > region(s) across 6 server(s), round-robin=true
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node1.example.com,60020,1458720625688
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node2.example.com,60020,1458721110988
> > > > 2016-04-30 13:10:47,667 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node3.example.com,60020,1458721713906
> > > > 2016-04-30 13:10:47,679 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> > region(s)
> > > to
> > > > rs-node4.example.com,60020,1458722335527
> > > > 2016-04-30 13:10:47,691 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > rs-node5.example.com,60020,1458722992296
> > > > 2016-04-30 13:10:47,702 INFO
> > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> > region(s)
> > > to
> > > > 

Re: Some regions never get online after a region server crashes

2016-05-19 Thread Ted Yu
Looks like region d1c7f3f455f2529da82a2f713b5ee067 received CLOSE request
when it was opening, leading to RegionAlreadyInTransitionException.

Was there any clue in master log why the close request was sent ?

Cheers

On Wed, May 4, 2016 at 8:02 PM, Shuai Lin  wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here  >
>  .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>
> > Can you pastebin related server log w.r.t.
> d1c7f3f455f2529da82a2f713b5ee067
> > from rs-node6 ?
> >
> > Which release of hbase are you using ?
> >
> > Cheers
> >
> > On Wed, May 4, 2016 at 6:07 PM, Shuai Lin 
> wrote:
> >
> > > Hi list,
> > >
> > > Last weekend I got a region server crashed, but some regions never got
> > > online again on other RSes. I've gone through the logs, and here is the
> > > timeline about some of the events:
> > >
> > > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > > failure. Master started to split rs-node7's WALs
> > >
> > >
> > > 2016-04-30 13:03:50,953 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > > region
> > > count=133
> > > 2016-04-30 13:03:50,966 DEBUG
> > > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
> logs
> > to
> > > split
> > > 2016-04-30 13:03:50,966 INFO
> > > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
> logs
> > > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > > ,60020,1458724695128-splitting]
> > > for [rs-node7.example.com,60020,1458724695128]
> > >
> > > * 13:10:47 WAL splits done, master began to re-assign regions
> > >
> > > 2016-04-30 13:10:47,655 INFO
> > > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
> Reassigning
> > > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
> carrying
> > > (and 0 regions(s) that were opening on this server)
> > > 2016-04-30 13:10:47,665 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > > region(s) across 6 server(s), round-robin=true
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node1.example.com,60020,1458720625688
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node2.example.com,60020,1458721110988
> > > 2016-04-30 13:10:47,667 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node3.example.com,60020,1458721713906
> > > 2016-04-30 13:10:47,679 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
> region(s)
> > to
> > > rs-node4.example.com,60020,1458722335527
> > > 2016-04-30 13:10:47,691 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node5.example.com,60020,1458722992296
> > > 2016-04-30 13:10:47,702 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
> region(s)
> > to
> > > rs-node6.example.com,60020,1458723856883
> > >
> > > * 13:11:47 the opening regions rpc sent by master to region servers
> timed
> > > out after 1 minutes
> > >
> > > 2016-04-30 13:11:47,780 INFO
> > > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> > > with rs-node3.example.com,60020,1458721713906 in order to assign
> regions
> > > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> > failed
> > > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> > Call
> > > id=4, waitTime=60001, operationTimeout=6 expired.
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node6.example.com,60020,1458723856883,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> > > regions to server rs-node4.example.com,60020,1458722335527,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,782 INFO
> > > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > > regions to server rs-node3.example.com,60020,1458721713906,
> reassigning
> > > them
> > > 2016-04-30 13:11:47,783 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> > ts=1462021907753,
> > > server=rs-node6.example.com,60020,1458723856883}
> > >
> > >
> > > * After that, part of the regions (40 out of 130 regions) never got
> > online,
> > > and the following lines were logged repeatly in master log:
> > >
> > > 2016-04-30 13:12:37,188 DEBUG
> > > org.apache.hadoop.hbase.master.AssignmentManager: update
> 

Re: Some regions never get online after a region server crashes

2016-05-19 Thread Shuai Lin
Hi Ted, Could be please take a look at the log I've pasted above? Thanks!

On Thu, May 5, 2016 at 11:02 AM, Shuai Lin  wrote:

> Hi Ted,
>
> The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
> RS logs on node6 can be found here
>  .
>
> Thanks!
>
> Shuai
>
> On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:
>
>> Can you pastebin related server log w.r.t.
>> d1c7f3f455f2529da82a2f713b5ee067
>> from rs-node6 ?
>>
>> Which release of hbase are you using ?
>>
>> Cheers
>>
>> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin  wrote:
>>
>> > Hi list,
>> >
>> > Last weekend I got a region server crashed, but some regions never got
>> > online again on other RSes. I've gone through the logs, and here is the
>> > timeline about some of the events:
>> >
>> > * 13:03:50 on of the region server, rs-node7, died because of a disk
>> > failure. Master started to split rs-node7's WALs
>> >
>> >
>> > 2016-04-30 13:03:50,953 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
>> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
>> > region
>> > count=133
>> > 2016-04-30 13:03:50,966 DEBUG
>> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of
>> logs to
>> > split
>> > 2016-04-30 13:03:50,966 INFO
>> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33
>> logs
>> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
>> > ,60020,1458724695128-splitting]
>> > for [rs-node7.example.com,60020,1458724695128]
>> >
>> > * 13:10:47 WAL splits done, master began to re-assign regions
>> >
>> > 2016-04-30 13:10:47,655 INFO
>> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler:
>> Reassigning
>> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was
>> carrying
>> > (and 0 regions(s) that were opening on this server)
>> > 2016-04-30 13:10:47,665 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
>> > region(s) across 6 server(s), round-robin=true
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node1.example.com,60020,1458720625688
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node2.example.com,60020,1458721110988
>> > 2016-04-30 13:10:47,667 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node3.example.com,60020,1458721713906
>> > 2016-04-30 13:10:47,679 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23
>> region(s) to
>> > rs-node4.example.com,60020,1458722335527
>> > 2016-04-30 13:10:47,691 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node5.example.com,60020,1458722992296
>> > 2016-04-30 13:10:47,702 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22
>> region(s) to
>> > rs-node6.example.com,60020,1458723856883
>> >
>> > * 13:11:47 the opening regions rpc sent by master to region servers
>> timed
>> > out after 1 minutes
>> >
>> > 2016-04-30 13:11:47,780 INFO
>> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
>> > with rs-node3.example.com,60020,1458721713906 in order to assign
>> regions
>> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
>> failed
>> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
>> Call
>> > id=4, waitTime=60001, operationTimeout=6 expired.
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
>> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
>> > them
>> > 2016-04-30 13:11:47,782 INFO
>> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
>> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
>> > them
>> > 2016-04-30 13:11:47,783 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
>> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
>> ts=1462021907753,
>> > server=rs-node6.example.com,60020,1458723856883}
>> >
>> >
>> > * After that, part of the regions (40 out of 130 regions) never got
>> online,
>> > and the following lines were logged repeatly in master log:
>> >
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: update
>> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
>> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
>> > 2016-04-30 13:12:37,188 DEBUG
>> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
>> > 

Re: Some regions never get online after a region server crashes

2016-05-04 Thread Shuai Lin
Hi Ted,

The hbase version is 1.0.0-cdh5.4.8, shipped with cloudera CDH 5.4.8. The
RS logs on node6 can be found here 
 .

Thanks!

Shuai

On Thu, May 5, 2016 at 9:15 AM, Ted Yu  wrote:

> Can you pastebin related server log w.r.t. d1c7f3f455f2529da82a2f713b5ee067
> from rs-node6 ?
>
> Which release of hbase are you using ?
>
> Cheers
>
> On Wed, May 4, 2016 at 6:07 PM, Shuai Lin  wrote:
>
> > Hi list,
> >
> > Last weekend I got a region server crashed, but some regions never got
> > online again on other RSes. I've gone through the logs, and here is the
> > timeline about some of the events:
> >
> > * 13:03:50 on of the region server, rs-node7, died because of a disk
> > failure. Master started to split rs-node7's WALs
> >
> >
> > 2016-04-30 13:03:50,953 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> > logs for rs-node7.example.com,60020,1458724695128 before assignment;
> > region
> > count=133
> > 2016-04-30 13:03:50,966 DEBUG
> > org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs
> to
> > split
> > 2016-04-30 13:03:50,966 INFO
> > org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
> > in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> > ,60020,1458724695128-splitting]
> > for [rs-node7.example.com,60020,1458724695128]
> >
> > * 13:10:47 WAL splits done, master began to re-assign regions
> >
> > 2016-04-30 13:10:47,655 INFO
> > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> > 133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
> > (and 0 regions(s) that were opening on this server)
> > 2016-04-30 13:10:47,665 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> > region(s) across 6 server(s), round-robin=true
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node1.example.com,60020,1458720625688
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node2.example.com,60020,1458721110988
> > 2016-04-30 13:10:47,667 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node3.example.com,60020,1458721713906
> > 2016-04-30 13:10:47,679 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s)
> to
> > rs-node4.example.com,60020,1458722335527
> > 2016-04-30 13:10:47,691 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node5.example.com,60020,1458722992296
> > 2016-04-30 13:10:47,702 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s)
> to
> > rs-node6.example.com,60020,1458723856883
> >
> > * 13:11:47 the opening regions rpc sent by master to region servers timed
> > out after 1 minutes
> >
> > 2016-04-30 13:11:47,780 INFO
> > org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> > with rs-node3.example.com,60020,1458721713906 in order to assign regions
> > java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020
> failed
> > on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException:
> Call
> > id=4, waitTime=60001, operationTimeout=6 expired.
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > regions to server rs-node6.example.com,60020,1458723856883, reassigning
> > them
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> > regions to server rs-node4.example.com,60020,1458722335527, reassigning
> > them
> > 2016-04-30 13:11:47,782 INFO
> > org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> > regions to server rs-node3.example.com,60020,1458721713906, reassigning
> > them
> > 2016-04-30 13:11:47,783 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> > offline {a65660e421f114e93862194f7cc35644 state=OPENING,
> ts=1462021907753,
> > server=rs-node6.example.com,60020,1458723856883}
> >
> >
> > * After that, part of the regions (40 out of 130 regions) never got
> online,
> > and the following lines were logged repeatly in master log:
> >
> > 2016-04-30 13:12:37,188 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: update
> > {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> > server=rs-node6.example.com,60020,1458723856883} the timestamp.
> > 2016-04-30 13:12:37,188 DEBUG
> > org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> > transition; waiting up to 10668ms
> >
> > $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> > /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> > 484
> >
> >
> > I've searched in mailing list archive and hbase JIRA, but didn't find any
> > similar 

Re: Some regions never get online after a region server crashes

2016-05-04 Thread Ted Yu
Can you pastebin related server log w.r.t. d1c7f3f455f2529da82a2f713b5ee067
from rs-node6 ?

Which release of hbase are you using ?

Cheers

On Wed, May 4, 2016 at 6:07 PM, Shuai Lin  wrote:

> Hi list,
>
> Last weekend I got a region server crashed, but some regions never got
> online again on other RSes. I've gone through the logs, and here is the
> timeline about some of the events:
>
> * 13:03:50 on of the region server, rs-node7, died because of a disk
> failure. Master started to split rs-node7's WALs
>
>
> 2016-04-30 13:03:50,953 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting
> logs for rs-node7.example.com,60020,1458724695128 before assignment;
> region
> count=133
> 2016-04-30 13:03:50,966 DEBUG
> org.apache.hadoop.hbase.master.SplitLogManager: Scheduling batch of logs to
> split
> 2016-04-30 13:03:50,966 INFO
> org.apache.hadoop.hbase.master.SplitLogManager: started splitting 33 logs
> in [hdfs://nameservice1/hbase/WALs/rs-node7.example.com
> ,60020,1458724695128-splitting]
> for [rs-node7.example.com,60020,1458724695128]
>
> * 13:10:47 WAL splits done, master began to re-assign regions
>
> 2016-04-30 13:10:47,655 INFO
> org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Reassigning
> 133 region(s) that rs-node7.example.com,60020,1458724695128 was carrying
> (and 0 regions(s) that were opening on this server)
> 2016-04-30 13:10:47,665 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Bulk assigning 133
> region(s) across 6 server(s), round-robin=true
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node1.example.com,60020,1458720625688
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node2.example.com,60020,1458721110988
> 2016-04-30 13:10:47,667 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node3.example.com,60020,1458721713906
> 2016-04-30 13:10:47,679 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 23 region(s) to
> rs-node4.example.com,60020,1458722335527
> 2016-04-30 13:10:47,691 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node5.example.com,60020,1458722992296
> 2016-04-30 13:10:47,702 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Assigning 22 region(s) to
> rs-node6.example.com,60020,1458723856883
>
> * 13:11:47 the opening regions rpc sent by master to region servers timed
> out after 1 minutes
>
> 2016-04-30 13:11:47,780 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Unable to communicate
> with rs-node3.example.com,60020,1458721713906 in order to assign regions
> java.io.IOException: Call to rs-node3.example.com/172.16.6.3:60020 failed
> on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
> id=4, waitTime=60001, operationTimeout=6 expired.
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node6.example.com,60020,1458723856883, reassigning
> them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 23
> regions to server rs-node4.example.com,60020,1458722335527, reassigning
> them
> 2016-04-30 13:11:47,782 INFO
> org.apache.hadoop.hbase.master.GeneralBulkAssigner: Failed assigning 22
> regions to server rs-node3.example.com,60020,1458721713906, reassigning
> them
> 2016-04-30 13:11:47,783 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Force region state
> offline {a65660e421f114e93862194f7cc35644 state=OPENING, ts=1462021907753,
> server=rs-node6.example.com,60020,1458723856883}
>
>
> * After that, part of the regions (40 out of 130 regions) never got online,
> and the following lines were logged repeatly in master log:
>
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: update
> {d1c7f3f455f2529da82a2f713b5ee067 state=PENDING_OPEN, ts=1462021957087,
> server=rs-node6.example.com,60020,1458723856883} the timestamp.
> 2016-04-30 13:12:37,188 DEBUG
> org.apache.hadoop.hbase.master.AssignmentManager: Region is already in
> transition; waiting up to 10668ms
>
> $ grep 'AssignmentManager: update {d1c7f3f455f2529da82a2f713b5ee067'
> /var/log/hbase/hbase-cmf-hbase-MASTER-head.example.com.log.out.1|wc -l
> 484
>
>
> I've searched in mailing list archive and hbase JIRA, but didn't find any
> similar situations like this one. The most similar one is HBASE-14407
>  , but after reading
> its
> discussion I don't think that's the same problem.
>
> Anyone have a clue? Thanks!
>
> Regards,
> Shuai
>