Hi Stack, 

Thanks for checking this issue and filing HBASE-3617. Well, that command was 
supposed the node to crash and shutdown. I'll check the detailed procedure and 
try to reproduce this issue during weekend. 


> This is odd.  Communication with the RegionServer was working fine up
> until it crashed?  On crash, the Master starts doing NRTHE?  

Yes. NRTHE occured about two minutes after the RS crash. He tried the same test 
procedure twice and got the same result.


> Master root filesystem is not full?

No, it shouldn't be full. I asked him to watch the disk space and network 
connection in very early stage of our conversation. 


> Try to figure more on why the NRTHE above happened Tatsuya, if you can.

Sure. Let me work on it. I'll have some time in Saturday and Sunday morning to 
set up a test cluster and play with the issue. 

Thanks,

--
Tatsuya Kawano
Tokyo, Japan


On Mar 11, 2011, at 6:51 AM, Stack <[email protected]> wrote:

> On Thu, Mar 10, 2011 at 3:41 AM, Tatsuya Kawano <[email protected]> wrote:
>> I suggested him to upgrade his environment to the latest version, so
>> at this time, he used CDH3b4 (HBase 0.90.1) and performed the same
>> test procedure. Then now he got a new issue. HMaster was aborted
>> because it couldn't reach to the host that had the kernel panic.
>> 
>> Can anybody verify this issue for us?
>> You can just issue "echo c > /proc/sysrq-trigger" on a worker node
>> running region server, and check what would happen after a couple of
>> minutes.
>> 
> 
> I did the above Tatsuya and saw this in the RS messages log:
> 
> Mar 10 10:25:46 sv4borg228 kernel: [1189382.838243] SysRq : Trigger a 
> crashdump
> 
> ... but all just kept chugging along.
> 
> (The RS stays up).
> 
> 
>> ---------------------------------------------------------------------------------------------------
>> 2011-03-10 07:48:39,192 FATAL org.apache.hadoop.hbase.master.HMaster:
>> Remote unexpected exception
>> java.net.NoRouteToHostException: No route to host
> 
> This is odd.  Communication with the RegionServer was working fine up
> until it crashed?  On crash, the Master starts doing NRTHE?  Master
> root filesystem is not full?
> 
> Checking code, this exception will not be caught and it will trigger a
> Master abort.  Thats a problem.  I opened
> https://issues.apache.org/jira/browse/HBASE-3617  Will fix for 0.90.2.
> 
> Try to figure more on why the NRTHE above happened Tatsuya, if you can.
> 
> St.Ack

Reply via email to