On Thu, Mar 10, 2011 at 3:41 AM, Tatsuya Kawano <[email protected]> wrote: > I suggested him to upgrade his environment to the latest version, so > at this time, he used CDH3b4 (HBase 0.90.1) and performed the same > test procedure. Then now he got a new issue. HMaster was aborted > because it couldn't reach to the host that had the kernel panic. > > Can anybody verify this issue for us? > You can just issue "echo c > /proc/sysrq-trigger" on a worker node > running region server, and check what would happen after a couple of > minutes. >
I did the above Tatsuya and saw this in the RS messages log: Mar 10 10:25:46 sv4borg228 kernel: [1189382.838243] SysRq : Trigger a crashdump ... but all just kept chugging along. (The RS stays up). > --------------------------------------------------------------------------------------------------- > 2011-03-10 07:48:39,192 FATAL org.apache.hadoop.hbase.master.HMaster: > Remote unexpected exception > java.net.NoRouteToHostException: No route to host This is odd. Communication with the RegionServer was working fine up until it crashed? On crash, the Master starts doing NRTHE? Master root filesystem is not full? Checking code, this exception will not be caught and it will trigger a Master abort. Thats a problem. I opened https://issues.apache.org/jira/browse/HBASE-3617 Will fix for 0.90.2. Try to figure more on why the NRTHE above happened Tatsuya, if you can. St.Ack
