Re: HMaster and HRegionServer going down

Ted Yu Wed, 05 Jun 2013 04:32:22 -0700

Have you looked at NameNode log ?

The snippet you posted seems to imply issue with data block placement.


Cheers

On Jun 5, 2013, at 4:12 AM, Vimal Jain <[email protected]> wrote:

> I am running Hbase in pseudo distributed mode . So there is only one
> machine involved.
> I am using  Hadoop version - 1.1.2 , Hbase version - 0.94.7
> 
> 
> On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <[email protected]> wrote:
> 
>> How many region servers / data nodes do you have ?
>> 
>> What Hadoop / HBase version are you using ?
>> 
>> Thanks
>> 
>> On Jun 5, 2013, at 3:54 AM, Vimal Jain <[email protected]> wrote:
>> 
>>> Yes.I did check those.
>>> But i am not sure if those parameter setting is the issue  , as there are
>>> some other exceptions in logs ( "DFSOutputStream ResponseProcessor
>>> exception " etc . )
>>> 
>>> 
>>> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <[email protected]> wrote:
>>> 
>>>> There are a few tips under :
>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>> 
>>>> Can you check ?
>>>> 
>>>> Thanks
>>>> 
>>>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <[email protected]> wrote:
>>>> 
>>>>> I don't think so , as i dont find any issues in data node logs.
>>>>> Also there are lot of exceptions like "session expired" , "slept more
>>>> than
>>>>> configured time" . what are these ?
>>>>> 
>>>>> 
>>>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <[email protected]> wrote:
>>>>> 
>>>>>> Because your data node 192.168.20.30 broke down. which leads to RS
>> down.
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <[email protected]> wrote:
>>>>>> 
>>>>>>> Here is the complete log:
>>>>>>> 
>>>>>>> http://bin.cakephp.org/saved/103001 - Hregion
>>>>>>> http://bin.cakephp.org/saved/103000 - Hmaster
>>>>>>> http://bin.cakephp.org/saved/103002 - Datanode
>>>>>>> 
>>>>>>> 
>>>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <[email protected]>
>> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> I have set up Hbase in pseudo-distributed mode.
>>>>>>>> It was working fine for 6 days , but suddenly today morning both
>>>>>> HMaster
>>>>>>>> and Hregion process went down.
>>>>>>>> I checked in logs of both hadoop and hbase.
>>>>>>>> Please help here.
>>>>>>>> Here are the snippets :-
>>>>>>>> 
>>>>>>>> *Datanode logs:*
>>>>>>>> 2013-06-05 05:12:51,436 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
>>>>>>> receiveBlock
>>>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
>>>>>> trying
>>>>>>>> to read 2347 bytes
>>>>>>>> 2013-06-05 05:12:51,442 INFO
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
>>>>>>>> blk_1597245478875608321_2818 received exception
>> java.io.EOFException:
>>>>>>> while
>>>>>>>> trying to read 2347 bytes
>>>>>>>> 2013-06-05 05:12:51,442 ERROR
>>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
>> DatanodeRegistration(
>>>>>>>> 192.168.20.30:50010,
>>>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
>>>>>>> infoPort=50075,
>>>>>>>> ipcPort=50020):DataXceiver
>>>>>>>> java.io.EOFException: while trying to read 2347 bytes
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *HRegion logs:*
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
>>>>>>>> DFSOutputStream ResponseProcessor exception  for block
>>>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
>>>>>> millis
>>>>>>>> timeout while waiting for channel to be ready for read. ch :
>>>>>>>> java.nio.channels.SocketChannel[connected local=/
>> 192.168.20.30:44333
>>>>>>> remote=/
>>>>>>>> 192.168.20.30:50010]
>>>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long
>>>>>>>> garbage collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
>>>>>>>> 192.168.20.30:50010
>>>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error
>>>>>>> while
>>>>>>>> syncing
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>>  at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>>>> 2013-06-05 05:12:51,110 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>>>> Requesting
>>>>>>>> close of hlog
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,180 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
>>>>>> Requesting
>>>>>>>> close of hlog
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,183 ERROR
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog
>>>>>>> writer
>>>>>>>> java.io.IOException: Reflection
>>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
>>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
>>>>>>>> 2013-06-05 05:12:51,184 WARN
>>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
>> close
>>>>>>>> failure! error count=1
>>>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>>>>>>> server
>>>>>>>> hbase.rummycircle.com,60020,1369877672964:
>>>>>>>> regionserver:60020-0x13ef31264d00001
>>>>>> regionserver:60020-0x13ef31264d00001
>>>>>>>> received expired from ZooKeeper, aborting
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> 2013-06-05 05:12:52,557 FATAL
>>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
>>>> abort:
>>>>>>>> loaded coprocessors are: []
>>>>>>>> 2013-06-05 05:12:52,621 INFO
>>>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker
>>>>>>>> interrupted while waiting for task, exiting:
>>>>>>> java.lang.InterruptedException
>>>>>>>> java.io.InterruptedIOException: Aborting compaction of store
>> cfp_info
>>>>>> in
>>>>>>>> region
>>>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
>>>>>>>> because user requested stop.
>>>>>>>> 2013-06-05 05:12:53,425 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:12:55,426 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:12:59,427 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:13:07,427 WARN
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
>>>>>>> transient
>>>>>>>> ZooKeeper exception:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>> 2013-06-05 05:13:07,427 ERROR
>>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
>>>>>> delete
>>>>>>>> failed after 3 retries
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
>> hbase.rummycircle.com
>>>>>>>> ,60020,1369877672964
>>>>>>>>  at
>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
>>>>>>>>  at
>>>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
>>>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
>>>>>> Exception
>>>>>>>> closing file /hbase/.logs/hbase.rummycircle.com
>> ,60020,1369877672964/
>>>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
>>>>>>>> Aborting...
>>>>>>>>  at
>> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> *HMaster logs:*
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
>> We
>>>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
>>>>>> garbage
>>>>>>>> collecting pause and it's usually bad, see
>>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
>>>>>>>> 2013-06-05 05:12:52,263 FATAL
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> Master server abort: loaded coprocessors are: []
>>>>>>>> 2013-06-05 05:12:52,465 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of
>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:52,561 ERROR
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a
>>>>>> fatal
>>>>>>>> error:
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> 2013-06-05 05:12:53,970 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout
>> of
>>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:55,476 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Waiting for region servers count to settle; currently checked in 1,
>>>>>> slept
>>>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout
>> of
>>>>>>> 4500
>>>>>>>> ms, interval of 1500 ms.
>>>>>>>> 2013-06-05 05:12:56,981 INFO
>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>> Finished waiting for region servers count to settle; checked in 1,
>>>>>> slept
>>>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
>> is
>>>>>>>> running.
>>>>>>>> 2013-06-05 05:12:57,019 INFO
>>>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification
>> of
>>>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
>>>>>>>> java.io.EOFException
>>>>>>>> 2013-06-05 05:17:52,302 WARN
>>>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while
>> splitting
>>>>>>> logs
>>>>>>>> in [hdfs://
>> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
>>>>>>> ]
>>>>>>>> installed = 19 but only 0 done
>>>>>>>> 2013-06-05 05:17:52,321 FATAL
>> org.apache.hadoop.hbase.master.HMaster:
>>>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
>> received
>>>>>>>> expired from ZooKeeper, aborting
>>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
>>>>>>>> KeeperErrorCode = Session expired
>>>>>>>> java.io.IOException: Giving up after tries=1
>>>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
>>>>>>>> 2013-06-05 05:17:52,381 ERROR
>>>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
>>>>>> master
>>>>>>>> java.lang.RuntimeException: HMaster Aborted
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Thanks and Regards,
>>>>>>>> Vimal Jain
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Thanks and Regards,
>>>>>>> Vimal Jain
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Thanks and Regards,
>>>>> Vimal Jain
>>> 
>>> 
>>> 
>>> --
>>> Thanks and Regards,
>>> Vimal Jain
> 
> 
> 
> -- 
> Thanks and Regards,
> Vimal Jain

Re: HMaster and HRegionServer going down

Reply via email to