Re: HMaster and HRegionServer going down

Vimal Jain Wed, 05 Jun 2013 04:37:16 -0700

I could not find any issue in namenode.
Here is the namenode log at time of the issue.


http://bin.cakephp.org/saved/103003


On Wed, Jun 5, 2013 at 5:00 PM, Ted Yu <[email protected]> wrote:

> Have you looked at NameNode log ?
>
> The snippet you posted seems to imply issue with data block placement.
>
> Cheers
>
> On Jun 5, 2013, at 4:12 AM, Vimal Jain <[email protected]> wrote:
>
> > I am running Hbase in pseudo distributed mode . So there is only one
> > machine involved.
> > I am using  Hadoop version - 1.1.2 , Hbase version - 0.94.7
> >
> >
> > On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <[email protected]> wrote:
> >
> >> How many region servers / data nodes do you have ?
> >>
> >> What Hadoop / HBase version are you using ?
> >>
> >> Thanks
> >>
> >> On Jun 5, 2013, at 3:54 AM, Vimal Jain <[email protected]> wrote:
> >>
> >>> Yes.I did check those.
> >>> But i am not sure if those parameter setting is the issue  , as there
> are
> >>> some other exceptions in logs ( "DFSOutputStream ResponseProcessor
> >>> exception " etc . )
> >>>
> >>>
> >>> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <[email protected]> wrote:
> >>>
> >>>> There are a few tips under :
> >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>
> >>>> Can you check ?
> >>>>
> >>>> Thanks
> >>>>
> >>>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <[email protected]> wrote:
> >>>>
> >>>>> I don't think so , as i dont find any issues in data node logs.
> >>>>> Also there are lot of exceptions like "session expired" , "slept more
> >>>> than
> >>>>> configured time" . what are these ?
> >>>>>
> >>>>>
> >>>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <[email protected]>
> wrote:
> >>>>>
> >>>>>> Because your data node 192.168.20.30 broke down. which leads to RS
> >> down.
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <[email protected]>
> wrote:
> >>>>>>
> >>>>>>> Here is the complete log:
> >>>>>>>
> >>>>>>> http://bin.cakephp.org/saved/103001 - Hregion
> >>>>>>> http://bin.cakephp.org/saved/103000 - Hmaster
> >>>>>>> http://bin.cakephp.org/saved/103002 - Datanode
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <[email protected]>
> >> wrote:
> >>>>>>>
> >>>>>>>> Hi,
> >>>>>>>> I have set up Hbase in pseudo-distributed mode.
> >>>>>>>> It was working fine for 6 days , but suddenly today morning both
> >>>>>> HMaster
> >>>>>>>> and Hregion process went down.
> >>>>>>>> I checked in logs of both hadoop and hbase.
> >>>>>>>> Please help here.
> >>>>>>>> Here are the snippets :-
> >>>>>>>>
> >>>>>>>> *Datanode logs:*
> >>>>>>>> 2013-06-05 05:12:51,436 INFO
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> >>>>>>> receiveBlock
> >>>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while
> >>>>>> trying
> >>>>>>>> to read 2347 bytes
> >>>>>>>> 2013-06-05 05:12:51,442 INFO
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> >>>>>>>> blk_1597245478875608321_2818 received exception
> >> java.io.EOFException:
> >>>>>>> while
> >>>>>>>> trying to read 2347 bytes
> >>>>>>>> 2013-06-05 05:12:51,442 ERROR
> >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >> DatanodeRegistration(
> >>>>>>>> 192.168.20.30:50010,
> >>>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> >>>>>>> infoPort=50075,
> >>>>>>>> ipcPort=50020):DataXceiver
> >>>>>>>> java.io.EOFException: while trying to read 2347 bytes
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *HRegion logs:*
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> >>>>>>>> DFSOutputStream ResponseProcessor exception  for block
> >>>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000
> >>>>>> millis
> >>>>>>>> timeout while waiting for channel to be ready for read. ch :
> >>>>>>>> java.nio.channels.SocketChannel[connected local=/
> >> 192.168.20.30:44333
> >>>>>>> remote=/
> >>>>>>>> 192.168.20.30:50010]
> >>>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a
> long
> >>>>>>>> garbage collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> >>>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> >>>>>>>> 192.168.20.30:50010
> >>>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> Error
> >>>>>>> while
> >>>>>>>> syncing
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>>  at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>>>> 2013-06-05 05:12:51,110 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>>>> Requesting
> >>>>>>>> close of hlog
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,180 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> >>>>>> Requesting
> >>>>>>>> close of hlog
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,183 ERROR
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of
> HLog
> >>>>>>> writer
> >>>>>>>> java.io.IOException: Reflection
> >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException
> >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed
> >>>>>>>> 2013-06-05 05:12:51,184 WARN
> >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> >> close
> >>>>>>>> failure! error count=1
> >>>>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> region
> >>>>>>> server
> >>>>>>>> hbase.rummycircle.com,60020,1369877672964:
> >>>>>>>> regionserver:60020-0x13ef31264d00001
> >>>>>> regionserver:60020-0x13ef31264d00001
> >>>>>>>> received expired from ZooKeeper, aborting
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> 2013-06-05 05:12:52,557 FATAL
> >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> >>>> abort:
> >>>>>>>> loaded coprocessors are: []
> >>>>>>>> 2013-06-05 05:12:52,621 INFO
> >>>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> SplitLogWorker
> >>>>>>>> interrupted while waiting for task, exiting:
> >>>>>>> java.lang.InterruptedException
> >>>>>>>> java.io.InterruptedIOException: Aborting compaction of store
> >> cfp_info
> >>>>>> in
> >>>>>>>> region
> >>>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> >>>>>>>> because user requested stop.
> >>>>>>>> 2013-06-05 05:12:53,425 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:12:55,426 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:12:59,427 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:13:07,427 WARN
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> >>>>>>> transient
> >>>>>>>> ZooKeeper exception:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>> 2013-06-05 05:13:07,427 ERROR
> >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> >>>>>> delete
> >>>>>>>> failed after 3 retries
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/
> >> hbase.rummycircle.com
> >>>>>>>> ,60020,1369877672964
> >>>>>>>>  at
> >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> >>>>>>>>  at
> >>>>>>>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> >>>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> >>>>>> Exception
> >>>>>>>> closing file /hbase/.logs/hbase.rummycircle.com
> >> ,60020,1369877672964/
> >>>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> >>>>>>>> Aborting...
> >>>>>>>>  at
> >>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> *HMaster logs:*
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long
> >>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper:
> >> We
> >>>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long
> >>>>>> garbage
> >>>>>>>> collecting pause and it's usually bad, see
> >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >>>>>>>> 2013-06-05 05:12:52,263 FATAL
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> Master server abort: loaded coprocessors are: []
> >>>>>>>> 2013-06-05 05:12:52,465 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> >>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:52,561 ERROR
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported
> a
> >>>>>> fatal
> >>>>>>>> error:
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> 2013-06-05 05:12:53,970 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> >> of
> >>>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:55,476 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Waiting for region servers count to settle; currently checked in
> 1,
> >>>>>> slept
> >>>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> >> of
> >>>>>>> 4500
> >>>>>>>> ms, interval of 1500 ms.
> >>>>>>>> 2013-06-05 05:12:56,981 INFO
> >>>>>>> org.apache.hadoop.hbase.master.ServerManager:
> >>>>>>>> Finished waiting for region servers count to settle; checked in 1,
> >>>>>> slept
> >>>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master
> >> is
> >>>>>>>> running.
> >>>>>>>> 2013-06-05 05:12:57,019 INFO
> >>>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> verification
> >> of
> >>>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> >>>>>>>> java.io.EOFException
> >>>>>>>> 2013-06-05 05:17:52,302 WARN
> >>>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while
> >> splitting
> >>>>>>> logs
> >>>>>>>> in [hdfs://
> >>
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> >>>>>>> ]
> >>>>>>>> installed = 19 but only 0 done
> >>>>>>>> 2013-06-05 05:17:52,321 FATAL
> >> org.apache.hadoop.hbase.master.HMaster:
> >>>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> >> received
> >>>>>>>> expired from ZooKeeper, aborting
> >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException:
> >>>>>>>> KeeperErrorCode = Session expired
> >>>>>>>> java.io.IOException: Giving up after tries=1
> >>>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted
> >>>>>>>> 2013-06-05 05:17:52,381 ERROR
> >>>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start
> >>>>>> master
> >>>>>>>> java.lang.RuntimeException: HMaster Aborted
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Thanks and Regards,
> >>>>>>>> Vimal Jain
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Thanks and Regards,
> >>>>>>> Vimal Jain
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Thanks and Regards,
> >>>>> Vimal Jain
> >>>
> >>>
> >>>
> >>> --
> >>> Thanks and Regards,
> >>> Vimal Jain
> >
> >
> >
> > --
> > Thanks and Regards,
> > Vimal Jain
>



-- 
Thanks and Regards,
Vimal Jain

Re: HMaster and HRegionServer going down

Reply via email to