How many region servers / data nodes do you have ? What Hadoop / HBase version are you using ?
Thanks On Jun 5, 2013, at 3:54 AM, Vimal Jain <[email protected]> wrote: > Yes.I did check those. > But i am not sure if those parameter setting is the issue , as there are > some other exceptions in logs ( "DFSOutputStream ResponseProcessor > exception " etc . ) > > > On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <[email protected]> wrote: > >> There are a few tips under : >> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >> >> Can you check ? >> >> Thanks >> >> On Jun 5, 2013, at 2:05 AM, Vimal Jain <[email protected]> wrote: >> >>> I don't think so , as i dont find any issues in data node logs. >>> Also there are lot of exceptions like "session expired" , "slept more >> than >>> configured time" . what are these ? >>> >>> >>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <[email protected]> wrote: >>> >>>> Because your data node 192.168.20.30 broke down. which leads to RS down. >>>> >>>> >>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <[email protected]> wrote: >>>> >>>>> Here is the complete log: >>>>> >>>>> http://bin.cakephp.org/saved/103001 - Hregion >>>>> http://bin.cakephp.org/saved/103000 - Hmaster >>>>> http://bin.cakephp.org/saved/103002 - Datanode >>>>> >>>>> >>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <[email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> I have set up Hbase in pseudo-distributed mode. >>>>>> It was working fine for 6 days , but suddenly today morning both >>>> HMaster >>>>>> and Hregion process went down. >>>>>> I checked in logs of both hadoop and hbase. >>>>>> Please help here. >>>>>> Here are the snippets :- >>>>>> >>>>>> *Datanode logs:* >>>>>> 2013-06-05 05:12:51,436 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >>>>> receiveBlock >>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while >>>> trying >>>>>> to read 2347 bytes >>>>>> 2013-06-05 05:12:51,442 INFO >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock >>>>>> blk_1597245478875608321_2818 received exception java.io.EOFException: >>>>> while >>>>>> trying to read 2347 bytes >>>>>> 2013-06-05 05:12:51,442 ERROR >>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( >>>>>> 192.168.20.30:50010, >>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237, >>>>> infoPort=50075, >>>>>> ipcPort=50020):DataXceiver >>>>>> java.io.EOFException: while trying to read 2347 bytes >>>>>> >>>>>> >>>>>> *HRegion logs:* >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long >> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient: >>>>>> DFSOutputStream ResponseProcessor exception for block >>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 >>>> millis >>>>>> timeout while waiting for channel to be ready for read. ch : >>>>>> java.nio.channels.SocketChannel[connected local=/192.168.20.30:44333 >>>>> remote=/ >>>>>> 192.168.20.30:50010] >>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a long >>>>>> garbage collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0] >>>>>> 192.168.20.30:50010 >>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: Error >>>>> while >>>>>> syncing >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>>>> Aborting... >>>>>> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) >>>>>> 2013-06-05 05:12:51,110 FATAL >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. >>>> Requesting >>>>>> close of hlog >>>>>> java.io.IOException: Reflection >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>>>> 2013-06-05 05:12:51,180 FATAL >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. >>>> Requesting >>>>>> close of hlog >>>>>> java.io.IOException: Reflection >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>>>> 2013-06-05 05:12:51,183 ERROR >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of HLog >>>>> writer >>>>>> java.io.IOException: Reflection >>>>>> Caused by: java.lang.reflect.InvocationTargetException >>>>>> Caused by: java.io.IOException: DFSOutputStream is closed >>>>>> 2013-06-05 05:12:51,184 WARN >>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog close >>>>>> failure! error count=1 >>>>>> 2013-06-05 05:12:52,557 FATAL >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region >>>>> server >>>>>> hbase.rummycircle.com,60020,1369877672964: >>>>>> regionserver:60020-0x13ef31264d00001 >>>> regionserver:60020-0x13ef31264d00001 >>>>>> received expired from ZooKeeper, aborting >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired >>>>>> 2013-06-05 05:12:52,557 FATAL >>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer >> abort: >>>>>> loaded coprocessors are: [] >>>>>> 2013-06-05 05:12:52,621 INFO >>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: SplitLogWorker >>>>>> interrupted while waiting for task, exiting: >>>>> java.lang.InterruptedException >>>>>> java.io.InterruptedIOException: Aborting compaction of store cfp_info >>>> in >>>>>> region >>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e. >>>>>> because user requested stop. >>>>>> 2013-06-05 05:12:53,425 WARN >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>> transient >>>>>> ZooKeeper exception: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com >>>>>> ,60020,1369877672964 >>>>>> 2013-06-05 05:12:55,426 WARN >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>> transient >>>>>> ZooKeeper exception: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com >>>>>> ,60020,1369877672964 >>>>>> 2013-06-05 05:12:59,427 WARN >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>> transient >>>>>> ZooKeeper exception: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com >>>>>> ,60020,1369877672964 >>>>>> 2013-06-05 05:13:07,427 WARN >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly >>>>> transient >>>>>> ZooKeeper exception: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com >>>>>> ,60020,1369877672964 >>>>>> 2013-06-05 05:13:07,427 ERROR >>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper >>>> delete >>>>>> failed after 3 retries >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired for /hbase/rs/hbase.rummycircle.com >>>>>> ,60020,1369877672964 >>>>>> at >>>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:127) >>>>>> at >>>>> org.apache.zookeeper.KeeperException.create(KeeperException.java:51) >>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: >>>> Exception >>>>>> closing file /hbase/.logs/hbase.rummycircle.com,60020,1369877672964/ >>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 : >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>>>> Aborting... >>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. >>>>>> Aborting... >>>>>> at >> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) >>>>>> >>>>>> >>>>>> *HMaster logs:* >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long >>>> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long >>>> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long >>>> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long >>>> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long >> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long >> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: We >>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long >>>> garbage >>>>>> collecting pause and it's usually bad, see >>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired >>>>>> 2013-06-05 05:12:52,263 FATAL org.apache.hadoop.hbase.master.HMaster: >>>>>> Master server abort: loaded coprocessors are: [] >>>>>> 2013-06-05 05:12:52,465 INFO >>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Waiting for region servers count to settle; currently checked in 1, >>>> slept >>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of >>>> 4500 >>>>>> ms, interval of 1500 ms. >>>>>> 2013-06-05 05:12:52,561 ERROR org.apache.hadoop.hbase.master.HMaster: >>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported a >>>> fatal >>>>>> error: >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired >>>>>> 2013-06-05 05:12:53,970 INFO >>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Waiting for region servers count to settle; currently checked in 1, >>>> slept >>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, timeout of >>>>> 4500 >>>>>> ms, interval of 1500 ms. >>>>>> 2013-06-05 05:12:55,476 INFO >>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Waiting for region servers count to settle; currently checked in 1, >>>> slept >>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, timeout of >>>>> 4500 >>>>>> ms, interval of 1500 ms. >>>>>> 2013-06-05 05:12:56,981 INFO >>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>> Finished waiting for region servers count to settle; checked in 1, >>>> slept >>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master is >>>>>> running. >>>>>> 2013-06-05 05:12:57,019 INFO >>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed verification of >>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964; >>>>>> java.io.EOFException >>>>>> 2013-06-05 05:17:52,302 WARN >>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while splitting >>>>> logs >>>>>> in [hdfs:// >> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting >>>>> ] >>>>>> installed = 19 but only 0 done >>>>>> 2013-06-05 05:17:52,321 FATAL org.apache.hadoop.hbase.master.HMaster: >>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 received >>>>>> expired from ZooKeeper, aborting >>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>>>>> KeeperErrorCode = Session expired >>>>>> java.io.IOException: Giving up after tries=1 >>>>>> Caused by: java.lang.InterruptedException: sleep interrupted >>>>>> 2013-06-05 05:17:52,381 ERROR >>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start >>>> master >>>>>> java.lang.RuntimeException: HMaster Aborted >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks and Regards, >>>>>> Vimal Jain >>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks and Regards, >>>>> Vimal Jain >>> >>> >>> >>> -- >>> Thanks and Regards, >>> Vimal Jain > > > > -- > Thanks and Regards, > Vimal Jain
