I could not find any issue in namenode. Here is the namenode log at time of the issue.
http://bin.cakephp.org/saved/103003 On Wed, Jun 5, 2013 at 5:00 PM, Ted Yu <[email protected]> wrote: > Have you looked at NameNode log ? > > The snippet you posted seems to imply issue with data block placement. > > Cheers > > On Jun 5, 2013, at 4:12 AM, Vimal Jain <[email protected]> wrote: > > > I am running Hbase in pseudo distributed mode . So there is only one > > machine involved. > > I am using Hadoop version - 1.1.2 , Hbase version - 0.94.7 > > > > > > On Wed, Jun 5, 2013 at 4:38 PM, Ted Yu <[email protected]> wrote: > > > >> How many region servers / data nodes do you have ? > >> > >> What Hadoop / HBase version are you using ? > >> > >> Thanks > >> > >> On Jun 5, 2013, at 3:54 AM, Vimal Jain <[email protected]> wrote: > >> > >>> Yes.I did check those. > >>> But i am not sure if those parameter setting is the issue , as there > are > >>> some other exceptions in logs ( "DFSOutputStream ResponseProcessor > >>> exception " etc . ) > >>> > >>> > >>> On Wed, Jun 5, 2013 at 4:19 PM, Ted Yu <[email protected]> wrote: > >>> > >>>> There are a few tips under : > >>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>> > >>>> Can you check ? > >>>> > >>>> Thanks > >>>> > >>>> On Jun 5, 2013, at 2:05 AM, Vimal Jain <[email protected]> wrote: > >>>> > >>>>> I don't think so , as i dont find any issues in data node logs. > >>>>> Also there are lot of exceptions like "session expired" , "slept more > >>>> than > >>>>> configured time" . what are these ? > >>>>> > >>>>> > >>>>> On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <[email protected]> > wrote: > >>>>> > >>>>>> Because your data node 192.168.20.30 broke down. which leads to RS > >> down. > >>>>>> > >>>>>> > >>>>>> On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <[email protected]> > wrote: > >>>>>> > >>>>>>> Here is the complete log: > >>>>>>> > >>>>>>> http://bin.cakephp.org/saved/103001 - Hregion > >>>>>>> http://bin.cakephp.org/saved/103000 - Hmaster > >>>>>>> http://bin.cakephp.org/saved/103002 - Datanode > >>>>>>> > >>>>>>> > >>>>>>> On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <[email protected]> > >> wrote: > >>>>>>> > >>>>>>>> Hi, > >>>>>>>> I have set up Hbase in pseudo-distributed mode. > >>>>>>>> It was working fine for 6 days , but suddenly today morning both > >>>>>> HMaster > >>>>>>>> and Hregion process went down. > >>>>>>>> I checked in logs of both hadoop and hbase. > >>>>>>>> Please help here. > >>>>>>>> Here are the snippets :- > >>>>>>>> > >>>>>>>> *Datanode logs:* > >>>>>>>> 2013-06-05 05:12:51,436 INFO > >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > >>>>>>> receiveBlock > >>>>>>>> for block blk_1597245478875608321_2818 java.io.EOFException: while > >>>>>> trying > >>>>>>>> to read 2347 bytes > >>>>>>>> 2013-06-05 05:12:51,442 INFO > >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock > >>>>>>>> blk_1597245478875608321_2818 received exception > >> java.io.EOFException: > >>>>>>> while > >>>>>>>> trying to read 2347 bytes > >>>>>>>> 2013-06-05 05:12:51,442 ERROR > >>>>>>>> org.apache.hadoop.hdfs.server.datanode.DataNode: > >> DatanodeRegistration( > >>>>>>>> 192.168.20.30:50010, > >>>>>>>> storageID=DS-1816106352-192.168.20.30-50010-1369314076237, > >>>>>>> infoPort=50075, > >>>>>>>> ipcPort=50020):DataXceiver > >>>>>>>> java.io.EOFException: while trying to read 2347 bytes > >>>>>>>> > >>>>>>>> > >>>>>>>> *HRegion logs:* > >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4694929ms instead of 3000ms, this is likely due to a long > >>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient: > >>>>>>>> DFSOutputStream ResponseProcessor exception for block > >>>>>>>> blk_1597245478875608321_2818java.net.SocketTimeoutException: 63000 > >>>>>> millis > >>>>>>>> timeout while waiting for channel to be ready for read. ch : > >>>>>>>> java.nio.channels.SocketChannel[connected local=/ > >> 192.168.20.30:44333 > >>>>>>> remote=/ > >>>>>>>> 192.168.20.30:50010] > >>>>>>>> 2013-06-05 05:12:51,046 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 11695345ms instead of 10000000ms, this is likely due to a > long > >>>>>>>> garbage collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient: > Error > >>>>>>>> Recovery for block blk_1597245478875608321_2818 bad datanode[0] > >>>>>>>> 192.168.20.30:50010 > >>>>>>>> 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient: > Error > >>>>>>> while > >>>>>>>> syncing > >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>>>>>> Aborting... > >>>>>>>> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) > >>>>>>>> 2013-06-05 05:12:51,110 FATAL > >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. > >>>>>> Requesting > >>>>>>>> close of hlog > >>>>>>>> java.io.IOException: Reflection > >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException > >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>>>>>> 2013-06-05 05:12:51,180 FATAL > >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync. > >>>>>> Requesting > >>>>>>>> close of hlog > >>>>>>>> java.io.IOException: Reflection > >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException > >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>>>>>> 2013-06-05 05:12:51,183 ERROR > >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of > HLog > >>>>>>> writer > >>>>>>>> java.io.IOException: Reflection > >>>>>>>> Caused by: java.lang.reflect.InvocationTargetException > >>>>>>>> Caused by: java.io.IOException: DFSOutputStream is closed > >>>>>>>> 2013-06-05 05:12:51,184 WARN > >>>>>>>> org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog > >> close > >>>>>>>> failure! error count=1 > >>>>>>>> 2013-06-05 05:12:52,557 FATAL > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING > region > >>>>>>> server > >>>>>>>> hbase.rummycircle.com,60020,1369877672964: > >>>>>>>> regionserver:60020-0x13ef31264d00001 > >>>>>> regionserver:60020-0x13ef31264d00001 > >>>>>>>> received expired from ZooKeeper, aborting > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired > >>>>>>>> 2013-06-05 05:12:52,557 FATAL > >>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer > >>>> abort: > >>>>>>>> loaded coprocessors are: [] > >>>>>>>> 2013-06-05 05:12:52,621 INFO > >>>>>>>> org.apache.hadoop.hbase.regionserver.SplitLogWorker: > SplitLogWorker > >>>>>>>> interrupted while waiting for task, exiting: > >>>>>>> java.lang.InterruptedException > >>>>>>>> java.io.InterruptedIOException: Aborting compaction of store > >> cfp_info > >>>>>> in > >>>>>>>> region > >>>>>> event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e. > >>>>>>>> because user requested stop. > >>>>>>>> 2013-06-05 05:12:53,425 WARN > >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>>>>>> transient > >>>>>>>> ZooKeeper exception: > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/ > >> hbase.rummycircle.com > >>>>>>>> ,60020,1369877672964 > >>>>>>>> 2013-06-05 05:12:55,426 WARN > >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>>>>>> transient > >>>>>>>> ZooKeeper exception: > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/ > >> hbase.rummycircle.com > >>>>>>>> ,60020,1369877672964 > >>>>>>>> 2013-06-05 05:12:59,427 WARN > >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>>>>>> transient > >>>>>>>> ZooKeeper exception: > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/ > >> hbase.rummycircle.com > >>>>>>>> ,60020,1369877672964 > >>>>>>>> 2013-06-05 05:13:07,427 WARN > >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly > >>>>>>> transient > >>>>>>>> ZooKeeper exception: > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/ > >> hbase.rummycircle.com > >>>>>>>> ,60020,1369877672964 > >>>>>>>> 2013-06-05 05:13:07,427 ERROR > >>>>>>>> org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper > >>>>>> delete > >>>>>>>> failed after 3 retries > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired for /hbase/rs/ > >> hbase.rummycircle.com > >>>>>>>> ,60020,1369877672964 > >>>>>>>> at > >> org.apache.zookeeper.KeeperException.create(KeeperException.java:127) > >>>>>>>> at > >>>>>>> > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) > >>>>>>>> 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient: > >>>>>> Exception > >>>>>>>> closing file /hbase/.logs/hbase.rummycircle.com > >> ,60020,1369877672964/ > >>>>>>>> hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 : > >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>>>>>> Aborting... > >>>>>>>> java.io.IOException: All datanodes 192.168.20.30:50010 are bad. > >>>>>>>> Aborting... > >>>>>>>> at > >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096) > >>>>>>>> > >>>>>>>> > >>>>>>>> *HMaster logs:* > >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4702394ms instead of 10000ms, this is likely due to a long > >>>>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4988731ms instead of 300000ms, this is likely due to a long > >>>>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4988726ms instead of 300000ms, this is likely due to a long > >>>>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,701 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4698291ms instead of 10000ms, this is likely due to a long > >>>>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,711 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4694502ms instead of 1000ms, this is likely due to a long > >>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,714 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4694492ms instead of 1000ms, this is likely due to a long > >>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:50,715 WARN org.apache.hadoop.hbase.util.Sleeper: > >> We > >>>>>>>> slept 4695589ms instead of 60000ms, this is likely due to a long > >>>>>> garbage > >>>>>>>> collecting pause and it's usually bad, see > >>>>>>>> http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired > >>>>>>>> 2013-06-05 05:12:52,263 FATAL > >> org.apache.hadoop.hbase.master.HMaster: > >>>>>>>> Master server abort: loaded coprocessors are: [] > >>>>>>>> 2013-06-05 05:12:52,465 INFO > >>>>>>> org.apache.hadoop.hbase.master.ServerManager: > >>>>>>>> Waiting for region servers count to settle; currently checked in > 1, > >>>>>> slept > >>>>>>>> for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout > of > >>>>>> 4500 > >>>>>>>> ms, interval of 1500 ms. > >>>>>>>> 2013-06-05 05:12:52,561 ERROR > >> org.apache.hadoop.hbase.master.HMaster: > >>>>>>>> Region server hbase.rummycircle.com,60020,1369877672964 reported > a > >>>>>> fatal > >>>>>>>> error: > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired > >>>>>>>> 2013-06-05 05:12:53,970 INFO > >>>>>>> org.apache.hadoop.hbase.master.ServerManager: > >>>>>>>> Waiting for region servers count to settle; currently checked in > 1, > >>>>>> slept > >>>>>>>> for 1506 ms, expecting minimum of 1, maximum of 2147483647, > timeout > >> of > >>>>>>> 4500 > >>>>>>>> ms, interval of 1500 ms. > >>>>>>>> 2013-06-05 05:12:55,476 INFO > >>>>>>> org.apache.hadoop.hbase.master.ServerManager: > >>>>>>>> Waiting for region servers count to settle; currently checked in > 1, > >>>>>> slept > >>>>>>>> for 3012 ms, expecting minimum of 1, maximum of 2147483647, > timeout > >> of > >>>>>>> 4500 > >>>>>>>> ms, interval of 1500 ms. > >>>>>>>> 2013-06-05 05:12:56,981 INFO > >>>>>>> org.apache.hadoop.hbase.master.ServerManager: > >>>>>>>> Finished waiting for region servers count to settle; checked in 1, > >>>>>> slept > >>>>>>>> for 4517 ms, expecting minimum of 1, maximum of 2147483647, master > >> is > >>>>>>>> running. > >>>>>>>> 2013-06-05 05:12:57,019 INFO > >>>>>>>> org.apache.hadoop.hbase.catalog.CatalogTracker: Failed > verification > >> of > >>>>>>>> -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964; > >>>>>>>> java.io.EOFException > >>>>>>>> 2013-06-05 05:17:52,302 WARN > >>>>>>>> org.apache.hadoop.hbase.master.SplitLogManager: error while > >> splitting > >>>>>>> logs > >>>>>>>> in [hdfs:// > >> > 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting > >>>>>>> ] > >>>>>>>> installed = 19 but only 0 done > >>>>>>>> 2013-06-05 05:17:52,321 FATAL > >> org.apache.hadoop.hbase.master.HMaster: > >>>>>>>> master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000 > >> received > >>>>>>>> expired from ZooKeeper, aborting > >>>>>>>> org.apache.zookeeper.KeeperException$SessionExpiredException: > >>>>>>>> KeeperErrorCode = Session expired > >>>>>>>> java.io.IOException: Giving up after tries=1 > >>>>>>>> Caused by: java.lang.InterruptedException: sleep interrupted > >>>>>>>> 2013-06-05 05:17:52,381 ERROR > >>>>>>>> org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start > >>>>>> master > >>>>>>>> java.lang.RuntimeException: HMaster Aborted > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Thanks and Regards, > >>>>>>>> Vimal Jain > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>>> Thanks and Regards, > >>>>>>> Vimal Jain > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Thanks and Regards, > >>>>> Vimal Jain > >>> > >>> > >>> > >>> -- > >>> Thanks and Regards, > >>> Vimal Jain > > > > > > > > -- > > Thanks and Regards, > > Vimal Jain > -- Thanks and Regards, Vimal Jain
