Check the 'bad datanode''s logs. Anything in there? You've upped the xceivers and ulimits? (See hbase requirements in 'Getting Started').
St.Ack 2010/12/22 Zhou Shuaifeng <[email protected]>: > Hi, > > There are many problem blocks, but the log I attached in my mail below have > only one. Many others have 3 replicas: > 2010-12-20 09:10:31,167 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1292656843783_2494443 in pipeline 167.6.5.17:50010, > 167.6.5.16:50010, 167.6.5.11:50010: bad datanode 167.6.5.17:50010 > 2010-12-20 09:10:31,206 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer > Exception: java.io.IOException: Connection reset by peer > > The hbase version I use is 0.20.6, not 0.89. > > Zhou > > -----邮件原件----- > 发件人: [email protected] [mailto:[email protected]] 代表 Stack > 发送时间: 2010年12月22日 3:12 > 收件人: [email protected] > 主题: Re: all regionserver shutdown after close hdfs datanode > > 2010/12/20 Zhou Shuaifeng <[email protected]>: >> Hi, >> I checked the log, It's not the master caused the regionserver shutdown, > but >> the regionserver log rolling failed caused regionserver shutdown. >> > > The problem block only had one replica? If you look in the hdfs > emissions, it'll usually log other nodes that have the wanted block. > > I don't believe you say which hbase/hdfs you are using? In 0.89.x > hbases, at least for WAL log, we'll go out of our way to guarantee > sufficient replicas. > > St.Ack > > >> According the log, error occurred in the pipeline, but why hdfs are not > able >> to select another good data node when one datanode in the pipeline is not >> available? >> >> >> The log: >> 2010-12-20 09:15:41,769 FATAL >> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with > ioe: >> >> java.io.IOException: Error Recovery for block blk_1292656843439_2494096 >> failed because recovery from primary datanode 167.6.5.17:50010 failed 6 >> times. Pipeline was 167.6.5.17:50010. Aborting... >> at >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli >> ent.java:3249) >> at >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java: >> 2654) >> at >> > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient. >> java:2837) >> >> the corresponding code in regionserver: >> LOG.fatal("Log rolling failed with ioe: ", >> RemoteExceptionHandler.checkIOException(ex)); >> server.checkFileSystem(); >> // Abort if we get here. We probably won't recover an IOE. >> HBASE-1132 >> server.abort(); >> >> the abort() code: >> public void abort() { >> this.abortRequested = true; >> this.reservedSpace.clear(); >> LOG.info("Dump of metrics: " + this.metrics.toString()); >> stop(); >> } >> >> The corresponding log: >> 2010-12-20 09:15:41,777 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: >> request=9.666667, regions=1512, stores=1512, storefiles=5833, >> storefileIndexSize=1833, memstoreSize=2941, compactionQueueSize=1228, >> usedHeap=6849, maxHeap=8165, blockCacheSize=14047672, >> blockCacheFree=1698276936, blockCacheCount=0, blockCacheHitRatio=0, >> fsReadLatency=0, fsWriteLatency=59, fsSyncLatency=0 >> >> >> >> >> Zhou Shuaifeng(Frank) >> HUAWEI TECHNOLOGIES CO.,LTD. huawei_logo >> >> >> -----邮件原件----- >> 发件人: Daniel Iancu [mailto:[email protected]] >> 发送时间: 2010年12月20日 23:46 >> 收件人: [email protected] >> 主题: Re: all regionserver shutdown after close hdfs datanode >> >> Hi Zhou >> You should check if the HMaster is still up. If not, check its logs, if >> for some reason HMaster thinks HDFS is not available it will >> shutdown the HBase cluster. >> Regards >> Daniel >> >> On 12/20/2010 06:15 AM, Zhou Shuaifeng wrote: >>> Hi, >>> >>> >>> >>> I have a cluster of 8 hdfs datanodes and 8 hbase regionservers. When I >>> shutdown one node(a pc with one datanode and one regionserver running), >> all >>> hbase regionservers shutdown after a while. >>> >>> Other 7 hdfs datanodes is OK. >>> >>> >>> >>> I think it's not reasionable. Hbase is a distribute system that should >>> tolerance some nodes abnormal. So, what's the matter? Is there any >> configure >>> that can solve this problem or is a bug? >>> >>> >>> >>> Thanks and best Regards. >>> >>> >>> >>> Zhou >>> >>> >> > ---------------------------------------------------------------------------- >>> --------------------------------------------------------- >>> This e-mail and its attachments contain confidential information from >>> HUAWEI, which >>> is intended only for the person or entity whose address is listed above. >> Any >>> use of the >>> information contained herein in any way (including, but not limited to, >>> total or partial >>> disclosure, reproduction, or dissemination) by persons other than the >>> intended >>> recipient(s) is prohibited. If you receive this e-mail in error, please >>> notify the sender by >>> phone or email immediately and delete it! >>> >> >> -- >> Daniel Iancu >> Java Developer,Web Components Romania >> 1&1 Internet Development srl. >> 18 Mircea Eliade St >> Sect 1, Bucharest >> RO Bucharest, 012015 >> www.1and1.ro >> Phone:+40-031-223-9081 >> Email:[email protected] >> IM:[email protected] >> >> >> >> > >
