Hi, There are many problem blocks, but the log I attached in my mail below have only one. Many others have 3 replicas: 2010-12-20 09:10:31,167 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1292656843783_2494443 in pipeline 167.6.5.17:50010, 167.6.5.16:50010, 167.6.5.11:50010: bad datanode 167.6.5.17:50010 2010-12-20 09:10:31,206 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: java.io.IOException: Connection reset by peer
The hbase version I use is 0.20.6, not 0.89. Zhou -----邮件原件----- 发件人: [email protected] [mailto:[email protected]] 代表 Stack 发送时间: 2010年12月22日 3:12 收件人: [email protected] 主题: Re: all regionserver shutdown after close hdfs datanode 2010/12/20 Zhou Shuaifeng <[email protected]>: > Hi, > I checked the log, It's not the master caused the regionserver shutdown, but > the regionserver log rolling failed caused regionserver shutdown. > The problem block only had one replica? If you look in the hdfs emissions, it'll usually log other nodes that have the wanted block. I don't believe you say which hbase/hdfs you are using? In 0.89.x hbases, at least for WAL log, we'll go out of our way to guarantee sufficient replicas. St.Ack > According the log, error occurred in the pipeline, but why hdfs are not able > to select another good data node when one datanode in the pipeline is not > available? > > > The log: > 2010-12-20 09:15:41,769 FATAL > org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with ioe: > > java.io.IOException: Error Recovery for block blk_1292656843439_2494096 > failed because recovery from primary datanode 167.6.5.17:50010 failed 6 > times. Pipeline was 167.6.5.17:50010. Aborting... > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli > ent.java:3249) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java: > 2654) > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient. > java:2837) > > the corresponding code in regionserver: > LOG.fatal("Log rolling failed with ioe: ", > RemoteExceptionHandler.checkIOException(ex)); > server.checkFileSystem(); > // Abort if we get here. We probably won't recover an IOE. > HBASE-1132 > server.abort(); > > the abort() code: > public void abort() { > this.abortRequested = true; > this.reservedSpace.clear(); > LOG.info("Dump of metrics: " + this.metrics.toString()); > stop(); > } > > The corresponding log: > 2010-12-20 09:15:41,777 INFO > org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics: > request=9.666667, regions=1512, stores=1512, storefiles=5833, > storefileIndexSize=1833, memstoreSize=2941, compactionQueueSize=1228, > usedHeap=6849, maxHeap=8165, blockCacheSize=14047672, > blockCacheFree=1698276936, blockCacheCount=0, blockCacheHitRatio=0, > fsReadLatency=0, fsWriteLatency=59, fsSyncLatency=0 > > > > > Zhou Shuaifeng(Frank) > HUAWEI TECHNOLOGIES CO.,LTD. huawei_logo > > > -----邮件原件----- > 发件人: Daniel Iancu [mailto:[email protected]] > 发送时间: 2010年12月20日 23:46 > 收件人: [email protected] > 主题: Re: all regionserver shutdown after close hdfs datanode > > Hi Zhou > You should check if the HMaster is still up. If not, check its logs, if > for some reason HMaster thinks HDFS is not available it will > shutdown the HBase cluster. > Regards > Daniel > > On 12/20/2010 06:15 AM, Zhou Shuaifeng wrote: >> Hi, >> >> >> >> I have a cluster of 8 hdfs datanodes and 8 hbase regionservers. When I >> shutdown one node(a pc with one datanode and one regionserver running), > all >> hbase regionservers shutdown after a while. >> >> Other 7 hdfs datanodes is OK. >> >> >> >> I think it's not reasionable. Hbase is a distribute system that should >> tolerance some nodes abnormal. So, what's the matter? Is there any > configure >> that can solve this problem or is a bug? >> >> >> >> Thanks and best Regards. >> >> >> >> Zhou >> >> > ---------------------------------------------------------------------------- >> --------------------------------------------------------- >> This e-mail and its attachments contain confidential information from >> HUAWEI, which >> is intended only for the person or entity whose address is listed above. > Any >> use of the >> information contained herein in any way (including, but not limited to, >> total or partial >> disclosure, reproduction, or dissemination) by persons other than the >> intended >> recipient(s) is prohibited. If you receive this e-mail in error, please >> notify the sender by >> phone or email immediately and delete it! >> > > -- > Daniel Iancu > Java Developer,Web Components Romania > 1&1 Internet Development srl. > 18 Mircea Eliade St > Sect 1, Bucharest > RO Bucharest, 012015 > www.1and1.ro > Phone:+40-031-223-9081 > Email:[email protected] > IM:[email protected] > > > >
