Check the 'bad datanode''s logs.  Anything in there?  You've upped the
xceivers and ulimits? (See hbase requirements in 'Getting Started').

St.Ack

2010/12/22 Zhou Shuaifeng <[email protected]>:
> Hi,
>
> There are many problem blocks, but the log I attached in my mail below have
> only one. Many others have 3 replicas:
> 2010-12-20 09:10:31,167 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_1292656843783_2494443 in pipeline 167.6.5.17:50010,
> 167.6.5.16:50010, 167.6.5.11:50010: bad datanode 167.6.5.17:50010
> 2010-12-20 09:10:31,206 WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer
> Exception: java.io.IOException: Connection reset by peer
>
> The hbase version I use is 0.20.6, not 0.89.
>
> Zhou
>
> -----邮件原件-----
> 发件人: [email protected] [mailto:[email protected]] 代表 Stack
> 发送时间: 2010年12月22日 3:12
> 收件人: [email protected]
> 主题: Re: all regionserver shutdown after close hdfs datanode
>
> 2010/12/20 Zhou Shuaifeng <[email protected]>:
>> Hi,
>> I checked the log, It's not the master caused the regionserver shutdown,
> but
>> the regionserver log rolling failed caused regionserver shutdown.
>>
>
> The problem block only had one replica?  If you look in the hdfs
> emissions, it'll usually log other nodes that have the wanted block.
>
> I don't believe you say which hbase/hdfs you are using?  In 0.89.x
> hbases, at least for WAL log, we'll go out of our way to guarantee
> sufficient replicas.
>
> St.Ack
>
>
>> According the log, error occurred in the pipeline, but why hdfs are not
> able
>> to select another good data node when one datanode in the pipeline is not
>> available?
>>
>>
>> The log:
>> 2010-12-20 09:15:41,769 FATAL
>> org.apache.hadoop.hbase.regionserver.LogRoller: Log rolling failed with
> ioe:
>>
>> java.io.IOException: Error Recovery for block blk_1292656843439_2494096
>> failed  because recovery from primary datanode 167.6.5.17:50010 failed 6
>> times.  Pipeline was 167.6.5.17:50010. Aborting...
>>        at
>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli
>> ent.java:3249)
>>        at
>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:
>> 2654)
>>        at
>>
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.
>> java:2837)
>>
>> the corresponding code in regionserver:
>>        LOG.fatal("Log rolling failed with ioe: ",
>>          RemoteExceptionHandler.checkIOException(ex));
>>        server.checkFileSystem();
>>        // Abort if we get here.  We probably won't recover an IOE.
>> HBASE-1132
>>        server.abort();
>>
>> the abort() code:
>>  public void abort() {
>>    this.abortRequested = true;
>>    this.reservedSpace.clear();
>>    LOG.info("Dump of metrics: " + this.metrics.toString());
>>    stop();
>>  }
>>
>> The corresponding log:
>> 2010-12-20 09:15:41,777 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Dump of metrics:
>> request=9.666667, regions=1512, stores=1512, storefiles=5833,
>> storefileIndexSize=1833, memstoreSize=2941, compactionQueueSize=1228,
>> usedHeap=6849, maxHeap=8165, blockCacheSize=14047672,
>> blockCacheFree=1698276936, blockCacheCount=0, blockCacheHitRatio=0,
>> fsReadLatency=0, fsWriteLatency=59, fsSyncLatency=0
>>
>>
>>
>>
>> Zhou Shuaifeng(Frank)
>> HUAWEI TECHNOLOGIES CO.,LTD.  huawei_logo
>>
>>
>> -----邮件原件-----
>> 发件人: Daniel Iancu [mailto:[email protected]]
>> 发送时间: 2010年12月20日 23:46
>> 收件人: [email protected]
>> 主题: Re: all regionserver shutdown after close hdfs datanode
>>
>> Hi Zhou
>> You should check if the HMaster is still up. If not, check its logs, if
>> for some reason HMaster thinks HDFS is not available it will
>> shutdown the HBase cluster.
>> Regards
>> Daniel
>>
>> On 12/20/2010 06:15 AM, Zhou Shuaifeng wrote:
>>> Hi,
>>>
>>>
>>>
>>> I have a cluster of 8  hdfs datanodes and 8 hbase regionservers. When I
>>> shutdown one node(a pc with one datanode and one regionserver running),
>> all
>>> hbase regionservers shutdown after a while.
>>>
>>> Other 7 hdfs datanodes is OK.
>>>
>>>
>>>
>>> I think it's not reasionable. Hbase is a distribute system that should
>>> tolerance some nodes abnormal. So, what's the matter? Is there any
>> configure
>>> that can solve this problem or is a bug?
>>>
>>>
>>>
>>> Thanks and best Regards.
>>>
>>>
>>>
>>> Zhou
>>>
>>>
>>
> ----------------------------------------------------------------------------
>>> ---------------------------------------------------------
>>> This e-mail and its attachments contain confidential information from
>>> HUAWEI, which
>>> is intended only for the person or entity whose address is listed above.
>> Any
>>> use of the
>>> information contained herein in any way (including, but not limited to,
>>> total or partial
>>> disclosure, reproduction, or dissemination) by persons other than the
>>> intended
>>> recipient(s) is prohibited. If you receive this e-mail in error, please
>>> notify the sender by
>>> phone or email immediately and delete it!
>>>
>>
>> --
>> Daniel Iancu
>> Java Developer,Web Components Romania
>> 1&1 Internet Development srl.
>> 18 Mircea Eliade St
>> Sect 1, Bucharest
>> RO Bucharest, 012015
>> www.1and1.ro
>> Phone:+40-031-223-9081
>> Email:[email protected]
>> IM:[email protected]
>>
>>
>>
>>
>
>

Reply via email to