xcievers exceptions will be in the datanodes' logs, and your problem
totally looks like it. 0.20.5 will have the same issue (since it's on
the HDFS side)

J-D

On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill
<[email protected]> wrote:
> Hi Todd & JD,
>
> Environment:
> All (hadoop and HBase) installed as of karmic-cdh3, which means:
> Hadoop 0.20.2+228
> HBase 0.89.20100621+17
> Zookeeper 3.3.1+7
>
> Unfortunately my whole cluster of regionservers have now crashed, so I
> can't really say if it was swapping too much. There is a DEBUG
> statement just before it crashes saying:
>
> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
> hdfs://<somewhere on my HDFS, in /hbase>
>
> What follows is:
>
> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
> org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
> on <file location as above> File does not exist. Holder
> DFSClient_-11113603 does not have any open files
>
> It then seems to try and do some error recovery (Error Recovery for
> block null bad datanode[0] nodes == null), fails (Could not get block
> locations. Source file "<hbase file as before>" - Aborting). There is
> then an ERROR org.apache...HRegionServer: Close and delete failed.
> There is then a similar LeaseExpiredException as above.
>
> There are then a couple of messages from HRegionServer saying that
> it's notifying master of its shutdown and stopping itself. The
> shutdown hook then fires and the RemoteException and
> LeaseExpiredExceptions are printed again.
>
> ulimit is set to 65000 (it's in the regionserver log, printed as I
> restarted the regionserver), however I haven't got the xceivers set
> anywhere. I'll give that a go. It does seem very odd as I did have a
> few of them fall over one at a time with a few early loads, but that
> seemed to be because the regions weren't splitting properly, so all
> the traffic was going to one node and it was being overwhelmed. Once I
> throttled it, after one load it a region split seemed to get
> triggered, which flung regions all over, which made subsequent loads
> much more distributed. However, perhaps the time-bomb was ticking...
> I'll  have a go at specifying the xcievers property. I'm pretty
> certain i've got everything else covered, except the patches as
> referenced in the JIRA.
>
> I just grepped some of the log files and didn't get an explicit
> exception with 'xciever' in it.
>
> I am considering downgrading(?) to 0.20.5, however because everything
> is installed as per karmic-cdh3, I'm a bit reluctant to do so as
> presumably Cloudera has tested each of these versions against each
> other? And I don't really want to introduce further versioning issues.
>
> Thanks,
>
> Jamie
>
>
> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote:
>> Jamie,
>>
>> Does your configuration meets the requirements?
>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
>>
>> ulimit and xcievers, if not set, are usually time bombs that blow off when
>> the cluster is under load.
>>
>> J-D
>>
>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill 
>> <[email protected]>wrote:
>>
>>> Dear all,
>>>
>>> My current HBase/Hadoop architecture has HBase region servers on the
>>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot
>>> of region server crashes. The last thing that happens appears to be a
>>> DroppedSnapshot Exception, caused by an IOException: could not
>>> complete write to file <file on HDFS>. I am running it under load, how
>>> heavy that is I'm not sure how that is quantified, but I'm guessing it
>>> is a load issue.
>>>
>>> Is it common practice to put region servers on data-nodes? Is it
>>> common to see region server crashes when either the HDFS or region
>>> server (or both) is under heavy load? I'm guessing that is the case as
>>> I've seen a few similar posts. I've not got a great deal of capacity
>>> to be separating region servers from HDFS data nodes, but it might be
>>> an argument I could make.
>>>
>>> Thanks
>>>
>>> Jamie
>>>
>>
>

Reply via email to