Hi Todd & JD,

Environment:
All (hadoop and HBase) installed as of karmic-cdh3, which means:
Hadoop 0.20.2+228
HBase 0.89.20100621+17
Zookeeper 3.3.1+7

Unfortunately my whole cluster of regionservers have now crashed, so I
can't really say if it was swapping too much. There is a DEBUG
statement just before it crashes saying:

org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
hdfs://<somewhere on my HDFS, in /hbase>

What follows is:

WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
on <file location as above> File does not exist. Holder
DFSClient_-11113603 does not have any open files

It then seems to try and do some error recovery (Error Recovery for
block null bad datanode[0] nodes == null), fails (Could not get block
locations. Source file "<hbase file as before>" - Aborting). There is
then an ERROR org.apache...HRegionServer: Close and delete failed.
There is then a similar LeaseExpiredException as above.

There are then a couple of messages from HRegionServer saying that
it's notifying master of its shutdown and stopping itself. The
shutdown hook then fires and the RemoteException and
LeaseExpiredExceptions are printed again.

ulimit is set to 65000 (it's in the regionserver log, printed as I
restarted the regionserver), however I haven't got the xceivers set
anywhere. I'll give that a go. It does seem very odd as I did have a
few of them fall over one at a time with a few early loads, but that
seemed to be because the regions weren't splitting properly, so all
the traffic was going to one node and it was being overwhelmed. Once I
throttled it, after one load it a region split seemed to get
triggered, which flung regions all over, which made subsequent loads
much more distributed. However, perhaps the time-bomb was ticking...
I'll  have a go at specifying the xcievers property. I'm pretty
certain i've got everything else covered, except the patches as
referenced in the JIRA.

I just grepped some of the log files and didn't get an explicit
exception with 'xciever' in it.

I am considering downgrading(?) to 0.20.5, however because everything
is installed as per karmic-cdh3, I'm a bit reluctant to do so as
presumably Cloudera has tested each of these versions against each
other? And I don't really want to introduce further versioning issues.

Thanks,

Jamie


On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote:
> Jamie,
>
> Does your configuration meets the requirements?
> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
>
> ulimit and xcievers, if not set, are usually time bombs that blow off when
> the cluster is under load.
>
> J-D
>
> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill 
> <[email protected]>wrote:
>
>> Dear all,
>>
>> My current HBase/Hadoop architecture has HBase region servers on the
>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot
>> of region server crashes. The last thing that happens appears to be a
>> DroppedSnapshot Exception, caused by an IOException: could not
>> complete write to file <file on HDFS>. I am running it under load, how
>> heavy that is I'm not sure how that is quantified, but I'm guessing it
>> is a load issue.
>>
>> Is it common practice to put region servers on data-nodes? Is it
>> common to see region server crashes when either the HDFS or region
>> server (or both) is under heavy load? I'm guessing that is the case as
>> I've seen a few similar posts. I've not got a great deal of capacity
>> to be separating region servers from HDFS data nodes, but it might be
>> an argument I could make.
>>
>> Thanks
>>
>> Jamie
>>
>

Reply via email to