Hi Todd & JD, Environment: All (hadoop and HBase) installed as of karmic-cdh3, which means: Hadoop 0.20.2+228 HBase 0.89.20100621+17 Zookeeper 3.3.1+7
Unfortunately my whole cluster of regionservers have now crashed, so I can't really say if it was swapping too much. There is a DEBUG statement just before it crashes saying: org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in hdfs://<somewhere on my HDFS, in /hbase> What follows is: WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on <file location as above> File does not exist. Holder DFSClient_-11113603 does not have any open files It then seems to try and do some error recovery (Error Recovery for block null bad datanode[0] nodes == null), fails (Could not get block locations. Source file "<hbase file as before>" - Aborting). There is then an ERROR org.apache...HRegionServer: Close and delete failed. There is then a similar LeaseExpiredException as above. There are then a couple of messages from HRegionServer saying that it's notifying master of its shutdown and stopping itself. The shutdown hook then fires and the RemoteException and LeaseExpiredExceptions are printed again. ulimit is set to 65000 (it's in the regionserver log, printed as I restarted the regionserver), however I haven't got the xceivers set anywhere. I'll give that a go. It does seem very odd as I did have a few of them fall over one at a time with a few early loads, but that seemed to be because the regions weren't splitting properly, so all the traffic was going to one node and it was being overwhelmed. Once I throttled it, after one load it a region split seemed to get triggered, which flung regions all over, which made subsequent loads much more distributed. However, perhaps the time-bomb was ticking... I'll have a go at specifying the xcievers property. I'm pretty certain i've got everything else covered, except the patches as referenced in the JIRA. I just grepped some of the log files and didn't get an explicit exception with 'xciever' in it. I am considering downgrading(?) to 0.20.5, however because everything is installed as per karmic-cdh3, I'm a bit reluctant to do so as presumably Cloudera has tested each of these versions against each other? And I don't really want to introduce further versioning issues. Thanks, Jamie On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote: > Jamie, > > Does your configuration meets the requirements? > http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements > > ulimit and xcievers, if not set, are usually time bombs that blow off when > the cluster is under load. > > J-D > > On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill > <[email protected]>wrote: > >> Dear all, >> >> My current HBase/Hadoop architecture has HBase region servers on the >> same physical boxes as the HDFS data-nodes. I'm getting an awful lot >> of region server crashes. The last thing that happens appears to be a >> DroppedSnapshot Exception, caused by an IOException: could not >> complete write to file <file on HDFS>. I am running it under load, how >> heavy that is I'm not sure how that is quantified, but I'm guessing it >> is a load issue. >> >> Is it common practice to put region servers on data-nodes? Is it >> common to see region server crashes when either the HDFS or region >> server (or both) is under heavy load? I'm guessing that is the case as >> I've seen a few similar posts. I've not got a great deal of capacity >> to be separating region servers from HDFS data nodes, but it might be >> an argument I could make. >> >> Thanks >> >> Jamie >> >
