xcievers exceptions will be in the datanodes' logs, and your problem totally looks like it. 0.20.5 will have the same issue (since it's on the HDFS side)
J-D On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill <[email protected]> wrote: > Hi Todd & JD, > > Environment: > All (hadoop and HBase) installed as of karmic-cdh3, which means: > Hadoop 0.20.2+228 > HBase 0.89.20100621+17 > Zookeeper 3.3.1+7 > > Unfortunately my whole cluster of regionservers have now crashed, so I > can't really say if it was swapping too much. There is a DEBUG > statement just before it crashes saying: > > org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in > hdfs://<somewhere on my HDFS, in /hbase> > > What follows is: > > WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease > on <file location as above> File does not exist. Holder > DFSClient_-11113603 does not have any open files > > It then seems to try and do some error recovery (Error Recovery for > block null bad datanode[0] nodes == null), fails (Could not get block > locations. Source file "<hbase file as before>" - Aborting). There is > then an ERROR org.apache...HRegionServer: Close and delete failed. > There is then a similar LeaseExpiredException as above. > > There are then a couple of messages from HRegionServer saying that > it's notifying master of its shutdown and stopping itself. The > shutdown hook then fires and the RemoteException and > LeaseExpiredExceptions are printed again. > > ulimit is set to 65000 (it's in the regionserver log, printed as I > restarted the regionserver), however I haven't got the xceivers set > anywhere. I'll give that a go. It does seem very odd as I did have a > few of them fall over one at a time with a few early loads, but that > seemed to be because the regions weren't splitting properly, so all > the traffic was going to one node and it was being overwhelmed. Once I > throttled it, after one load it a region split seemed to get > triggered, which flung regions all over, which made subsequent loads > much more distributed. However, perhaps the time-bomb was ticking... > I'll have a go at specifying the xcievers property. I'm pretty > certain i've got everything else covered, except the patches as > referenced in the JIRA. > > I just grepped some of the log files and didn't get an explicit > exception with 'xciever' in it. > > I am considering downgrading(?) to 0.20.5, however because everything > is installed as per karmic-cdh3, I'm a bit reluctant to do so as > presumably Cloudera has tested each of these versions against each > other? And I don't really want to introduce further versioning issues. > > Thanks, > > Jamie > > > On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote: >> Jamie, >> >> Does your configuration meets the requirements? >> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements >> >> ulimit and xcievers, if not set, are usually time bombs that blow off when >> the cluster is under load. >> >> J-D >> >> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill >> <[email protected]>wrote: >> >>> Dear all, >>> >>> My current HBase/Hadoop architecture has HBase region servers on the >>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot >>> of region server crashes. The last thing that happens appears to be a >>> DroppedSnapshot Exception, caused by an IOException: could not >>> complete write to file <file on HDFS>. I am running it under load, how >>> heavy that is I'm not sure how that is quantified, but I'm guessing it >>> is a load issue. >>> >>> Is it common practice to put region servers on data-nodes? Is it >>> common to see region server crashes when either the HDFS or region >>> server (or both) is under heavy load? I'm guessing that is the case as >>> I've seen a few similar posts. I've not got a great deal of capacity >>> to be separating region servers from HDFS data nodes, but it might be >>> an argument I could make. >>> >>> Thanks >>> >>> Jamie >>> >> >
