One last thing, a slight oddity of our setup is that although we're on Hadoop 0.20.2, we were previously on 0.18.something and upgraded. That went fine and there have been no problems, however some convenience base-classes that we created for our jobs were based on the old pre-0.20 API, as such there are deprecation warnings all over. I am being consistence and using the mapred.TableOutputFormat (complete with deprecation), but just in case that's causing an issue, I thought I'd throw it in...
I might try make a version that uses only classes in the 0.20 API. Thanks, Jamie On 7 July 2010 18:08, Jamie Cockrill <[email protected]> wrote: > Hi Todd & JD, > > Environment: > All (hadoop and HBase) installed as of karmic-cdh3, which means: > Hadoop 0.20.2+228 > HBase 0.89.20100621+17 > Zookeeper 3.3.1+7 > > Unfortunately my whole cluster of regionservers have now crashed, so I > can't really say if it was swapping too much. There is a DEBUG > statement just before it crashes saying: > > org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in > hdfs://<somewhere on my HDFS, in /hbase> > > What follows is: > > WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease > on <file location as above> File does not exist. Holder > DFSClient_-11113603 does not have any open files > > It then seems to try and do some error recovery (Error Recovery for > block null bad datanode[0] nodes == null), fails (Could not get block > locations. Source file "<hbase file as before>" - Aborting). There is > then an ERROR org.apache...HRegionServer: Close and delete failed. > There is then a similar LeaseExpiredException as above. > > There are then a couple of messages from HRegionServer saying that > it's notifying master of its shutdown and stopping itself. The > shutdown hook then fires and the RemoteException and > LeaseExpiredExceptions are printed again. > > ulimit is set to 65000 (it's in the regionserver log, printed as I > restarted the regionserver), however I haven't got the xceivers set > anywhere. I'll give that a go. It does seem very odd as I did have a > few of them fall over one at a time with a few early loads, but that > seemed to be because the regions weren't splitting properly, so all > the traffic was going to one node and it was being overwhelmed. Once I > throttled it, after one load it a region split seemed to get > triggered, which flung regions all over, which made subsequent loads > much more distributed. However, perhaps the time-bomb was ticking... > I'll have a go at specifying the xcievers property. I'm pretty > certain i've got everything else covered, except the patches as > referenced in the JIRA. > > I just grepped some of the log files and didn't get an explicit > exception with 'xciever' in it. > > I am considering downgrading(?) to 0.20.5, however because everything > is installed as per karmic-cdh3, I'm a bit reluctant to do so as > presumably Cloudera has tested each of these versions against each > other? And I don't really want to introduce further versioning issues. > > Thanks, > > Jamie > > > On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote: >> Jamie, >> >> Does your configuration meets the requirements? >> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements >> >> ulimit and xcievers, if not set, are usually time bombs that blow off when >> the cluster is under load. >> >> J-D >> >> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill >> <[email protected]>wrote: >> >>> Dear all, >>> >>> My current HBase/Hadoop architecture has HBase region servers on the >>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot >>> of region server crashes. The last thing that happens appears to be a >>> DroppedSnapshot Exception, caused by an IOException: could not >>> complete write to file <file on HDFS>. I am running it under load, how >>> heavy that is I'm not sure how that is quantified, but I'm guessing it >>> is a load issue. >>> >>> Is it common practice to put region servers on data-nodes? Is it >>> common to see region server crashes when either the HDFS or region >>> server (or both) is under heavy load? I'm guessing that is the case as >>> I've seen a few similar posts. I've not got a great deal of capacity >>> to be separating region servers from HDFS data nodes, but it might be >>> an argument I could make. >>> >>> Thanks >>> >>> Jamie >>> >> >
