Bad news, it looks like my xcievers is set as it should be, it's in the hdfs-site.xml and looking at the job.xml of one of my jobs in the job-tracker, it's showing that property as set to 2047. I've cat | grepped one of the datanode logs and although there were a few in there, they were from a few months ago. I've upped my MAX_FILESIZE on my table to 1GB to see if that helps (not sure if it will!).
Thanks, Jamie On 7 July 2010 18:12, Jean-Daniel Cryans <[email protected]> wrote: > xcievers exceptions will be in the datanodes' logs, and your problem > totally looks like it. 0.20.5 will have the same issue (since it's on > the HDFS side) > > J-D > > On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill > <[email protected]> wrote: >> Hi Todd & JD, >> >> Environment: >> All (hadoop and HBase) installed as of karmic-cdh3, which means: >> Hadoop 0.20.2+228 >> HBase 0.89.20100621+17 >> Zookeeper 3.3.1+7 >> >> Unfortunately my whole cluster of regionservers have now crashed, so I >> can't really say if it was swapping too much. There is a DEBUG >> statement just before it crashes saying: >> >> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in >> hdfs://<somewhere on my HDFS, in /hbase> >> >> What follows is: >> >> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: >> org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease >> on <file location as above> File does not exist. Holder >> DFSClient_-11113603 does not have any open files >> >> It then seems to try and do some error recovery (Error Recovery for >> block null bad datanode[0] nodes == null), fails (Could not get block >> locations. Source file "<hbase file as before>" - Aborting). There is >> then an ERROR org.apache...HRegionServer: Close and delete failed. >> There is then a similar LeaseExpiredException as above. >> >> There are then a couple of messages from HRegionServer saying that >> it's notifying master of its shutdown and stopping itself. The >> shutdown hook then fires and the RemoteException and >> LeaseExpiredExceptions are printed again. >> >> ulimit is set to 65000 (it's in the regionserver log, printed as I >> restarted the regionserver), however I haven't got the xceivers set >> anywhere. I'll give that a go. It does seem very odd as I did have a >> few of them fall over one at a time with a few early loads, but that >> seemed to be because the regions weren't splitting properly, so all >> the traffic was going to one node and it was being overwhelmed. Once I >> throttled it, after one load it a region split seemed to get >> triggered, which flung regions all over, which made subsequent loads >> much more distributed. However, perhaps the time-bomb was ticking... >> I'll have a go at specifying the xcievers property. I'm pretty >> certain i've got everything else covered, except the patches as >> referenced in the JIRA. >> >> I just grepped some of the log files and didn't get an explicit >> exception with 'xciever' in it. >> >> I am considering downgrading(?) to 0.20.5, however because everything >> is installed as per karmic-cdh3, I'm a bit reluctant to do so as >> presumably Cloudera has tested each of these versions against each >> other? And I don't really want to introduce further versioning issues. >> >> Thanks, >> >> Jamie >> >> >> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote: >>> Jamie, >>> >>> Does your configuration meets the requirements? >>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements >>> >>> ulimit and xcievers, if not set, are usually time bombs that blow off when >>> the cluster is under load. >>> >>> J-D >>> >>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill >>> <[email protected]>wrote: >>> >>>> Dear all, >>>> >>>> My current HBase/Hadoop architecture has HBase region servers on the >>>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot >>>> of region server crashes. The last thing that happens appears to be a >>>> DroppedSnapshot Exception, caused by an IOException: could not >>>> complete write to file <file on HDFS>. I am running it under load, how >>>> heavy that is I'm not sure how that is quantified, but I'm guessing it >>>> is a load issue. >>>> >>>> Is it common practice to put region servers on data-nodes? Is it >>>> common to see region server crashes when either the HDFS or region >>>> server (or both) is under heavy load? I'm guessing that is the case as >>>> I've seen a few similar posts. I've not got a great deal of capacity >>>> to be separating region servers from HDFS data nodes, but it might be >>>> an argument I could make. >>>> >>>> Thanks >>>> >>>> Jamie >>>> >>> >> >
