On the subject of swapping, I'm re-running one of the jobs to have a go. All the load is going to one regionserver at the moment (no region splits have occurred yet) and it's on (via top):
Mem: 8184284k total, ~8130000k used, ~524000k free, 28000k buffers (might be inaccurate, can't type at a ms rate!) Swap: 23976972k total, ~759000k used, ~23222000k free, 458000k cached Not sure if that is indicative of anything. thanks Jamie PS, I have disabled compression on my table for now, as having 'GZ' compression specified slowed loading of data down massively and my RS logs seemed to be filled with messages from a supposed CodecPool with something like 'returning new codec instance'. On 7 July 2010 18:32, Jamie Cockrill <[email protected]> wrote: > On the subject of GC and heap, I've left those as defaults. I could > look at those if that's the next logical step? Would there be anything > in any of the logs that I should look at? > > One thing I have noticed is that it does take an absolute age to log > in to the DN/RS to restart the RS once it's fallen over, in one > instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes > > ta > > Jamie > > > > On 7 July 2010 18:30, Jamie Cockrill <[email protected]> wrote: >> Bad news, it looks like my xcievers is set as it should be, it's in >> the hdfs-site.xml and looking at the job.xml of one of my jobs in the >> job-tracker, it's showing that property as set to 2047. I've cat | >> grepped one of the datanode logs and although there were a few in >> there, they were from a few months ago. I've upped my MAX_FILESIZE on >> my table to 1GB to see if that helps (not sure if it will!). >> >> Thanks, >> >> Jamie >> >> On 7 July 2010 18:12, Jean-Daniel Cryans <[email protected]> wrote: >>> xcievers exceptions will be in the datanodes' logs, and your problem >>> totally looks like it. 0.20.5 will have the same issue (since it's on >>> the HDFS side) >>> >>> J-D >>> >>> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill >>> <[email protected]> wrote: >>>> Hi Todd & JD, >>>> >>>> Environment: >>>> All (hadoop and HBase) installed as of karmic-cdh3, which means: >>>> Hadoop 0.20.2+228 >>>> HBase 0.89.20100621+17 >>>> Zookeeper 3.3.1+7 >>>> >>>> Unfortunately my whole cluster of regionservers have now crashed, so I >>>> can't really say if it was swapping too much. There is a DEBUG >>>> statement just before it crashes saying: >>>> >>>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in >>>> hdfs://<somewhere on my HDFS, in /hbase> >>>> >>>> What follows is: >>>> >>>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception: >>>> org.apache.hadoop.ipc.RemoteException: >>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease >>>> on <file location as above> File does not exist. Holder >>>> DFSClient_-11113603 does not have any open files >>>> >>>> It then seems to try and do some error recovery (Error Recovery for >>>> block null bad datanode[0] nodes == null), fails (Could not get block >>>> locations. Source file "<hbase file as before>" - Aborting). There is >>>> then an ERROR org.apache...HRegionServer: Close and delete failed. >>>> There is then a similar LeaseExpiredException as above. >>>> >>>> There are then a couple of messages from HRegionServer saying that >>>> it's notifying master of its shutdown and stopping itself. The >>>> shutdown hook then fires and the RemoteException and >>>> LeaseExpiredExceptions are printed again. >>>> >>>> ulimit is set to 65000 (it's in the regionserver log, printed as I >>>> restarted the regionserver), however I haven't got the xceivers set >>>> anywhere. I'll give that a go. It does seem very odd as I did have a >>>> few of them fall over one at a time with a few early loads, but that >>>> seemed to be because the regions weren't splitting properly, so all >>>> the traffic was going to one node and it was being overwhelmed. Once I >>>> throttled it, after one load it a region split seemed to get >>>> triggered, which flung regions all over, which made subsequent loads >>>> much more distributed. However, perhaps the time-bomb was ticking... >>>> I'll have a go at specifying the xcievers property. I'm pretty >>>> certain i've got everything else covered, except the patches as >>>> referenced in the JIRA. >>>> >>>> I just grepped some of the log files and didn't get an explicit >>>> exception with 'xciever' in it. >>>> >>>> I am considering downgrading(?) to 0.20.5, however because everything >>>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as >>>> presumably Cloudera has tested each of these versions against each >>>> other? And I don't really want to introduce further versioning issues. >>>> >>>> Thanks, >>>> >>>> Jamie >>>> >>>> >>>> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote: >>>>> Jamie, >>>>> >>>>> Does your configuration meets the requirements? >>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements >>>>> >>>>> ulimit and xcievers, if not set, are usually time bombs that blow off when >>>>> the cluster is under load. >>>>> >>>>> J-D >>>>> >>>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill >>>>> <[email protected]>wrote: >>>>> >>>>>> Dear all, >>>>>> >>>>>> My current HBase/Hadoop architecture has HBase region servers on the >>>>>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot >>>>>> of region server crashes. The last thing that happens appears to be a >>>>>> DroppedSnapshot Exception, caused by an IOException: could not >>>>>> complete write to file <file on HDFS>. I am running it under load, how >>>>>> heavy that is I'm not sure how that is quantified, but I'm guessing it >>>>>> is a load issue. >>>>>> >>>>>> Is it common practice to put region servers on data-nodes? Is it >>>>>> common to see region server crashes when either the HDFS or region >>>>>> server (or both) is under heavy load? I'm guessing that is the case as >>>>>> I've seen a few similar posts. I've not got a great deal of capacity >>>>>> to be separating region servers from HDFS data nodes, but it might be >>>>>> an argument I could make. >>>>>> >>>>>> Thanks >>>>>> >>>>>> Jamie >>>>>> >>>>> >>>> >>> >> >
