On the subject of GC and heap, I've left those as defaults. I could
look at those if that's the next logical step? Would there be anything
in any of the logs that I should look at?

One thing I have noticed is that it does take an absolute age to log
in to the DN/RS to restart the RS once it's fallen over, in one
instance it took about 10 minutes. These are 8GB, 4 core amd64 boxes

ta

Jamie



On 7 July 2010 18:30, Jamie Cockrill <[email protected]> wrote:
> Bad news, it looks like my xcievers is set as it should be, it's in
> the hdfs-site.xml and looking at the job.xml of one of my jobs in the
> job-tracker, it's showing that property as set to 2047. I've cat |
> grepped one of the datanode logs and although there were a few in
> there, they were from a few months ago. I've upped my MAX_FILESIZE on
> my table to 1GB to see if that helps (not sure if it will!).
>
> Thanks,
>
> Jamie
>
> On 7 July 2010 18:12, Jean-Daniel Cryans <[email protected]> wrote:
>> xcievers exceptions will be in the datanodes' logs, and your problem
>> totally looks like it. 0.20.5 will have the same issue (since it's on
>> the HDFS side)
>>
>> J-D
>>
>> On Wed, Jul 7, 2010 at 10:08 AM, Jamie Cockrill
>> <[email protected]> wrote:
>>> Hi Todd & JD,
>>>
>>> Environment:
>>> All (hadoop and HBase) installed as of karmic-cdh3, which means:
>>> Hadoop 0.20.2+228
>>> HBase 0.89.20100621+17
>>> Zookeeper 3.3.1+7
>>>
>>> Unfortunately my whole cluster of regionservers have now crashed, so I
>>> can't really say if it was swapping too much. There is a DEBUG
>>> statement just before it crashes saying:
>>>
>>> org.apache.hadoop.hbase.regionserver.wal.HLog: closing hlog writer in
>>> hdfs://<somewhere on my HDFS, in /hbase>
>>>
>>> What follows is:
>>>
>>> WARN org.apache.hadoop.hdfs.DFSClient: DataStreamer Exception:
>>> org.apache.hadoop.ipc.RemoteException:
>>> org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease
>>> on <file location as above> File does not exist. Holder
>>> DFSClient_-11113603 does not have any open files
>>>
>>> It then seems to try and do some error recovery (Error Recovery for
>>> block null bad datanode[0] nodes == null), fails (Could not get block
>>> locations. Source file "<hbase file as before>" - Aborting). There is
>>> then an ERROR org.apache...HRegionServer: Close and delete failed.
>>> There is then a similar LeaseExpiredException as above.
>>>
>>> There are then a couple of messages from HRegionServer saying that
>>> it's notifying master of its shutdown and stopping itself. The
>>> shutdown hook then fires and the RemoteException and
>>> LeaseExpiredExceptions are printed again.
>>>
>>> ulimit is set to 65000 (it's in the regionserver log, printed as I
>>> restarted the regionserver), however I haven't got the xceivers set
>>> anywhere. I'll give that a go. It does seem very odd as I did have a
>>> few of them fall over one at a time with a few early loads, but that
>>> seemed to be because the regions weren't splitting properly, so all
>>> the traffic was going to one node and it was being overwhelmed. Once I
>>> throttled it, after one load it a region split seemed to get
>>> triggered, which flung regions all over, which made subsequent loads
>>> much more distributed. However, perhaps the time-bomb was ticking...
>>> I'll  have a go at specifying the xcievers property. I'm pretty
>>> certain i've got everything else covered, except the patches as
>>> referenced in the JIRA.
>>>
>>> I just grepped some of the log files and didn't get an explicit
>>> exception with 'xciever' in it.
>>>
>>> I am considering downgrading(?) to 0.20.5, however because everything
>>> is installed as per karmic-cdh3, I'm a bit reluctant to do so as
>>> presumably Cloudera has tested each of these versions against each
>>> other? And I don't really want to introduce further versioning issues.
>>>
>>> Thanks,
>>>
>>> Jamie
>>>
>>>
>>> On 7 July 2010 17:30, Jean-Daniel Cryans <[email protected]> wrote:
>>>> Jamie,
>>>>
>>>> Does your configuration meets the requirements?
>>>> http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements
>>>>
>>>> ulimit and xcievers, if not set, are usually time bombs that blow off when
>>>> the cluster is under load.
>>>>
>>>> J-D
>>>>
>>>> On Wed, Jul 7, 2010 at 9:11 AM, Jamie Cockrill 
>>>> <[email protected]>wrote:
>>>>
>>>>> Dear all,
>>>>>
>>>>> My current HBase/Hadoop architecture has HBase region servers on the
>>>>> same physical boxes as the HDFS data-nodes. I'm getting an awful lot
>>>>> of region server crashes. The last thing that happens appears to be a
>>>>> DroppedSnapshot Exception, caused by an IOException: could not
>>>>> complete write to file <file on HDFS>. I am running it under load, how
>>>>> heavy that is I'm not sure how that is quantified, but I'm guessing it
>>>>> is a load issue.
>>>>>
>>>>> Is it common practice to put region servers on data-nodes? Is it
>>>>> common to see region server crashes when either the HDFS or region
>>>>> server (or both) is under heavy load? I'm guessing that is the case as
>>>>> I've seen a few similar posts. I've not got a great deal of capacity
>>>>> to be separating region servers from HDFS data nodes, but it might be
>>>>> an argument I could make.
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jamie
>>>>>
>>>>
>>>
>>
>

Reply via email to