Re: Bulk loading job failed when one region server went down in the cluster

Stack Wed, 15 Aug 2012 14:53:20 -0700

On Mon, Aug 13, 2012 at 6:05 PM, anil gupta <[email protected]> wrote:
> It would be great if you can answer this simple question of mine: Is HBase
> Bulk Loading fault tolerant to Region Server failures in a viable/decent
> environment?
>


Bulk Loading is an MapReduce job.  Bulk Loading is as 'fault tolerant'
as MapReduce is (MapReduce jobs have long timeouts -- ten minutes IIRC
-- and tasks are retried up to a maximum, 4 by default, but if after
all timeouts and retries have expired, the job will fail).

You have RSs failing, maybe because you have too many slots allocated
to MapReduce for the hardware you are using to PoC (as Michael Segel
suggests).  Maybe the MR task is not finding the region's new
locations in time or maybe the regions are not coming back on line in
time for the MR job to complete?

The logs you provide for the MR task show us failing to go against a
RS who has died but doesn't know it yet (the YouAreDeadException).
Try looking at the subsequent map tasks that fail.  Why are they
failing?  For same reason?  Look in the master log to see whats
happening around log splitting of the failed server?  Is it hung up
preventing the regions being assigned to new locations?

St.Ack

Re: Bulk loading job failed when one region server went down in the cluster

Reply via email to