Re: Bulk loading job failed when one region server went down in the cluster

Kevin O'dell Mon, 13 Aug 2012 06:51:38 -0700

Anil,

  Do you have root cause on the RS failure?  I have never heard of one RS
failure causing a whole job to fail.


On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]> wrote:

> Hi HBase Folks,
>
> I ran the bulk loader yesterday night to load data in a table. During the
> bulk loading job one of the region server crashed and the entire job
> failed. It takes around 2.5 hours for this job to finish and the job failed
> when it was at around 50% complete. After the failure that table was also
> corrupted in HBase. My cluster has 8 region servers.
>
> Is bulk loading not fault tolerant to failure of region servers?
>
> I am using this old email chain because at that time my question went
> unanswered. Please share your views.
>
> Thanks,
> Anil Gupta
>
> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]> wrote:
>
> > Hi Kevin,
> >
> > I am not really concerned about the RegionServer going down as the same
> > thing can happen when deployed in production. Although, in production we
> > wont be having VM environment and I am aware that my current Dev
> > environment is not good for heavy processing.  What i am concerned about
> is
> > the failure of bulk loading job when the Region Server failed. Does this
> > mean that Bulk loading job is not fault tolerant to Failure of Region
> > Server? I was expecting the job to be successful even though the
> > RegionServer failed because there 6 more RS running in the cluster. Fault
> > Tolerance is one of the biggest selling point of Hadoop platform. Let me
> > know your views.
> > Thanks for your time.
> >
> > Thanks,
> > Anil Gupta
> >
> >
> > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[email protected]
> >wrote:
> >
> >> Anil,
> >>
> >>  I am sorry for the delayed response.  Reviewing the logs it appears:
> >>
> >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out,
> >> have not heard from server in 59311ms for sessionid 0x136557f99c90065,
> >> closing socket connection and attempting reconnect
> >>
> >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
> >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
> >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >>
> >>   It appears to be a classic overworked RS.  You were doing too much
> >> for the RS and it did not respond in time, the Master marked it as
> >> dead, when the RS responded Master said no your are already dead and
> >> aborted the server.  This is why you see the YouAreDeadException.
> >> This is probably due to the shared resources of the VM infrastructure
> >> you are running.  You will either need to devote more resources or add
> >> more nodes(most likely physical) to the cluster if you would like to
> >> keep running these jobs.
> >>
> >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[email protected]>
> wrote:
> >> > Hi Kevin,
> >> >
> >> > Here is dropbox link to the log file of region server which failed:
> >> >
> >>
> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
> >> > IMHO, the problem starts from the line #3009 which says: 12/03/30
> >> 15:38:32
> >> > FATAL regionserver.HRegionServer: ABORTING region server
> >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
> regions=44,
> >> > usedHeap=446, maxHeap=1197): Unhandled exception:
> >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server
> >> >
> >> > I have already tested fault tolerance of HBase by manually bringing
> >> down a
> >> > RS while querying a Table and it worked fine and I was expecting the
> >> same
> >> > today(even though the RS went down by itself today) when i was loading
> >> the
> >> > data. But, it didn't work out well.
> >> > Thanks for your time. Let me know if you need more details.
> >> >
> >> > ~Anil
> >> >
> >> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
> [email protected]
> >> >wrote:
> >> >
> >> >> Anil,
> >> >>
> >> >>  Can you please attach the RS logs from the failure?
> >> >>
> >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[email protected]>
> >> wrote:
> >> >> > Hi All,
> >> >> >
> >> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two
> >> >> > machines) which are running Datanode, Tasktracker, and Region
> >> Server(1200
> >> >> > MB heap size). I was loading data into HBase using Bulk Loader
> with a
> >> >> > custom mapper. I was loading around 34 million records and I have
> >> loaded
> >> >> > the same set of data in the same environment many times before
> >> without
> >> >> any
> >> >> > problem. This time while loading the data, one of the region
> >> server(but
> >> >> the
> >> >> > DN and TT kept on running on that node ) failed and then after
> >> numerous
> >> >> > failures of map-tasks the loding job failed. Is there any
> >> >> > setting/configuration which can make Bulk Loading fault-tolerant to
> >> >> failure
> >> >> > of region-servers?
> >> >> >
> >> >> > --
> >> >> > Thanks & Regards,
> >> >> > Anil Gupta
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Kevin O'Dell
> >> >> Customer Operations Engineer, Cloudera
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Thanks & Regards,
> >> > Anil Gupta
> >>
> >>
> >>
> >> --
> >> Kevin O'Dell
> >> Customer Operations Engineer, Cloudera
> >>
> >> --
> >> Thanks & Regards,
> >> Anil Gupta
> >>
> >>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera

Re: Bulk loading job failed when one region server went down in the cluster

Reply via email to