Anil, Do you have root cause on the RS failure? I have never heard of one RS failure causing a whole job to fail.
On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]> wrote: > Hi HBase Folks, > > I ran the bulk loader yesterday night to load data in a table. During the > bulk loading job one of the region server crashed and the entire job > failed. It takes around 2.5 hours for this job to finish and the job failed > when it was at around 50% complete. After the failure that table was also > corrupted in HBase. My cluster has 8 region servers. > > Is bulk loading not fault tolerant to failure of region servers? > > I am using this old email chain because at that time my question went > unanswered. Please share your views. > > Thanks, > Anil Gupta > > On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]> wrote: > > > Hi Kevin, > > > > I am not really concerned about the RegionServer going down as the same > > thing can happen when deployed in production. Although, in production we > > wont be having VM environment and I am aware that my current Dev > > environment is not good for heavy processing. What i am concerned about > is > > the failure of bulk loading job when the Region Server failed. Does this > > mean that Bulk loading job is not fault tolerant to Failure of Region > > Server? I was expecting the job to be successful even though the > > RegionServer failed because there 6 more RS running in the cluster. Fault > > Tolerance is one of the biggest selling point of Hadoop platform. Let me > > know your views. > > Thanks for your time. > > > > Thanks, > > Anil Gupta > > > > > > On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <[email protected] > >wrote: > > > >> Anil, > >> > >> I am sorry for the delayed response. Reviewing the logs it appears: > >> > >> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed out, > >> have not heard from server in 59311ms for sessionid 0x136557f99c90065, > >> closing socket connection and attempting reconnect > >> > >> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region > >> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > >> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: > >> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >> currently processing ihub-dn-b1,60020,1332955859363 as dead server > >> > >> It appears to be a classic overworked RS. You were doing too much > >> for the RS and it did not respond in time, the Master marked it as > >> dead, when the RS responded Master said no your are already dead and > >> aborted the server. This is why you see the YouAreDeadException. > >> This is probably due to the shared resources of the VM infrastructure > >> you are running. You will either need to devote more resources or add > >> more nodes(most likely physical) to the cluster if you would like to > >> keep running these jobs. > >> > >> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[email protected]> > wrote: > >> > Hi Kevin, > >> > > >> > Here is dropbox link to the log file of region server which failed: > >> > > >> > http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out > >> > IMHO, the problem starts from the line #3009 which says: 12/03/30 > >> 15:38:32 > >> > FATAL regionserver.HRegionServer: ABORTING region server > >> > serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > regions=44, > >> > usedHeap=446, maxHeap=1197): Unhandled exception: > >> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > >> > currently processing ihub-dn-b1,60020,1332955859363 as dead server > >> > > >> > I have already tested fault tolerance of HBase by manually bringing > >> down a > >> > RS while querying a Table and it worked fine and I was expecting the > >> same > >> > today(even though the RS went down by itself today) when i was loading > >> the > >> > data. But, it didn't work out well. > >> > Thanks for your time. Let me know if you need more details. > >> > > >> > ~Anil > >> > > >> > On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell < > [email protected] > >> >wrote: > >> > > >> >> Anil, > >> >> > >> >> Can you please attach the RS logs from the failure? > >> >> > >> >> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <[email protected]> > >> wrote: > >> >> > Hi All, > >> >> > > >> >> > I am using cdh3u2 and i have 7 worker nodes(VM's spread across two > >> >> > machines) which are running Datanode, Tasktracker, and Region > >> Server(1200 > >> >> > MB heap size). I was loading data into HBase using Bulk Loader > with a > >> >> > custom mapper. I was loading around 34 million records and I have > >> loaded > >> >> > the same set of data in the same environment many times before > >> without > >> >> any > >> >> > problem. This time while loading the data, one of the region > >> server(but > >> >> the > >> >> > DN and TT kept on running on that node ) failed and then after > >> numerous > >> >> > failures of map-tasks the loding job failed. Is there any > >> >> > setting/configuration which can make Bulk Loading fault-tolerant to > >> >> failure > >> >> > of region-servers? > >> >> > > >> >> > -- > >> >> > Thanks & Regards, > >> >> > Anil Gupta > >> >> > >> >> > >> >> > >> >> -- > >> >> Kevin O'Dell > >> >> Customer Operations Engineer, Cloudera > >> >> > >> > > >> > > >> > > >> > -- > >> > Thanks & Regards, > >> > Anil Gupta > >> > >> > >> > >> -- > >> Kevin O'Dell > >> Customer Operations Engineer, Cloudera > >> > >> -- > >> Thanks & Regards, > >> Anil Gupta > >> > >> > > > -- > Thanks & Regards, > Anil Gupta > -- Kevin O'Dell Customer Operations Engineer, Cloudera
