Re: Bulk loading job failed when one region server went down in the cluster

anil gupta Mon, 13 Aug 2012 13:25:24 -0700

Hi Mike,

Here is the link to my email on Hadoop list regarding YARN problem:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3ccaf1+vs8of4vshbg14b7sgzbb_8ty7gc9lw3nm1bm0v+24ck...@mail.gmail.com%3E


Somehow the link for cloudera mail in last email does not seems to work.
Here is the new link:
https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D

Thanks for your help,
Anil Gupta

On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[email protected]> wrote:

> Hi Mike,
>
> I tried doing that by setting up properties in mapred-site.xml but Yarn
> doesnt seems to work with "mapreduce.tasktracker.
> map.tasks.maximum" property. Here is a reference to a discussion to same
> problem:
>
> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25]
> I have also posted about the same problem in Hadoop mailing list.
>
> I already admitted in my previous email that YARN is having major issues
> when we want to control it in low memory environment. I was just trying to
> get views HBase experts on bulk load failures since we will be relying
> heavily on Fault Tolerance.
> If HBase Bulk Loader is fault tolerant to failure of RS in a viable
> environment  then I dont have any issue. I hope this clears up my purpose
> of posting on this topic.
>
> Thanks,
> Anil
>
> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <[email protected]
> > wrote:
>
>> Anil,
>>
>> Do you know what happens when you have an airplane that has too heavy a
>> cargo when it tries to take off?
>> You run out of runway and you crash and burn.
>>
>> Looking at your post, why are you starting 8 map processes on each slave?
>> That's tunable and you clearly do not have enough memory in each VM to
>> support 8 slots on a node.
>> Here you swap, you swap you cause HBase to crash and burn.
>>
>> 3.2GB of memory means that no more than 1 slot per slave and even then...
>> you're going to be very tight. Not to mention that you will need to loosen
>> up on your timings since its all virtual and you have way too much i/o per
>> drive going on.
>>
>>
>> My suggestion is that you go back and tune your system before thinking
>> about running anything.
>>
>> HTH
>>
>> -Mike
>>
>> On Aug 13, 2012, at 2:11 PM, anil gupta <[email protected]> wrote:
>>
>> > Hi Guys,
>> >
>> > Sorry for not mentioning the version I am currently running. My current
>> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN
>> for
>> > MR. My original post was for HBase0.92. Here are some more details of my
>> > current setup:
>> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's
>> installed on
>> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500
>> > HDFS space.
>> > I use this cluster for POC(Proof of Concepts). I am not looking for any
>> > performance benchmarking from this set-up. Due to some major bugs in
>> YARN i
>> > am unable to make work in a proper way in memory less than 4GB. I am
>> > already having discussion regarding them on Hadoop Mailing List.
>> >
>> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv
>> >
>> > The problem is that when i start a Bulk loading job in YARN, 8 Map
>> > processes start on each slave and then all of my slaves are hammered
>> badly
>> > due to this. Since the slaves are getting hammered badly then
>> RegionServer
>> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which
>> > caused the job to fail: http://pastebin.com/9ZQx0DtD
>> >
>> > I am aware that this is happening due to underperforming hardware(Two
>> > slaves are using one 7200 rpm Hard Drive in my setup) and some major
>> bugs
>> > regarding running YARN in less than 4 GB memory. My only concern is the
>> > failure of entire MR job and its fault tolerance to RS failures. I am
>> not
>> > really concerned about RS failure since HBase is fault tolerant.
>> >
>> > Please let me know if you need anything else.
>> >
>> > Thanks,
>> > Anil
>> >
>> >
>> >
>> > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel <
>> [email protected]>wrote:
>> >
>> >> Yes, it can.
>> >> You can see RS failure causing a cascading RS failure. Of course YMMV
>> and
>> >> it depends on which version you are running.
>> >>
>> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and
>> he
>> >> should upgrade.
>> >>
>> >> (Or go to CHD4...)
>> >>
>> >> HTH
>> >>
>> >> -Mike
>> >>
>> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[email protected]>
>> >> wrote:
>> >>
>> >>> Anil,
>> >>>
>> >>> Do you have root cause on the RS failure?  I have never heard of one
>> RS
>> >>> failure causing a whole job to fail.
>> >>>
>> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]>
>> >> wrote:
>> >>>
>> >>>> Hi HBase Folks,
>> >>>>
>> >>>> I ran the bulk loader yesterday night to load data in a table. During
>> >> the
>> >>>> bulk loading job one of the region server crashed and the entire job
>> >>>> failed. It takes around 2.5 hours for this job to finish and the job
>> >> failed
>> >>>> when it was at around 50% complete. After the failure that table was
>> >> also
>> >>>> corrupted in HBase. My cluster has 8 region servers.
>> >>>>
>> >>>> Is bulk loading not fault tolerant to failure of region servers?
>> >>>>
>> >>>> I am using this old email chain because at that time my question went
>> >>>> unanswered. Please share your views.
>> >>>>
>> >>>> Thanks,
>> >>>> Anil Gupta
>> >>>>
>> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]>
>> >> wrote:
>> >>>>
>> >>>>> Hi Kevin,
>> >>>>>
>> >>>>> I am not really concerned about the RegionServer going down as the
>> same
>> >>>>> thing can happen when deployed in production. Although, in
>> production
>> >> we
>> >>>>> wont be having VM environment and I am aware that my current Dev
>> >>>>> environment is not good for heavy processing.  What i am concerned
>> >> about
>> >>>> is
>> >>>>> the failure of bulk loading job when the Region Server failed. Does
>> >> this
>> >>>>> mean that Bulk loading job is not fault tolerant to Failure of
>> Region
>> >>>>> Server? I was expecting the job to be successful even though the
>> >>>>> RegionServer failed because there 6 more RS running in the cluster.
>> >> Fault
>> >>>>> Tolerance is one of the biggest selling point of Hadoop platform.
>> Let
>> >> me
>> >>>>> know your views.
>> >>>>> Thanks for your time.
>> >>>>>
>> >>>>> Thanks,
>> >>>>> Anil Gupta
>> >>>>>
>> >>>>>
>> >>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell <
>> [email protected]
>> >>>>> wrote:
>> >>>>>
>> >>>>>> Anil,
>> >>>>>>
>> >>>>>> I am sorry for the delayed response.  Reviewing the logs it
>> appears:
>> >>>>>>
>> >>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed
>> out,
>> >>>>>> have not heard from server in 59311ms for sessionid
>> 0x136557f99c90065,
>> >>>>>> closing socket connection and attempting reconnect
>> >>>>>>
>> >>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region
>> >>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> >>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception:
>> >>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> rejected;
>> >>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>> >>>>>>
>> >>>>>> It appears to be a classic overworked RS.  You were doing too much
>> >>>>>> for the RS and it did not respond in time, the Master marked it as
>> >>>>>> dead, when the RS responded Master said no your are already dead
>> and
>> >>>>>> aborted the server.  This is why you see the YouAreDeadException.
>> >>>>>> This is probably due to the shared resources of the VM
>> infrastructure
>> >>>>>> you are running.  You will either need to devote more resources or
>> add
>> >>>>>> more nodes(most likely physical) to the cluster if you would like
>> to
>> >>>>>> keep running these jobs.
>> >>>>>>
>> >>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[email protected]>
>> >>>> wrote:
>> >>>>>>> Hi Kevin,
>> >>>>>>>
>> >>>>>>> Here is dropbox link to the log file of region server which
>> failed:
>> >>>>>>>
>> >>>>>>
>> >>>>
>> >>
>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out
>> >>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30
>> >>>>>> 15:38:32
>> >>>>>>> FATAL regionserver.HRegionServer: ABORTING region server
>> >>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0,
>> >>>> regions=44,
>> >>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception:
>> >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT
>> rejected;
>> >>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server
>> >>>>>>>
>> >>>>>>> I have already tested fault tolerance of HBase by manually
>> bringing
>> >>>>>> down a
>> >>>>>>> RS while querying a Table and it worked fine and I was expecting
>> the
>> >>>>>> same
>> >>>>>>> today(even though the RS went down by itself today) when i was
>> >> loading
>> >>>>>> the
>> >>>>>>> data. But, it didn't work out well.
>> >>>>>>> Thanks for your time. Let me know if you need more details.
>> >>>>>>>
>> >>>>>>> ~Anil
>> >>>>>>>
>> >>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell <
>> >>>> [email protected]
>> >>>>>>> wrote:
>> >>>>>>>
>> >>>>>>>> Anil,
>> >>>>>>>>
>> >>>>>>>> Can you please attach the RS logs from the failure?
>> >>>>>>>>
>> >>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta <
>> [email protected]>
>> >>>>>> wrote:
>> >>>>>>>>> Hi All,
>> >>>>>>>>>
>> >>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across
>> two
>> >>>>>>>>> machines) which are running Datanode, Tasktracker, and Region
>> >>>>>> Server(1200
>> >>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader
>> >>>> with a
>> >>>>>>>>> custom mapper. I was loading around 34 million records and I
>> have
>> >>>>>> loaded
>> >>>>>>>>> the same set of data in the same environment many times before
>> >>>>>> without
>> >>>>>>>> any
>> >>>>>>>>> problem. This time while loading the data, one of the region
>> >>>>>> server(but
>> >>>>>>>> the
>> >>>>>>>>> DN and TT kept on running on that node ) failed and then after
>> >>>>>> numerous
>> >>>>>>>>> failures of map-tasks the loding job failed. Is there any
>> >>>>>>>>> setting/configuration which can make Bulk Loading
>> fault-tolerant to
>> >>>>>>>> failure
>> >>>>>>>>> of region-servers?
>> >>>>>>>>>
>> >>>>>>>>> --
>> >>>>>>>>> Thanks & Regards,
>> >>>>>>>>> Anil Gupta
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> --
>> >>>>>>>> Kevin O'Dell
>> >>>>>>>> Customer Operations Engineer, Cloudera
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Thanks & Regards,
>> >>>>>>> Anil Gupta
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> --
>> >>>>>> Kevin O'Dell
>> >>>>>> Customer Operations Engineer, Cloudera
>> >>>>>>
>> >>>>>> --
>> >>>>>> Thanks & Regards,
>> >>>>>> Anil Gupta
>> >>>>>>
>> >>>>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Thanks & Regards,
>> >>>> Anil Gupta
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Kevin O'Dell
>> >>> Customer Operations Engineer, Cloudera
>> >>
>> >>
>> >
>> >
>> > --
>> > Thanks & Regards,
>> > Anil Gupta
>>
>>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>



-- 
Thanks & Regards,
Anil Gupta

Re: Bulk loading job failed when one region server went down in the cluster

Reply via email to