Hi Mike, Here is the link to my email on Hadoop list regarding YARN problem: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3ccaf1+vs8of4vshbg14b7sgzbb_8ty7gc9lw3nm1bm0v+24ck...@mail.gmail.com%3E
Somehow the link for cloudera mail in last email does not seems to work. Here is the new link: https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D Thanks for your help, Anil Gupta On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[email protected]> wrote: > Hi Mike, > > I tried doing that by setting up properties in mapred-site.xml but Yarn > doesnt seems to work with "mapreduce.tasktracker. > map.tasks.maximum" property. Here is a reference to a discussion to same > problem: > > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] > I have also posted about the same problem in Hadoop mailing list. > > I already admitted in my previous email that YARN is having major issues > when we want to control it in low memory environment. I was just trying to > get views HBase experts on bulk load failures since we will be relying > heavily on Fault Tolerance. > If HBase Bulk Loader is fault tolerant to failure of RS in a viable > environment then I dont have any issue. I hope this clears up my purpose > of posting on this topic. > > Thanks, > Anil > > On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel <[email protected] > > wrote: > >> Anil, >> >> Do you know what happens when you have an airplane that has too heavy a >> cargo when it tries to take off? >> You run out of runway and you crash and burn. >> >> Looking at your post, why are you starting 8 map processes on each slave? >> That's tunable and you clearly do not have enough memory in each VM to >> support 8 slots on a node. >> Here you swap, you swap you cause HBase to crash and burn. >> >> 3.2GB of memory means that no more than 1 slot per slave and even then... >> you're going to be very tight. Not to mention that you will need to loosen >> up on your timings since its all virtual and you have way too much i/o per >> drive going on. >> >> >> My suggestion is that you go back and tune your system before thinking >> about running anything. >> >> HTH >> >> -Mike >> >> On Aug 13, 2012, at 2:11 PM, anil gupta <[email protected]> wrote: >> >> > Hi Guys, >> > >> > Sorry for not mentioning the version I am currently running. My current >> > version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN >> for >> > MR. My original post was for HBase0.92. Here are some more details of my >> > current setup: >> > I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's >> installed on >> > VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and 500 >> > HDFS space. >> > I use this cluster for POC(Proof of Concepts). I am not looking for any >> > performance benchmarking from this set-up. Due to some major bugs in >> YARN i >> > am unable to make work in a proper way in memory less than 4GB. I am >> > already having discussion regarding them on Hadoop Mailing List. >> > >> > Here is the log of failed mapper: http://pastebin.com/f83xE2wv >> > >> > The problem is that when i start a Bulk loading job in YARN, 8 Map >> > processes start on each slave and then all of my slaves are hammered >> badly >> > due to this. Since the slaves are getting hammered badly then >> RegionServer >> > gets lease expired or YourAreDeadExpcetion. Here is the log of RS which >> > caused the job to fail: http://pastebin.com/9ZQx0DtD >> > >> > I am aware that this is happening due to underperforming hardware(Two >> > slaves are using one 7200 rpm Hard Drive in my setup) and some major >> bugs >> > regarding running YARN in less than 4 GB memory. My only concern is the >> > failure of entire MR job and its fault tolerance to RS failures. I am >> not >> > really concerned about RS failure since HBase is fault tolerant. >> > >> > Please let me know if you need anything else. >> > >> > Thanks, >> > Anil >> > >> > >> > >> > On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel < >> [email protected]>wrote: >> > >> >> Yes, it can. >> >> You can see RS failure causing a cascading RS failure. Of course YMMV >> and >> >> it depends on which version you are running. >> >> >> >> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and >> he >> >> should upgrade. >> >> >> >> (Or go to CHD4...) >> >> >> >> HTH >> >> >> >> -Mike >> >> >> >> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[email protected]> >> >> wrote: >> >> >> >>> Anil, >> >>> >> >>> Do you have root cause on the RS failure? I have never heard of one >> RS >> >>> failure causing a whole job to fail. >> >>> >> >>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]> >> >> wrote: >> >>> >> >>>> Hi HBase Folks, >> >>>> >> >>>> I ran the bulk loader yesterday night to load data in a table. During >> >> the >> >>>> bulk loading job one of the region server crashed and the entire job >> >>>> failed. It takes around 2.5 hours for this job to finish and the job >> >> failed >> >>>> when it was at around 50% complete. After the failure that table was >> >> also >> >>>> corrupted in HBase. My cluster has 8 region servers. >> >>>> >> >>>> Is bulk loading not fault tolerant to failure of region servers? >> >>>> >> >>>> I am using this old email chain because at that time my question went >> >>>> unanswered. Please share your views. >> >>>> >> >>>> Thanks, >> >>>> Anil Gupta >> >>>> >> >>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]> >> >> wrote: >> >>>> >> >>>>> Hi Kevin, >> >>>>> >> >>>>> I am not really concerned about the RegionServer going down as the >> same >> >>>>> thing can happen when deployed in production. Although, in >> production >> >> we >> >>>>> wont be having VM environment and I am aware that my current Dev >> >>>>> environment is not good for heavy processing. What i am concerned >> >> about >> >>>> is >> >>>>> the failure of bulk loading job when the Region Server failed. Does >> >> this >> >>>>> mean that Bulk loading job is not fault tolerant to Failure of >> Region >> >>>>> Server? I was expecting the job to be successful even though the >> >>>>> RegionServer failed because there 6 more RS running in the cluster. >> >> Fault >> >>>>> Tolerance is one of the biggest selling point of Hadoop platform. >> Let >> >> me >> >>>>> know your views. >> >>>>> Thanks for your time. >> >>>>> >> >>>>> Thanks, >> >>>>> Anil Gupta >> >>>>> >> >>>>> >> >>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell < >> [email protected] >> >>>>> wrote: >> >>>>> >> >>>>>> Anil, >> >>>>>> >> >>>>>> I am sorry for the delayed response. Reviewing the logs it >> appears: >> >>>>>> >> >>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed >> out, >> >>>>>> have not heard from server in 59311ms for sessionid >> 0x136557f99c90065, >> >>>>>> closing socket connection and attempting reconnect >> >>>>>> >> >>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING region >> >>>>>> server serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >> >>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >> >>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >> rejected; >> >>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server >> >>>>>> >> >>>>>> It appears to be a classic overworked RS. You were doing too much >> >>>>>> for the RS and it did not respond in time, the Master marked it as >> >>>>>> dead, when the RS responded Master said no your are already dead >> and >> >>>>>> aborted the server. This is why you see the YouAreDeadException. >> >>>>>> This is probably due to the shared resources of the VM >> infrastructure >> >>>>>> you are running. You will either need to devote more resources or >> add >> >>>>>> more nodes(most likely physical) to the cluster if you would like >> to >> >>>>>> keep running these jobs. >> >>>>>> >> >>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta <[email protected]> >> >>>> wrote: >> >>>>>>> Hi Kevin, >> >>>>>>> >> >>>>>>> Here is dropbox link to the log file of region server which >> failed: >> >>>>>>> >> >>>>>> >> >>>> >> >> >> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >> >>>>>>> IMHO, the problem starts from the line #3009 which says: 12/03/30 >> >>>>>> 15:38:32 >> >>>>>>> FATAL regionserver.HRegionServer: ABORTING region server >> >>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >> >>>> regions=44, >> >>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception: >> >>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >> rejected; >> >>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead server >> >>>>>>> >> >>>>>>> I have already tested fault tolerance of HBase by manually >> bringing >> >>>>>> down a >> >>>>>>> RS while querying a Table and it worked fine and I was expecting >> the >> >>>>>> same >> >>>>>>> today(even though the RS went down by itself today) when i was >> >> loading >> >>>>>> the >> >>>>>>> data. But, it didn't work out well. >> >>>>>>> Thanks for your time. Let me know if you need more details. >> >>>>>>> >> >>>>>>> ~Anil >> >>>>>>> >> >>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell < >> >>>> [email protected] >> >>>>>>> wrote: >> >>>>>>> >> >>>>>>>> Anil, >> >>>>>>>> >> >>>>>>>> Can you please attach the RS logs from the failure? >> >>>>>>>> >> >>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta < >> [email protected]> >> >>>>>> wrote: >> >>>>>>>>> Hi All, >> >>>>>>>>> >> >>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across >> two >> >>>>>>>>> machines) which are running Datanode, Tasktracker, and Region >> >>>>>> Server(1200 >> >>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader >> >>>> with a >> >>>>>>>>> custom mapper. I was loading around 34 million records and I >> have >> >>>>>> loaded >> >>>>>>>>> the same set of data in the same environment many times before >> >>>>>> without >> >>>>>>>> any >> >>>>>>>>> problem. This time while loading the data, one of the region >> >>>>>> server(but >> >>>>>>>> the >> >>>>>>>>> DN and TT kept on running on that node ) failed and then after >> >>>>>> numerous >> >>>>>>>>> failures of map-tasks the loding job failed. Is there any >> >>>>>>>>> setting/configuration which can make Bulk Loading >> fault-tolerant to >> >>>>>>>> failure >> >>>>>>>>> of region-servers? >> >>>>>>>>> >> >>>>>>>>> -- >> >>>>>>>>> Thanks & Regards, >> >>>>>>>>> Anil Gupta >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> -- >> >>>>>>>> Kevin O'Dell >> >>>>>>>> Customer Operations Engineer, Cloudera >> >>>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> -- >> >>>>>>> Thanks & Regards, >> >>>>>>> Anil Gupta >> >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> -- >> >>>>>> Kevin O'Dell >> >>>>>> Customer Operations Engineer, Cloudera >> >>>>>> >> >>>>>> -- >> >>>>>> Thanks & Regards, >> >>>>>> Anil Gupta >> >>>>>> >> >>>>>> >> >>>> >> >>>> >> >>>> -- >> >>>> Thanks & Regards, >> >>>> Anil Gupta >> >>>> >> >>> >> >>> >> >>> >> >>> -- >> >>> Kevin O'Dell >> >>> Customer Operations Engineer, Cloudera >> >> >> >> >> > >> > >> > -- >> > Thanks & Regards, >> > Anil Gupta >> >> > > > -- > Thanks & Regards, > Anil Gupta > -- Thanks & Regards, Anil Gupta
