Hi Mike, I am constrained by the hardware available for POC cluster. We are waiting for hardware which we will use for performance.
Best Regards, Anil On Aug 13, 2012, at 6:59 PM, Michael Segel <[email protected]> wrote: > Anil, > > I don't know if you can call it a bug if you don't have enough memory > available. > > I mean if you don't use HBase, then you may have more leeway in terms of > swap. > > You can also do more tuning of HBase to handle the additional latency found > in a Virtual environment. > > Why don't you rebuild your vm's to be slightly larger in terms of memory? > > > On Aug 13, 2012, at 8:05 PM, anil gupta <[email protected]> wrote: > >> Hi Mike, >> >> You hit the nail on the that i need to lower down the memory by setting >> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you >> are talking about. I already tried setting that property to 1500 MB in >> yarn-site.xml and setting yarn.app.mapreduce.am.resource.mb to 1000 MB in >> mapred-site.xml. If i do this change then the YARN job does not runs at all >> even though the configuration is right. It's a bug and i have to file a >> JIRA for it. So, i was only left with the option to let it run with >> incorrect YARN conf since my objective is to load data into HBase rather >> than playing with YARN. MapReduce is only used for bulk loading in my >> cluster. >> >> Here is a link to the mailing list email regarding running YARN with lesser >> memory: >> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164 >> >> It would be great if you can answer this simple question of mine: Is HBase >> Bulk Loading fault tolerant to Region Server failures in a viable/decent >> environment? >> >> Thanks, >> Anil Gupta >> >> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel >> <[email protected]>wrote: >> >>> Not sure why you're having an issue in getting an answer. >>> Even if you're not a YARN expert, google is your friend. >>> >>> See: >>> >>> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false >>> >>> This is a web page from Tom White's 3rd Edition. >>> >>> The bottom line... >>> -=- >>> The considerations for how much memory to dedicate to a node manager for >>> running containers are similar to the those discussed in >>> >>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode >>> and a node manager, the total is 2,000 MB. Set aside enough for other >>> processes that are running on the machine, and the remainder can be >>> dedicated to the node manager’s containers by setting the configuration >>> property yarn.nodemanager.resource.memory-mb to the total allocation in MB. >>> (The default is 8,192 MB.) >>> -=- >>> >>> Taken per fair use. Page 323 >>> >>> As you can see you need to drop this down to something like 1GB if you >>> even have enough memory for that. >>> Again set yarn.nodemanager.resource.memory-mb to a more realistic value. >>> >>> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're >>> trying to run HBase too. >>> >>> Even here... You really don't have enough memory to do it all. (Maybe >>> enough to do a small test) >>> >>> >>> >>> Good luck. >>> >>> On Aug 13, 2012, at 3:24 PM, anil gupta <[email protected]> wrote: >>> >>> >>>> Hi Mike, >>>> >>>> Here is the link to my email on Hadoop list regarding YARN problem: >>>> >>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3ccaf1+vs8of4vshbg14b7sgzbb_8ty7gc9lw3nm1bm0v+24ck...@mail.gmail.com%3E >>>> >>>> Somehow the link for cloudera mail in last email does not seems to work. >>>> Here is the new link: >>>> >>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D >>>> >>>> Thanks for your help, >>>> Anil Gupta >>>> >>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[email protected]> >>> wrote: >>>> >>>>> Hi Mike, >>>>> >>>>> I tried doing that by setting up properties in mapred-site.xml but Yarn >>>>> doesnt seems to work with "mapreduce.tasktracker. >>>>> map.tasks.maximum" property. Here is a reference to a discussion to same >>>>> problem: >>>>> >>>>> >>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] >>>>> I have also posted about the same problem in Hadoop mailing list. >>>>> >>>>> I already admitted in my previous email that YARN is having major issues >>>>> when we want to control it in low memory environment. I was just trying >>> to >>>>> get views HBase experts on bulk load failures since we will be relying >>>>> heavily on Fault Tolerance. >>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable >>>>> environment then I dont have any issue. I hope this clears up my >>> purpose >>>>> of posting on this topic. >>>>> >>>>> Thanks, >>>>> Anil >>>>> >>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel < >>> [email protected] >>>>>> wrote: >>>>> >>>>>> Anil, >>>>>> >>>>>> Do you know what happens when you have an airplane that has too heavy a >>>>>> cargo when it tries to take off? >>>>>> You run out of runway and you crash and burn. >>>>>> >>>>>> Looking at your post, why are you starting 8 map processes on each >>> slave? >>>>>> That's tunable and you clearly do not have enough memory in each VM to >>>>>> support 8 slots on a node. >>>>>> Here you swap, you swap you cause HBase to crash and burn. >>>>>> >>>>>> 3.2GB of memory means that no more than 1 slot per slave and even >>> then... >>>>>> you're going to be very tight. Not to mention that you will need to >>> loosen >>>>>> up on your timings since its all virtual and you have way too much i/o >>> per >>>>>> drive going on. >>>>>> >>>>>> >>>>>> My suggestion is that you go back and tune your system before thinking >>>>>> about running anything. >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <[email protected]> wrote: >>>>>> >>>>>>> Hi Guys, >>>>>>> >>>>>>> Sorry for not mentioning the version I am currently running. My >>> current >>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN >>>>>> for >>>>>>> MR. My original post was for HBase0.92. Here are some more details of >>> my >>>>>>> current setup: >>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's >>>>>> installed on >>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and >>> 500 >>>>>>> HDFS space. >>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking for >>> any >>>>>>> performance benchmarking from this set-up. Due to some major bugs in >>>>>> YARN i >>>>>>> am unable to make work in a proper way in memory less than 4GB. I am >>>>>>> already having discussion regarding them on Hadoop Mailing List. >>>>>>> >>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv >>>>>>> >>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map >>>>>>> processes start on each slave and then all of my slaves are hammered >>>>>> badly >>>>>>> due to this. Since the slaves are getting hammered badly then >>>>>> RegionServer >>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS >>> which >>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD >>>>>>> >>>>>>> I am aware that this is happening due to underperforming hardware(Two >>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major >>>>>> bugs >>>>>>> regarding running YARN in less than 4 GB memory. My only concern is >>> the >>>>>>> failure of entire MR job and its fault tolerance to RS failures. I am >>>>>> not >>>>>>> really concerned about RS failure since HBase is fault tolerant. >>>>>>> >>>>>>> Please let me know if you need anything else. >>>>>>> >>>>>>> Thanks, >>>>>>> Anil >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel < >>>>>> [email protected]>wrote: >>>>>>> >>>>>>>> Yes, it can. >>>>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV >>>>>> and >>>>>>>> it depends on which version you are running. >>>>>>>> >>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and >>>>>> he >>>>>>>> should upgrade. >>>>>>>> >>>>>>>> (Or go to CHD4...) >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> -Mike >>>>>>>> >>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Anil, >>>>>>>>> >>>>>>>>> Do you have root cause on the RS failure? I have never heard of one >>>>>> RS >>>>>>>>> failure causing a whole job to fail. >>>>>>>>> >>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi HBase Folks, >>>>>>>>>> >>>>>>>>>> I ran the bulk loader yesterday night to load data in a table. >>> During >>>>>>>> the >>>>>>>>>> bulk loading job one of the region server crashed and the entire >>> job >>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and the >>> job >>>>>>>> failed >>>>>>>>>> when it was at around 50% complete. After the failure that table >>> was >>>>>>>> also >>>>>>>>>> corrupted in HBase. My cluster has 8 region servers. >>>>>>>>>> >>>>>>>>>> Is bulk loading not fault tolerant to failure of region servers? >>>>>>>>>> >>>>>>>>>> I am using this old email chain because at that time my question >>> went >>>>>>>>>> unanswered. Please share your views. >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Anil Gupta >>>>>>>>>> >>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]> >>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Kevin, >>>>>>>>>>> >>>>>>>>>>> I am not really concerned about the RegionServer going down as the >>>>>> same >>>>>>>>>>> thing can happen when deployed in production. Although, in >>>>>> production >>>>>>>> we >>>>>>>>>>> wont be having VM environment and I am aware that my current Dev >>>>>>>>>>> environment is not good for heavy processing. What i am concerned >>>>>>>> about >>>>>>>>>> is >>>>>>>>>>> the failure of bulk loading job when the Region Server failed. >>> Does >>>>>>>> this >>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of >>>>>> Region >>>>>>>>>>> Server? I was expecting the job to be successful even though the >>>>>>>>>>> RegionServer failed because there 6 more RS running in the >>> cluster. >>>>>>>> Fault >>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform. >>>>>> Let >>>>>>>> me >>>>>>>>>>> know your views. >>>>>>>>>>> Thanks for your time. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Anil Gupta >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell < >>>>>> [email protected] >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Anil, >>>>>>>>>>>> >>>>>>>>>>>> I am sorry for the delayed response. Reviewing the logs it >>>>>> appears: >>>>>>>>>>>> >>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed >>>>>> out, >>>>>>>>>>>> have not heard from server in 59311ms for sessionid >>>>>> 0x136557f99c90065, >>>>>>>>>>>> closing socket connection and attempting reconnect >>>>>>>>>>>> >>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING >>> region >>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363, >>> load=(requests=0, >>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>>>>> rejected; >>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead >>> server >>>>>>>>>>>> >>>>>>>>>>>> It appears to be a classic overworked RS. You were doing too >>> much >>>>>>>>>>>> for the RS and it did not respond in time, the Master marked it >>> as >>>>>>>>>>>> dead, when the RS responded Master said no your are already dead >>>>>> and >>>>>>>>>>>> aborted the server. This is why you see the YouAreDeadException. >>>>>>>>>>>> This is probably due to the shared resources of the VM >>>>>> infrastructure >>>>>>>>>>>> you are running. You will either need to devote more resources >>> or >>>>>> add >>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would like >>>>>> to >>>>>>>>>>>> keep running these jobs. >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta < >>> [email protected]> >>>>>>>>>> wrote: >>>>>>>>>>>>> Hi Kevin, >>>>>>>>>>>>> >>>>>>>>>>>>> Here is dropbox link to the log file of region server which >>>>>> failed: >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>> >>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says: >>> 12/03/30 >>>>>>>>>>>> 15:38:32 >>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server >>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >>>>>>>>>> regions=44, >>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception: >>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>>>>> rejected; >>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead >>> server >>>>>>>>>>>>> >>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually >>>>>> bringing >>>>>>>>>>>> down a >>>>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting >>>>>> the >>>>>>>>>>>> same >>>>>>>>>>>>> today(even though the RS went down by itself today) when i was >>>>>>>> loading >>>>>>>>>>>> the >>>>>>>>>>>>> data. But, it didn't work out well. >>>>>>>>>>>>> Thanks for your time. Let me know if you need more details. >>>>>>>>>>>>> >>>>>>>>>>>>> ~Anil >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell < >>>>>>>>>> [email protected] >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Anil, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Can you please attach the RS logs from the failure? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta < >>>>>> [email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across >>>>>> two >>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region >>>>>>>>>>>> Server(1200 >>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader >>>>>>>>>> with a >>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I >>>>>> have >>>>>>>>>>>> loaded >>>>>>>>>>>>>>> the same set of data in the same environment many times before >>>>>>>>>>>> without >>>>>>>>>>>>>> any >>>>>>>>>>>>>>> problem. This time while loading the data, one of the region >>>>>>>>>>>> server(but >>>>>>>>>>>>>> the >>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after >>>>>>>>>>>> numerous >>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any >>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading >>>>>> fault-tolerant to >>>>>>>>>>>>>> failure >>>>>>>>>>>>>>> of region-servers? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>>>> Anil Gupta >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Kevin O'Dell >>>>>>>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>> Anil Gupta >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Kevin O'Dell >>>>>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>> Anil Gupta >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Thanks & Regards, >>>>>>>>>> Anil Gupta >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Kevin O'Dell >>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Thanks & Regards, >>>>>>> Anil Gupta >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Anil Gupta >>>>> >>>> >>>> >>>> >>>> -- >>>> Thanks & Regards, >>>> Anil Gupta >>> >>> >> >> >> -- >> Thanks & Regards, >> Anil Gupta >
