Anil, Same hardware, fewer VMs.
On Aug 13, 2012, at 9:49 PM, Anil Gupta <[email protected]> wrote: > Hi Mike, > I am constrained by the hardware available for POC cluster. We are waiting > for hardware which we will use for performance. > > > Best Regards, > Anil > > On Aug 13, 2012, at 6:59 PM, Michael Segel <[email protected]> wrote: > >> Anil, >> >> I don't know if you can call it a bug if you don't have enough memory >> available. >> >> I mean if you don't use HBase, then you may have more leeway in terms of >> swap. >> >> You can also do more tuning of HBase to handle the additional latency found >> in a Virtual environment. >> >> Why don't you rebuild your vm's to be slightly larger in terms of memory? >> >> >> On Aug 13, 2012, at 8:05 PM, anil gupta <[email protected]> wrote: >> >>> Hi Mike, >>> >>> You hit the nail on the that i need to lower down the memory by setting >>> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN you >>> are talking about. I already tried setting that property to 1500 MB in >>> yarn-site.xml and setting yarn.app.mapreduce.am.resource.mb to 1000 MB in >>> mapred-site.xml. If i do this change then the YARN job does not runs at all >>> even though the configuration is right. It's a bug and i have to file a >>> JIRA for it. So, i was only left with the option to let it run with >>> incorrect YARN conf since my objective is to load data into HBase rather >>> than playing with YARN. MapReduce is only used for bulk loading in my >>> cluster. >>> >>> Here is a link to the mailing list email regarding running YARN with lesser >>> memory: >>> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164 >>> >>> It would be great if you can answer this simple question of mine: Is HBase >>> Bulk Loading fault tolerant to Region Server failures in a viable/decent >>> environment? >>> >>> Thanks, >>> Anil Gupta >>> >>> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel >>> <[email protected]>wrote: >>> >>>> Not sure why you're having an issue in getting an answer. >>>> Even if you're not a YARN expert, google is your friend. >>>> >>>> See: >>>> >>>> http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false >>>> >>>> This is a web page from Tom White's 3rd Edition. >>>> >>>> The bottom line... >>>> -=- >>>> The considerations for how much memory to dedicate to a node manager for >>>> running containers are similar to the those discussed in >>>> >>>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a datanode >>>> and a node manager, the total is 2,000 MB. Set aside enough for other >>>> processes that are running on the machine, and the remainder can be >>>> dedicated to the node manager’s containers by setting the configuration >>>> property yarn.nodemanager.resource.memory-mb to the total allocation in MB. >>>> (The default is 8,192 MB.) >>>> -=- >>>> >>>> Taken per fair use. Page 323 >>>> >>>> As you can see you need to drop this down to something like 1GB if you >>>> even have enough memory for that. >>>> Again set yarn.nodemanager.resource.memory-mb to a more realistic value. >>>> >>>> 8GB on a 3 GB node? Yeah that would really hose you, especially if you're >>>> trying to run HBase too. >>>> >>>> Even here... You really don't have enough memory to do it all. (Maybe >>>> enough to do a small test) >>>> >>>> >>>> >>>> Good luck. >>>> >>>> On Aug 13, 2012, at 3:24 PM, anil gupta <[email protected]> wrote: >>>> >>>> >>>>> Hi Mike, >>>>> >>>>> Here is the link to my email on Hadoop list regarding YARN problem: >>>>> >>>> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3ccaf1+vs8of4vshbg14b7sgzbb_8ty7gc9lw3nm1bm0v+24ck...@mail.gmail.com%3E >>>>> >>>>> Somehow the link for cloudera mail in last email does not seems to work. >>>>> Here is the new link: >>>>> >>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D >>>>> >>>>> Thanks for your help, >>>>> Anil Gupta >>>>> >>>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[email protected]> >>>> wrote: >>>>> >>>>>> Hi Mike, >>>>>> >>>>>> I tried doing that by setting up properties in mapred-site.xml but Yarn >>>>>> doesnt seems to work with "mapreduce.tasktracker. >>>>>> map.tasks.maximum" property. Here is a reference to a discussion to same >>>>>> problem: >>>>>> >>>>>> >>>> https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] >>>>>> I have also posted about the same problem in Hadoop mailing list. >>>>>> >>>>>> I already admitted in my previous email that YARN is having major issues >>>>>> when we want to control it in low memory environment. I was just trying >>>> to >>>>>> get views HBase experts on bulk load failures since we will be relying >>>>>> heavily on Fault Tolerance. >>>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable >>>>>> environment then I dont have any issue. I hope this clears up my >>>> purpose >>>>>> of posting on this topic. >>>>>> >>>>>> Thanks, >>>>>> Anil >>>>>> >>>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel < >>>> [email protected] >>>>>>> wrote: >>>>>> >>>>>>> Anil, >>>>>>> >>>>>>> Do you know what happens when you have an airplane that has too heavy a >>>>>>> cargo when it tries to take off? >>>>>>> You run out of runway and you crash and burn. >>>>>>> >>>>>>> Looking at your post, why are you starting 8 map processes on each >>>> slave? >>>>>>> That's tunable and you clearly do not have enough memory in each VM to >>>>>>> support 8 slots on a node. >>>>>>> Here you swap, you swap you cause HBase to crash and burn. >>>>>>> >>>>>>> 3.2GB of memory means that no more than 1 slot per slave and even >>>> then... >>>>>>> you're going to be very tight. Not to mention that you will need to >>>> loosen >>>>>>> up on your timings since its all virtual and you have way too much i/o >>>> per >>>>>>> drive going on. >>>>>>> >>>>>>> >>>>>>> My suggestion is that you go back and tune your system before thinking >>>>>>> about running anything. >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> -Mike >>>>>>> >>>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <[email protected]> wrote: >>>>>>> >>>>>>>> Hi Guys, >>>>>>>> >>>>>>>> Sorry for not mentioning the version I am currently running. My >>>> current >>>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with YARN >>>>>>> for >>>>>>>> MR. My original post was for HBase0.92. Here are some more details of >>>> my >>>>>>>> current setup: >>>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's >>>>>>> installed on >>>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory and >>>> 500 >>>>>>>> HDFS space. >>>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking for >>>> any >>>>>>>> performance benchmarking from this set-up. Due to some major bugs in >>>>>>> YARN i >>>>>>>> am unable to make work in a proper way in memory less than 4GB. I am >>>>>>>> already having discussion regarding them on Hadoop Mailing List. >>>>>>>> >>>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv >>>>>>>> >>>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map >>>>>>>> processes start on each slave and then all of my slaves are hammered >>>>>>> badly >>>>>>>> due to this. Since the slaves are getting hammered badly then >>>>>>> RegionServer >>>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS >>>> which >>>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD >>>>>>>> >>>>>>>> I am aware that this is happening due to underperforming hardware(Two >>>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some major >>>>>>> bugs >>>>>>>> regarding running YARN in less than 4 GB memory. My only concern is >>>> the >>>>>>>> failure of entire MR job and its fault tolerance to RS failures. I am >>>>>>> not >>>>>>>> really concerned about RS failure since HBase is fault tolerant. >>>>>>>> >>>>>>>> Please let me know if you need anything else. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Anil >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel < >>>>>>> [email protected]>wrote: >>>>>>>> >>>>>>>>> Yes, it can. >>>>>>>>> You can see RS failure causing a cascading RS failure. Of course YMMV >>>>>>> and >>>>>>>>> it depends on which version you are running. >>>>>>>>> >>>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the latest and >>>>>>> he >>>>>>>>> should upgrade. >>>>>>>>> >>>>>>>>> (Or go to CHD4...) >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> -Mike >>>>>>>>> >>>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Anil, >>>>>>>>>> >>>>>>>>>> Do you have root cause on the RS failure? I have never heard of one >>>>>>> RS >>>>>>>>>> failure causing a whole job to fail. >>>>>>>>>> >>>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta <[email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi HBase Folks, >>>>>>>>>>> >>>>>>>>>>> I ran the bulk loader yesterday night to load data in a table. >>>> During >>>>>>>>> the >>>>>>>>>>> bulk loading job one of the region server crashed and the entire >>>> job >>>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and the >>>> job >>>>>>>>> failed >>>>>>>>>>> when it was at around 50% complete. After the failure that table >>>> was >>>>>>>>> also >>>>>>>>>>> corrupted in HBase. My cluster has 8 region servers. >>>>>>>>>>> >>>>>>>>>>> Is bulk loading not fault tolerant to failure of region servers? >>>>>>>>>>> >>>>>>>>>>> I am using this old email chain because at that time my question >>>> went >>>>>>>>>>> unanswered. Please share your views. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Anil Gupta >>>>>>>>>>> >>>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta <[email protected]> >>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Kevin, >>>>>>>>>>>> >>>>>>>>>>>> I am not really concerned about the RegionServer going down as the >>>>>>> same >>>>>>>>>>>> thing can happen when deployed in production. Although, in >>>>>>> production >>>>>>>>> we >>>>>>>>>>>> wont be having VM environment and I am aware that my current Dev >>>>>>>>>>>> environment is not good for heavy processing. What i am concerned >>>>>>>>> about >>>>>>>>>>> is >>>>>>>>>>>> the failure of bulk loading job when the Region Server failed. >>>> Does >>>>>>>>> this >>>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of >>>>>>> Region >>>>>>>>>>>> Server? I was expecting the job to be successful even though the >>>>>>>>>>>> RegionServer failed because there 6 more RS running in the >>>> cluster. >>>>>>>>> Fault >>>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop platform. >>>>>>> Let >>>>>>>>> me >>>>>>>>>>>> know your views. >>>>>>>>>>>> Thanks for your time. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Anil Gupta >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell < >>>>>>> [email protected] >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Anil, >>>>>>>>>>>>> >>>>>>>>>>>>> I am sorry for the delayed response. Reviewing the logs it >>>>>>> appears: >>>>>>>>>>>>> >>>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session timed >>>>>>> out, >>>>>>>>>>>>> have not heard from server in 59311ms for sessionid >>>>>>> 0x136557f99c90065, >>>>>>>>>>>>> closing socket connection and attempting reconnect >>>>>>>>>>>>> >>>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING >>>> region >>>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363, >>>> load=(requests=0, >>>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: >>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>>>>>> rejected; >>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead >>>> server >>>>>>>>>>>>> >>>>>>>>>>>>> It appears to be a classic overworked RS. You were doing too >>>> much >>>>>>>>>>>>> for the RS and it did not respond in time, the Master marked it >>>> as >>>>>>>>>>>>> dead, when the RS responded Master said no your are already dead >>>>>>> and >>>>>>>>>>>>> aborted the server. This is why you see the YouAreDeadException. >>>>>>>>>>>>> This is probably due to the shared resources of the VM >>>>>>> infrastructure >>>>>>>>>>>>> you are running. You will either need to devote more resources >>>> or >>>>>>> add >>>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would like >>>>>>> to >>>>>>>>>>>>> keep running these jobs. >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta < >>>> [email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>>>>> Hi Kevin, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here is dropbox link to the log file of region server which >>>>>>> failed: >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>> http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out >>>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says: >>>> 12/03/30 >>>>>>>>>>>>> 15:38:32 >>>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server >>>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, >>>>>>>>>>> regions=44, >>>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception: >>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT >>>>>>> rejected; >>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead >>>> server >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually >>>>>>> bringing >>>>>>>>>>>>> down a >>>>>>>>>>>>>> RS while querying a Table and it worked fine and I was expecting >>>>>>> the >>>>>>>>>>>>> same >>>>>>>>>>>>>> today(even though the RS went down by itself today) when i was >>>>>>>>> loading >>>>>>>>>>>>> the >>>>>>>>>>>>>> data. But, it didn't work out well. >>>>>>>>>>>>>> Thanks for your time. Let me know if you need more details. >>>>>>>>>>>>>> >>>>>>>>>>>>>> ~Anil >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell < >>>>>>>>>>> [email protected] >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Anil, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you please attach the RS logs from the failure? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta < >>>>>>> [email protected]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>> Hi All, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread across >>>>>>> two >>>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and Region >>>>>>>>>>>>> Server(1200 >>>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk Loader >>>>>>>>>>> with a >>>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records and I >>>>>>> have >>>>>>>>>>>>> loaded >>>>>>>>>>>>>>>> the same set of data in the same environment many times before >>>>>>>>>>>>> without >>>>>>>>>>>>>>> any >>>>>>>>>>>>>>>> problem. This time while loading the data, one of the region >>>>>>>>>>>>> server(but >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then after >>>>>>>>>>>>> numerous >>>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any >>>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading >>>>>>> fault-tolerant to >>>>>>>>>>>>>>> failure >>>>>>>>>>>>>>>> of region-servers? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>>>>> Anil Gupta >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- >>>>>>>>>>>>>>> Kevin O'Dell >>>>>>>>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>>> Anil Gupta >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Kevin O'Dell >>>>>>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Thanks & Regards, >>>>>>>>>>>>> Anil Gupta >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Thanks & Regards, >>>>>>>>>>> Anil Gupta >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Kevin O'Dell >>>>>>>>>> Customer Operations Engineer, Cloudera >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Thanks & Regards, >>>>>>>> Anil Gupta >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Thanks & Regards, >>>>>> Anil Gupta >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Thanks & Regards, >>>>> Anil Gupta >>>> >>>> >>> >>> >>> -- >>> Thanks & Regards, >>> Anil Gupta >> >
