Hi Mike, I knew this would be your next response. :) However, as i said earlier this cluster is for HBase. At present, I only use MR for loading data.
Thanks, Anil On Mon, Aug 13, 2012 at 8:12 PM, Michael Segel <[email protected]>wrote: > Anil, > > Same hardware, fewer VMs. > > On Aug 13, 2012, at 9:49 PM, Anil Gupta <[email protected]> wrote: > > > Hi Mike, > > I am constrained by the hardware available for POC cluster. We are > waiting for hardware which we will use for performance. > > > > > > Best Regards, > > Anil > > > > On Aug 13, 2012, at 6:59 PM, Michael Segel <[email protected]> > wrote: > > > >> Anil, > >> > >> I don't know if you can call it a bug if you don't have enough memory > available. > >> > >> I mean if you don't use HBase, then you may have more leeway in terms > of swap. > >> > >> You can also do more tuning of HBase to handle the additional latency > found in a Virtual environment. > >> > >> Why don't you rebuild your vm's to be slightly larger in terms of > memory? > >> > >> > >> On Aug 13, 2012, at 8:05 PM, anil gupta <[email protected]> wrote: > >> > >>> Hi Mike, > >>> > >>> You hit the nail on the that i need to lower down the memory by setting > >>> yarn.nodemanager.resource.memory-mb. Here's another major bug of YARN > you > >>> are talking about. I already tried setting that property to 1500 MB in > >>> yarn-site.xml and setting yarn.app.mapreduce.am.resource.mb to 1000 > MB in > >>> mapred-site.xml. If i do this change then the YARN job does not runs > at all > >>> even though the configuration is right. It's a bug and i have to file a > >>> JIRA for it. So, i was only left with the option to let it run with > >>> incorrect YARN conf since my objective is to load data into HBase > rather > >>> than playing with YARN. MapReduce is only used for bulk loading in my > >>> cluster. > >>> > >>> Here is a link to the mailing list email regarding running YARN with > lesser > >>> memory: > >>> http://permalink.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/33164 > >>> > >>> It would be great if you can answer this simple question of mine: Is > HBase > >>> Bulk Loading fault tolerant to Region Server failures in a > viable/decent > >>> environment? > >>> > >>> Thanks, > >>> Anil Gupta > >>> > >>> On Mon, Aug 13, 2012 at 5:17 PM, Michael Segel < > [email protected]>wrote: > >>> > >>>> Not sure why you're having an issue in getting an answer. > >>>> Even if you're not a YARN expert, google is your friend. > >>>> > >>>> See: > >>>> > >>>> > http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA323&lpg=PA323&dq=Hadoop+YARN+setting+number+of+slots&source=bl&ots=i7xQYwQf-u&sig=ceuDmiOkbqTqok_HfIr3udvm6C0&hl=en&sa=X&ei=8JYpUNeZJMnxygGzqIGwCw&ved=0CEQQ6AEwAQ#v=onepage&q=Hadoop%20YARN%20setting%20number%20of%20slots&f=false > >>>> > >>>> This is a web page from Tom White's 3rd Edition. > >>>> > >>>> The bottom line... > >>>> -=- > >>>> The considerations for how much memory to dedicate to a node manager > for > >>>> running containers are similar to the those discussed in > >>>> > >>>> “Memory” on page 307. Each Hadoop daemon uses 1,000 MB, so for a > datanode > >>>> and a node manager, the total is 2,000 MB. Set aside enough for other > >>>> processes that are running on the machine, and the remainder can be > >>>> dedicated to the node manager’s containers by setting the > configuration > >>>> property yarn.nodemanager.resource.memory-mb to the total allocation > in MB. > >>>> (The default is 8,192 MB.) > >>>> -=- > >>>> > >>>> Taken per fair use. Page 323 > >>>> > >>>> As you can see you need to drop this down to something like 1GB if you > >>>> even have enough memory for that. > >>>> Again set yarn.nodemanager.resource.memory-mb to a more realistic > value. > >>>> > >>>> 8GB on a 3 GB node? Yeah that would really hose you, especially if > you're > >>>> trying to run HBase too. > >>>> > >>>> Even here... You really don't have enough memory to do it all. (Maybe > >>>> enough to do a small test) > >>>> > >>>> > >>>> > >>>> Good luck. > >>>> > >>>> On Aug 13, 2012, at 3:24 PM, anil gupta <[email protected]> > wrote: > >>>> > >>>> > >>>>> Hi Mike, > >>>>> > >>>>> Here is the link to my email on Hadoop list regarding YARN problem: > >>>>> > >>>> > http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201208.mbox/%3ccaf1+vs8of4vshbg14b7sgzbb_8ty7gc9lw3nm1bm0v+24ck...@mail.gmail.com%3E > >>>>> > >>>>> Somehow the link for cloudera mail in last email does not seems to > work. > >>>>> Here is the new link: > >>>>> > >>>> > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ%5B1-25%5D > >>>>> > >>>>> Thanks for your help, > >>>>> Anil Gupta > >>>>> > >>>>> On Mon, Aug 13, 2012 at 1:14 PM, anil gupta <[email protected]> > >>>> wrote: > >>>>> > >>>>>> Hi Mike, > >>>>>> > >>>>>> I tried doing that by setting up properties in mapred-site.xml but > Yarn > >>>>>> doesnt seems to work with "mapreduce.tasktracker. > >>>>>> map.tasks.maximum" property. Here is a reference to a discussion to > same > >>>>>> problem: > >>>>>> > >>>>>> > >>>> > https://groups.google.com/a/cloudera.org/forum/?fromgroups#!searchin/cdh-user/yarn$20anil/cdh-user/J564g9A8tPE/ZpslzOkIGZYJ[1-25] > >>>>>> I have also posted about the same problem in Hadoop mailing list. > >>>>>> > >>>>>> I already admitted in my previous email that YARN is having major > issues > >>>>>> when we want to control it in low memory environment. I was just > trying > >>>> to > >>>>>> get views HBase experts on bulk load failures since we will be > relying > >>>>>> heavily on Fault Tolerance. > >>>>>> If HBase Bulk Loader is fault tolerant to failure of RS in a viable > >>>>>> environment then I dont have any issue. I hope this clears up my > >>>> purpose > >>>>>> of posting on this topic. > >>>>>> > >>>>>> Thanks, > >>>>>> Anil > >>>>>> > >>>>>> On Mon, Aug 13, 2012 at 12:39 PM, Michael Segel < > >>>> [email protected] > >>>>>>> wrote: > >>>>>> > >>>>>>> Anil, > >>>>>>> > >>>>>>> Do you know what happens when you have an airplane that has too > heavy a > >>>>>>> cargo when it tries to take off? > >>>>>>> You run out of runway and you crash and burn. > >>>>>>> > >>>>>>> Looking at your post, why are you starting 8 map processes on each > >>>> slave? > >>>>>>> That's tunable and you clearly do not have enough memory in each > VM to > >>>>>>> support 8 slots on a node. > >>>>>>> Here you swap, you swap you cause HBase to crash and burn. > >>>>>>> > >>>>>>> 3.2GB of memory means that no more than 1 slot per slave and even > >>>> then... > >>>>>>> you're going to be very tight. Not to mention that you will need to > >>>> loosen > >>>>>>> up on your timings since its all virtual and you have way too much > i/o > >>>> per > >>>>>>> drive going on. > >>>>>>> > >>>>>>> > >>>>>>> My suggestion is that you go back and tune your system before > thinking > >>>>>>> about running anything. > >>>>>>> > >>>>>>> HTH > >>>>>>> > >>>>>>> -Mike > >>>>>>> > >>>>>>> On Aug 13, 2012, at 2:11 PM, anil gupta <[email protected]> > wrote: > >>>>>>> > >>>>>>>> Hi Guys, > >>>>>>>> > >>>>>>>> Sorry for not mentioning the version I am currently running. My > >>>> current > >>>>>>>> version is HBase 0.92.1(cdh4) and running Hadoop2.0.0-Alpha with > YARN > >>>>>>> for > >>>>>>>> MR. My original post was for HBase0.92. Here are some more > details of > >>>> my > >>>>>>>> current setup: > >>>>>>>> I am running a 8 slave, 4 admin node cluster on CentOS6.0 VM's > >>>>>>> installed on > >>>>>>>> VMware Hyprevisor 5.0. Each of my VM is having 3.2 GB of memory > and > >>>> 500 > >>>>>>>> HDFS space. > >>>>>>>> I use this cluster for POC(Proof of Concepts). I am not looking > for > >>>> any > >>>>>>>> performance benchmarking from this set-up. Due to some major bugs > in > >>>>>>> YARN i > >>>>>>>> am unable to make work in a proper way in memory less than 4GB. I > am > >>>>>>>> already having discussion regarding them on Hadoop Mailing List. > >>>>>>>> > >>>>>>>> Here is the log of failed mapper: http://pastebin.com/f83xE2wv > >>>>>>>> > >>>>>>>> The problem is that when i start a Bulk loading job in YARN, 8 Map > >>>>>>>> processes start on each slave and then all of my slaves are > hammered > >>>>>>> badly > >>>>>>>> due to this. Since the slaves are getting hammered badly then > >>>>>>> RegionServer > >>>>>>>> gets lease expired or YourAreDeadExpcetion. Here is the log of RS > >>>> which > >>>>>>>> caused the job to fail: http://pastebin.com/9ZQx0DtD > >>>>>>>> > >>>>>>>> I am aware that this is happening due to underperforming > hardware(Two > >>>>>>>> slaves are using one 7200 rpm Hard Drive in my setup) and some > major > >>>>>>> bugs > >>>>>>>> regarding running YARN in less than 4 GB memory. My only concern > is > >>>> the > >>>>>>>> failure of entire MR job and its fault tolerance to RS failures. > I am > >>>>>>> not > >>>>>>>> really concerned about RS failure since HBase is fault tolerant. > >>>>>>>> > >>>>>>>> Please let me know if you need anything else. > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Anil > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mon, Aug 13, 2012 at 6:58 AM, Michael Segel < > >>>>>>> [email protected]>wrote: > >>>>>>>> > >>>>>>>>> Yes, it can. > >>>>>>>>> You can see RS failure causing a cascading RS failure. Of course > YMMV > >>>>>>> and > >>>>>>>>> it depends on which version you are running. > >>>>>>>>> > >>>>>>>>> OP is on CHD3u2 which still had some issues. CDH3u4 is the > latest and > >>>>>>> he > >>>>>>>>> should upgrade. > >>>>>>>>> > >>>>>>>>> (Or go to CHD4...) > >>>>>>>>> > >>>>>>>>> HTH > >>>>>>>>> > >>>>>>>>> -Mike > >>>>>>>>> > >>>>>>>>> On Aug 13, 2012, at 8:51 AM, Kevin O'dell < > [email protected]> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Anil, > >>>>>>>>>> > >>>>>>>>>> Do you have root cause on the RS failure? I have never heard > of one > >>>>>>> RS > >>>>>>>>>> failure causing a whole job to fail. > >>>>>>>>>> > >>>>>>>>>> On Tue, Aug 7, 2012 at 1:59 PM, anil gupta < > [email protected]> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hi HBase Folks, > >>>>>>>>>>> > >>>>>>>>>>> I ran the bulk loader yesterday night to load data in a table. > >>>> During > >>>>>>>>> the > >>>>>>>>>>> bulk loading job one of the region server crashed and the > entire > >>>> job > >>>>>>>>>>> failed. It takes around 2.5 hours for this job to finish and > the > >>>> job > >>>>>>>>> failed > >>>>>>>>>>> when it was at around 50% complete. After the failure that > table > >>>> was > >>>>>>>>> also > >>>>>>>>>>> corrupted in HBase. My cluster has 8 region servers. > >>>>>>>>>>> > >>>>>>>>>>> Is bulk loading not fault tolerant to failure of region > servers? > >>>>>>>>>>> > >>>>>>>>>>> I am using this old email chain because at that time my > question > >>>> went > >>>>>>>>>>> unanswered. Please share your views. > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Anil Gupta > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Apr 3, 2012 at 9:12 AM, anil gupta < > [email protected]> > >>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi Kevin, > >>>>>>>>>>>> > >>>>>>>>>>>> I am not really concerned about the RegionServer going down > as the > >>>>>>> same > >>>>>>>>>>>> thing can happen when deployed in production. Although, in > >>>>>>> production > >>>>>>>>> we > >>>>>>>>>>>> wont be having VM environment and I am aware that my current > Dev > >>>>>>>>>>>> environment is not good for heavy processing. What i am > concerned > >>>>>>>>> about > >>>>>>>>>>> is > >>>>>>>>>>>> the failure of bulk loading job when the Region Server failed. > >>>> Does > >>>>>>>>> this > >>>>>>>>>>>> mean that Bulk loading job is not fault tolerant to Failure of > >>>>>>> Region > >>>>>>>>>>>> Server? I was expecting the job to be successful even though > the > >>>>>>>>>>>> RegionServer failed because there 6 more RS running in the > >>>> cluster. > >>>>>>>>> Fault > >>>>>>>>>>>> Tolerance is one of the biggest selling point of Hadoop > platform. > >>>>>>> Let > >>>>>>>>> me > >>>>>>>>>>>> know your views. > >>>>>>>>>>>> Thanks for your time. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Anil Gupta > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Tue, Apr 3, 2012 at 7:34 AM, Kevin O'dell < > >>>>>>> [email protected] > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Anil, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I am sorry for the delayed response. Reviewing the logs it > >>>>>>> appears: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 2/03/30 15:38:31 INFO zookeeper.ClientCnxn: Client session > timed > >>>>>>> out, > >>>>>>>>>>>>> have not heard from server in 59311ms for sessionid > >>>>>>> 0x136557f99c90065, > >>>>>>>>>>>>> closing socket connection and attempting reconnect > >>>>>>>>>>>>> > >>>>>>>>>>>>> 12/03/30 15:38:32 FATAL regionserver.HRegionServer: ABORTING > >>>> region > >>>>>>>>>>>>> server serverName=ihub-dn-b1,60020,1332955859363, > >>>> load=(requests=0, > >>>>>>>>>>>>> regions=44, usedHeap=446, maxHeap=1197): Unhandled exception: > >>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>>>>>> rejected; > >>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead > >>>> server > >>>>>>>>>>>>> > >>>>>>>>>>>>> It appears to be a classic overworked RS. You were doing too > >>>> much > >>>>>>>>>>>>> for the RS and it did not respond in time, the Master marked > it > >>>> as > >>>>>>>>>>>>> dead, when the RS responded Master said no your are already > dead > >>>>>>> and > >>>>>>>>>>>>> aborted the server. This is why you see the > YouAreDeadException. > >>>>>>>>>>>>> This is probably due to the shared resources of the VM > >>>>>>> infrastructure > >>>>>>>>>>>>> you are running. You will either need to devote more > resources > >>>> or > >>>>>>> add > >>>>>>>>>>>>> more nodes(most likely physical) to the cluster if you would > like > >>>>>>> to > >>>>>>>>>>>>> keep running these jobs. > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Mar 30, 2012 at 9:24 PM, anil gupta < > >>>> [email protected]> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>> Hi Kevin, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Here is dropbox link to the log file of region server which > >>>>>>> failed: > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>> > >>>> > http://dl.dropbox.com/u/64149128/hbase-hbase-regionserver-ihub-dn-b1.out > >>>>>>>>>>>>>> IMHO, the problem starts from the line #3009 which says: > >>>> 12/03/30 > >>>>>>>>>>>>> 15:38:32 > >>>>>>>>>>>>>> FATAL regionserver.HRegionServer: ABORTING region server > >>>>>>>>>>>>>> serverName=ihub-dn-b1,60020,1332955859363, load=(requests=0, > >>>>>>>>>>> regions=44, > >>>>>>>>>>>>>> usedHeap=446, maxHeap=1197): Unhandled exception: > >>>>>>>>>>>>>> org.apache.hadoop.hbase.YouAreDeadException: Server REPORT > >>>>>>> rejected; > >>>>>>>>>>>>>> currently processing ihub-dn-b1,60020,1332955859363 as dead > >>>> server > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I have already tested fault tolerance of HBase by manually > >>>>>>> bringing > >>>>>>>>>>>>> down a > >>>>>>>>>>>>>> RS while querying a Table and it worked fine and I was > expecting > >>>>>>> the > >>>>>>>>>>>>> same > >>>>>>>>>>>>>> today(even though the RS went down by itself today) when i > was > >>>>>>>>> loading > >>>>>>>>>>>>> the > >>>>>>>>>>>>>> data. But, it didn't work out well. > >>>>>>>>>>>>>> Thanks for your time. Let me know if you need more details. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> ~Anil > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 6:05 PM, Kevin O'dell < > >>>>>>>>>>> [email protected] > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Anil, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Can you please attach the RS logs from the failure? > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Fri, Mar 30, 2012 at 7:05 PM, anil gupta < > >>>>>>> [email protected]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> Hi All, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am using cdh3u2 and i have 7 worker nodes(VM's spread > across > >>>>>>> two > >>>>>>>>>>>>>>>> machines) which are running Datanode, Tasktracker, and > Region > >>>>>>>>>>>>> Server(1200 > >>>>>>>>>>>>>>>> MB heap size). I was loading data into HBase using Bulk > Loader > >>>>>>>>>>> with a > >>>>>>>>>>>>>>>> custom mapper. I was loading around 34 million records > and I > >>>>>>> have > >>>>>>>>>>>>> loaded > >>>>>>>>>>>>>>>> the same set of data in the same environment many times > before > >>>>>>>>>>>>> without > >>>>>>>>>>>>>>> any > >>>>>>>>>>>>>>>> problem. This time while loading the data, one of the > region > >>>>>>>>>>>>> server(but > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> DN and TT kept on running on that node ) failed and then > after > >>>>>>>>>>>>> numerous > >>>>>>>>>>>>>>>> failures of map-tasks the loding job failed. Is there any > >>>>>>>>>>>>>>>> setting/configuration which can make Bulk Loading > >>>>>>> fault-tolerant to > >>>>>>>>>>>>>>> failure > >>>>>>>>>>>>>>>> of region-servers? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>>> Thanks & Regards, > >>>>>>>>>>>>>>>> Anil Gupta > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> -- > >>>>>>>>>>>>>>> Kevin O'Dell > >>>>>>>>>>>>>>> Customer Operations Engineer, Cloudera > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> -- > >>>>>>>>>>>>>> Thanks & Regards, > >>>>>>>>>>>>>> Anil Gupta > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Kevin O'Dell > >>>>>>>>>>>>> Customer Operations Engineer, Cloudera > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- > >>>>>>>>>>>>> Thanks & Regards, > >>>>>>>>>>>>> Anil Gupta > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Thanks & Regards, > >>>>>>>>>>> Anil Gupta > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Kevin O'Dell > >>>>>>>>>> Customer Operations Engineer, Cloudera > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> -- > >>>>>>>> Thanks & Regards, > >>>>>>>> Anil Gupta > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> Thanks & Regards, > >>>>>> Anil Gupta > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Thanks & Regards, > >>>>> Anil Gupta > >>>> > >>>> > >>> > >>> > >>> -- > >>> Thanks & Regards, > >>> Anil Gupta > >> > > > > -- Thanks & Regards, Anil Gupta
