I looked at your logs. They show you running w/ default ulimit. Please fix (See the requirements section, in particular the ulimit and maxprocs section, http://hbase.apache.org/book.html#ulimit, but ensure you satisfy all requirements). Let us know how it goes after you change this. St.Ack
On Wed, Jan 4, 2012 at 4:45 AM, Seraph Imalia <[email protected]> wrote: > > On 03 Jan 2012, at 6:39 PM, Stack wrote: > >> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <[email protected]> wrote: >>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM >>> issues. I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with >>> 0.90.4 even 4.5 Gigs seems not enough. It does not matter how much load >>> the hbase services are under, it just crashes after 24-48 hours. >> >> What kind of a 'crash' is it? Is it OOME, or JVM seg faulting or just >> a full GC making the RS look like its gone away? > > The crash seems slightly different each time (which I suppose is consistent > with running out of RAM). When our monitoring system alerts me to the > problem and I log into the 3 servers, sometimes the regionservers on > dynobuntu10 and dynobuntu12 have already shutdown and the last thing in their > logs says that the Shutdown Hook finished. The regionsever on dynobuntu17 > (which also has the master running) is usually frozen with the last item in > the log being 10-20 minutes prior. > > I then run bin/stop-hbase.sh on the dynobuntu17: if the regionservers on > dynobuntu10 or dynobuntu12 are still running, sometimes they shutdown > gracefully whilst other times the logs just show Shutdown Hook Initiated and > then nothing more. The master then keeps logging which servers it is waiting > on to shutdown. I leave it like that for about 5-10 minutes allowing any > processes that are still alive to do as much as they can before I do a kill > -9. > > That said, for the latest crash: when I logged in, the regionservers on > dynobuntu10 and dynobuntu12 had shutdown already, and when I ran > bin/stop-hbase.sh on the master, everything shutdown gracefully (kill -9 was > not necessary) - this is the first time this has happened so effortlessly. > >> >>> The only difference the load makes is how quickly the services crash. >>> Even over this holiday season with our lowest load of the year, it crashes >>> just after 36 hours of being started. To fix it, I have to run the >>> stop-hbase.sh command, wait a while and kill -9 any hbase processes that >>> have stopped outputting logs or stopped responding, and then run >>> start-hbase.sh again. >> >> The process is deadlocked? IIRC, 0.90.4 had a possible deadlock. >> You could try 0.90.5. > > Sometimes yes, my answer above gives more detail. > Nice - didn't notice 0.90.5 had been released, I will try that next! > >> >> I took a look at some of the logs. They do not run from server start >> because i do not see the ulimit output in there. I'd like to see >> that. > > Sorry, I see that now :(. I have put the logs for the last two crashes here > (it's 2.5 Megs): > https://rapidshare.com/files/4120740991/hbase-last-two-crashes-2012-01-03_2012-01-04.tgz > One crash was around 19:30 yesterday and the second was at 12:50 today. > >> Looking at dynobuntu10, I see some interesting 'keys': >> >> 2011-12-28 15:25:53,297 INFO >> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request >> to open region: >> UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape >> town&industryid=14&person=&problem=out of stock&other=&headline=why >> advertise goods online and you cannot deliver%29&incident=i purchased >> goods online that were supposedly in stock on the 5th october. 2010. >> after numerous phone calls i was promised that i would receive the >> ordered goods on the 20th october 2010. this has not happened to date. >> i spoke with them today and they promised to answer my queries on >> 21st october2010. how can you run a online busines ans sell %22we dont >> have stock%22%3a this is the easy way out as we have no proof of >> that%0d%0ait is just common curtousy to return a phone call. they have >> had my money in their bank account for 15 days. this seems like a >> ****. they could be reaping interest on thousands of peoples money. >> easy way of making money.%0d%0akalahari. net are in a comfort zone. >> they need to realize that customers are king%0d%0athey reimburse my >> money. i paid bank charges and transfer fees. what about this. my >> unnessessary phone calls. do they reinburse this.%0d%0acome on stop >> taking the innocent public for a ride with your sweet >> talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210 >> %2f3573310 &cellno=%2b27 >> 766881896&preview=preview,1308921597915.1827414390 >> >> Thats a single key. It looks like you have an issue in your crawler's >> url extraction facility. > > Yeah, that URL actually exists, but I can see how that can be a problem to > use as a key. Not sure what to do here, perhaps we should exclude URL's like > this - or perhaps your hashing idea below will solve that. I don't really > know enough about hashing to make the call though - is it not possible to run > into duplicate keys using e.g. an MD5 Hash? - The MD5 Hash of the above URL > is: 8f157d290ceeacedb6c1be133f1ca153 - it seems logical to me that a string > that small cannot possibly be unique given that the URL was originally 1431 > characters long. What is your opinion on this? - I will be doing some more > research on this though. Perhaps there is a Hash-type that is more unique > that you could suggest for our purposes (but keeping in mind our ad delivery > servers will need to hash the URL before querying hBase so it needs to be > fast and not resource intensive)? > >> If you have lots of URLs like the above, my guess is that you have >> massive indices. Look at a regionserver and see how much RAM the >> indexes take up? > > Yeah, looks pretty high, it is currently half the MaxHeap on a fresh start... > > Below is what it is now, I have just disabled the block cache after the last > crash as you suggested to try keep it stable until we have a real fix. With > the Block cache at the default of 25% (1 Gig) and the IndexSize at around 2 > Gigs, that only leaves 1 Gig for everything else :( which is not much. > > dynobuntu10: requests=93, regions=225, stores=225, storefiles=354, > storefileIndexSize=2239, memstoreSize=35, compactionQueueSize=0, > flushQueueSize=0, usedHeap=2639, maxHeap=4087, blockCacheSize=0, > blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, > blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, > blockCacheHitCachingRatio=0 > > dynobuntu12: requests=305, regions=225, stores=225, storefiles=435, > storefileIndexSize=2004, memstoreSize=31, compactionQueueSize=0, > flushQueueSize=0, usedHeap=2321, maxHeap=4087, blockCacheSize=0, > blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, > blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, > blockCacheHitCachingRatio=0 > > dynobuntu17: requests=51, regions=226, stores=226, storefiles=410, > storefileIndexSize=2046, memstoreSize=36, compactionQueueSize=0, > flushQueueSize=0, usedHeap=2927, maxHeap=4087, blockCacheSize=0, > blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, > blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, > blockCacheHitCachingRatio=0 > >> >> In dynoubuntu12 I see an OOME. Interestingly, the OOME is while >> trying to read in a file's index on: >> >> 2011-12-28 15:26:50,310 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION >> => {NAME => >> 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877', >> STARTKEY => >> 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript', >> ENDKEY => >> 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript', >> ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES => >> ' indexUrlUID >> UrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGenerator >> UrlIndex_Family:urluid >> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable >> indexHostLocations UrlIndex_Family:host >> UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator >> UrlIndex_Family:host >> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable >> indexLocationCodes >> UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator >> UrlIndex_Family:locationcodes >> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable >> indexHost >> UrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator >> UrlIndex_Family:host >> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable >> indexChannelUIDs >> UrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator >> UrlIndex_Family:channeluids >> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable', >> FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered >> => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647', >> datenextspider => '', daylastaccesed => '', host => '', originalurl => >> '', locationcodes => '', extension => '', IN_MEMORY => 'false', >> COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount => >> '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536', >> datelastmodified => '', urluid => '', BLOCKCACHE => 'true', >> channeluids => ''}]}} >> >> >> What is that INDEXES thing in the above schema? Is that some >> secondary indexing thing you have going on? > > Yes, it is a secondary Index we created. Basically, we serve ads and ads are > queued per URL. When we discover a new URL, we add it to HBase and give it a > GUID which is stored as a column. Other servers build lists of ads for each > URL and store it against the GUID of the URL. So when a request comes in for > ads, we use HBase to lookup the URL and get the GUID so that it can then know > which ads to show. This is all handled by the main URLIndex Table. BUT, > sometimes and far less often, we have a situation where we have the GUID but > need to look up the URL - so we have another table where the GUID is the > rowKey and the URL is a column. We also do this for hosts and for Channels > (this is what we call a place where ads show). > >> >> You might take a look at the files under UrlIndex/1246560666 in the >> UrlIndex column family..... Print out their meta data and see what >> size indices you have. See >> http://hbase.apache.org/book.html#rowkey.design in the book. It has >> some pointers and some talk on issues you may be running into. >> > > root@dynobuntu17:/opt/hadoop-0.20.2# bin/hadoop fs -ls > /hbase/UrlIndex/1246560666/UrlIndex_Family > Found 1 items > -rw-r--r-- 2 root supergroup 199115380 2010-09-10 01:52 > /hbase/UrlIndex/1246560666/UrlIndex_Family/6442966743799940481 > root@dynobuntu17:/opt/hadoop-0.20.2# > > After reading that, it seems clear we need to make some minor changes to our > Table Design, unfortunately it means creating a new table and copying rows to > the new table - not a fun process because we can't be down whilst doing it, > so we'll have to write some good code to ease the process - doable, but not > fun. I am hoping that upgrading the version to 0.90.5 and disabling the > block cache buys us about a month so we have the time to plan it properly. > >> >>> Attached are my logs from the latest "start-to-crash". There are 3 servers >>> and hbase is being used for storing URL's - 7 client servers connect to >>> hbase and perform URL Lookups at about 40 requests per second (this is the >>> low load over this holiday season). If the URL does not exist, it gets >>> added. The Key on the HTable is the URL and there are a few fields stored >>> against it - e.g. DateDiscovered, Host, Script, QueryString, etc. >>> >> >> Do you have to scan the URLs in order or by website? If not, you >> might have a key that is a hash of the URL (and keep actual URL as >> column data). > > Yes sometimes we need to do scans like that - but only for a manual > investigation, not during normal operation. We may be able to get by as long > as we can come up with a plan for how we can find the url's for a particular > website. I am concerned about the uniqueness of a hash. I see there are > lots of different hashes. Will there be uniqueness issues? - we can't have > two URL's having the same hash. > >> >>> Each server has a hadoop datanode and an hbase regionserver and 1 of the >>> servers additionally has the namenode, master and zookeeper. On first >>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart >>> the clients, the usedHeap slowly climes until it reaches the maxHeap and >>> shortly after that, the regionservers start crashing - sometimes they >>> actually shutdown gracefully by themselves. >>> >> >> >> Are the URL lookups totally random? If so, turn off the block cache. >> That'll get you some more memory. > > Yes, pretty random and as we grow, it will get more random. I have disabled > the block cache - and looking at the heap stats which I pasted above, it > seems like it will buy us some time to make some long-term changes - I will > keep you updated here. > >> >> Add more servers too to spread the load if you can afford it. Things >> tend to run smoother once you get above 5 servers or so. > > We currently have 4 instances of HBase - 2 each have 5 servers and are used > for Ad Delivery Log Storage and the other two are used for URL lookups and > each have 3 servers. I will struggle to get our finance guys to approve more > servers for hBase, but if that is my only option I will definitely try :) > > Coincidently, the 2 instances used for Ad Delivery log storage are down at > the moment, but it is because we are having stability issues with Ubuntu > Server - they periodically do a memory dump and shut down - even if nothing > is running on them. I have to tackle that problem too pretty soon. In the > meantime, MySQL is taking up the slack, but we will quickly run into > performance issues if we don't fix that soon. But anyway, at the moment I > don't need your help with those servers because it does not seem to be hBase > or hadoop causing the crashes. I am tackling this URL Server problem first. > >> >>> Originally, we had hbase.regionserver.handler.count set to 100 and I have >>> now removed that to leave it as default which has not helped. >>> >>> We have not made any changes to the clients and we have a mirrored instance >>> of this in our UK Data Centre which is still running 0.20.6 and servicing >>> 10 clients currently at over 300 requests per second (again low load over >>> the holidays) and it is 100% stable. >>> >>> What do I do now? - your website says I cannot downgrade? >>> >> >> That is right. >> >> Lets get this stable again. >> >> St.Ack > > Thanks for your help so far. I have already disabled the block cache (which > I am sure will show an immediate improvement) and I will schedule an upgrade > of hBase to 0.90.5 during this week and then monitor it. If you do have some > knowledge about the uniqueness of an MD5 Hash, please share it with me if you > have the time? - it will help me whilst I plan the changes we need to make to > the table structure. > > Regards, > Seraph >
