On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <[email protected]> wrote: > After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues. > I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even > 4.5 Gigs seems not enough. It does not matter how much load the hbase > services are under, it just crashes after 24-48 hours.
What kind of a 'crash' is it? Is it OOME, or JVM seg faulting or just a full GC making the RS look like its gone away? > The only difference the load makes is how quickly the services crash. Even > over this holiday season with our lowest load of the year, it crashes just > after 36 hours of being started. To fix it, I have to run the stop-hbase.sh > command, wait a while and kill -9 any hbase processes that have stopped > outputting logs or stopped responding, and then run start-hbase.sh again. > The process is deadlocked? IIRC, 0.90.4 had a possible deadlock. You could try 0.90.5. I took a look at some of the logs. They do not run from server start because i do not see the ulimit output in there. I'd like to see that. Looking at dynobuntu10, I see some interesting 'keys': 2011-12-28 15:25:53,297 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Received request to open region: UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape town&industryid=14&person=&problem=out of stock&other=&headline=why advertise goods online and you cannot deliver%29&incident=i purchased goods online that were supposedly in stock on the 5th october. 2010. after numerous phone calls i was promised that i would receive the ordered goods on the 20th october 2010. this has not happened to date. i spoke with them today and they promised to answer my queries on 21st october2010. how can you run a online busines ans sell %22we dont have stock%22%3a this is the easy way out as we have no proof of that%0d%0ait is just common curtousy to return a phone call. they have had my money in their bank account for 15 days. this seems like a ****. they could be reaping interest on thousands of peoples money. easy way of making money.%0d%0akalahari. net are in a comfort zone. they need to realize that customers are king%0d%0athey reimburse my money. i paid bank charges and transfer fees. what about this. my unnessessary phone calls. do they reinburse this.%0d%0acome on stop taking the innocent public for a ride with your sweet talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210 %2f3573310 &cellno=%2b27 766881896&preview=preview,1308921597915.1827414390 Thats a single key. It looks like you have an issue in your crawler's url extraction facility. If you have lots of URLs like the above, my guess is that you have massive indices. Look at a regionserver and see how much RAM the indexes take up? In dynoubuntu12 I see an OOME. Interestingly, the OOME is while trying to read in a file's index on: 2011-12-28 15:26:50,310 DEBUG org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION => {NAME => 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877', STARTKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript', ENDKEY => 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript', ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES => ' indexUrlUID UrlIndex_Family:urluid =org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator :com.entelligence.tools.hbase.index.UniqueIndexKeyGeneratorUrlIndex_Family:urluid org.apache.hadoop.io.Writable 0org.apache.hadoop.io.ObjectWritable$NullInstance 'org.apache.hadoop.io.WritableComparable indexHostLocations UrlIndex_Family:hostUrlIndex_Family:locationcodes =org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator Corg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:host org.apache.hadoop.io.Writable 0org.apache.hadoop.io.ObjectWritable$NullInstance 'org.apache.hadoop.io.WritableComparable indexLocationCodes UrlIndex_Family:locationcodes =org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator Corg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:locationcodes org.apache.hadoop.io.Writable 0org.apache.hadoop.io.ObjectWritable$NullInstance 'org.apache.hadoop.io.WritableComparable indexHost UrlIndex_Family:host =org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator Corg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:host org.apache.hadoop.io.Writable 0org.apache.hadoop.io.ObjectWritable$NullInstance 'org.apache.hadoop.io.WritableComparable indexChannelUIDs UrlIndex_Family:channeluids =org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator Corg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:channeluids org.apache.hadoop.io.Writable 0org.apache.hadoop.io.ObjectWritable$NullInstance 'org.apache.hadoop.io.WritableComparable', FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647', datenextspider => '', daylastaccesed => '', host => '', originalurl => '', locationcodes => '', extension => '', IN_MEMORY => 'false', COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount => '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536', datelastmodified => '', urluid => '', BLOCKCACHE => 'true', channeluids => ''}]}} What is that INDEXES thing in the above schema? Is that some secondary indexing thing you have going on? You might take a look at the files under UrlIndex/1246560666 in the UrlIndex column family..... Print out their meta data and see what size indices you have. See http://hbase.apache.org/book.html#rowkey.design in the book. It has some pointers and some talk on issues you may be running into. > Attached are my logs from the latest "start-to-crash". There are 3 servers > and hbase is being used for storing URL's - 7 client servers connect to hbase > and perform URL Lookups at about 40 requests per second (this is the low load > over this holiday season). If the URL does not exist, it gets added. The > Key on the HTable is the URL and there are a few fields stored against it - > e.g. DateDiscovered, Host, Script, QueryString, etc. > Do you have to scan the URLs in order or by website? If not, you might have a key that is a hash of the URL (and keep actual URL as column data). > Each server has a hadoop datanode and an hbase regionserver and 1 of the > servers additionally has the namenode, master and zookeeper. On first start, > each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the > clients, the usedHeap slowly climes until it reaches the maxHeap and shortly > after that, the regionservers start crashing - sometimes they actually > shutdown gracefully by themselves. > Are the URL lookups totally random? If so, turn off the block cache. That'll get you some more memory. Add more servers too to spread the load if you can afford it. Things tend to run smoother once you get above 5 servers or so. > Originally, we had hbase.regionserver.handler.count set to 100 and I have now > removed that to leave it as default which has not helped. > > We have not made any changes to the clients and we have a mirrored instance > of this in our UK Data Centre which is still running 0.20.6 and servicing 10 > clients currently at over 300 requests per second (again low load over the > holidays) and it is 100% stable. > > What do I do now? - your website says I cannot downgrade? > That is right. Lets get this stable again. St.Ack
