I looked at your logs.  They show you running w/ default ulimit.
Please fix (See the requirements section, in particular the ulimit and
maxprocs section, http://hbase.apache.org/book.html#ulimit, but ensure
you satisfy all requirements).  Let us know how it goes after you
change this.
St.Ack


On Wed, Jan 4, 2012 at 4:45 AM, Seraph Imalia <[email protected]> wrote:
>
> On 03 Jan 2012, at 6:39 PM, Stack wrote:
>
>> On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <[email protected]> wrote:
>>> After updating from 0.20.6 to 0.90.4, we have been having serious RAM 
>>> issues.  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 
>>> 0.90.4 even 4.5 Gigs seems not enough.  It does not matter how much load 
>>> the hbase services are under, it just crashes after 24-48 hours.
>>
>> What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
>> a full GC making the RS look like its gone away?
>
> The crash seems slightly different each time (which I suppose is consistent 
> with running out of RAM).  When our monitoring system alerts me to the 
> problem and I log into the 3 servers, sometimes the regionservers on 
> dynobuntu10 and dynobuntu12 have already shutdown and the last thing in their 
> logs says that the Shutdown Hook finished.  The regionsever on dynobuntu17 
> (which also has the master running) is usually frozen with the last item in 
> the log being 10-20 minutes prior.
>
> I then run bin/stop-hbase.sh on the dynobuntu17: if the regionservers on 
> dynobuntu10 or dynobuntu12 are still running, sometimes they shutdown 
> gracefully whilst other times the logs just show Shutdown Hook Initiated and 
> then nothing more.  The master then keeps logging which servers it is waiting 
> on to shutdown.  I leave it like that for about 5-10 minutes allowing any 
> processes that are still alive to do as much as they can before I do a kill 
> -9.
>
> That said, for the latest crash: when I logged in, the regionservers on 
> dynobuntu10 and dynobuntu12 had shutdown already, and when I ran 
> bin/stop-hbase.sh on the master, everything shutdown gracefully (kill -9 was 
> not necessary) - this is the first time this has happened so effortlessly.
>
>>
>>>  The only difference the load makes is how quickly the services crash.  
>>> Even over this holiday season with our lowest load of the year, it crashes 
>>> just after 36 hours of being started.  To fix it, I have to run the 
>>> stop-hbase.sh command, wait a while and kill -9 any hbase processes that 
>>> have stopped outputting logs or stopped responding, and then run 
>>> start-hbase.sh again.
>>
>> The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
>> You could try 0.90.5.
>
> Sometimes yes, my answer above gives more detail.
> Nice - didn't notice 0.90.5 had been released, I will try that next!
>
>>
>> I took a look at some of the logs.  They do not run from server start
>> because i do not see the ulimit output in there.  I'd like to see
>> that.
>
> Sorry, I see that now :(.  I have put the logs for the last two crashes here 
> (it's 2.5 Megs):  
> https://rapidshare.com/files/4120740991/hbase-last-two-crashes-2012-01-03_2012-01-04.tgz
> One crash was around 19:30 yesterday and the second was at 12:50 today.
>
>> Looking at dynobuntu10, I see some interesting 'keys':
>>
>> 2011-12-28 15:25:53,297 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
>> to open region:
>> UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
>> town&industryid=14&person=&problem=out of stock&other=&headline=why
>> advertise goods online and you cannot deliver%29&incident=i purchased
>> goods online that were supposedly in stock on the 5th october. 2010.
>> after numerous phone calls i was promised that i would receive the
>> ordered goods on the 20th october 2010. this has not happened to date.
>> i spoke with them today and they promised to answer my queries on
>> 21st october2010. how can you run a online busines ans sell %22we dont
>> have stock%22%3a this is the easy way out as we have no proof of
>> that%0d%0ait is just common curtousy to return a phone call. they have
>> had my money in their bank account for 15 days. this seems like a
>> ****. they could be reaping interest on thousands of peoples money.
>> easy way of making money.%0d%0akalahari. net are in a comfort zone.
>> they need to realize that customers are king%0d%0athey reimburse my
>> money. i paid bank charges and transfer fees. what about this. my
>> unnessessary phone calls. do they reinburse this.%0d%0acome on stop
>> taking the innocent public for a ride with your sweet
>> talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
>> %2f3573310 &cellno=%2b27
>> 766881896&preview=preview,1308921597915.1827414390
>>
>> Thats a single key.  It looks like you have an issue in your crawler's
>> url extraction facility.
>
> Yeah, that URL actually exists, but I can see how that can be a problem to 
> use as a key.  Not sure what to do here, perhaps we should exclude URL's like 
> this - or perhaps your hashing idea below will solve that.  I don't really 
> know enough about hashing to make the call though - is it not possible to run 
> into duplicate keys using e.g. an MD5 Hash? - The MD5 Hash of the above URL 
> is: 8f157d290ceeacedb6c1be133f1ca153 - it seems logical to me that a string 
> that small cannot possibly be unique given that the URL was originally 1431 
> characters long.  What is your opinion on this? - I will be doing some more 
> research on this though.  Perhaps there is a Hash-type that is more unique 
> that you could suggest for our purposes (but keeping in mind our ad delivery 
> servers will need to hash the URL before querying hBase so it needs to be 
> fast and not resource intensive)?
>
>> If you have lots of URLs like the above, my guess is that you have
>> massive indices.  Look at a regionserver and see how much RAM the
>> indexes take up?
>
> Yeah, looks pretty high, it is currently half the MaxHeap on a fresh start...
>
> Below is what it is now, I have just disabled the block cache after the last 
> crash as you suggested to try keep it stable until we have a real fix.  With 
> the Block cache at the default of 25% (1 Gig) and the IndexSize at around 2 
> Gigs, that only leaves 1 Gig for everything else :( which is not much.
>
> dynobuntu10: requests=93, regions=225, stores=225, storefiles=354, 
> storefileIndexSize=2239, memstoreSize=35, compactionQueueSize=0, 
> flushQueueSize=0, usedHeap=2639, maxHeap=4087, blockCacheSize=0, 
> blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, 
> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, 
> blockCacheHitCachingRatio=0
>
> dynobuntu12: requests=305, regions=225, stores=225, storefiles=435, 
> storefileIndexSize=2004, memstoreSize=31, compactionQueueSize=0, 
> flushQueueSize=0, usedHeap=2321, maxHeap=4087, blockCacheSize=0, 
> blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, 
> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, 
> blockCacheHitCachingRatio=0
>
> dynobuntu17: requests=51, regions=226, stores=226, storefiles=410, 
> storefileIndexSize=2046, memstoreSize=36, compactionQueueSize=0, 
> flushQueueSize=0, usedHeap=2927, maxHeap=4087, blockCacheSize=0, 
> blockCacheFree=0, blockCacheCount=0, blockCacheHitCount=0, 
> blockCacheMissCount=0, blockCacheEvictedCount=0, blockCacheHitRatio=0, 
> blockCacheHitCachingRatio=0
>
>>
>> In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
>> trying to read in a file's index on:
>>
>> 2011-12-28 15:26:50,310 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
>> => {NAME => 
>> 'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
>> STARTKEY => 
>> 'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>> ENDKEY => 
>> 'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
>> ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
>> '  indexUrlUID  
>> UrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGenerator
>>  UrlIndex_Family:urluid 
>> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable
>>  indexHostLocations  UrlIndex_Family:host 
>> UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator
>>  UrlIndex_Family:host 
>> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable
>>  indexLocationCodes  
>> UrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator
>>  UrlIndex_Family:locationcodes 
>> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable
>>    indexHost  
>> UrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator
>>  UrlIndex_Family:host 
>> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable
>>  indexChannelUIDs  
>> UrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGenerator
>>  UrlIndex_Family:channeluids 
>> org.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
>> FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
>> => '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
>> datenextspider => '', daylastaccesed => '', host => '', originalurl =>
>> '', locationcodes => '', extension => '', IN_MEMORY => 'false',
>> COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
>> '', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
>> datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
>> channeluids => ''}]}}
>>
>>
>> What is that INDEXES thing in the above schema?  Is that some
>> secondary indexing thing you have going on?
>
> Yes, it is a secondary Index we created. Basically, we serve ads and ads are 
> queued per URL.  When we discover a new URL, we add it to HBase and give it a 
> GUID which is stored as a column.  Other servers build lists of ads for each 
> URL and store it against the GUID of the URL.  So when a request comes in for 
> ads, we use HBase to lookup the URL and get the GUID so that it can then know 
> which ads to show.  This is all handled by the main URLIndex Table.  BUT, 
> sometimes and far less often, we have a situation where we have the GUID but 
> need to look up the URL - so we have another table where the GUID is the 
> rowKey and the URL is a column.  We also do this for hosts and for Channels 
> (this is what we call a place where ads show).
>
>>
>> You might take a look at the files under UrlIndex/1246560666 in the
>> UrlIndex column family..... Print out their meta data and see what
>> size indices you have.   See
>> http://hbase.apache.org/book.html#rowkey.design in the book.  It has
>> some pointers and some talk on issues you may be running into.
>>
>
> root@dynobuntu17:/opt/hadoop-0.20.2# bin/hadoop fs -ls 
> /hbase/UrlIndex/1246560666/UrlIndex_Family
> Found 1 items
> -rw-r--r--   2 root supergroup  199115380 2010-09-10 01:52 
> /hbase/UrlIndex/1246560666/UrlIndex_Family/6442966743799940481
> root@dynobuntu17:/opt/hadoop-0.20.2#
>
> After reading that, it seems clear we need to make some minor changes to our 
> Table Design, unfortunately it means creating a new table and copying rows to 
> the new table - not a fun process because we can't be down whilst doing it, 
> so we'll have to write some good code to ease the process - doable, but not 
> fun.  I am hoping that upgrading the version to 0.90.5 and disabling the 
> block cache buys us about a month so we have the time to plan it properly.
>
>>
>>> Attached are my logs from the latest "start-to-crash".  There are 3 servers 
>>> and hbase is being used for storing URL's - 7 client servers connect to 
>>> hbase and perform URL Lookups at about 40 requests per second (this is the 
>>> low load over this holiday season).  If the URL does not exist, it gets 
>>> added.  The Key on the HTable is the URL and there are a few fields stored 
>>> against it - e.g. DateDiscovered, Host, Script, QueryString, etc.
>>>
>>
>> Do you have to scan the URLs in order or by website?  If not, you
>> might have a key that is a hash of the URL (and keep actual URL as
>> column data).
>
> Yes sometimes we need to do scans like that - but only for a manual 
> investigation, not during normal operation.  We may be able to get by as long 
> as we can come up with a plan for how we can find the url's for a particular 
> website.  I am concerned about the uniqueness of a hash.  I see there are 
> lots of different hashes.  Will there be uniqueness issues? - we can't have 
> two URL's having the same hash.
>
>>
>>> Each server has a hadoop datanode and an hbase regionserver and 1 of the 
>>> servers additionally has the namenode, master and zookeeper.  On first 
>>> start, each regionserver uses 2 Gigs (usedHeap) and as soon as I restart 
>>> the clients, the usedHeap slowly climes until it reaches the maxHeap and 
>>> shortly after that, the regionservers start crashing - sometimes they 
>>> actually shutdown gracefully by themselves.
>>>
>>
>>
>> Are the URL lookups totally random?  If so, turn off the block cache.
>> That'll get you some more memory.
>
> Yes, pretty random and as we grow, it will get more random.  I have disabled 
> the block cache - and looking at the heap stats which I pasted above, it 
> seems like it will buy us some time to make some long-term changes - I will 
> keep you updated here.
>
>>
>> Add more servers too to spread the load if you can afford it.  Things
>> tend to run smoother once you get above 5 servers or so.
>
> We currently have 4 instances of HBase - 2 each have 5 servers and are used 
> for Ad Delivery Log Storage and the other two are used for URL lookups and 
> each have 3 servers.  I will struggle to get our finance guys to approve more 
> servers for hBase, but if that is my only option I will definitely try :)
>
> Coincidently, the 2 instances used for Ad Delivery log storage are down at 
> the moment, but it is because we are having stability issues with Ubuntu 
> Server - they periodically do a memory dump and shut down - even if nothing 
> is running on them.  I have to tackle that problem too pretty soon.  In the 
> meantime, MySQL is taking up the slack, but we will quickly run into 
> performance issues if we don't fix that soon.  But anyway, at the moment I 
> don't need your help with those servers because it does not seem to be hBase 
> or hadoop causing the crashes.  I am tackling this URL Server problem first.
>
>>
>>> Originally, we had hbase.regionserver.handler.count set to 100 and I have 
>>> now removed that to leave it as default which has not helped.
>>>
>>> We have not made any changes to the clients and we have a mirrored instance 
>>> of this in our UK Data Centre which is still running 0.20.6 and servicing 
>>> 10 clients currently at over 300 requests per second (again low load over 
>>> the holidays) and it is 100% stable.
>>>
>>> What do I do now? - your website says I cannot downgrade?
>>>
>>
>> That is right.
>>
>> Lets get this stable again.
>>
>> St.Ack
>
> Thanks for your help so far.  I have already disabled the block cache (which 
> I am sure will show an immediate improvement) and I will schedule an upgrade 
> of hBase to 0.90.5 during this week and then monitor it.  If you do have some 
> knowledge about the uniqueness of an MD5 Hash, please share it with me if you 
> have the time? - it will help me whilst I plan the changes we need to make to 
> the table structure.
>
> Regards,
> Seraph
>

Reply via email to