Re: RAM Problems - Keeps Crashing

Stack Tue, 03 Jan 2012 08:39:46 -0800

On Wed, Dec 28, 2011 at 6:27 AM, Seraph Imalia <[email protected]> wrote:
> After updating from 0.20.6 to 0.90.4, we have been having serious RAM issues. 
>  I had hbase-env.sh set to use 3 Gigs of RAM with 0.20.6 but with 0.90.4 even 
> 4.5 Gigs seems not enough.  It does not matter how much load the hbase 
> services are under, it just crashes after 24-48 hours.


What kind of a 'crash' is it?  Is it OOME, or JVM seg faulting or just
a full GC making the RS look like its gone away?


>  The only difference the load makes is how quickly the services crash.  Even 
> over this holiday season with our lowest load of the year, it crashes just 
> after 36 hours of being started.  To fix it, I have to run the stop-hbase.sh 
> command, wait a while and kill -9 any hbase processes that have stopped 
> outputting logs or stopped responding, and then run start-hbase.sh again.
>

The process is deadlocked?   IIRC, 0.90.4 had a possible deadlock.
You could try 0.90.5.

I took a look at some of the logs.  They do not run from server start
because i do not see the ulimit output in there.  I'd like to see
that.

Looking at dynobuntu10, I see some interesting 'keys':

2011-12-28 15:25:53,297 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Received request
to open region:
UrlIndex,http://www.hellopeter.com/write_report_preview.php?inclination=1&company=kalahari.net&countryid=168&location=cape
town&industryid=14&person=&problem=out of stock&other=&headline=why
advertise goods online and you cannot deliver%29&incident=i purchased
goods online that were supposedly in stock on the 5th october. 2010.
after numerous phone calls i was promised that i would receive the
ordered goods on the 20th october 2010. this has not happened to date.
i spoke with them today and they promised to answer my queries on
21st october2010. how can you run a online busines ans sell %22we dont
have stock%22%3a this is the easy way out as we have no proof of
that%0d%0ait is just common curtousy to return a phone call. they have
had my money in their bank account for 15 days. this seems like a
****. they could be reaping interest on thousands of peoples money.
easy way of making money.%0d%0akalahari. net are in a comfort zone.
they need to realize that customers are king%0d%0athey reimburse my
money. i paid bank charges and transfer fees. what about this. my
unnessessary phone calls. do they reinburse this.%0d%0acome on stop
taking the innocent public for a ride with your sweet
talk.&incidentcharsleft=270&incident_day_select=21&incident_month_select=10&incident_year_select=2010&incident_hour_select=11&incident_min_select=45&incident_ampm_select=pm&policyno=3573210
%2f3573310 &cellno=%2b27
766881896&preview=preview,1308921597915.1827414390

Thats a single key.  It looks like you have an issue in your crawler's
url extraction facility.

If you have lots of URLs like the above, my guess is that you have
massive indices.  Look at a regionserver and see how much RAM the
indexes take up?

In dynoubuntu12 I see an OOME.  Interestingly, the OOME is while
trying to read in a file's index on:

2011-12-28 15:26:50,310 DEBUG
org.apache.hadoop.hbase.regionserver.HRegion: Opening region: REGION
=> {NAME => 
'UrlIndex,http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript,1283006905877',
STARTKEY => 
'http://media.imgbnr.com/images/prep_ct.php?imgfile=4327_567146_7571713_250_300.html&partnerid=113471&appid=35229&subid=&advertiserid=567146&keywordid=42825417&type=11&uuid=e11ac4bea82d42838fde8eb306fbc354&keyword=www.&matchedby=c&ct=cpi&wid=5008233&size=300x250&lid=7571713&cid=230614&cc=us&rc=in&mc=602&dc=0&vt=1275659190365&refurl=mangafox.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
ENDKEY => 
'http://media.imgbnr.com/images/prep_ct.php?imgfile=6966_567146_7571715_90_728.html&partnerid=113474&appid=35224&subid=&advertiserid=567146&keywordid=42825616&type=11&uuid=6178294088f545ab938c403be5b7c957&keyword=www.&matchedby=c&ct=cpi&wid=5008236&size=728x90&lid=7571715&cid=230615&cc=us&rc=ny&mc=501&dc=0&vt=1275772980357&refurl=worldstarhiphop.com&clickdomain=66.45.56.124&pinfo=&rurl=http://javascript',
ENCODED => 1246560666, TABLE => {{NAME => 'UrlIndex', INDEXES =>
'indexUrlUIDUrlIndex_Family:urluid=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGenerator:com.entelligence.tools.hbase.index.UniqueIndexKeyGeneratorUrlIndex_Family:urluidorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexHostLocationsUrlIndex_Family:hostUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexLocationCodesUrlIndex_Family:locationcodes=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:locationcodesorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable
  
indexHostUrlIndex_Family:host=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:hostorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparableindexChannelUIDsUrlIndex_Family:channeluids=org.apache.hadoop.hbase.client.tableindexed.IndexKeyGeneratorCorg.apache.hadoop.hbase.client.tableindexed.SimpleIndexKeyGeneratorUrlIndex_Family:channeluidsorg.apache.hadoop.io.Writable0org.apache.hadoop.io.ObjectWritable$NullInstance'org.apache.hadoop.io.WritableComparable',
FAMILIES => [{NAME => 'UrlIndex_Family', query => '', datelastspidered
=> '', path => '', BLOOMFILTER => 'NONE', TTL => '2147483647',
datenextspider => '', daylastaccesed => '', host => '', originalurl =>
'', locationcodes => '', extension => '', IN_MEMORY => 'false',
COMPRESSION => 'LZO', VERSIONS => '3', protocol => '', failCount =>
'', datediscovered => '', contentDiff => '', BLOCKSIZE => '65536',
datelastmodified => '', urluid => '', BLOCKCACHE => 'true',
channeluids => ''}]}}


What is that INDEXES thing in the above schema?  Is that some
secondary indexing thing you have going on?

You might take a look at the files under UrlIndex/1246560666 in the
UrlIndex column family..... Print out their meta data and see what
size indices you have.   See
http://hbase.apache.org/book.html#rowkey.design in the book.  It has
some pointers and some talk on issues you may be running into.


> Attached are my logs from the latest "start-to-crash".  There are 3 servers 
> and hbase is being used for storing URL's - 7 client servers connect to hbase 
> and perform URL Lookups at about 40 requests per second (this is the low load 
> over this holiday season).  If the URL does not exist, it gets added.  The 
> Key on the HTable is the URL and there are a few fields stored against it - 
> e.g. DateDiscovered, Host, Script, QueryString, etc.
>

Do you have to scan the URLs in order or by website?  If not, you
might have a key that is a hash of the URL (and keep actual URL as
column data).

> Each server has a hadoop datanode and an hbase regionserver and 1 of the 
> servers additionally has the namenode, master and zookeeper.  On first start, 
> each regionserver uses 2 Gigs (usedHeap) and as soon as I restart the 
> clients, the usedHeap slowly climes until it reaches the maxHeap and shortly 
> after that, the regionservers start crashing - sometimes they actually 
> shutdown gracefully by themselves.
>


Are the URL lookups totally random?  If so, turn off the block cache.
That'll get you some more memory.

Add more servers too to spread the load if you can afford it.  Things
tend to run smoother once you get above 5 servers or so.

> Originally, we had hbase.regionserver.handler.count set to 100 and I have now 
> removed that to leave it as default which has not helped.
>
> We have not made any changes to the clients and we have a mirrored instance 
> of this in our UK Data Centre which is still running 0.20.6 and servicing 10 
> clients currently at over 300 requests per second (again low load over the 
> holidays) and it is 100% stable.
>
> What do I do now? - your website says I cannot downgrade?
>

That is right.

Lets get this stable again.

St.Ack

Re: RAM Problems - Keeps Crashing

Reply via email to