Re: YouAreDeadException

2011-01-15 Thread Stack
Yes. Currently, there are two heartbeats: the zk client one and then the hbase which used to be what we relied on figuring whether a regionserver is alive but now its just used to post the master the regionserver stats such as requests per second. This latter is going away in 0.92 (Pre-0.90.0 reg

Re: YouAreDeadException

2011-01-15 Thread Ted Yu
For #1, I assume I should look for 'received expired from ZooKeeper, aborting' On Sat, Jan 15, 2011 at 5:02 PM, Ted Yu wrote: > For #1, what string should I look for in region server log ? > For #4, what's the rationale behind sending YADE after receiving heartbeat > ? I thought heartbeat means

Re: YouAreDeadException

2011-01-15 Thread Ted Yu
For #1, what string should I look for in region server log ? For #4, what's the rationale behind sending YADE after receiving heartbeat ? I thought heartbeat means the RS is alive. Thanks On Sat, Jan 15, 2011 at 4:49 PM, Stack wrote: > FYI Ted, the YourAreDeadException usually happens in follow

Re: YouAreDeadException

2011-01-15 Thread Stack
FYI Ted, the YourAreDeadException usually happens in following context: 1. Regionserver has some kinda issue -- long GC pause for instance -- and it stops tickling zk. 2. Master gets zk session expired event. Starts up recovery of the hung region. 3. Regionserver recovers but has not yet processe

Re: YouAreDeadException

2011-01-14 Thread Ted Yu
Thanks for your analysis, Ryan. The dev cluster has half as many nodes as our staging cluster. Each node has half the number of cores as the node in staging. I agree with your conclusion. I will report back after I collect more data - the flow uses hbase heavily toward the end. On Fri, Jan 14, 2

Re: YouAreDeadException

2011-01-14 Thread Ryan Rawson
I'm seeing not much in the way of errors, timeouts, all to one machine ending with .80, so that is probably your failed node. Other than that, the log doesnt seem to say too much. Searching for strings like FATAL and Exception is the way to go here. Also things like this: 2011-01-14 23:38:52,936

Re: YouAreDeadException

2011-01-14 Thread Ryan Rawson
This is the cause: org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=sjc1-hadoop1.sjc1.carrieriq.com,60020,1294856823378, load=(requests=0, regions=6, usedHeap=514, maxHeap=3983): regionserver:60020-0x12d7b7b1c760004 regionserver:60020-0x12d7b7b1c760004 received