On Thu, Jan 27, 2011 at 11:38 AM, Wayne <[email protected]> wrote:
> Should we schedule a rolling restart every 24 hours?

You could do this (See J-Ds note).

You could try a different JVM (Todd tripped over this beauty recently
that would seem to have fixes added in u20 to address fragmentation
made things worse:
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6999988 --- looking
at u23 release notes, this issue is not yet fixed).  Not u18 -- thats
broke -- but maybe a u16?


> export HBASE_HEAPSIZE=8192
> export HBASE_OPTS="-XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSInitiatingOccupancyFraction=60 -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8 -XX:NewRatio=3 -XX:MaxTenuringThreshold=1
> -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC
> -XX:+CMSIncrementalMode"
> export HBASE_OPTS="$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log"
>

You don't need this on your class of hardware: CMSIncrementalMode

But the one that looks off is XX:MaxTenuringThreshold.  It struck me
as odd (and Ted Dunning as frightening) as it means all gets promoted
to old gen after one young gen GC.  Let the young gen churn some more
before items earn a promotion.  Make it 5 or 7 or something.  See if
you go longer between full GCs?  Initiating fraction of 60 is low.
Its probably costing you a bunch of CPU.  Try the
XX:MaxTenuringThreshold first.

St.Ack

>
> Thanks for your help!
>
>
> On Thu, Jan 27, 2011 at 1:53 PM, Stack <[email protected]> wrote:
>
>> On Thu, Jan 27, 2011 at 6:48 AM, Wayne <[email protected]> wrote:
>> > We have got .90 up and running well, but again after 24 hours of loading
>> a
>> > node went down. Under it all I assume it is a GC issue, but the GC
>> logging
>> > rolls every < 60 minutes so I can never see logs from 5 hours ago
>> (working
>> > on getting Scribe up to solve that).
>>
>> Is it possible to ask GC logs roll?  Whats the config. please?
>>
>> > Most of our issues are a node being
>> > marked as dead after being un-responsive. It often starts with a socket
>> > timeout. We can turn up the timeout for zookeeper but that is not dealing
>> > with the issue.
>> >
>> > Here is the first sign of trouble. Is the 1 min 34 second gap below most
>> > likely a stop the world GC?
>> >
>> > 2011-01-27 07:00:43,716 INFO
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>> > Roll
>> /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111623011,
>> > entries=242, filesize=69508440. New hlog
>> > /hbase/.logs/x.x.x.6,60020,1295969329357/x.x.x.6%3A60020.1296111643436
>> > 2011-01-27 07:02:17,663 WARN org.apache.hadoop.hdfs.DFSClient:
>> > DFSOutputStream ResponseProcessor exception for block
>> > blk_-5705652521953118952_104835java.net.SocketTimeoutException: 69000
>> millis
>> > timeout while waiting for channel to be ready for read. ch :
>> > java.nio.channels.SocketChannel[connected local=/x.x.x.6:48141
>> > remote=/x.x.x.6:50010]
>> >
>>
>> Possibly.  Whats the log history like before that?
>>
>> What was going on in the local DataNode at that time?  Seems like we
>> failed talking to it (I'm guessing x.x.x.6 the local datanode).  Could
>> be though because of GC pause.
>>
>>
>> > It is followed by zookeeper complaining due to lack of a response.
>> >
>> > 2011-01-27 07:02:17,665 INFO org.apache.zookeeper.ClientCnxn: Client
>> session
>> > timed out, have not heard from server in 94590ms for sessionid
>> > 0x2dbdc88700000e, closing socket connection and attempting reconnect
>> >
>>
>> This is strong indicator that there was a long GC pause around this time.
>>
>>
>> > There is also a message about the data node.
>> >
>> > 2011-01-27 07:02:17,665 WARN org.apache.hadoop.hdfs.DFSClient: Error
>> > Recovery for block blk_4267667433820446273_104837 bad datanode[0]
>> > x.x.x.6:50010
>> >
>>
>> This is probably just the DFSClient up in the RegionServer marking the
>> local datanode bad because of the above timeout talking to it
>> (possibly because of GC pause).  Its marking the first datanode in the
>> pipeline, probably the first, as bad.. Its probably going to try the
>> next node in the pipeline.
>>
>>
>> > And eventually the node is brought down.
>> >
>> > 2011-01-27 07:02:17,783 FATAL
>> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
>> > server...
>> >
>>
>> This happened because session expired on zk?  You should be able to
>> see up earlier in the log.
>>
>>
>> > The data node also shows some errors.
>> >
>> > 2011-01-27 07:02:17,667 ERROR
>> > org.apache.hadoop.hdfs.server.datanode.DataNode:
>> > DatanodeRegistration(x.x.x.6:50010,
>> > storageID=DS-1438948528-x.x.x.6-50010-1295969305669, infoPort=50075,
>> > ipcPort=50020):DataXceiver java.net.SocketException: Connection reset
>> >
>>
>> Oh, you did check the DN logs.
>>
>> So, does this look like DN logging the RS dropping the connection
>> because of timeout?  (Check the IPs -- see if it makes sense reading
>> it that way).
>>
>> >
>> > Any help, advice, ideas, or guesses would be greatly appreciated. Can
>> anyone
>> > sustain 30-40k writes/node/sec for days/weeks on end without using the
>> bulk
>> > loader? Am I rolling a rock uphill against the reality of the JVM?
>> >
>>
>> You are going to run into heap fragmentation that brings on the
>> stop-the-world GCs.  Some good progress has been made on this
>> high-priority issue recently, see
>> https://issues.apache.org/jira/browse/HBASE-3455, but a 'solution' is
>> still to do.  Even then, my guess is that we would not be able to
>> guarantee our never running into a stop-the-world GC.
>>
>> Are you all about writes?  If so, the bulk loader runs at one to two
>> orders of magnitude faster than going via API.  For this reason alone,
>> its worth exploring (You can do incremental bulk loads to a running
>> table).  If reads are rare, you might down the size of your heap so
>> the stop-the-worlds run quicker or you try giving HBase a really large
>> heap; perhaps the elbow room will make it so we avoid fragmentation
>> (but it might make it worse).   Reads and reading from cache do not
>> seem to bring on the stop-the-world GC the way writes do.  Are your
>> writes latency sensitive such that you can't tolerate the occasional
>> long pause?  If not, up your zk lease timeouts.
>>
>> St.Ack
>>
>

Reply via email to