I do not believe GC logging is enabled. I will look into that for the future.
The cluster is 6 machines all with the same spec. I have not seen any evidence that any other server in the cluster had any problems at the same time. There are/were no dead nodes. The master did not seem to notice anything during this time. The issue was detected because requests to a particular RS would consistently timeout during the 20 minutes in question. --Tom On Tue, Jun 10, 2014 at 12:49 PM, Vladimir Rodionov <[email protected] > wrote: > 1. Do you have GC logging enabled on your cluster? It does not look like > GC - pause to me but for future troubleshooting it is better > to enable GC logging. > > 2. How large is your cluster? Did you check NN and DN logs as well? Are > all your nodes (RS and DN) up and running? No dead nodes? > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [email protected] > > ________________________________________ > From: Tom Brown [[email protected]] > Sent: Tuesday, June 10, 2014 11:13 AM > To: [email protected] > Subject: Re: Is this a long GC pause, or something else? > > We are still using 0.94.10. We are looking at upgrading soon, but have not > done so yet. > > --Tom > > > On Tue, Jun 10, 2014 at 12:10 PM, Ted Yu <[email protected]> wrote: > > > Which release are you using ? > > > > In 0.98+, there is JvmPauseMonitor. > > > > Cheers > > > > > > On Tue, Jun 10, 2014 at 11:05 AM, Tom Brown <[email protected]> > wrote: > > > > > Last night a regionserver in my cluster stopped responding in a timely > > > manner for about 20 minutes. I know that stop-the-world GC can cause > this > > > type of behavior, but 20 minutes seems excessive. > > > > > > The server is a 2 core VM with 16GB of RAM, (hbase max heap is 12GB). > We > > > are using the latest java 7 from oracle. HDFS is provided by an Isilon > > > cluster. > > > > > > The server workload is read/write: the writing process reads all rows > it > > is > > > about to write, updates them if they exist, and then writes all the > rows > > > (replacing ones that were updated). > > > > > > The last messages before the pause were regarding an HLog roll: > > > > > > DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: HLog roll > requested > > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support > > > getDefaultReplication > > > INFO org.apache.hadoop.hbase.util.FSUtils: FileSystem doesn't support > > > getDefaultBlockSize > > > > > > During the next 20 minutes there were a handful of sporadic > LruBlockCache > > > stats messages but nothing else. After 20 minutes, normal operation > > > resumed. > > > > > > Is 20 minutes for a GC pause expected given the operational load and > > > machine specs? Could a GC pause include periodic log messages? If it > > wasn't > > > a GC pause, what else could it be? > > > > > > --Tom > > > > > > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [email protected] and > delete or destroy any copy of this message and its attachments. >
