Eric/Josef,
The issue is resoved now, You were right, I think the OS swapout the
tservers as GC was not working properly. It had a conflicting port with
some other service as I recently made some changes and I also have
increased GC heap memory limit. And yes my Monitor was running on
192.168.10.124 :) .
Thanks
On 11/05/2015 07:46 PM, Josef Roehrl - PHEMI wrote:
Everything else not withstanding, if you see any swap space being
used, you need to adjust things to prevent swapping first.
My 2 cents.
On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <[email protected]
<mailto:[email protected]>> wrote:
Comments inline:
On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik
<[email protected] <mailto:[email protected]>> wrote:
I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 )
which was working fine before I ran into this issue. whenever
I start writing data with a batchwritter, tablet servers loses
there locks one by one. I found in zookeeper logs repeatedly
trying and closing socket connection for servers and log has
infinite repetitions of following line.
By far, the most common reason why locks are lost is due to java
gc pauses. In turn, these pauses are almost always due to memory
pressure within the entire system. The OS sees a nice big hunk of
memory in the tserver and swaps it out. Over the years we've tuned
various settings to prevent this, and other memory-hogging, but if
you are pushing the system hard, you may have to tune your
existing memory settings.
The tserver occasionally prints some gc stats in the debug log. If
you see a >30s pause between these messages, memory pressure is
probably the problem.
2015-11-05 12:11:23,860 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197
<http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197>] -
Accepted socket connection from /192.168.10.124:47503
<http://192.168.10.124:47503>
2015-11-05 12:11:23,861 [myid:3] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827
<http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827>] - Processing
stat command from /192.168.10.124:47503
<http://192.168.10.124:47503>
2015-11-05 12:11:23,869 [myid:3] - INFO
[Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
2015-11-05 12:11:23,870 [myid:3] - INFO
[Thread-244:NIOServerCnxn@1007] - Closed socket connection for
client /192.168.10.124:47503 <http://192.168.10.124:47503> (no
session established for client)
Yes, this is quite annoying: you get these messages when the
monitor grabs the zookeeper status EVERY 5s. Your monitor is
running on 192.168.10.124. right?
These messages are expected.
I found it similar to ZOOKEEPER-832 if it is. There is one
thread discussing on socket connection but it do not provide
much help in my
case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3ccam1_12yvaxoe+kq9-qcqtpv1vegpwqvtkhn3ictifw6vq7l...@mail.gmail.com%3E
<mailto:case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3ccam1_12yvaxoe+kq9-qcqtpv1vegpwqvtkhn3ictifw6vq7l...@mail.gmail.com%3E>
There are no exceptions in tserver logs and tablet servers
simply lose there locks.
Ah, is it possible the JVM is killing itself because GC overhead
is climbing too high? You can check the .out (or .err) file for
this error.
I can scan data without any problem/exception. I need to know
what is the cause of the problem and work around. Would
upgrading resolve the issue or it needs some configuration
changes.
Check all your system processes. I know old versions of the SNMP
servers would leak resources, putting memory pressure on the
system after a few months. Check to see if your tserver is
approximately the size you need. If you aren't already doing it,
you will want to monitor system memory/swap usage, and see if it
correlates to the lost servers. Zookeeper itself is also subject
to gc pauses, so they can die from the same cause, although it's a
much smaller process.
My current zoo.cfg is as follows.
clientPort=2181
syncLimit=5
tickTime=2000
initLimit=10
maxClientCnxn=100
That's all fine, but you may want to turn on the zookeeper clean-up:
http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration
Search for "autopurge".
I can upload full logs if anyone needs. Please do let me know
if you need any other info.
How much memory is allocated to the various processes? Do you have
swap turned on? Do you see the delay in the debug GC messages?
You could try turning off swap, so the OS will kill your process
instead of killing itself. :-)
-Eric
--
Josef Roehrl
Senior Software Developer
*PHEMI Systems*
180-887 Great Northern Way
Vancouver, BC V5T 4T5
604-336-1119
Website <http://www.phemi.com/> Twitter
<https://twitter.com/PHEMISystems> Linkedin
<http://www.linkedin.com/company/3561810?trk=tyah&trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>