Re: Unable to write data, tablet servers lose there locks

Josef Roehrl - PHEMI Thu, 05 Nov 2015 06:18:56 -0800

Everything else not withstanding, if you see any swap space being used, you
need to adjust things to prevent swapping first.


My 2 cents.

On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <[email protected]> wrote:

> Comments inline:
>
> On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik <[email protected]>
> wrote:
>
>>
>> I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 ) which was
>> working fine before I ran into this issue. whenever I start writing data
>> with a batchwritter, tablet servers loses there locks one by one. I found
>> in zookeeper logs repeatedly trying and closing socket connection for
>> servers and log has infinite repetitions of following line.
>>
>
> By far, the most common reason why locks are lost is due to java gc
> pauses.  In turn, these pauses are almost always due to memory pressure
> within the entire system. The OS sees a nice big hunk of memory in the
> tserver and swaps it out. Over the years we've tuned various settings to
> prevent this, and other memory-hogging, but if you are pushing the system
> hard, you may have to tune your existing memory settings.
>
> The tserver occasionally prints some gc stats in the debug log. If you see
> a >30s pause between these messages, memory pressure is probably the
> problem.
>
>
>>
>> 2015-11-05 12:11:23,860 [myid:3] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket
>> connection from /192.168.10.124:47503
>> 2015-11-05 12:11:23,861 [myid:3] - INFO  [NIOServerCxn.Factory:
>> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827] - Processing stat command from /
>> 192.168.10.124:47503
>> 2015-11-05 12:11:23,869 [myid:3] - INFO
>> [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
>> 2015-11-05 12:11:23,870 [myid:3] - INFO  [Thread-244:NIOServerCnxn@1007]
>> - Closed socket connection for client /192.168.10.124:47503 (no session
>> established for client)
>>
>
> Yes, this is quite annoying: you get these messages when the monitor grabs
> the zookeeper status EVERY 5s.  Your monitor is running on 192.168.10.124.
> right?
>
> These messages are expected.
>
>
>> I found it similar to ZOOKEEPER-832 if it is. There is one thread
>> discussing on socket connection but it do not provide much help in my
>> case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3ccam1_12yvaxoe+kq9-qcqtpv1vegpwqvtkhn3ictifw6vq7l...@mail.gmail.com%3E
>>
>> There are no exceptions in tserver logs and tablet servers simply lose
>> there locks.
>>
>
> Ah, is it possible the JVM is killing itself because GC overhead is
> climbing too high? You can check the .out (or .err) file for this error.
>
>
>>  I can scan data without any problem/exception. I need to know what is
>> the cause of the problem and work around. Would upgrading resolve the issue
>> or it needs some configuration changes.
>>
>
> Check all your system processes. I know old versions of the SNMP servers
> would leak resources, putting memory pressure on the system after a few
> months.  Check to see if your tserver is approximately the size you need.
> If you aren't already doing it, you will want to monitor system memory/swap
> usage, and see if it correlates to the lost servers.  Zookeeper itself is
> also subject to gc pauses, so they can die from the same cause, although
> it's a much smaller process.
>
>
>
>> My current zoo.cfg is as follows.
>>
>> clientPort=2181
>> syncLimit=5
>> tickTime=2000
>> initLimit=10
>> maxClientCnxn=100
>>
>
> That's all fine, but you may want to turn on the zookeeper clean-up:
>
>
> http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration
>
>
> Search for "autopurge".
>
>
>>
>> I can upload full logs if anyone needs. Please do let me know if you need
>> any other info.
>>
>
> How much memory is allocated to the various processes? Do you have swap
> turned on? Do you see the delay in the debug GC messages?
>
> You could try turning off swap, so the OS will kill your process instead
> of killing itself. :-)
>
> -Eric
>



-- 


Josef Roehrl
Senior Software Developer
*PHEMI Systems*
180-887 Great Northern Way
Vancouver, BC V5T 4T5
604-336-1119
Website <http://www.phemi.com/> Twitter <https://twitter.com/PHEMISystems>
Linkedin
<http://www.linkedin.com/company/3561810?trk=tyah&amp;trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>

Re: Unable to write data, tablet servers lose there locks

Reply via email to