Re: Unable to write data, tablet servers lose there locks

mohit.kaushik Fri, 06 Nov 2015 03:25:29 -0800

 Eric/Josef,

The issue is resoved now, You were right, I think the OS swapout thetservers as GC was not working properly. It had a conflicting port withsome other service as I recently made some changes and I also haveincreased GC heap memory limit. And yes my Monitor was running on192.168.10.124 :) .


Thanks

On 11/05/2015 07:46 PM, Josef Roehrl - PHEMI wrote:

Everything else not withstanding, if you see any swap space beingused, you need to adjust things to prevent swapping first.


My 2 cents.

On Thu, Nov 5, 2015 at 2:12 PM, Eric Newton <[email protected]<mailto:[email protected]>> wrote:


    Comments inline:

    On Thu, Nov 5, 2015 at 2:18 AM, mohit.kaushik
    <[email protected] <mailto:[email protected]>> wrote:


        I have 3 node cluster ( Accumulo-1.6.3, zookeeper 3.4.6 )
        which was working fine before I ran into this issue. whenever
        I start writing data with a batchwritter, tablet servers loses
        there locks one by one. I found in zookeeper logs repeatedly
        trying and closing socket connection for servers and log has
        infinite repetitions of following line.


    By far, the most common reason why locks are lost is due to java
    gc pauses.  In turn, these pauses are almost always due to memory
    pressure within the entire system. The OS sees a nice big hunk of
    memory in the tserver and swaps it out. Over the years we've tuned
    various settings to prevent this, and other memory-hogging, but if
    you are pushing the system hard, you may have to tune your
    existing memory settings.

    The tserver occasionally prints some gc stats in the debug log. If
    you see a >30s pause between these messages, memory pressure is
    probably the problem.


        2015-11-05 12:11:23,860 [myid:3] - INFO
        [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197
        <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197>] -
        Accepted socket connection from /192.168.10.124:47503
        <http://192.168.10.124:47503>
        2015-11-05 12:11:23,861 [myid:3] - INFO
        [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827
        <http://0.0.0.0/0.0.0.0:2181:NIOServerCnxn@827>] - Processing
        stat command from /192.168.10.124:47503
        <http://192.168.10.124:47503>
        2015-11-05 12:11:23,869 [myid:3] - INFO
        [Thread-244:NIOServerCnxn$StatCommand@663] - Stat command output
        2015-11-05 12:11:23,870 [myid:3] - INFO
        [Thread-244:NIOServerCnxn@1007] - Closed socket connection for
        client /192.168.10.124:47503 <http://192.168.10.124:47503> (no
        session established for client)


    Yes, this is quite annoying: you get these messages when the
    monitor grabs the zookeeper status EVERY 5s.  Your monitor is
    running on 192.168.10.124. right?

    These messages are expected.

        I found it similar to ZOOKEEPER-832 if it is. There is one
        thread discussing on socket connection but it do not provide
        much help in my
        
case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3ccam1_12yvaxoe+kq9-qcqtpv1vegpwqvtkhn3ictifw6vq7l...@mail.gmail.com%3E
        
<mailto:case.http://mail-archives.apache.org/mod_mbox/accumulo-user/201208.mbox/%3ccam1_12yvaxoe+kq9-qcqtpv1vegpwqvtkhn3ictifw6vq7l...@mail.gmail.com%3E>

        There are no exceptions in tserver logs and tablet servers
        simply lose there locks.


    Ah, is it possible the JVM is killing itself because GC overhead
    is climbing too high? You can check the .out (or .err) file for
    this error.

         I can scan data without any problem/exception. I need to know
        what is the cause of the problem and work around. Would
        upgrading resolve the issue or it needs some configuration
        changes.


    Check all your system processes. I know old versions of the SNMP
    servers would leak resources, putting memory pressure on the
    system after a few months.  Check to see if your tserver is
    approximately the size you need. If you aren't already doing it,
    you will want to monitor system memory/swap usage, and see if it
    correlates to the lost servers.  Zookeeper itself is also subject
    to gc pauses, so they can die from the same cause, although it's a
    much smaller process.

        My current zoo.cfg is as follows.

        clientPort=2181
        syncLimit=5
        tickTime=2000
        initLimit=10
        maxClientCnxn=100


    That's all fine, but you may want to turn on the zookeeper clean-up:

    
http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_advancedConfiguration


    Search for "autopurge".


        I can upload full logs if anyone needs. Please do let me know
        if you need any other info.


    How much memory is allocated to the various processes? Do you have
    swap turned on? Do you see the delay in the debug GC messages?

    You could try turning off swap, so the OS will kill your process
    instead of killing itself. :-)

    -Eric




--

Josef Roehrl
Senior Software Developer
*PHEMI Systems*
180-887 Great Northern Way
Vancouver, BC V5T 4T5
604-336-1119
Website <http://www.phemi.com/> Twitter<https://twitter.com/PHEMISystems> Linkedin<http://www.linkedin.com/company/3561810?trk=tyah&trkInfo=tarId%3A1403279580554%2Ctas%3Aphemi%20hea%2Cidx%3A1-1-1>

Re: Unable to write data, tablet servers lose there locks

Reply via email to