I guess the issue affects version 1.6.x as well. We had exactly the same problem with that version, but unfortunately I have no thread dump from that time to investigate.

Erlend

On 17.09.14 12:01, Aeham Abushwashi wrote:
Thanks for finding and fixing the issue. Could you confirm whether it
affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
1.6.1 shows the same pattern identified in CONNECTORS-1031 -
https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978

On 16 September 2014 22:19, Karl Wright <[email protected]
<mailto:[email protected]>> wrote:

    I believe I've fixed the problem for real.  There's a patch attached
    to the CONNECTORS-1031 ticket, which should be applicable to 1.7.
    The fix is already checked into the dev_1x branch, as well as trunk
    (which is MCF 2.0, so don't use that yet).

    I also believe that we're going to need to make a 1.7.1 release that
    contains this fix, and others of similar importance.

    Karl


    On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]
    <mailto:[email protected]>> wrote:

        After some research, I found that increasing the zookeeper.cfg
        tick time count from 2000 to 5000 makes this problem go away for me.

        Clearly we have an issue, still, with resetting zookeeper
        connections after tick timeout failures.  The connections are
        reset but the state of the connections are somehow incorrect.
        I'll need to do more research to figure out how this can be
        addressed.

        For the interim, increasing the tick time seems to be a
        reasonable workaround.

        Thanks,
        Karl


        On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]
        <mailto:[email protected]>> wrote:

            Believe it or not, I was able to reproduce this here with a
            crawl of 100000 documents.  I get this in the Zookeeper
            server-side log, hundreds of times:

             >>>>>>
            [SyncThread:0] ERROR
            org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
            ption:
            java.nio.channels.CancelledKeyException
                     at
            sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
                     at
            sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
                     at
            
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
            va:153)
                     at
            
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
            java:1076)
                     at
            
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
            lRequestProcessor.java:170)
                     at
            
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
            cessor.java:167)
                     at
            
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
            ssor.java:101)
            [SyncThread:0] ERROR
            org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
            ption:
            java.nio.channels.CancelledKeyException
                     at
            sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
                     at
            sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
                     at
            
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
            va:153)
                     at
            
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
            java:1076)
                     at
            
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
            lRequestProcessor.java:170)
                     at
            
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
            cessor.java:167)
                     at
            
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
            ssor.java:101)
            <<<<<<

            ... and then everything locks up.  I have no idea what is
            happening; seems to be an NIO exception ZooKeeper is not
            expecting.

            Karl


            On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen
            <[email protected] <mailto:[email protected]>>
            wrote:


                Ouch, I forgot to place the Zookeeper logs on web. Since
                they do not include timestamps and I have restarted MCF
                after a few changes, I guess it will be difficult to get
                the relevant lines. I'll do that next time it hangs,
                probably in the end of the day.

                I will add the new Zookeeper configuration settings as
                Lalit suggested next time I'm restarting MCF.

                    How many worker threads are you using?  How many
                    documents (about) do
                    you crawl before things hang?


                Throttling -> max connections: 30
                Throttling -> Max fetches/min: 100
                Bandwith -> max connections: 25
                Bandwith -> max kbytes/sec: 8000
                Bandwith -> max fetches/min: 20

                I have four jobs configured. The one I'm running now has
                100,000 documents configured. Totally around 110,000
                documents for all four jobs.

                I guess there are more documents involved since the
                largest job excludes a lot of documents based on
                sophisticated and complex filtering rules. Maybe 50%
                more even though they are not added to Solr (but they
                are of course fetched).

                Erlend


                    You may also want to try to increase the parameter:
                    maxClientCnxns in
                    zookeeper.cfg to something bigger, if you have a lot
                    of worker threads.
                    I'm thinking 1000 or some such.  See if it makes a
                    difference for you.


                I'll try that at next restart.

                Erlend






Reply via email to