Thanks for finding and fixing the issue. Could you confirm whether it affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in 1.6.1 shows the same pattern identified in CONNECTORS-1031 - https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978
On 16 September 2014 22:19, Karl Wright <[email protected]> wrote: > I believe I've fixed the problem for real. There's a patch attached to > the CONNECTORS-1031 ticket, which should be applicable to 1.7. The fix is > already checked into the dev_1x branch, as well as trunk (which is MCF 2.0, > so don't use that yet). > > I also believe that we're going to need to make a 1.7.1 release that > contains this fix, and others of similar importance. > > Karl > > > On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]> wrote: > >> After some research, I found that increasing the zookeeper.cfg tick time >> count from 2000 to 5000 makes this problem go away for me. >> >> Clearly we have an issue, still, with resetting zookeeper connections >> after tick timeout failures. The connections are reset but the state of >> the connections are somehow incorrect. I'll need to do more research to >> figure out how this can be addressed. >> >> For the interim, increasing the tick time seems to be a reasonable >> workaround. >> >> Thanks, >> Karl >> >> >> On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]> wrote: >> >>> Believe it or not, I was able to reproduce this here with a crawl of >>> 100000 documents. I get this in the Zookeeper server-side log, hundreds of >>> times: >>> >>> >>>>>> >>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - >>> Unexpected Exce >>> ption: >>> java.nio.channels.CancelledKeyException >>> at >>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >>> at >>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja >>> va:153) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. >>> java:1076) >>> at >>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina >>> lRequestProcessor.java:170) >>> at >>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro >>> cessor.java:167) >>> at >>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce >>> ssor.java:101) >>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - >>> Unexpected Exce >>> ption: >>> java.nio.channels.CancelledKeyException >>> at >>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >>> at >>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja >>> va:153) >>> at >>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. >>> java:1076) >>> at >>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina >>> lRequestProcessor.java:170) >>> at >>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro >>> cessor.java:167) >>> at >>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce >>> ssor.java:101) >>> <<<<<< >>> >>> ... and then everything locks up. I have no idea what is happening; >>> seems to be an NIO exception ZooKeeper is not expecting. >>> >>> Karl >>> >>> >>> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected] >>> > wrote: >>> >>>> >>>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not >>>> include timestamps and I have restarted MCF after a few changes, I guess it >>>> will be difficult to get the relevant lines. I'll do that next time it >>>> hangs, probably in the end of the day. >>>> >>>> I will add the new Zookeeper configuration settings as Lalit suggested >>>> next time I'm restarting MCF. >>>> >>>> How many worker threads are you using? How many documents (about) do >>>>> you crawl before things hang? >>>>> >>>> >>>> Throttling -> max connections: 30 >>>> Throttling -> Max fetches/min: 100 >>>> Bandwith -> max connections: 25 >>>> Bandwith -> max kbytes/sec: 8000 >>>> Bandwith -> max fetches/min: 20 >>>> >>>> I have four jobs configured. The one I'm running now has 100,000 >>>> documents configured. Totally around 110,000 documents for all four jobs. >>>> >>>> I guess there are more documents involved since the largest job >>>> excludes a lot of documents based on sophisticated and complex filtering >>>> rules. Maybe 50% more even though they are not added to Solr (but they are >>>> of course fetched). >>>> >>>> Erlend >>>> >>>> >>>>> You may also want to try to increase the parameter: maxClientCnxns in >>>>> zookeeper.cfg to something bigger, if you have a lot of worker threads. >>>>> I'm thinking 1000 or some such. See if it makes a difference for you. >>>>> >>>> >>>> I'll try that at next restart. >>>> >>>> Erlend >>>> >>> >>> >> >
