I believe I've fixed the problem for real. There's a patch attached to the CONNECTORS-1031 ticket, which should be applicable to 1.7. The fix is already checked into the dev_1x branch, as well as trunk (which is MCF 2.0, so don't use that yet).
I also believe that we're going to need to make a 1.7.1 release that contains this fix, and others of similar importance. Karl On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]> wrote: > After some research, I found that increasing the zookeeper.cfg tick time > count from 2000 to 5000 makes this problem go away for me. > > Clearly we have an issue, still, with resetting zookeeper connections > after tick timeout failures. The connections are reset but the state of > the connections are somehow incorrect. I'll need to do more research to > figure out how this can be addressed. > > For the interim, increasing the tick time seems to be a reasonable > workaround. > > Thanks, > Karl > > > On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]> wrote: > >> Believe it or not, I was able to reproduce this here with a crawl of >> 100000 documents. I get this in the Zookeeper server-side log, hundreds of >> times: >> >> >>>>>> >> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - >> Unexpected Exce >> ption: >> java.nio.channels.CancelledKeyException >> at >> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >> at >> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja >> va:153) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. >> java:1076) >> at >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina >> lRequestProcessor.java:170) >> at >> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro >> cessor.java:167) >> at >> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce >> ssor.java:101) >> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - >> Unexpected Exce >> ption: >> java.nio.channels.CancelledKeyException >> at >> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) >> at >> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja >> va:153) >> at >> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. >> java:1076) >> at >> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina >> lRequestProcessor.java:170) >> at >> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro >> cessor.java:167) >> at >> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce >> ssor.java:101) >> <<<<<< >> >> ... and then everything locks up. I have no idea what is happening; >> seems to be an NIO exception ZooKeeper is not expecting. >> >> Karl >> >> >> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected]> >> wrote: >> >>> >>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not >>> include timestamps and I have restarted MCF after a few changes, I guess it >>> will be difficult to get the relevant lines. I'll do that next time it >>> hangs, probably in the end of the day. >>> >>> I will add the new Zookeeper configuration settings as Lalit suggested >>> next time I'm restarting MCF. >>> >>> How many worker threads are you using? How many documents (about) do >>>> you crawl before things hang? >>>> >>> >>> Throttling -> max connections: 30 >>> Throttling -> Max fetches/min: 100 >>> Bandwith -> max connections: 25 >>> Bandwith -> max kbytes/sec: 8000 >>> Bandwith -> max fetches/min: 20 >>> >>> I have four jobs configured. The one I'm running now has 100,000 >>> documents configured. Totally around 110,000 documents for all four jobs. >>> >>> I guess there are more documents involved since the largest job excludes >>> a lot of documents based on sophisticated and complex filtering rules. >>> Maybe 50% more even though they are not added to Solr (but they are of >>> course fetched). >>> >>> Erlend >>> >>> >>>> You may also want to try to increase the parameter: maxClientCnxns in >>>> zookeeper.cfg to something bigger, if you have a lot of worker threads. >>>> I'm thinking 1000 or some such. See if it makes a difference for you. >>>> >>> >>> I'll try that at next restart. >>> >>> Erlend >>> >> >> >
