After some research, I found that increasing the zookeeper.cfg tick time count from 2000 to 5000 makes this problem go away for me.
Clearly we have an issue, still, with resetting zookeeper connections after tick timeout failures. The connections are reset but the state of the connections are somehow incorrect. I'll need to do more research to figure out how this can be addressed. For the interim, increasing the tick time seems to be a reasonable workaround. Thanks, Karl On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]> wrote: > Believe it or not, I was able to reproduce this here with a crawl of > 100000 documents. I get this in the Zookeeper server-side log, hundreds of > times: > > >>>>>> > [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - > Unexpected Exce > ption: > java.nio.channels.CancelledKeyException > at > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja > va:153) > at > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. > java:1076) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina > lRequestProcessor.java:170) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro > cessor.java:167) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce > ssor.java:101) > [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - > Unexpected Exce > ption: > java.nio.channels.CancelledKeyException > at > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > at > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > at > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja > va:153) > at > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. > java:1076) > at > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina > lRequestProcessor.java:170) > at > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro > cessor.java:167) > at > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce > ssor.java:101) > <<<<<< > > ... and then everything locks up. I have no idea what is happening; seems > to be an NIO exception ZooKeeper is not expecting. > > Karl > > > On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected]> > wrote: > >> >> Ouch, I forgot to place the Zookeeper logs on web. Since they do not >> include timestamps and I have restarted MCF after a few changes, I guess it >> will be difficult to get the relevant lines. I'll do that next time it >> hangs, probably in the end of the day. >> >> I will add the new Zookeeper configuration settings as Lalit suggested >> next time I'm restarting MCF. >> >> How many worker threads are you using? How many documents (about) do >>> you crawl before things hang? >>> >> >> Throttling -> max connections: 30 >> Throttling -> Max fetches/min: 100 >> Bandwith -> max connections: 25 >> Bandwith -> max kbytes/sec: 8000 >> Bandwith -> max fetches/min: 20 >> >> I have four jobs configured. The one I'm running now has 100,000 >> documents configured. Totally around 110,000 documents for all four jobs. >> >> I guess there are more documents involved since the largest job excludes >> a lot of documents based on sophisticated and complex filtering rules. >> Maybe 50% more even though they are not added to Solr (but they are of >> course fetched). >> >> Erlend >> >> >>> You may also want to try to increase the parameter: maxClientCnxns in >>> zookeeper.cfg to something bigger, if you have a lot of worker threads. >>> I'm thinking 1000 or some such. See if it makes a difference for you. >>> >> >> I'll try that at next restart. >> >> Erlend >> > >
