I guess the issue affects version 1.6.x as well. We had exactly the same problem with that version, but unfortunately I have no thread dump from that time to investigate.
Erlend On 17.09.14 12:01, Aeham Abushwashi wrote:
Thanks for finding and fixing the issue. Could you confirm whether it affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in 1.6.1 shows the same pattern identified in CONNECTORS-1031 - https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978 On 16 September 2014 22:19, Karl Wright <[email protected] <mailto:[email protected]>> wrote: I believe I've fixed the problem for real. There's a patch attached to the CONNECTORS-1031 ticket, which should be applicable to 1.7. The fix is already checked into the dev_1x branch, as well as trunk (which is MCF 2.0, so don't use that yet). I also believe that we're going to need to make a 1.7.1 release that contains this fix, and others of similar importance. Karl On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected] <mailto:[email protected]>> wrote: After some research, I found that increasing the zookeeper.cfg tick time count from 2000 to 5000 makes this problem go away for me. Clearly we have an issue, still, with resetting zookeeper connections after tick timeout failures. The connections are reset but the state of the connections are somehow incorrect. I'll need to do more research to figure out how this can be addressed. For the interim, increasing the tick time seems to be a reasonable workaround. Thanks, Karl On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected] <mailto:[email protected]>> wrote: Believe it or not, I was able to reproduce this here with a crawl of 100000 documents. I get this in the Zookeeper server-side log, hundreds of times: >>>>>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce ption: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja va:153) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. java:1076) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina lRequestProcessor.java:170) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro cessor.java:167) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce ssor.java:101) [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce ption: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja va:153) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. java:1076) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina lRequestProcessor.java:170) at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro cessor.java:167) at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce ssor.java:101) <<<<<< ... and then everything locks up. I have no idea what is happening; seems to be an NIO exception ZooKeeper is not expecting. Karl On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected] <mailto:[email protected]>> wrote: Ouch, I forgot to place the Zookeeper logs on web. Since they do not include timestamps and I have restarted MCF after a few changes, I guess it will be difficult to get the relevant lines. I'll do that next time it hangs, probably in the end of the day. I will add the new Zookeeper configuration settings as Lalit suggested next time I'm restarting MCF. How many worker threads are you using? How many documents (about) do you crawl before things hang? Throttling -> max connections: 30 Throttling -> Max fetches/min: 100 Bandwith -> max connections: 25 Bandwith -> max kbytes/sec: 8000 Bandwith -> max fetches/min: 20 I have four jobs configured. The one I'm running now has 100,000 documents configured. Totally around 110,000 documents for all four jobs. I guess there are more documents involved since the largest job excludes a lot of documents based on sophisticated and complex filtering rules. Maybe 50% more even though they are not added to Solr (but they are of course fetched). Erlend You may also want to try to increase the parameter: maxClientCnxns in zookeeper.cfg to something bigger, if you have a lot of worker threads. I'm thinking 1000 or some such. See if it makes a difference for you. I'll try that at next restart. Erlend
