Thanks Erlend and Karl On 17 September 2014 12:03, Karl Wright <[email protected]> wrote:
> Yes, this problem was introduced in 1.6. > > Karl > > Sent from my Windows Phone > From: Erlend Garåsen > Sent: 9/17/2014 6:06 AM > To: [email protected] > Subject: Re: Zookeeper configured MCF not working in production mode > > I guess the issue affects version 1.6.x as well. We had exactly the same > problem with that version, but unfortunately I have no thread dump from > that time to investigate. > > Erlend > > On 17.09.14 12:01, Aeham Abushwashi wrote: > > Thanks for finding and fixing the issue. Could you confirm whether it > > affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in > > 1.6.1 shows the same pattern identified in CONNECTORS-1031 - > > > https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978 > > > > On 16 September 2014 22:19, Karl Wright <[email protected] > > <mailto:[email protected]>> wrote: > > > > I believe I've fixed the problem for real. There's a patch attached > > to the CONNECTORS-1031 ticket, which should be applicable to 1.7. > > The fix is already checked into the dev_1x branch, as well as trunk > > (which is MCF 2.0, so don't use that yet). > > > > I also believe that we're going to need to make a 1.7.1 release that > > contains this fix, and others of similar importance. > > > > Karl > > > > > > On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected] > > <mailto:[email protected]>> wrote: > > > > After some research, I found that increasing the zookeeper.cfg > > tick time count from 2000 to 5000 makes this problem go away for > me. > > > > Clearly we have an issue, still, with resetting zookeeper > > connections after tick timeout failures. The connections are > > reset but the state of the connections are somehow incorrect. > > I'll need to do more research to figure out how this can be > > addressed. > > > > For the interim, increasing the tick time seems to be a > > reasonable workaround. > > > > Thanks, > > Karl > > > > > > On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected] > > <mailto:[email protected]>> wrote: > > > > Believe it or not, I was able to reproduce this here with a > > crawl of 100000 documents. I get this in the Zookeeper > > server-side log, hundreds of times: > > > > >>>>>> > > [SyncThread:0] ERROR > > org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce > > ption: > > java.nio.channels.CancelledKeyException > > at > > > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > > at > > > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > > at > > > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja > > va:153) > > at > > > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. > > java:1076) > > at > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina > > lRequestProcessor.java:170) > > at > > > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro > > cessor.java:167) > > at > > > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce > > ssor.java:101) > > [SyncThread:0] ERROR > > org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce > > ption: > > java.nio.channels.CancelledKeyException > > at > > > sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) > > at > > > sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) > > at > > > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja > > va:153) > > at > > > org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn. > > java:1076) > > at > > > org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina > > lRequestProcessor.java:170) > > at > > > org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro > > cessor.java:167) > > at > > > org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce > > ssor.java:101) > > <<<<<< > > > > ... and then everything locks up. I have no idea what is > > happening; seems to be an NIO exception ZooKeeper is not > > expecting. > > > > Karl > > > > > > On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen > > <[email protected] <mailto:[email protected]>> > > wrote: > > > > > > Ouch, I forgot to place the Zookeeper logs on web. Since > > they do not include timestamps and I have restarted MCF > > after a few changes, I guess it will be difficult to get > > the relevant lines. I'll do that next time it hangs, > > probably in the end of the day. > > > > I will add the new Zookeeper configuration settings as > > Lalit suggested next time I'm restarting MCF. > > > > How many worker threads are you using? How many > > documents (about) do > > you crawl before things hang? > > > > > > Throttling -> max connections: 30 > > Throttling -> Max fetches/min: 100 > > Bandwith -> max connections: 25 > > Bandwith -> max kbytes/sec: 8000 > > Bandwith -> max fetches/min: 20 > > > > I have four jobs configured. The one I'm running now has > > 100,000 documents configured. Totally around 110,000 > > documents for all four jobs. > > > > I guess there are more documents involved since the > > largest job excludes a lot of documents based on > > sophisticated and complex filtering rules. Maybe 50% > > more even though they are not added to Solr (but they > > are of course fetched). > > > > Erlend > > > > > > You may also want to try to increase the parameter: > > maxClientCnxns in > > zookeeper.cfg to something bigger, if you have a lot > > of worker threads. > > I'm thinking 1000 or some such. See if it makes a > > difference for you. > > > > > > I'll try that at next restart. > > > > Erlend > > > > > > > > > > >
