RE: Zookeeper configured MCF not working in production mode

Karl Wright Wed, 17 Sep 2014 04:04:21 -0700

Yes, this problem was introduced in 1.6.

Karl


Sent from my Windows Phone
From: Erlend Garåsen
Sent: 9/17/2014 6:06 AM
To: [email protected]
Subject: Re: Zookeeper configured MCF not working in production mode

I guess the issue affects version 1.6.x as well. We had exactly the same
problem with that version, but unfortunately I have no thread dump from
that time to investigate.

Erlend

On 17.09.14 12:01, Aeham Abushwashi wrote:
> Thanks for finding and fixing the issue. Could you confirm whether it
> affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
> 1.6.1 shows the same pattern identified in CONNECTORS-1031 -
> https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978
>
> On 16 September 2014 22:19, Karl Wright <[email protected]
> <mailto:[email protected]>> wrote:
>
>     I believe I've fixed the problem for real.  There's a patch attached
>     to the CONNECTORS-1031 ticket, which should be applicable to 1.7.
>     The fix is already checked into the dev_1x branch, as well as trunk
>     (which is MCF 2.0, so don't use that yet).
>
>     I also believe that we're going to need to make a 1.7.1 release that
>     contains this fix, and others of similar importance.
>
>     Karl
>
>
>     On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]
>     <mailto:[email protected]>> wrote:
>
>         After some research, I found that increasing the zookeeper.cfg
>         tick time count from 2000 to 5000 makes this problem go away for me.
>
>         Clearly we have an issue, still, with resetting zookeeper
>         connections after tick timeout failures.  The connections are
>         reset but the state of the connections are somehow incorrect.
>         I'll need to do more research to figure out how this can be
>         addressed.
>
>         For the interim, increasing the tick time seems to be a
>         reasonable workaround.
>
>         Thanks,
>         Karl
>
>
>         On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]
>         <mailto:[email protected]>> wrote:
>
>             Believe it or not, I was able to reproduce this here with a
>             crawl of 100000 documents.  I get this in the Zookeeper
>             server-side log, hundreds of times:
>
>              >>>>>>
>             [SyncThread:0] ERROR
>             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>             ption:
>             java.nio.channels.CancelledKeyException
>                      at
>             sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>                      at
>             sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>                      at
>             
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>             va:153)
>                      at
>             
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>             java:1076)
>                      at
>             
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>             lRequestProcessor.java:170)
>                      at
>             
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>             cessor.java:167)
>                      at
>             
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>             ssor.java:101)
>             [SyncThread:0] ERROR
>             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>             ption:
>             java.nio.channels.CancelledKeyException
>                      at
>             sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>                      at
>             sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>                      at
>             
> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>             va:153)
>                      at
>             
> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>             java:1076)
>                      at
>             
> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>             lRequestProcessor.java:170)
>                      at
>             
> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>             cessor.java:167)
>                      at
>             
> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>             ssor.java:101)
>             <<<<<<
>
>             ... and then everything locks up.  I have no idea what is
>             happening; seems to be an NIO exception ZooKeeper is not
>             expecting.
>
>             Karl
>
>
>             On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen
>             <[email protected] <mailto:[email protected]>>
>             wrote:
>
>
>                 Ouch, I forgot to place the Zookeeper logs on web. Since
>                 they do not include timestamps and I have restarted MCF
>                 after a few changes, I guess it will be difficult to get
>                 the relevant lines. I'll do that next time it hangs,
>                 probably in the end of the day.
>
>                 I will add the new Zookeeper configuration settings as
>                 Lalit suggested next time I'm restarting MCF.
>
>                     How many worker threads are you using?  How many
>                     documents (about) do
>                     you crawl before things hang?
>
>
>                 Throttling -> max connections: 30
>                 Throttling -> Max fetches/min: 100
>                 Bandwith -> max connections: 25
>                 Bandwith -> max kbytes/sec: 8000
>                 Bandwith -> max fetches/min: 20
>
>                 I have four jobs configured. The one I'm running now has
>                 100,000 documents configured. Totally around 110,000
>                 documents for all four jobs.
>
>                 I guess there are more documents involved since the
>                 largest job excludes a lot of documents based on
>                 sophisticated and complex filtering rules. Maybe 50%
>                 more even though they are not added to Solr (but they
>                 are of course fetched).
>
>                 Erlend
>
>
>                     You may also want to try to increase the parameter:
>                     maxClientCnxns in
>                     zookeeper.cfg to something bigger, if you have a lot
>                     of worker threads.
>                     I'm thinking 1000 or some such.  See if it makes a
>                     difference for you.
>
>
>                 I'll try that at next restart.
>
>                 Erlend
>
>
>
>
>

RE: Zookeeper configured MCF not working in production mode

Reply via email to