Re: Zookeeper configured MCF not working in production mode

Karl Wright Wed, 17 Sep 2014 18:14:53 -0700

A second release candidate has been built, which fixes the issues
discovered in RC0.  Can be downloaded from the same place.


Karl


On Wed, Sep 17, 2014 at 7:36 AM, Karl Wright <[email protected]> wrote:

> There is now a release candidate for 1.7.1 that can be downloaded and
> installed at http://people.apache.org/~kwright/apache-manifoldcf-1.7.1 .
>
> Thanks!
> Karl
>
>
> On Wed, Sep 17, 2014 at 7:05 AM, Aeham Abushwashi <
> [email protected]> wrote:
>
>> Thanks Erlend and Karl
>>
>> On 17 September 2014 12:03, Karl Wright <[email protected]> wrote:
>>
>>> Yes, this problem was introduced in 1.6.
>>>
>>> Karl
>>>
>>> Sent from my Windows Phone
>>> From: Erlend Garåsen
>>> Sent: 9/17/2014 6:06 AM
>>> To: [email protected]
>>> Subject: Re: Zookeeper configured MCF not working in production mode
>>>
>>> I guess the issue affects version 1.6.x as well. We had exactly the same
>>> problem with that version, but unfortunately I have no thread dump from
>>> that time to investigate.
>>>
>>> Erlend
>>>
>>> On 17.09.14 12:01, Aeham Abushwashi wrote:
>>> > Thanks for finding and fixing the issue. Could you confirm whether it
>>> > affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
>>> > 1.6.1 shows the same pattern identified in CONNECTORS-1031 -
>>> >
>>> https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978
>>> >
>>> > On 16 September 2014 22:19, Karl Wright <[email protected]
>>> > <mailto:[email protected]>> wrote:
>>> >
>>> >     I believe I've fixed the problem for real.  There's a patch
>>> attached
>>> >     to the CONNECTORS-1031 ticket, which should be applicable to 1.7.
>>> >     The fix is already checked into the dev_1x branch, as well as trunk
>>> >     (which is MCF 2.0, so don't use that yet).
>>> >
>>> >     I also believe that we're going to need to make a 1.7.1 release
>>> that
>>> >     contains this fix, and others of similar importance.
>>> >
>>> >     Karl
>>> >
>>> >
>>> >     On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]
>>> >     <mailto:[email protected]>> wrote:
>>> >
>>> >         After some research, I found that increasing the zookeeper.cfg
>>> >         tick time count from 2000 to 5000 makes this problem go away
>>> for me.
>>> >
>>> >         Clearly we have an issue, still, with resetting zookeeper
>>> >         connections after tick timeout failures.  The connections are
>>> >         reset but the state of the connections are somehow incorrect.
>>> >         I'll need to do more research to figure out how this can be
>>> >         addressed.
>>> >
>>> >         For the interim, increasing the tick time seems to be a
>>> >         reasonable workaround.
>>> >
>>> >         Thanks,
>>> >         Karl
>>> >
>>> >
>>> >         On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <
>>> [email protected]
>>> >         <mailto:[email protected]>> wrote:
>>> >
>>> >             Believe it or not, I was able to reproduce this here with a
>>> >             crawl of 100000 documents.  I get this in the Zookeeper
>>> >             server-side log, hundreds of times:
>>> >
>>> >              >>>>>>
>>> >             [SyncThread:0] ERROR
>>> >             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>>> >             ption:
>>> >             java.nio.channels.CancelledKeyException
>>> >                      at
>>> >
>>>  sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>> >                      at
>>> >
>>>  sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> >             va:153)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> >             java:1076)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> >             lRequestProcessor.java:170)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> >             cessor.java:167)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> >             ssor.java:101)
>>> >             [SyncThread:0] ERROR
>>> >             org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
>>> >             ption:
>>> >             java.nio.channels.CancelledKeyException
>>> >                      at
>>> >
>>>  sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>> >                      at
>>> >
>>>  sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> >             va:153)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> >             java:1076)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> >             lRequestProcessor.java:170)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> >             cessor.java:167)
>>> >                      at
>>> >
>>>  org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> >             ssor.java:101)
>>> >             <<<<<<
>>> >
>>> >             ... and then everything locks up.  I have no idea what is
>>> >             happening; seems to be an NIO exception ZooKeeper is not
>>> >             expecting.
>>> >
>>> >             Karl
>>> >
>>> >
>>> >             On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen
>>> >             <[email protected] <mailto:[email protected]>>
>>> >             wrote:
>>> >
>>> >
>>> >                 Ouch, I forgot to place the Zookeeper logs on web.
>>> Since
>>> >                 they do not include timestamps and I have restarted MCF
>>> >                 after a few changes, I guess it will be difficult to
>>> get
>>> >                 the relevant lines. I'll do that next time it hangs,
>>> >                 probably in the end of the day.
>>> >
>>> >                 I will add the new Zookeeper configuration settings as
>>> >                 Lalit suggested next time I'm restarting MCF.
>>> >
>>> >                     How many worker threads are you using?  How many
>>> >                     documents (about) do
>>> >                     you crawl before things hang?
>>> >
>>> >
>>> >                 Throttling -> max connections: 30
>>> >                 Throttling -> Max fetches/min: 100
>>> >                 Bandwith -> max connections: 25
>>> >                 Bandwith -> max kbytes/sec: 8000
>>> >                 Bandwith -> max fetches/min: 20
>>> >
>>> >                 I have four jobs configured. The one I'm running now
>>> has
>>> >                 100,000 documents configured. Totally around 110,000
>>> >                 documents for all four jobs.
>>> >
>>> >                 I guess there are more documents involved since the
>>> >                 largest job excludes a lot of documents based on
>>> >                 sophisticated and complex filtering rules. Maybe 50%
>>> >                 more even though they are not added to Solr (but they
>>> >                 are of course fetched).
>>> >
>>> >                 Erlend
>>> >
>>> >
>>> >                     You may also want to try to increase the parameter:
>>> >                     maxClientCnxns in
>>> >                     zookeeper.cfg to something bigger, if you have a
>>> lot
>>> >                     of worker threads.
>>> >                     I'm thinking 1000 or some such.  See if it makes a
>>> >                     difference for you.
>>> >
>>> >
>>> >                 I'll try that at next restart.
>>> >
>>> >                 Erlend
>>> >
>>> >
>>> >
>>> >
>>> >
>>>
>>
>>
>

Re: Zookeeper configured MCF not working in production mode

Reply via email to