Re: Zookeeper configured MCF not working in production mode

Karl Wright Tue, 16 Sep 2014 14:21:06 -0700

I believe I've fixed the problem for real.  There's a patch attached to the
CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is
already checked into the dev_1x branch, as well as trunk (which is MCF 2.0,
so don't use that yet).


I also believe that we're going to need to make a 1.7.1 release that
contains this fix, and others of similar importance.

Karl


On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]> wrote:

> After some research, I found that increasing the zookeeper.cfg tick time
> count from 2000 to 5000 makes this problem go away for me.
>
> Clearly we have an issue, still, with resetting zookeeper connections
> after tick timeout failures.  The connections are reset but the state of
> the connections are somehow incorrect.  I'll need to do more research to
> figure out how this can be addressed.
>
> For the interim, increasing the tick time seems to be a reasonable
> workaround.
>
> Thanks,
> Karl
>
>
> On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]> wrote:
>
>> Believe it or not, I was able to reproduce this here with a crawl of
>> 100000 documents.  I get this in the Zookeeper server-side log, hundreds of
>> times:
>>
>> >>>>>>
>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>> Unexpected Exce
>> ption:
>> java.nio.channels.CancelledKeyException
>>         at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>         at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>> va:153)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>> java:1076)
>>         at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>> lRequestProcessor.java:170)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>> cessor.java:167)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>> ssor.java:101)
>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>> Unexpected Exce
>> ption:
>> java.nio.channels.CancelledKeyException
>>         at
>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>         at
>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>> va:153)
>>         at
>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>> java:1076)
>>         at
>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>> lRequestProcessor.java:170)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>> cessor.java:167)
>>         at
>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>> ssor.java:101)
>> <<<<<<
>>
>> ... and then everything locks up.  I have no idea what is happening;
>> seems to be an NIO exception ZooKeeper is not expecting.
>>
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected]>
>> wrote:
>>
>>>
>>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
>>> include timestamps and I have restarted MCF after a few changes, I guess it
>>> will be difficult to get the relevant lines. I'll do that next time it
>>> hangs, probably in the end of the day.
>>>
>>> I will add the new Zookeeper configuration settings as Lalit suggested
>>> next time I'm restarting MCF.
>>>
>>>  How many worker threads are you using?  How many documents (about) do
>>>> you crawl before things hang?
>>>>
>>>
>>> Throttling -> max connections: 30
>>> Throttling -> Max fetches/min: 100
>>> Bandwith -> max connections: 25
>>> Bandwith -> max kbytes/sec: 8000
>>> Bandwith -> max fetches/min: 20
>>>
>>> I have four jobs configured. The one I'm running now has 100,000
>>> documents configured. Totally around 110,000 documents for all four jobs.
>>>
>>> I guess there are more documents involved since the largest job excludes
>>> a lot of documents based on sophisticated and complex filtering rules.
>>> Maybe 50% more even though they are not added to Solr (but they are of
>>> course fetched).
>>>
>>> Erlend
>>>
>>>
>>>> You may also want to try to increase the parameter: maxClientCnxns in
>>>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>>>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>>>
>>>
>>> I'll try that at next restart.
>>>
>>> Erlend
>>>
>>
>>
>

Re: Zookeeper configured MCF not working in production mode

Reply via email to