Thanks for finding and fixing the issue. Could you confirm whether it
affects 1.6.x? A quick look at ZooKeeperConnection.obtainWriteLock() in
1.6.1 shows the same pattern identified in CONNECTORS-1031 -
https://issues.apache.org/jira/browse/CONNECTORS-1031?focusedCommentId=14135978

On 16 September 2014 22:19, Karl Wright <[email protected]> wrote:

> I believe I've fixed the problem for real.  There's a patch attached to
> the CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is
> already checked into the dev_1x branch, as well as trunk (which is MCF 2.0,
> so don't use that yet).
>
> I also believe that we're going to need to make a 1.7.1 release that
> contains this fix, and others of similar importance.
>
> Karl
>
>
> On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright <[email protected]> wrote:
>
>> After some research, I found that increasing the zookeeper.cfg tick time
>> count from 2000 to 5000 makes this problem go away for me.
>>
>> Clearly we have an issue, still, with resetting zookeeper connections
>> after tick timeout failures.  The connections are reset but the state of
>> the connections are somehow incorrect.  I'll need to do more research to
>> figure out how this can be addressed.
>>
>> For the interim, increasing the tick time seems to be a reasonable
>> workaround.
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright <[email protected]> wrote:
>>
>>> Believe it or not, I was able to reproduce this here with a crawl of
>>> 100000 documents.  I get this in the Zookeeper server-side log, hundreds of
>>> times:
>>>
>>> >>>>>>
>>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>>> Unexpected Exce
>>> ption:
>>> java.nio.channels.CancelledKeyException
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> va:153)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> java:1076)
>>>         at
>>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> lRequestProcessor.java:170)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> cessor.java:167)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> ssor.java:101)
>>> [SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn -
>>> Unexpected Exce
>>> ption:
>>> java.nio.channels.CancelledKeyException
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
>>>         at
>>> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
>>> va:153)
>>>         at
>>> org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
>>> java:1076)
>>>         at
>>> org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
>>> lRequestProcessor.java:170)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
>>> cessor.java:167)
>>>         at
>>> org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
>>> ssor.java:101)
>>> <<<<<<
>>>
>>> ... and then everything locks up.  I have no idea what is happening;
>>> seems to be an NIO exception ZooKeeper is not expecting.
>>>
>>> Karl
>>>
>>>
>>> On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen <[email protected]
>>> > wrote:
>>>
>>>>
>>>> Ouch, I forgot to place the Zookeeper logs on web. Since they do not
>>>> include timestamps and I have restarted MCF after a few changes, I guess it
>>>> will be difficult to get the relevant lines. I'll do that next time it
>>>> hangs, probably in the end of the day.
>>>>
>>>> I will add the new Zookeeper configuration settings as Lalit suggested
>>>> next time I'm restarting MCF.
>>>>
>>>>  How many worker threads are you using?  How many documents (about) do
>>>>> you crawl before things hang?
>>>>>
>>>>
>>>> Throttling -> max connections: 30
>>>> Throttling -> Max fetches/min: 100
>>>> Bandwith -> max connections: 25
>>>> Bandwith -> max kbytes/sec: 8000
>>>> Bandwith -> max fetches/min: 20
>>>>
>>>> I have four jobs configured. The one I'm running now has 100,000
>>>> documents configured. Totally around 110,000 documents for all four jobs.
>>>>
>>>> I guess there are more documents involved since the largest job
>>>> excludes a lot of documents based on sophisticated and complex filtering
>>>> rules. Maybe 50% more even though they are not added to Solr (but they are
>>>> of course fetched).
>>>>
>>>> Erlend
>>>>
>>>>
>>>>> You may also want to try to increase the parameter: maxClientCnxns in
>>>>> zookeeper.cfg to something bigger, if you have a lot of worker threads.
>>>>> I'm thinking 1000 or some such.  See if it makes a difference for you.
>>>>>
>>>>
>>>> I'll try that at next restart.
>>>>
>>>> Erlend
>>>>
>>>
>>>
>>
>

Reply via email to