RE: Zookeeper configured MCF not working in production mode

Adrian Conlon Wed, 17 Sep 2014 01:06:07 -0700

Hi Karl,

+1 on a 1.7.1 update to fix this.


Because of various issues with file based synchronisation, we’ve been looking 
at using zookeeper for synchronisation and have been hitting the 
CancelledKeyException problem all the time.  Up until this thread, I’d assumed 
we’d missed something obvious in our zookeeper setup, and hadn’t reported it.  
That’ll teach me!

Thanks,

Adrian

p.s.
I *always* feel better you get one of my problems Karl…

From: Karl Wright [mailto:[email protected]]
Sent: 16 September 2014 22:20
To: [email protected]
Subject: Re: Zookeeper configured MCF not working in production mode

I believe I've fixed the problem for real.  There's a patch attached to the 
CONNECTORS-1031 ticket, which should be applicable to 1.7.  The fix is already 
checked into the dev_1x branch, as well as trunk (which is MCF 2.0, so don't 
use that yet).
I also believe that we're going to need to make a 1.7.1 release that contains 
this fix, and others of similar importance.

Karl

On Tue, Sep 16, 2014 at 9:15 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
After some research, I found that increasing the zookeeper.cfg tick time count 
from 2000 to 5000 makes this problem go away for me.
Clearly we have an issue, still, with resetting zookeeper connections after 
tick timeout failures.  The connections are reset but the state of the 
connections are somehow incorrect.  I'll need to do more research to figure out 
how this can be addressed.

For the interim, increasing the tick time seems to be a reasonable workaround.

Thanks,
Karl

On Tue, Sep 16, 2014 at 8:14 AM, Karl Wright 
<[email protected]<mailto:[email protected]>> wrote:
Believe it or not, I was able to reproduce this here with a crawl of 100000 
documents.  I get this in the Zookeeper server-side log, hundreds of times:

>>>>>>
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
[SyncThread:0] ERROR org.apache.zookeeper.server.NIOServerCnxn - Unexpected Exce
ption:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.ja
va:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.
java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(Fina
lRequestProcessor.java:170)
        at org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestPro
cessor.java:167)
        at org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProce
ssor.java:101)
<<<<<<
... and then everything locks up.  I have no idea what is happening; seems to 
be an NIO exception ZooKeeper is not expecting.

Karl

On Tue, Sep 16, 2014 at 7:52 AM, Erlend Garåsen 
<[email protected]<mailto:[email protected]>> wrote:

Ouch, I forgot to place the Zookeeper logs on web. Since they do not include 
timestamps and I have restarted MCF after a few changes, I guess it will be 
difficult to get the relevant lines. I'll do that next time it hangs, probably 
in the end of the day.

I will add the new Zookeeper configuration settings as Lalit suggested next 
time I'm restarting MCF.
How many worker threads are you using?  How many documents (about) do
you crawl before things hang?

Throttling -> max connections: 30
Throttling -> Max fetches/min: 100
Bandwith -> max connections: 25
Bandwith -> max kbytes/sec: 8000
Bandwith -> max fetches/min: 20

I have four jobs configured. The one I'm running now has 100,000 documents 
configured. Totally around 110,000 documents for all four jobs.

I guess there are more documents involved since the largest job excludes a lot 
of documents based on sophisticated and complex filtering rules. Maybe 50% more 
even though they are not added to Solr (but they are of course fetched).

Erlend

You may also want to try to increase the parameter: maxClientCnxns in
zookeeper.cfg to something bigger, if you have a lot of worker threads.
I'm thinking 1000 or some such.  See if it makes a difference for you.

I'll try that at next restart.

Erlend



____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses

RE: Zookeeper configured MCF not working in production mode

Reply via email to