Implementing both suggestions worked.

Thanks!

- Eddie


On Tue, Feb 18, 2014 at 1:58 PM, Michael Rose <[email protected]>wrote:

> You may need to configure your cluster to give it more time to start up.
> Additionally, knowing how long it can take to load the Stanford NLP models,
> make sure you're only doing it in a single bolt instance (e.g. static
> initializer or double-check synch) and sharing it between all your bolt
> instances.
>
> supervisor.worker.start.timeout.secs 120
> supervisor.worker.timeout.secs 60
>
> I'd try tuning your worker start timeout here. Try setting it up to 300s
> and (again) ensuring your prepare method only initializes expensive
> resources once, then shares them between instances in the JVM.
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> [email protected]
>
>
> On Tue, Feb 18, 2014 at 1:45 PM, Eddie Santos <[email protected]>wrote:
>
>> Hi all,
>>
>> How do you get bolts that take a ludicrously long time to load (we're
>> talking minutes here) to cooperate with Zookeeper?
>>
>> I may not be understanding my problem properly, but on my test cluster
>> (**not** in local mode!) my bolt keeps getting restarted in the middle of
>> its prepare() method -- which may take up to two minutes to return.
>>
>> The problem seems to be the " Client session timed out", but I'm not
>> knowledgable enough with Zookeeper to really know how to fix this.
>>
>> Here's a portion of logs from the supervisor affected. The STDIO messages
>> come from a poorly-coded third party library that I have to use.
>>
>>     2014-01-17 23:19:28 o.a.z.ClientCnxn [INFO] Client session timed out,
>> have not heard from server in 2747ms for sessionid 0x143a22eb4060078,
>> closing socket connection and attempting reconnect
>>     2014-01-17 23:19:28 b.s.d.worker [DEBUG] Doing heartbeat
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000768,
>> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
>> :port 6702}
>>     2014-01-17 23:19:28 b.s.d.worker [DEBUG] Doing heartbeat
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000768,
>> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
>> :port 6702}
>>     2014-01-17 23:19:28 c.n.c.f.s.ConnectionStateManager [INFO] State
>> change: SUSPENDED
>>     2014-01-17 23:19:28 c.n.c.f.s.ConnectionStateManager [WARN] There are
>> no ConnectionStateListeners registered.
>>     2014-01-17 23:19:28 b.s.cluster [WARN] Received event
>> :disconnected::none: with disconnected Zookeeper.
>>     2014-01-17 23:19:28 b.s.cluster [WARN] Received event
>> :disconnected::none: with disconnected Zookeeper.
>>     2014-01-17 23:19:28 STDIO [ERROR] done [7.2 sec].
>>     2014-01-17 23:19:28 STDIO [ERROR] Adding annotator lemma
>>     2014-01-17 23:19:28 STDIO [ERROR] Adding annotator ner
>>     2014-01-17 23:19:28 STDIO [ERROR] Loading classifier from
>> edu/stanford/nlp/models/ner/english.all.3class.distsim.crf.ser.gz
>>     2014-01-17 23:19:28 STDIO [ERROR] ...
>>     2014-01-17 23:19:29 b.s.d.worker [DEBUG] Doing heartbeat
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000769,
>> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
>> :port 6702}
>>     2014-01-17 23:19:29 b.s.d.worker [DEBUG] Doing heartbeat
>> #backtype.storm.daemon.common.WorkerHeartbeat{:time-secs 1390000769,
>> :storm-id "nlptools-test-1-1390000740", :executors #{[3 3] [6 6] [-1 -1]},
>> :port 6702}
>>     2014-01-17 23:19:30 o.a.z.ClientCnxn [INFO] Opening socket connection
>> to server zookeeper/192.168.50.3:2181
>>
>>   ^-- This is where the bolt gets restarted in its initialization.
>>
>> Thanks,
>> Eddie
>>
>
>

Reply via email to