Hello list,
I have a 0.9.6 topology with 16 spout executors (a custom RMQ spout)
and around 80 executors on various bolts, taking up 8 slots on our
18-slot cluster. It's a straightforward affair with only
BaseBasicBolts.
This morning it just stopped processing any tuples at 10:40 AM, on all
workers within the same minute, despite the RMQ queues having
thousands of messages ready. Logs do not contain any errors. We have
the logs configured to emit on INFO+, except for logs from our org
which are configured to emit on TRACE+.
The *very* first thing the nextTuple method does is emit a log
message. Grepping for this message shows that it stops appearing at
10:40, so exactly at the same time that the topology hangs. (The rest
of the threads keep processing the tuples they already have, then they
stop when starved.)
netstat shows the RMQ connections are still up, and the RMQ cluster
agrees (sees 16 consumers on the queue).
Looking at the metrics, the internal queues are empty (population=0 or
1 everywhere on every worker). A thread dump shows this for the spout:
"Thread-44-rabbit" #76 prio=5 os_prio=0 tid=0x00007fcacc129000
nid=0x1a9e waiting on condition [0x00007fca3d1d4000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at
backtype.storm.spout.SleepSpoutWaitStrategy.emptyEmit(SleepSpoutWaitStrategy.java:36)
at
backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:582)
at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:745)
That *looks* like it's just running the spout wait strategy
(configured for 5000ms), but then the log message should appear, so it
really seems nextTuple is not called.
Killing an RMQ connection from the admin interface triggers an
exception from the RMQ client, and then nothing. The client is
designed to reconnect the next time its receive method is called,
which it isn't (since nextTuple is not called, presumably).
We used to have this type of problem as well when using Zaqar as a
message queue. This makes the problem even harder to debug, since
Zaqar was at the time purely HTTP polling (running GET /messages in a
loop, basically, so no blocking). Restarting the topology works but
I'd rather that not become a permanent thing...
Does anybody have an idea where I should start looking?
--
Fabrice Gabolde