Are these tuples being acked by bolts?

Excuse typos
On Mar 10, 2016 11:19 PM, "Fabrice Gabolde" <[email protected]> wrote:

> Hello list,
>
> I have a 0.9.6 topology with 16 spout executors (a custom RMQ spout)
> and around 80 executors on various bolts, taking up 8 slots on our
> 18-slot cluster. It's a straightforward affair with only
> BaseBasicBolts.
>
> This morning it just stopped processing any tuples at 10:40 AM, on all
> workers within the same minute, despite the RMQ queues having
> thousands of messages ready. Logs do not contain any errors. We have
> the logs configured to emit on INFO+, except for logs from our org
> which are configured to emit on TRACE+.
>
> The *very* first thing the nextTuple method does is emit a log
> message. Grepping for this message shows that it stops appearing at
> 10:40, so exactly at the same time that the topology hangs. (The rest
> of the threads keep processing the tuples they already have, then they
> stop when starved.)
>
> netstat shows the RMQ connections are still up, and the RMQ cluster
> agrees (sees 16 consumers on the queue).
>
> Looking at the metrics, the internal queues are empty (population=0 or
> 1 everywhere on every worker). A thread dump shows this for the spout:
>
> "Thread-44-rabbit" #76 prio=5 os_prio=0 tid=0x00007fcacc129000
> nid=0x1a9e waiting on condition [0x00007fca3d1d4000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
>     at java.lang.Thread.sleep(Native Method)
>     at
> backtype.storm.spout.SleepSpoutWaitStrategy.emptyEmit(SleepSpoutWaitStrategy.java:36)
>     at
> backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:582)
>     at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463)
>     at clojure.lang.AFn.run(AFn.java:24)
>     at java.lang.Thread.run(Thread.java:745)
>
> That *looks* like it's just running the spout wait strategy
> (configured for 5000ms), but then the log message should appear, so it
> really seems nextTuple is not called.
>
> Killing an RMQ connection from the admin interface triggers an
> exception from the RMQ client, and then nothing. The client is
> designed to reconnect the next time its receive method is called,
> which it isn't (since nextTuple is not called, presumably).
>
> We used to have this type of problem as well when using Zaqar as a
> message queue. This makes the problem even harder to debug, since
> Zaqar was at the time purely HTTP polling (running GET /messages in a
> loop, basically, so no blocking). Restarting the topology works but
> I'd rather that not become a permanent thing...
>
> Does anybody have an idea where I should start looking?
>
> --
> Fabrice Gabolde
>

Reply via email to