Are these tuples being acked by bolts? Excuse typos On Mar 10, 2016 11:19 PM, "Fabrice Gabolde" <[email protected]> wrote:
> Hello list, > > I have a 0.9.6 topology with 16 spout executors (a custom RMQ spout) > and around 80 executors on various bolts, taking up 8 slots on our > 18-slot cluster. It's a straightforward affair with only > BaseBasicBolts. > > This morning it just stopped processing any tuples at 10:40 AM, on all > workers within the same minute, despite the RMQ queues having > thousands of messages ready. Logs do not contain any errors. We have > the logs configured to emit on INFO+, except for logs from our org > which are configured to emit on TRACE+. > > The *very* first thing the nextTuple method does is emit a log > message. Grepping for this message shows that it stops appearing at > 10:40, so exactly at the same time that the topology hangs. (The rest > of the threads keep processing the tuples they already have, then they > stop when starved.) > > netstat shows the RMQ connections are still up, and the RMQ cluster > agrees (sees 16 consumers on the queue). > > Looking at the metrics, the internal queues are empty (population=0 or > 1 everywhere on every worker). A thread dump shows this for the spout: > > "Thread-44-rabbit" #76 prio=5 os_prio=0 tid=0x00007fcacc129000 > nid=0x1a9e waiting on condition [0x00007fca3d1d4000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > backtype.storm.spout.SleepSpoutWaitStrategy.emptyEmit(SleepSpoutWaitStrategy.java:36) > at > backtype.storm.daemon.executor$fn__3371$fn__3386$fn__3415.invoke(executor.clj:582) > at backtype.storm.util$async_loop$fn__460.invoke(util.clj:463) > at clojure.lang.AFn.run(AFn.java:24) > at java.lang.Thread.run(Thread.java:745) > > That *looks* like it's just running the spout wait strategy > (configured for 5000ms), but then the log message should appear, so it > really seems nextTuple is not called. > > Killing an RMQ connection from the admin interface triggers an > exception from the RMQ client, and then nothing. The client is > designed to reconnect the next time its receive method is called, > which it isn't (since nextTuple is not called, presumably). > > We used to have this type of problem as well when using Zaqar as a > message queue. This makes the problem even harder to debug, since > Zaqar was at the time purely HTTP polling (running GET /messages in a > loop, basically, so no blocking). Restarting the topology works but > I'd rather that not become a permanent thing... > > Does anybody have an idea where I should start looking? > > -- > Fabrice Gabolde >
