I've still had no chance to try this on a large Storm 0.9.6/0.10.0 setup, so any similar experience is of interest for me. Thanks.
2016-01-15 20:01 GMT+03:00 Yury Ruchin <[email protected]>: > Hi, > > I'm facing an issue when Storm 0.9.5 topology looks alive but effectively > stops processing tuples. These are the steps to reproduce: > > 0. Have a large topology with dozens of workers. The topology reads data > from Kafka spout and has topology.max.spout.pending set to a finite size. > 1. Deploy topology so that all the worker slots are occupied. > 2. Take note of two processes, let's call them worker A and worker B. > Let's assume worker B occupies slot (N+P), where N is a node name, P is > port. > 3. Kill worker A. > 4. Wait for Nimbus to detect A's death. Nimbus will initiate restart of A. > 5. Wait for A to establish Netty client connection to B. > 6. Kill B. Since that point A's connection to B is stale. Nevertheless it > will remain in the ":cached-node+port->socket" map unless it's closed in > refresh-connections() call later. > 7. If B restarts before the next scheduled refresh-connections() call goes > off, the A's stale connection to (N+P) will never be reestablished, since B > is restarted in the same slot it occupied before death and so assignment > does not change in regard to (N+P) port. > 8. Worker A hangs in the not-yet-started state (storm-active-flag is > false), but from Nimbus perspective it is alive, so other workers' spouts > keep sending data to A, run out of their topology.max.spout.pending and > stop emitting as well. > > This may look like a contrived case, but I'm facing it several times a day > in my setup. Probably, because of ZK being slow I observe massive worker > restart by heartbeat timeout at nearly the same time, which leads me to the > scenario above. Actually, I do have some free slots in cluster but that > does not prevent workers from being assigned to the same slot in rapid > succession. > > Something very similar is described in this issue: > https://issues.apache.org/jira/browse/STORM-946. > > Have anyone ever seen this? Maybe it's somehow fixed / alleviated in Storm > 0.9.6/0.10.0? > > Thanks, > Yury >
