Hi Apologies for the cold email — I'd appreciate a sanity check on a Storm spout shutdown design before we ship it.
Setup: 6 worker JVMs across 3 hosts, custom Kafka spout (not storm-kafka-client), non-idempotent bolts. On `storm kill` during deploys, we need every in-flight tuple acked/failed and its offset committed before exit; otherwise the uncommitted window replays and we get duplicates. Our current shutdown path: - `deactivate()`: pause consumer + emission, keep draining the staging queue via `nextTuple()`, advance watermarks as acks arrive. - `close()`: stop consumer loop, `commitSync(watermarks)` via `onPartitionsRevoked`. What I think I know (please correct): - `storm kill` *is* graceful: spouts get `deactivate()`, then Storm waits `TOPOLOGY_MESSAGE_TIMEOUT_SECS` before tearing workers down. - `close()` is explicitly best-effort — the ISpout Javadoc says so, and `supervisor.worker.shutdown.sleep.secs` caps how long the worker has before force-kill. If that's right, putting the final `commitSync` in `close()` is structurally fragile — it may never run, or get cut off mid-call. Two questions: 1. Is the recommended pattern "commit-on-ack from `nextTuple()`, treat `close()` as best-effort" — so the committed offset already reflects all fully-acked tuples by the time the JVM dies? That seems to be roughly how storm-kafka-client does it. 2. Beyond `TOPOLOGY_MESSAGE_TIMEOUT_SECS` and `supervisor.worker.shutdown.sleep.secs`, are there other knobs (`topology.kill.delay.secs`, shutdown hooks) we should tune to widen the drain window? Already read: ISpout Javadoc, Running-topologies-on-a-production-cluster.md, STORM-794, STORM-2176, STORM-3355. A pointer to anything I've missed is plenty; I don't want to take up your time re-explaining things. Please guide me on the best approach. Thanks and regards, Karthick.
