> Can someone explain to me how one would update a live topology? This was always problematic for our topology, but this is how we did it. I make no claims about it being optimal or even the best way to do it. It worked for us after we figured out all the kinks.
For topology changes that are backwards compatible, we would just deploy the new topology with a new name (e.g. name-<timestamp>), and when Storm UI would show that the new topology was up and running properly we would kill the old topology. Make sure the timeout you're using to kill the old topology works for you, i.e. if you have any in-memory batching or other in-memory stateful processing that you give the workers enough time to finish whatever they're doing. The above obviously temporarily requires twice the resources for the topology, because you're running the old and new version at the same time for a short period of time. For backwards incompatible changes you can't do it that way, if you're processing a lot of events, because you would swamp your topology with errors, unless your error processing is designed in such a way that you could recover easily. What we used to do is kill the old topology to stop processing events entirely (same caveats as above). Events would start backing up in the message queue (e.g. Kafka). We would then deploy the new topology after the old topology was 100% gone. To prepare for this we would copy the topology jar files to be deployed on the nimbus server, and then use the Storm command line tool on the Nimbus server to kill the old topology and deploy the new topology. We did this to get rid of the topology jar file transfer to the Nimbus server over a network. This way the new topology gets up and running MUCH faster minimizing your downtime. You need to make sure your topology can handle the rush of events backed up in whatever event source you're using in this case. Depending on your event volume, you could end up with a incredibly high event spike compared to your normal event volume. When we had prolonged downtime for maintenance we would back up events into a raw events data store and then use batch processing later to catch up. YMMV. -TPP
