> Can someone explain to me how one would update a live topology?

This was always problematic for our topology, but this is how we did
it. I make no claims about it being optimal or even the best way to do
it. It worked for us after we figured out all the kinks.

For topology changes that are backwards compatible, we would just
deploy the new topology with a new name (e.g. name-<timestamp>), and
when Storm UI would show that the new topology was up and running
properly we would kill the old topology. Make sure the timeout you're
using to kill the old topology works for you, i.e. if you have any
in-memory batching or other in-memory stateful processing that you
give the workers enough time to finish whatever they're doing.

The above obviously temporarily requires twice the resources for the
topology, because you're running the old and new version at the same
time for a short period of time.


For backwards incompatible changes you can't do it that way, if you're
processing a lot of events, because you would swamp your topology with
errors, unless your error processing is designed in such a way that
you could recover easily.

What we used to do is kill the old topology to stop processing events
entirely (same caveats as above). Events would start backing up in the
message queue (e.g. Kafka). We would then deploy the new topology
after the old topology was 100% gone. To prepare for this we would
copy the topology jar files to be deployed on the nimbus server, and
then use the Storm command line tool on the Nimbus server to kill the
old topology and deploy the new topology. We did this to get rid of
the topology jar file transfer to the Nimbus server over a network.
This way the new topology gets up and running MUCH faster minimizing
your downtime.

You need to make sure your topology can handle the rush of events
backed up in whatever event source you're using in this case.
Depending on your event volume, you could end up with a incredibly
high event spike compared to your normal event volume. When we had
prolonged downtime for maintenance we would back up events into a raw
events data store and then use batch processing later to catch up.
YMMV.

-TPP

Reply via email to