Remember that spot instances are given shutdown notifications on a
best-effort basis[1]. You would have to disconnect the node, drain it, then
shut it down after draining, and hope you do so before you get killed. You
could also consider the new hibernation feature -- it'll hibernate your
node instead of terminating, and then rehydrate it at a later time. Your
cluster would have a disconnected node in the mean time though. All of
these scenarios introduce a significant potential of data loss, you should
be sure you could reproduce the data from a durable source if needed (ex.
Kafka, etc), or be accepting of the data loss.


[1]
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-interruptions.html
While
we make every effort to provide this warning as soon as possible, it is
possible that your Spot Instance is terminated before the warning can be
made available. Test your application to ensure that it handles an
unexpected instance termination gracefully, even if you are testing for
interruption notices. You can do so by running the application using an
On-Demand Instance and then terminating the On-Demand Instance yourself.

On Wed, Aug 28, 2019 at 8:57 PM Jean-Sebastien Vachon <
[email protected]> wrote:

> Hi Craig,
>
> I made some additional tests and I am afraid I lost flows... I used the
> same flow I described earlier, generated around 30k flows and load balanced
> them on the three nodes forming my cluster.
> I then shutdown one of the machine. The result is that I lost 10k flows
> that were scheduled to be processed on this machine. This is a problem I
> need to address and I'll be looking for ideas shortly.
>
> For those interested in automating the removal of a spot instance from a
> cluster... here is something to get you started.
> AWS recommend to monitor the URL found in the if statement every 5s (or
> so)... Since cron only supports 1 minute intervals and nothing smaller,
> I accomplish what I wanted by adding multiple crons and sleeping for a
> variable amount of time.
>
> You will need jq and curl to be installed on your machine for this to
> work.
> The basic idea is to wait until the web page appears to exist and then
> trigger a series of actions.
>
> ---
>
> #!/bin/bash
> sleep $1
>
> NODE_IP=`curl -s http://169.254.169.254/latest/meta-data/local-ipv4`
> <http://169.254.169.254/latest/meta-data/local-ipv4>
> NODE_ID=`curl -s "http://${NODE_IP}:8088/nifi-api/controller/cluster"; |
> jq --arg IP "${NODE_IP}" -r '.cluster.nodes[] | select('.address' ==
> $IP).nodeId'`
> OTHER_NODE=`curl -s "http://${NODE_IP}:8088/nifi-api/controller/cluster";
> | jq --arg IP "${NODE_IP}"  -r '.cluster.nodes[] | select('.address' !=
> $IP).address' | head -1`
>
> if [ -z $(curl -Is
> http://169.254.169.254/latest/meta-data/spot/termination-time | head -1 |
> grep 404 | cut -d' ' -f 2) ]
> then
>     echo "Running shutdown hook."
>     systemctl stop nifi
>     sleep 5
>     curl -s -X DELETE "http://
> ${OTHER_NODE}:8088/nifi-api/controller/cluster/nodes/$NODE_ID"
> fi
>
> ------------------------------
> *From:* Jean-Sebastien Vachon <[email protected]>
> *Sent:* Wednesday, August 28, 2019 7:39 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: clean shutdown
>
> Hi Craig,
>
> First the generic stuff...
>
> according to the tests I made, no flows are lost when a machine is removed
> from the cluster.  They seem to be requeued.
> However, I only tested with a very basic flow and not with my whole flow
> which involves a lot of things.
> Basically, I used a GenerateFlow to generate some data and a dummy Python
> process to do something with it. The queue between the two
> processors was configured to do load balancing using a round robin. I must
> admit that I haven't look if the item was requeued and dispatched to
> another node.
> The output of the python module was split between success and failure and
> no single flow reached the failure state.
>
> then to AWS specific stuff...
>
> I had to script a few things to cleanup within the two minutes warning AWS
> is giving me.
> Since I am using spot instances, I know the instance will not come back so
> I had to automate the clean up of the cluster by
> using an API call to remove the machine from the cluster. In order to
> remove the machine from the cluster, I need to stop Nifi first and then
> remove the machine through
> a call to the API on a second node. I am still polishing the script to
> accomplish this. I may share it once it is working as expected in case
> someone else has this issue.
>
> Let me know if you need more details about anything...
> ------------------------------
> *From:* Craig Knell <[email protected]>
> *Sent:* Wednesday, August 28, 2019 6:52 PM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: clean shutdown
>
> Hi Jean-Sebastien,
>
> I’d be interested to hear how this performs
>
> Best regards
>
> Craig
>
> On 28 Aug 2019, at 22:28, Jean-Sebastien Vachon <[email protected]>
> wrote:
>
> Hi Pierre,
>
> thanks for your input.
>
> I am already intercepting AWS termination notification so I will add a few
> steps and see how it reacts
>
> Thanks again
> ------------------------------
> *From:* Pierre Villard <[email protected]>
> *Sent:* Wednesday, August 28, 2019 4:17 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: clean shutdown
>
> Hi Jean-Sebastien,
>
> When you stop NiFi, by default, it will try to gracefully stop everything
> in 10 seconds, and if not all components are nicely stopped after that, it
> will force shut down the NiFi process. This is configured with
> "nifi.flowcontroller.graceful.shutdown.period" in nifi.properties file. If
> you have processors/CS that might take longer to stop gracefully (because
> of connections to external systems for instance), you could increase this
> value.
>
> I'm not very familiar with AWS spot instances but I'd try to catch the
> spot notification event to stop the NiFi service on the host before the
> instance is stopped/killed.
>
> Pierre
>
>
>
> Le mar. 27 août 2019 à 20:05, Jean-Sebastien Vachon <
> [email protected]> a écrit :
>
> Hi everybody,
>
> I am working with AWS spot instances and one thing that is giving me a
> hard time is to perform a clean (and quick) shutdown of Nifi in order to
> prevent data loss.
>
> AWS will give you about two minutes to clean up everything before the
> machine is actually shutdown.
> Is there a way to stop/kill all processes running on the host without
> loosing anything? It is fine if all the flowfiles being processed are
> simply requeued.
>
> Would simply killing the processes achieve this? (I doubt so)... would it
> be better to fetch a list of running processors and terminate them using
> Nifi's API?
>
> All ideas and thoughts are welcome
>
> thanks
>
>

Reply via email to