You could edit the
openshift-ansible\playbooks\common\openshift-node\restart.yml and add:

max_fail_percentage: 0

under

serial: "{{ openshift_restart_nodes_serial | default(1) }}"

That, in theory, should make it fail straight away.

On Wed, Mar 14, 2018 at 9:46 PM Alan Christie <
achris...@informaticsmatters.com> wrote:

> Hi,
>
> I’ve been running the Ansible release-3.7 branch playbook and occasionally
> I get errors restarting nodes. I’m not looking for help on why my nodes are
> not restarting but I am curious as to why the playbook continues when there
> are fatal errors that eventually lead to a failure some 30 minutes or so
> later? Especially annoying if you happen a) not to be looking at the screen
> at the time of the original failure or b) running the installation inside
> another IaC framework.
>
> Is there an option to “stop on fatal” I’m missing by chance?
>
> Here’s a typical failure at (in my case) 21 minutes in…
>
>
> *RUNNING HANDLER [openshift_node : restart
> node] 
> *******************************************************************Wednesday
> 14 March 2018  10:12:44 +0000 (0:00:00.081)       0:21:47.968 *******
> skipping: [os-master-1]
> skipping: [os-node-001]
> FAILED - RETRYING: restart node (3 retries left).
> FAILED - RETRYING: restart node (3 retries left).
> FAILED - RETRYING: restart node (2 retries left).
> FAILED - RETRYING: restart node (2 retries left).
> FAILED - RETRYING: restart node (1 retries left).
> FAILED - RETRYING: restart node (1 retries left).
>
>
> *fatal: [os-infra-1]: FAILED! => {"attempts": 3, "changed": false, "msg":
> "Unable to restart service origin-node: Job for origin-node.service failed
> because the control process exited with error code. See \"systemctl status
> origin-node.service\" and \"journalctl -xe\" for details.\n"}fatal:
> [os-node-002]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable
> to restart service origin-node: Job for origin-node.service failed because
> the control process exited with error code. See \"systemctl status
> origin-node.service\" and \"journalctl -xe\" for details.\n"}*
> And the roll-out finally "gives up the ghost" (in my case) after a further
> 30 minutes...
>
> TASK [debug]
> *****************************************************************************************************
> Wednesday 14 March 2018  10:42:20 +0000 (0:00:00.117)       0:51:23.829
> *******
> skipping: [os-master-1]
> to retry, use: --limit
> @/home/centos/abc/orchestrator/openshift/openshift-ansible/playbooks/byo/config.retry
>
> PLAY RECAP
> *******************************************************************************************************
> localhost                  : ok=13   changed=0    unreachable=0
>   failed=0
> *os-infra-1                 : ok=182  changed=70   unreachable=0
>   failed=1   *
> os-master-1                : ok=539  changed=210  unreachable=0
>   failed=0
> os-node-001                : ok=188  changed=65   unreachable=0
>   failed=0
> *os-node-002                : ok=165  changed=61   unreachable=0
>   failed=1*
>
> Alan Christie
>
>
>
>
> _______________________________________________
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Reply via email to