You could edit the
openshift-ansible\playbooks\common\openshift-node\restart.yml and add:
max_fail_percentage: 0
under
serial: "{{ openshift_restart_nodes_serial | default(1) }}"
That, in theory, should make it fail straight away.
On Wed, Mar 14, 2018 at 9:46 PM Alan Christie <
[email protected]> wrote:
> Hi,
>
> I’ve been running the Ansible release-3.7 branch playbook and occasionally
> I get errors restarting nodes. I’m not looking for help on why my nodes are
> not restarting but I am curious as to why the playbook continues when there
> are fatal errors that eventually lead to a failure some 30 minutes or so
> later? Especially annoying if you happen a) not to be looking at the screen
> at the time of the original failure or b) running the installation inside
> another IaC framework.
>
> Is there an option to “stop on fatal” I’m missing by chance?
>
> Here’s a typical failure at (in my case) 21 minutes in…
>
>
> *RUNNING HANDLER [openshift_node : restart
> node]
> *******************************************************************Wednesday
> 14 March 2018 10:12:44 +0000 (0:00:00.081) 0:21:47.968 *******
> skipping: [os-master-1]
> skipping: [os-node-001]
> FAILED - RETRYING: restart node (3 retries left).
> FAILED - RETRYING: restart node (3 retries left).
> FAILED - RETRYING: restart node (2 retries left).
> FAILED - RETRYING: restart node (2 retries left).
> FAILED - RETRYING: restart node (1 retries left).
> FAILED - RETRYING: restart node (1 retries left).
>
>
> *fatal: [os-infra-1]: FAILED! => {"attempts": 3, "changed": false, "msg":
> "Unable to restart service origin-node: Job for origin-node.service failed
> because the control process exited with error code. See \"systemctl status
> origin-node.service\" and \"journalctl -xe\" for details.\n"}fatal:
> [os-node-002]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable
> to restart service origin-node: Job for origin-node.service failed because
> the control process exited with error code. See \"systemctl status
> origin-node.service\" and \"journalctl -xe\" for details.\n"}*
> And the roll-out finally "gives up the ghost" (in my case) after a further
> 30 minutes...
>
> TASK [debug]
> *****************************************************************************************************
> Wednesday 14 March 2018 10:42:20 +0000 (0:00:00.117) 0:51:23.829
> *******
> skipping: [os-master-1]
> to retry, use: --limit
> @/home/centos/abc/orchestrator/openshift/openshift-ansible/playbooks/byo/config.retry
>
> PLAY RECAP
> *******************************************************************************************************
> localhost : ok=13 changed=0 unreachable=0
> failed=0
> *os-infra-1 : ok=182 changed=70 unreachable=0
> failed=1 *
> os-master-1 : ok=539 changed=210 unreachable=0
> failed=0
> os-node-001 : ok=188 changed=65 unreachable=0
> failed=0
> *os-node-002 : ok=165 changed=61 unreachable=0
> failed=1*
>
> Alan Christie
>
>
>
>
> _______________________________________________
> users mailing list
> [email protected]
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users