You could edit the openshift-ansible\playbooks\common\openshift-node\restart.yml and add:
max_fail_percentage: 0 under serial: "{{ openshift_restart_nodes_serial | default(1) }}" That, in theory, should make it fail straight away. On Wed, Mar 14, 2018 at 9:46 PM Alan Christie < achris...@informaticsmatters.com> wrote: > Hi, > > I’ve been running the Ansible release-3.7 branch playbook and occasionally > I get errors restarting nodes. I’m not looking for help on why my nodes are > not restarting but I am curious as to why the playbook continues when there > are fatal errors that eventually lead to a failure some 30 minutes or so > later? Especially annoying if you happen a) not to be looking at the screen > at the time of the original failure or b) running the installation inside > another IaC framework. > > Is there an option to “stop on fatal” I’m missing by chance? > > Here’s a typical failure at (in my case) 21 minutes in… > > > *RUNNING HANDLER [openshift_node : restart > node] > *******************************************************************Wednesday > 14 March 2018 10:12:44 +0000 (0:00:00.081) 0:21:47.968 ******* > skipping: [os-master-1] > skipping: [os-node-001] > FAILED - RETRYING: restart node (3 retries left). > FAILED - RETRYING: restart node (3 retries left). > FAILED - RETRYING: restart node (2 retries left). > FAILED - RETRYING: restart node (2 retries left). > FAILED - RETRYING: restart node (1 retries left). > FAILED - RETRYING: restart node (1 retries left). > > > *fatal: [os-infra-1]: FAILED! => {"attempts": 3, "changed": false, "msg": > "Unable to restart service origin-node: Job for origin-node.service failed > because the control process exited with error code. See \"systemctl status > origin-node.service\" and \"journalctl -xe\" for details.\n"}fatal: > [os-node-002]: FAILED! => {"attempts": 3, "changed": false, "msg": "Unable > to restart service origin-node: Job for origin-node.service failed because > the control process exited with error code. See \"systemctl status > origin-node.service\" and \"journalctl -xe\" for details.\n"}* > And the roll-out finally "gives up the ghost" (in my case) after a further > 30 minutes... > > TASK [debug] > ***************************************************************************************************** > Wednesday 14 March 2018 10:42:20 +0000 (0:00:00.117) 0:51:23.829 > ******* > skipping: [os-master-1] > to retry, use: --limit > @/home/centos/abc/orchestrator/openshift/openshift-ansible/playbooks/byo/config.retry > > PLAY RECAP > ******************************************************************************************************* > localhost : ok=13 changed=0 unreachable=0 > failed=0 > *os-infra-1 : ok=182 changed=70 unreachable=0 > failed=1 * > os-master-1 : ok=539 changed=210 unreachable=0 > failed=0 > os-node-001 : ok=188 changed=65 unreachable=0 > failed=0 > *os-node-002 : ok=165 changed=61 unreachable=0 > failed=1* > > Alan Christie > > > > > _______________________________________________ > users mailing list > users@lists.openshift.redhat.com > http://lists.openshift.redhat.com/openshiftmm/listinfo/users >
_______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users