Hi Matthew, The stock ha-restart scripts needs to include a proper fencing mechanism for the vm hosts. This is needed to prevent the split-brain conditions described in your email.
Simply include the fencing command in the hook (you have the hostname of the target host in the script, so it should be straight-forward). This will typically reboot the host, shutdown any VM in it. Cheers Ruben On Thu, Jan 9, 2014 at 5:39 PM, Matthew Richardson <[email protected]>wrote: > Hi, > > I'm running a ONE 4.2 pool, and had some issues with it earlier today. > > I had some vm hosts lock up due to networking issues, where the vm hosts > could see the rest of the world, but not be reached by the ONE server. > > As a result, the ONE server called a hook script: > > VM_HOOK = [ name = "on_crash_boot", on = "UNKNOWN", command = > "/usr/bin/env onevm boot", arguments = "$ID" ] > > This resulted in an attempted cleanup (which appears to fail due to the > ongoing network problems) followed by a restart elsewhere. However, the > failed cleanup meant that I then had 2 instances of the same guest > running on 2 vm hosts, which led to mac address conflicts on the network. > > Is this a bug in ONE's handling of cleanup failure, or is there > something else I should be doing in my hook script to ensure that it is > safe to call onevm boot? > > Any advice appreciated! (other than to take better care of the network :) ) > > thanks, > > Matthew > > > oned.log starts as follows: > > Thu Jan 9 08:13:07 2014 [InM][I]: Command execution fail: 'if [ -x > "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2 > vmhost3; else exit 42; fi' > Thu Jan 9 08:13:07 2014 [InM][I]: Connection closed by 192.168.12.16 > Thu Jan 9 08:13:07 2014 [InM][I]: ExitCode: 255 > Thu Jan 9 08:13:07 2014 [ONE][E]: Error monitoring Host vmhost3 (2): - > Thu Jan 9 08:13:07 2014 [ReM][D]: Req:3296 UID:0 VirtualMachineAction > invoked, "boot", 14 > Thu Jan 9 08:13:07 2014 [DiM][D]: Restarting VM 14 > Thu Jan 9 08:13:07 2014 [ReM][D]: Req:3296 UID:0 VirtualMachineAction > result SUCCESS, 14 > Thu Jan 9 08:13:07 2014 [HKM][D]: Message received: EXECUTE SUCCESS 14 > on_crash_boot: > > Thu Jan 9 08:13:08 2014 [ReM][D]: Req:3328 UID:0 VirtualMachineInfo > invoked, 14 > Thu Jan 9 08:13:08 2014 [ReM][D]: Req:3328 UID:0 VirtualMachineInfo > result SUCCESS, "<VM><ID>14</ID><UID>..." > > Thu Jan 9 08:13:08 2014 [ReM][D]: Req:9328 UID:0 VirtualMachineAction > invoked, "delete-recreate", 14 > Thu Jan 9 08:13:08 2014 [ReM][D]: Req:9328 UID:0 VirtualMachineAction > result SUCCESS, 14 > > Thu Jan 9 08:13:08 2014 [VMM][D]: Message received: LOG I 14 Driver > command for 14 cancelled > > > > The (slightly redacted) guest log (14.log) is as follows: > > Thu Jan 9 07:44:53 2014 [LCM][I]: New VM state is RUNNING > Thu Jan 9 08:13:07 2014 [LCM][I]: New VM state is UNKNOWN > Thu Jan 9 08:13:07 2014 [LCM][I]: New VM state is BOOT_UNKNOWN > Thu Jan 9 08:13:07 2014 [HKM][I]: Success executing Hook: on_crash_boot: . > Thu Jan 9 08:13:07 2014 [VMM][I]: Generating deployment file: > /var/lib/one/vms/14/deployment.4917 > Thu Jan 9 08:13:08 2014 [LCM][I]: New VM state is CLEANUP. > Thu Jan 9 08:13:08 2014 [VMM][I]: Driver command for 14 cancelled > Thu Jan 9 08:18:52 2014 [VMM][I]: Command execution fail: > /var/tmp/one/vmm/kvm/cancel 'one-14' 'vmhost3' 14 vmhost3 > Thu Jan 9 08:18:52 2014 [VMM][I]: Connection closed by 192.168.12.16 > Thu Jan 9 08:18:52 2014 [VMM][I]: ExitSSHCode: 255 > Thu Jan 9 08:18:52 2014 [VMM][E]: Error connecting to vmhost3 > Thu Jan 9 08:18:52 2014 [VMM][I]: Failed to execute virtualization > driver operation: cancel. > Thu Jan 9 08:18:52 2014 [VMM][I]: Command execution fail: > /var/tmp/one/vnm/dummy/clean <...snip...> > Thu Jan 9 08:18:52 2014 [VMM][I]: Connection closed by 192.168.12.16 > Thu Jan 9 08:18:52 2014 [VMM][I]: ExitSSHCode: 255 > Thu Jan 9 08:18:52 2014 [VMM][E]: Error connecting to vmhost3 > Thu Jan 9 08:18:52 2014 [VMM][I]: Failed to execute network driver > operation: clean. > Thu Jan 9 08:19:01 2014 [VMM][I]: Successfully execute transfer manager > driver operation: tm_delete. > Thu Jan 9 08:19:02 2014 [VMM][I]: Successfully execute transfer manager > driver operation: tm_delete. > Thu Jan 9 08:19:02 2014 [VMM][I]: Host successfully cleaned. > Thu Jan 9 08:19:03 2014 [DiM][I]: New VM state is PENDING > Thu Jan 9 08:20:54 2014 [DiM][I]: New VM state is ACTIVE. > Thu Jan 9 08:20:54 2014 [LCM][I]: New VM state is PROLOG. > Thu Jan 9 08:20:54 2014 [VM][I]: Virtual Machine has no context > Thu Jan 9 08:20:54 2014 [LCM][I]: New VM state is BOOT > Thu Jan 9 08:20:54 2014 [VMM][I]: Generating deployment file: > /var/lib/one/vms/14/deployment.4918 > Thu Jan 9 08:20:56 2014 [VMM][I]: ExitCode: 0 > Thu Jan 9 08:20:56 2014 [VMM][I]: Successfully execute network driver > operation: pre. > Thu Jan 9 08:20:56 2014 [VMM][I]: ExitCode: 0 > Thu Jan 9 08:20:56 2014 [VMM][I]: Successfully execute virtualization > driver operation: deploy. > Thu Jan 9 08:20:56 2014 [VMM][I]: ExitCode: 0 > Thu Jan 9 08:20:56 2014 [VMM][I]: Successfully execute network driver > operation: post. > Thu Jan 9 08:20:56 2014 [LCM][I]: New VM state is RUNNING > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > _______________________________________________ > Users mailing list > [email protected] > http://lists.opennebula.org/listinfo.cgi/users-opennebula.org > > -- > <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org> > -- > Ruben S. Montero, PhD > Project co-Lead and Chief > Architect<http://lists.opennebula.org/listinfo.cgi/users-opennebula.org> > OpenNebula - Flexible Enterprise Cloud Made Simple > <http://lists.opennebula.org/listinfo.cgi/users-opennebula.org> > www.OpenNebula.org | [email protected] | @OpenNebula >
_______________________________________________ Users mailing list [email protected] http://lists.opennebula.org/listinfo.cgi/users-opennebula.org
