Greetings,
I have a four node tashi cluster setup and I ran into few problems with
migration and I was hoping if someone could help. Migration sort of
works. I create brand new VM and migrate it to another node while it is
running. When migration is completed tashi will report running state,
but network connectivity is not restored.
I am using qemu for running virtual machines.
I suspect that this happens because both nodes are connected on
the same network switch and ARP resolution fails - switch already knows
where the MAC address in question is located and it doesn't know that we
moved the VM.
To confirm this, I moved the VM back to the previous node and the
network connection was restored.
Then another problem surfaced. I have nodes MUCA03 and MUCA04. VM was
running on MUCA03 and it was migrated to MUCA04. The network died. Now
I migrated VM back to MUCA03. The networking was restored and cluster
manager reported VM to be running on MUCA03.
However, node manager on MUCA04 complained:
2010-07-30 12:55:38,702
[/usr/lib/python2.6/tashi/nodemanager/vmcontrol/qemu.pyc:ERROR]
Migration (transiently) failed:
migration failed
(qemu)
This message is then repeat few times and then node manager thread died
a horrible death with RunTime exception:
File "/usr/lib/python2.6/tashi/nodemanager/nodemanagerservice.py", line
207, in migrateVmHelper
self.vmm.migrateVm(instance.vmId, target.name, transportCookie)
File "/usr/lib/python2.6/tashi/nodemanager/vmcontrol/qemu.py", line 503,
in migrateVm
res = self.stopVm(vmId, "tcp:%s:%d" % (target, port), False)
File "/usr/lib/python2.6/tashi/nodemanager/vmcontrol/qemu.py", line 433,
in stopVm
raise RuntimeError
Qemu continued to run (and was unresponsive) on this particular
node.
Nodemanager on MUCA04 will continue claiming that VM for itself and
cluster manager will go nuts on this with errors:
[tashi.clustermanager.clustermanagerservice:INFO] Host muca04 is
claiming instance 4 actually owned by hostId 3 (decay)
[tashi.clustermanager.clustermanagerservice:WARNING] Fetching state from
host muca04 because it is decayed
Any insight on how to deal with the stubborn network switch will be
greatly appreciated.
Oh, one more thing, just to let you know since it is not a primary
concern right now. I tried suspending the VM after it was migrated to a
different node and then resuming it in hope to wake up the network
switch. Unfortunately resume failed and nodemanager's thread again died
with a Runtime Exception somewhere in the bowels
of /usr/lib/python2.6/tashi/nodemanager/vmcontrol/qemu.py
Regards,
David