I have noticed that sometimes the arping stuff can slow down the router drastically. At least in 4.7, arping will try to ping a hostname of "None" (Python's equivalent of null). I'm wondering if this has been fixed in newer versions.
*Jeff Hair* Technical Lead and Software Developer Tel: (+354) 415 0200 j...@greenqloud.com www.greenqloud.com On Fri, Feb 17, 2017 at 4:22 AM, Syahrul Sazli Shaharir <sa...@pulasan.my> wrote: > Hi, > > For the benefit of others, I've worked around this issue by:- > - SSH to router VM via local IP through host running the VM > - tail /var/log/cloud.log to pinpoint the source of loop > - Find the looping script in /opt/cloud , edit and disable the looping part > - If you do this quick (before timeout), VM will resume boot > successfully - if not try again until succeed. > > Full notes for my particular case: > https://pulasan.my/doku.php?id=cs_masalah_router_vm > > Thanks. > > On Tue, Dec 20, 2016 at 10:09 AM, Syahrul Sazli Shaharir > <sa...@pulasan.my> wrote: > > On Mon, Dec 19, 2016 at 8:54 PM, Simon Weller <swel...@ena.com> wrote: > >> When you're in the console, can you ping the host ip? > > > > Yes - some (not all) of the IPs assigned on the host. > > > >> What are your ip tables rules on this host currently? > > > > Chain INPUT (policy ACCEPT) > > target prot opt source destination > > ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:53 > > ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:53 > > ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:67 > > ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:67 > > > > Chain FORWARD (policy ACCEPT) > > target prot opt source destination > > ACCEPT all -- 0.0.0.0/0 192.168.122.0/24 ctstate > > RELATED,ESTABLISHED > > ACCEPT all -- 192.168.122.0/24 0.0.0.0/0 > > ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 > > REJECT all -- 0.0.0.0/0 0.0.0.0/0 > > reject-with icmp-port-unreachable > > REJECT all -- 0.0.0.0/0 0.0.0.0/0 > > reject-with icmp-port-unreachable > > > > Chain OUTPUT (policy ACCEPT) > > target prot opt source destination > > ACCEPT udp -- 0.0.0.0/0 0.0.0.0/0 udp dpt:68 > > > >> Can you dump the routing table as well? > > > > Kernel IP routing table > > Destination Gateway Genmask Flags MSS Window irtt > Iface > > 0.0.0.0 172.16.30.33 0.0.0.0 UG 0 0 > > 0 cloudbr2.304 > > 10.1.30.0 0.0.0.0 255.255.255.0 U 0 0 0 > bond1 > > 10.2.30.0 0.0.0.0 255.255.255.0 U 0 0 > > 0 cloudbr2.352 > > 10.3.30.0 0.0.0.0 255.255.255.0 U 0 0 > > 0 cloudbr2.353 > > 172.16.30.32 0.0.0.0 255.255.255.224 U 0 0 > > 0 cloudbr2.304 > > 169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 > cloud0 > > 192.168.122.0 0.0.0.0 255.255.255.0 U 0 0 0 > virbr0 > > > >> Have you tried a restart of one of the working networks to see if it > fails on restart? > > > > Yes, and able to restart OK. I logged on to each network's router VM > > console during the patchviasocket.py stage onwards, and find the > > following difference (both VMs were created and booted in the same > > host):- > > > > 1. Working network router VM ( http://pastebin.com/Y6zpDa6M ) :- > > ..... > > > > Dec 20 01:37:55 r-686-VM cloud: Boot up process done > > Dec 20 01:37:55 r-686-VM cloud: VR config: configuation format version > 1.0 > > Dec 20 01:37:55 r-686-VM cloud: VR config: creating file: > > /var/cache/cloud/monitor_service.json > > Dec 20 01:37:55 r-686-VM cloud: VR config: create file success > > Dec 20 01:37:55 r-686-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py monitor_service.json > > Dec 20 01:38:16 r-686-VM cloud: VR config: execution success > > Dec 20 01:38:16 r-686-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:38:16 r-686-VM cloud: VR config: create file success > > Dec 20 01:38:16 r-686-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:38:38 r-686-VM cloud: VR config: execution success > > Dec 20 01:38:38 r-686-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:38:38 r-686-VM cloud: VR config: create file success > > Dec 20 01:38:38 r-686-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:39:01 r-686-VM cloud: VR config: execution success > > Dec 20 01:39:01 r-686-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_metadata.json > > Dec 20 01:39:01 r-686-VM cloud: VR config: create file success > > Dec 20 01:39:01 r-686-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_metadata.json > > Dec 20 01:39:21 r-686-VM cloud: VR config: execution success > > Dec 20 01:39:21 r-686-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_metadata.json > > Dec 20 01:39:21 r-686-VM cloud: VR config: create file success > > Dec 20 01:39:21 r-686-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_metadata.json > > Dec 20 01:39:41 r-686-VM cloud: VR config: execution success > > Dec 20 01:39:41 r-686-VM cloud: VR config: Flushing conntrack table > > Dec 20 01:39:41 r-686-VM cloud: VR config: Flushing conntrack table > completed > > Dec 20 01:39:42 r-686-VM cloud: VR config: configuation format version > 1.0 > > Dec 20 01:39:42 r-686-VM cloud: VR config: Flushing conntrack table > > Dec 20 01:39:42 r-686-VM cloud: VR config: Flushing conntrack table > completed > > > > 2. Non-working network router VM ( http://pastebin.com/jzfGMGQB ):- > > ..... > > > > Dec 20 01:44:21 r-687-VM cloud: Boot up process done > > Dec 20 01:44:22 r-687-VM cloud: VR config: configuation format version > 1.0 > > Dec 20 01:44:22 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/monitor_service.json > > Dec 20 01:44:22 r-687-VM cloud: VR config: create file success > > Dec 20 01:44:22 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py monitor_service.json > > Dec 20 01:44:42 r-687-VM cloud: VR config: execution success > > Dec 20 01:44:42 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:44:42 r-687-VM cloud: VR config: create file success > > Dec 20 01:44:42 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:45:05 r-687-VM cloud: VR config: execution success > > Dec 20 01:45:05 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:45:05 r-687-VM cloud: VR config: create file success > > Dec 20 01:45:05 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:45:27 r-687-VM cloud: VR config: execution success > > Dec 20 01:45:27 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:45:27 r-687-VM cloud: VR config: create file success > > Dec 20 01:45:27 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:45:49 r-687-VM cloud: VR config: execution success > > Dec 20 01:45:49 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:45:49 r-687-VM cloud: VR config: create file success > > Dec 20 01:45:49 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:46:12 r-687-VM cloud: VR config: execution success > > Dec 20 01:46:12 r-687-VM cloud: VR config: creating file: > > /var/cache/cloud/vm_dhcp_entry.json > > Dec 20 01:46:12 r-687-VM cloud: VR config: create file success > > Dec 20 01:46:12 r-687-VM cloud: VR config: executing: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > Dec 20 01:46:22 r-687-VM shutdown[3919]: shutting down for system halt > > > > Broadcast message from root@r-687-VM (Tue Dec 20 01:46:22 2016): > > > > The system is going down for system halt NOW! > > Dec 20 01:46:22 r-687-VM shutdown[3962]: shutting down for system halt > > > > Broadcast message from root@r-687-VM (Tue Dec 20 01:46:22 2016): > > > > Power button pressed > > The system is going down for system halt NOW! > > Dec 20 01:46:23 r-687-VM KVP: KVP starting; pid is:4037 > > Dec 20 01:46:23 r-687-VM cloud: VR config: executing failed: > > /opt/cloud/bin/update_config.py vm_dhcp_entry.json > > debug1: channel 0: free: client-session, nchannels 1 > > Connection to 169.254.0.197 closed by remote host. > > Connection to 169.254.0.197 closed. > > Transferred: sent 4336, received 93744 bytes, in 180.3 seconds > > Bytes per second: sent 24.0, received 519.8 > > debug1: Exit status -1 > > > > Looks like the config script didn't get past vm_dhcp_entry.json ? > > > > Thanks. > > > >> > >> > >> > >> ________________________________ > >> From: Syahrul Sazli Shaharir <sa...@pulasan.my> > >> Sent: Monday, December 19, 2016 2:09 AM > >> To: users@cloudstack.apache.org > >> Subject: Re: Router VM: patchviasocket.py timeout issue on 1 out of 4 > networks > >> > >> On Tue, Dec 13, 2016 at 7:26 PM, Syahrul Sazli Shaharir > >> <sa...@pulasan.my> wrote: > >>> Hi Simon, > >>> > >>> On Tue, Dec 13, 2016 at 10:31 AM, Simon Weller <swel...@ena.com> > wrote: > >>>> Can you turn on agent debug mode and take a look at the debug level > logs? > >>>> > >>>> > >>>> You can do that by running sed -i 's/INFO/DEBUG/g' > /etc/cloudstack/agent/log4j-cloud.xml on the host and then restarting the > agent. > >>> > >>> Here are the debug logs - patchviasocket.py executed OK but couldn't > >>> connect to the router VM's internal IP:- > >>> > >>> 2016-12-13 19:23:18,627 DEBUG [kvm.resource.LibvirtComputingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Executing: > >>> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/ > patchviasocket.py > >>> -n r-669-VM -p %template=domP%name=r-669-VM% > eth0ip=10.3.28.10%eth0mask=255.255.255.0%gateway=10.3.28.1%domain= > nocser.net%cidrsize=24%dhcprange=10.3.28.1%eth1ip= > 169.254.3.7%eth1mask=255.255.0.0%type=dhcpsrvr%disable_rp_ > filter=true%dns1=8.8.8.8%dns2=8.8.4.4%ip6dns1=%ip6dns2=% > baremetalnotificationsecuritykey=uavJByNGGjNLrELG-qbdN99__ > 1I3tnp8qa0KbcsKokKJcPB43K9s6oQu2nMLqo3YP8p6jqDy5XT3WWOWBA2yN > w%baremetalnotificationapikey=8JH4mdkxsEMhgIBgMonkNXAEKjVOeZ > nG1m5UVekvvo4v_iXQ4ZS7rh6NNS0qphhc7ZrCauiz23tp2-Wa3AASlg%host=10.2.30.11% > port=8080 > >>> 2016-12-13 19:23:18,739 DEBUG [kvm.resource.LibvirtComputingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Execution is > >>> successful. > >>> 2016-12-13 19:23:18,742 DEBUG > >>> [resource.virtualnetwork.VirtualRoutingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Trying to connect to > >>> 169.254.3.7 > >>> 2016-12-13 19:23:21,749 DEBUG > >>> [resource.virtualnetwork.VirtualRoutingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Could not connect to > >>> 169.254.3.7 > >>> 2016-12-13 19:23:26,750 DEBUG > >>> [resource.virtualnetwork.VirtualRoutingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Trying to connect to > >>> 169.254.3.7 > >>> 2016-12-13 19:23:29,757 DEBUG > >>> [resource.virtualnetwork.VirtualRoutingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Could not connect to > >>> 169.254.3.7 > >>> 2016-12-13 19:23:29,869 DEBUG [cloud.agent.Agent] > >>> (agentRequest-Handler-5:null) (logid:981a5f6f) Processing command: > >>> com.cloud.agent.api.GetHostStatsCommand > >>> 2016-12-13 19:23:34,759 DEBUG > >>> [resource.virtualnetwork.VirtualRoutingResource] > >>> (agentRequest-Handler-4:null) (logid:0bf9a356) Unable to logon to > >>> 169.254.3.7 > >>> > >>> virsh console also failed to show anything. > >> > >> Ok after upgrading to latest qemu-kvm-ev-2.6.0-27.1.el7, this time I > >> got to the console at some stage, but patchviasocket.py still times > >> out. Here are the console output:- > >> > >> http://pastebin.com/n37aHeSa > >> [http://pastebin.com/i/facebook.png]<http://pastebin.com/n37aHeSa> > >> > >> Router VM's short lifetime - Pastebin.com<http://pastebin.com/n37aHeSa> > >> pastebin.com > >> > >> > >> > >> > >> Thanks. > >> > >> > >>>> ________________________________ > >>>> From: Syahrul Sazli Shaharir <sa...@pulasan.my> > >>>> Sent: Monday, December 12, 2016 8:21 PM > >>>> To: users@cloudstack.apache.org > >>>> Subject: Router VM: patchviasocket.py timeout issue on 1 out of 4 > networks > >>>> > >>>> Hi, > >>>> > >>>> I am running latest Cloudstack 4.9.0.1 on CentOS 7 KVM + ceph > >>>> environment. After running for some time, I faced with an issue with > >>>> one out of 4 networks - following a heartbeat-induced reset on all > >>>> hosts, the associated virtual router would not get recreated and > >>>> started properly on any of the 3 hosts I have, even after repeated > >>>> attempts of the following:- > >>>> - destroy-recreate cycles, via Cloudstack UI > >>>> - restartNetwork cleanup=true API calls (failed with errorcode = 530). > >>>> - redownload and reregister system VM template as another entry and > >>>> assign to router VM in global setting (boots the new template OK, but > >>>> still same problem) > >>>> - tweak default system offering for router VM (increased RAM from 256 > to 512MB) > >>>> - created new system offering, with RAM tweak, and use of ceph rbd > >>>> store, and assigned it to Cloud.Com-SoftwareRouter as per docs - which > >>>> didnt work for some reason: it kept on using initial default offering > >>>> and created image on local host storage > >>>> - upgrade to latest cloudstack (previously was running 4.8) > >>>> > >>>> As with a handful of others in this list archives, virsh list and > >>>> dumpxml shows the VM created OK but failed soon after booting, as > >>>> found in the following error in agent.log :- > >>>> > >>>> 2016-12-13 10:03:33,894 WARN [kvm.resource.LibvirtComputingResource] > >>>> (agentRequest-Handler-1:null) (logid:633e6e03) Timed out: > >>>> /usr/share/cloudstack-common/scripts/vm/hypervisor/kvm/ > patchviasocket.py > >>>> -n r-668-VM -p %template=domP%name=r-668-VM% > eth0ip=10.3.28.10%eth0mask=255.255.255.0%gateway=10.3.28.1%domain= > nocser.net%cidrsize=24%dhcprange=10.3.28.1%eth1ip= > 169.254.0.33%eth1mask=255.255.0.0%type=dhcpsrvr%disable_rp_ > filter=true%dns1=8.8.8.8%dns2=8.8.4.4%ip6dns1=%ip6dns2=% > baremetalnotificationsecuritykey=uavJByNGGjNLrELG-qbdN99__ > 1I3tnp8qa0KbcsKokKJcPB43K9s6oQu2nMLqo3YP8p6jqDy5XT3WWOWBA2yN > w%baremetalnotificationapikey=8JH4mdkxsEMhgIBgMonkNXAEKjVOeZ > nG1m5UVekvvo4v_iXQ4ZS7rh6NNS0qphhc7ZrCauiz23tp2-Wa3AASlg%host=10.2.30.11% > port=8080 > >>>> . Output is: > >>>> ..... > >>>> 2016-12-13 10:05:45,895 WARN [kvm.resource.LibvirtComputingResource] > >>>> (agentRequest-Handler-1:null) (logid:633e6e03) Timed out: > >>>> /usr/share/cloudstack-common/scripts/network/domr/router_proxy.sh > >>>> vr_cfg.sh 169.254.0.33 -c > >>>> /var/cache/cloud/VR-48ea8a95-6c02-499f-88d3-eae5bf9f9fbe.cfg . > Output > >>>> is: > >>>> > >>>> As mentioned, this only happens with 1 network (always the same > >>>> network). The other router VMs work OK. Any clues on how to > >>>> troubleshoot this further, would be greatly appreciated. > >>>> > >>>> Thanks. > >>>> > >>>> -- > >>>> --sazli > > > > > > > > -- > > --sazli > > > > -- > --sazli > Syahrul Sazli Shaharir <sa...@pulasan.my> > Mobile: +6019 385 8301 - YM/Skype: syahrulsazli > System Administrator > TMK Pulasan (002339810-M) http://pulasan.my/ > 11 Jalan 3/4, 43650 Bandar Baru Bangi, Selangor, Malaysia. > Tel/Fax: +603 8926 0338 >