dubauski opened a new issue #3025: VPC Router Corruption when working with large number of networks containing instances with public IP addresses URL: https://github.com/apache/cloudstack/issues/3025 [testCloudStack.zip](https://github.com/apache/cloudstack/files/2577548/testCloudStack.zip) ##### ISSUE TYPE * Bug Report ##### COMPONENT NAME VR ##### CLOUDSTACK VERSION 4.11.1 ##### CONFIGURATION N/A ##### OS / ENVIRONMENT N/A ##### SUMMARY VPC network / Virtual router is in an unstable/corrupted state and requires one or more restarts with "Clean" option ##### STEPS TO REPRODUCE We are using CloudStack 4.11.1 running with KVM hosts. To simulate our usecase, we created a small program that calls CloudStack API to 1) create VPC network with 20 guest networks, each containing one virtual machine with a public IP address allocated. 2) delete the machines and networks one by one. However, we frequently get a timeout error, sometimes during VM deletion, and sometimes during guest network deletion or even during static NAT disable step. Once the timeout occurs, it seems that the VPC network / Virtual router is in an unstable/corrupted state. We need to restart the Virtual Router with a clean option (sometimes have to try restart several times as it fails to deploy router VM as well). After that, we can continue delete the network remaining environment. Here is the high level steps that we did: 1. Create VPC Network 2. For each of the 20 "environments" 3. Create Guest Network 4. Add a VM to the network 5. Acquire Public IP 6. Associate the Public IP with VM 7. For each of the 20 environment 8. Disassociate the Public IP 9. Delete VM 10. Delete Guest network 11. Delete VPC I'm attaching the simple java program which performs all of the above described steps and which allowed us to consistently run into the bug. To use the application: java -jar testCloudStack.jar <CloudStack API url: e.g. http://foo:8080/client/api> <apiKey> <secretKey> <zoneName> Note, that the test application works successfully with CloudStack server 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1 ##### EXPECTED RESULTS Network deletion is successful and VPC is in operational state ##### ACTUAL RESULTS The hanging / timeout problems could be in any time during environment deletion. The first few deletions could go through successfully, and then fail at some point. The failure could be in any stage. i.e. Disassociate public IP, delete VM or delete guest network. We looked at cloud.log, agent log and management server log but couldn’t get any obvious errors. It seems that management server sends the request to do the deletion, but the VR does not respond and the system/network becomes stuck in an invalid state. Network often gets stuck in “Shutdown” state as a result. Here are some errors in the management server log: _2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async job-29965, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed to delete network"} 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Seq 4-667095694804259240: Received: { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 4(cehv02.core.jazz.net), Ver: v1, Flags: 110, { GroupAnswer } } 2018-11-01 01:15:29,245 WARN [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Unable to destroy guest network on router VM[DomainRouter|r-3388-VM] 2018-11-01 01:15:29,247 WARN [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Failed to destroy guest network config Ntwk[1122|Guest|12] on router VM[DomainRouter|r-3388-VM] 2018-11-01 01:15:29,247 WARN [c.c.n.e.VpcVirtualRouterElement] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Failed to unplug nic in network Ntwk[1122|Guest|12] for virtual router VM[DomainRouter|r-3388-VM] 2018-11-01 01:15:29,247 WARN [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Unable to complete shutdown of the network elements due to element: VpcVirtualRouter 2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown 2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Network is not not in the correct state to be destroyed: Shutdown_
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services