dubauski opened a new issue #3025: VPC Router Corruption when working with 
large number of networks containing instances with public IP addresses
URL: https://github.com/apache/cloudstack/issues/3025
 
 
   
   
[testCloudStack.zip](https://github.com/apache/cloudstack/files/2577548/testCloudStack.zip)
   
   ##### ISSUE TYPE
    * Bug Report
   
   ##### COMPONENT NAME
   VR
   
   ##### CLOUDSTACK VERSION
   4.11.1
   
   ##### CONFIGURATION
   N/A
   
   
   ##### OS / ENVIRONMENT
   N/A
   
   
   ##### SUMMARY
   VPC network / Virtual router is in an unstable/corrupted state and requires 
one or more restarts with "Clean" option 
   
   
   ##### STEPS TO REPRODUCE
   We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our 
usecase, we created a small program that calls CloudStack API to
   1) create VPC network with 20 guest networks, each containing one virtual 
machine with a public IP address allocated.  
   2) delete the machines and networks one by one. 
    
   However,  we frequently get a timeout error, sometimes during VM deletion, 
and sometimes during guest network deletion or even during static NAT disable 
step.  Once the timeout occurs, it seems that the VPC network / Virtual router 
is in an unstable/corrupted state.  We need to restart the Virtual Router with 
a clean option (sometimes have to try restart several times as it fails to 
deploy router VM as well).  After that, we can continue delete the network 
remaining environment.  Here is the high level steps that we did:
   
   1. Create VPC Network
   2. For each of the 20 "environments"
   3. Create Guest Network
   4. Add a VM to the network
   5. Acquire Public IP
   6. Associate the Public IP with VM
   7. For each of the 20 environment
   8. Disassociate the Public IP
   9. Delete VM
   10. Delete Guest network
   11. Delete VPC
   
    
   I'm attaching the simple java program which performs all of the above 
described steps and which allowed us to consistently run into the bug.
    
   To use the application:
    
   java -jar testCloudStack.jar <CloudStack API url: e.g. 
http://foo:8080/client/api> <apiKey> <secretKey> <zoneName>
    
   Note, that the test application works successfully with CloudStack server 
4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1
   
   ##### EXPECTED RESULTS
   Network deletion is successful and VPC is in operational state
   
   ##### ACTUAL RESULTS
   The hanging / timeout problems could be in any time during environment 
deletion.  The first few deletions could go through successfully, and then fail 
at some point.  The failure could be in any stage.  i.e. Disassociate public 
IP, delete VM or delete guest network.  We looked at cloud.log, agent log and 
management server log but couldn’t get any obvious errors.  It seems that 
management server sends the request to do the deletion, but the VR does not 
respond and the system/network becomes stuck in an invalid state. Network often 
gets stuck in “Shutdown” state as a result.
    
   Here are some errors in the management server log:
   
   _2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async 
job-29965, jobStatus: FAILED, resultCode: 530, result: 
org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":530,"errortext":"Failed
 to delete network"}
   2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Seq 
4-667095694804259240: Received: 
   { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 
4(cehv02.core.jazz.net), Ver: v1, Flags: 110, 
   { GroupAnswer }
   }
   2018-11-01 01:15:29,245 WARN  
[c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Unable to destroy guest network on router VM[DomainRouter|r-3388-VM]
   2018-11-01 01:15:29,247 WARN  
[c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Failed to destroy guest network config Ntwk[1122|Guest|12] on router 
VM[DomainRouter|r-3388-VM]
   2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Failed to unplug nic in network Ntwk[1122|Guest|12] for virtual router 
VM[DomainRouter|r-3388-VM]
   2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Unable to complete shutdown of the network elements due to element: 
VpcVirtualRouter
   2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
   2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] 
(API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) 
Network is not not in the correct state to be destroyed: Shutdown_
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to