H Carlos, glad you figured it out. A colleague had a similar issue but his finding was that the host table included a timestamp to identify the management server.
You are right about the replacement of the management server whether it is mac or timestamp, this will pose a problem. On Fri, Jul 25, 2014 at 10:43 PM, Carlos Reátegui <create...@gmail.com> wrote: > My system is back up and running. > > As I suspected in my second email the problem was related to the msid in the > mshost table. Upon bringing up my system a new mshost entry was being > created for the same MS and for some reason it was unable to connect to my > XenServer hosts. > > I decided to go back to my edited sql with the new IPs and change the > existing mshost entry to have the new msid value: > > sed -i.bak4 's/159090355471823/159090355471825/g' cloudstack_cloud-newips.sql > > I did a global replace since there are foreign key constraints on the msid > that I saw in the host and async_job tables > > After reloading this new sql and starting the MS everything is back to > normal, but with a new subnet for my hosts and guests (this is a basic > network). > > > Question for the Devs: > > Lets say the machine my MS was running on crashed and I replaced it with a > new machine. Since the msid is derived from the MAC wouldn’t I have > encountered this same problem and not been able to have the new machine > connect to the hosts? > > thanks, > Carlos > > > > > On Jul 25, 2014, at 8:46 AM, Carlos Reátegui <create...@gmail.com> wrote: > >> Any thoughts out there? >> >> It keeps trying to connect to the hosts but it is unable to and there are no >> clues in the logs as to why. I am successfully connected with XenCenter to >> the pool and also am able to ssh to all the hosts from the MS. >> >> What does “Disable Cluster” or “Unmanage Cluster” do? Should I try that and >> re-enable/manage? >> >> From the UI, things appear ok but starting any instance fails. >> >> Thanks, >> Carlos >> >> Log snippet form this am: >> >> 2014-07-25 21:03:19,599 DEBUG [agent.manager.ClusteredAgentAttache] >> (StatsCollector-2:null) Seq 2-65931749: Forwarding null to 159090355471823 >> 2014-07-25 21:03:19,599 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-13:null) Seq 2-65931749: Routing from 159090355471825 >> 2014-07-25 21:03:19,599 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-13:null) Seq 2-65931749: Link is closed >> 2014-07-25 21:03:19,600 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-13:null) Seq 2-65931749: MgmtId 159090355471825: Req: >> Resource [Host:2] is >> unreachable: Host 2: Link is closed >> 2014-07-25 21:03:19,600 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-13:null) Seq 2--1: MgmtId 159090355471825: Req: >> Routing to peer >> 2014-07-25 21:03:19,601 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-14:null) Seq 2--1: MgmtId 159090355471825: Req: Cancel >> request received >> 2014-07-25 21:03:19,601 DEBUG [agent.manager.AgentAttache] >> (AgentManager-Handler-14:null) Seq 2-65931749: Cancelling. >> 2014-07-25 21:03:19,601 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 2-65931749: Waiting some more time because this >> is the current command >> 2014-07-25 21:03:19,601 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 2-65931749: Waiting some more time because this >> is the current command >> 2014-07-25 21:03:19,601 INFO [utils.exception.CSExceptionErrorCode] >> (StatsCollector-2:null) Could not find exception: >> com.cloud.exception.OperationTimedoutException in >> error code list for exceptions >> 2014-07-25 21:03:19,601 WARN [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 2-65931749: Timed out on null >> 2014-07-25 21:03:19,601 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 2-65931749: Cancelling. >> 2014-07-25 21:03:19,601 WARN [agent.manager.AgentManagerImpl] >> (StatsCollector-2:null) Operation timed out: Commands 65931749 to Host 2 >> timed out after 3600 >> 2014-07-25 21:03:19,601 WARN [cloud.resource.ResourceManagerImpl] >> (StatsCollector-2:null) Unable to obtain host 2 statistics. >> 2014-07-25 21:03:19,601 WARN [cloud.server.StatsCollector] >> (StatsCollector-2:null) Received invalid host stats for host: 2 >> 2014-07-25 21:03:19,606 DEBUG [agent.manager.ClusteredAgentAttache] >> (StatsCollector-2:null) Seq 3-602278373: Forwarding null to 159090355471823 >> 2014-07-25 21:03:19,607 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-15:null) Seq 3-602278373: Routing from 159090355471825 >> 2014-07-25 21:03:19,607 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-15:null) Seq 3-602278373: Link is closed >> 2014-07-25 21:03:19,607 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-15:null) Seq 3-602278373: MgmtId 159090355471825: Req: >> Resource [Host:3] is unreachable: Host 3: Link is closed >> 2014-07-25 21:03:19,608 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-15:null) Seq 3--1: MgmtId 159090355471825: Req: >> Routing to peer >> 2014-07-25 21:03:19,608 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-4:null) Seq 3--1: MgmtId 159090355471825: Req: Cancel >> request received >> 2014-07-25 21:03:19,609 DEBUG [agent.manager.AgentAttache] >> (AgentManager-Handler-4:null) Seq 3-602278373: Cancelling. >> 2014-07-25 21:03:19,609 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 3-602278373: Waiting some more time because this >> is the current command >> 2014-07-25 21:03:19,609 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 3-602278373: Waiting some more time because this >> is the current command >> 2014-07-25 21:03:19,609 INFO [utils.exception.CSExceptionErrorCode] >> (StatsCollector-2:null) Could not find exception: >> com.cloud.exception.OperationTimedoutException in error code list for >> exceptions >> 2014-07-25 21:03:19,609 WARN [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 3-602278373: Timed out on null >> 2014-07-25 21:03:19,609 DEBUG [agent.manager.AgentAttache] >> (StatsCollector-2:null) Seq 3-602278373: Cancelling. >> 2014-07-25 21:03:19,609 WARN [agent.manager.AgentManagerImpl] >> (StatsCollector-2:null) Operation timed out: Commands 602278373 to Host 3 >> timed out after 3600 >> 2014-07-25 21:03:19,609 WARN [cloud.resource.ResourceManagerImpl] >> (StatsCollector-2:null) Unable to obtain host 3 statistics. >> 2014-07-25 21:03:19,609 WARN [cloud.server.StatsCollector] >> (StatsCollector-2:null) Received invalid host stats for host: 3 >> 2014-07-25 21:03:19,614 DEBUG [agent.manager.ClusteredAgentAttache] >> (StatsCollector-2:null) Seq 5-1311574501: Forwarding null to 159090355471823 >> 2014-07-25 21:03:19,617 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-1:null) Seq 5-1311574501: Routing from 159090355471825 >> 2014-07-25 21:03:19,617 DEBUG [agent.manager.ClusteredAgentAttache] >> (AgentManager-Handler-1:null) Seq 5-1311574501: Link is closed >> 2014-07-25 21:03:19,617 DEBUG [agent.manager.ClusteredAgentManagerImpl] >> (AgentManager-Handler-1:null) Seq 5-1311574501: MgmtId 159090355471825: Req: >> Resource [Host:5] is unreachable: Host 5: Link is closed >> >> >> >> On Jul 24, 2014, at 10:59 PM, Carlos Reátegui <car...@reategui.com> wrote: >> >>> Not sure if it is related but I see 2 entries in the mshost for my same >>> server but with different msid. Both show as ‘Up’. In reading the table >>> comments it seems the msid is based on the MAC. I am guessing this may be >>> due to using a bond and that it may be have selected a different NIC to get >>> the bond MAC from. Is it ok to have both of these entries? Should I mark >>> the old one as Down? >>> >>> Along these lines is there something similar with the hosts and that is why >>> the MS is having problems connecting to them, ie. the MACs don’t match? >>> >>> thanks, >>> Carlos >>> >>> >>> On Jul 24, 2014, at 3:35 PM, Carlos Reategui <car...@reategui.com> wrote: >>> >>>> Hi All, >>>> >>>> Had to move one of my clusters to a new subnet but it is not working (e.g. >>>> 192.168.1.0/24 to 10.100.1.0/24). These are the steps I took: >>>> >>>> Environment: CS 4.1.1 on Ubuntu 12.04, XenServer 6.1, Shared NFS SR. >>>> >>>> 1) stopped all instances using cloudstack UI >>>> 2) stop cloudstack-management service on MS >>>> 3) Used XenCenter to kill the system VMs (no other instances running) >>>> 4) Created backup of cloud db. >>>> 5) Followed http://support.citrix.com/article/CTX123477 and successfully >>>> changed the IP of hosts. According to XenCenter everything is good >>>> including SR. >>>> 6) Changed IP of MS >>>> 7) verified communication between MS and Hosts using ssh and ping with new >>>> IPs. >>>> 8) used sed to search and replace all old IPs with new IPs in cloud backup >>>> sql file (e.g. sed -i.bak 's/192.168.1./10.100.1./g' clouddb.sql). >>>> 9) visually verified all diffs in the sql file and made sure no references >>>> to 192.168 left. >>>> 10) loaded up new sql >>>> 11) search all files under /etc on MS for old IP. found and edited: >>>> /etc/cloudstack/management/db.properties >>>> 12) start cloudstack-management service on MS >>>> >>>> Unfortunately things are not working. The MS is apparently unable to >>>> connect to the hosts but I can not figure out why from the logs. >>>> >>>> Logs here: https://www.dropbox.com/s/s5glxrbyatmsoug/management-server.log >>>> >>>> Any help recovering is appreciated. I do not want to have to re-install >>>> and create/import template for each of the instance VHDs. >>>> >>>> thank you, >>>> -Carlos >>> >> > -- Daan