[GitHub] Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues
Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171791837 ## File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java ## @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, s_logger.debug("Caught exception while getting agent's next status", ne); } +// For log and alert purposes later +final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId()); +final HostPodVO podVO = _podDao.findById(host.getPodId()); +final String hostDesc = "[name: " + host.getName() + " (id:" + host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName() + "]"; +final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")"; + +final ResourceState resourceState = host.getResourceState(); +if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) { +// If we are in this resourceState, no need to investigate or do anything. AgentMonitor will handle when in these resourceStates +s_logger.info(hostShortDesc + " has disconnected with event " + event + ", but is in Resource State of " + resourceState + ", so doing nothing"); +return true; +} + if (nextStatus == Status.Alert) { -/* OK, we are going to the bad status, let's see what happened */ -s_logger.info("Investigating why host " + hostId + " has disconnected with event " + event); +/* Our next Agent transition state is Alert + * Let's see if the host down or why we had this event + */ +s_logger.info("Investigating why host " + hostShortDesc + " has disconnected with event " + event); Status determinedState = investigate(attache); // if state cannot be determined do nothing and bail out if (determinedState == null) { if ((System.currentTimeMillis() >> 10) - host.getLastPinged() > AlertWait.value()) { -s_logger.warn("Agent " + hostId + " state cannot be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state"); +s_logger.warn("State for " + hostShortDesc + " could not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state"); determinedState = Status.Alert; } else { -s_logger.warn("Agent " + hostId + " state cannot be determined, do nothing"); +s_logger.warn("State for " + hostShortDesc + " could not be determined, doing nothing"); return false; } } final Status currentStatus = host.getStatus(); -s_logger.info("The agent from host " + hostId + " state determined is " + determinedState); +s_logger.info("Status for " + hostShortDesc + " was " + currentStatus + ". Investigation determined the current state is " + determinedState); -if (determinedState == Status.Down) { -final String message = "Host is down: " + host.getId() + "-" + host.getName() + ". Starting HA on the VMs"; -s_logger.error(message); -if (host.getType() != Host.Type.SecondaryStorage && host.getType() != Host.Type.ConsoleProxy) { - _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message); -} -event = Status.Event.HostDown; -} else if (determinedState == Status.Up) { -/* Got ping response from host, bring it back */ -s_logger.info("Agent is determined to be up and running"); +if (determinedState == Status.Up) { +// Got ping response from host, bring it back +s_logger.info(hostShortDesc + " is up again"); agentStatusTransitTo(host, Status.Event.Ping, _nodeId); -return false; } else if (determinedState == Status.Disconnected) { -s_logger.warn("Agent is disconnected but the host is still up: " + host.getId() + "-" + host.getName()); +// Investigation says host isn't down, just disconnected if (currentStatus == Status.Disconnected) { +// Last status was disconnected, only switch
[GitHub] Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues
Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171791683 ## File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java ## @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, s_logger.debug("Caught exception while getting agent's next status", ne); } +// For log and alert purposes later +final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId()); +final HostPodVO podVO = _podDao.findById(host.getPodId()); +final String hostDesc = "[name: " + host.getName() + " (id:" + host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName() + "]"; +final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")"; + +final ResourceState resourceState = host.getResourceState(); +if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) { +// If we are in this resourceState, no need to investigate or do anything. AgentMonitor will handle when in these resourceStates +s_logger.info(hostShortDesc + " has disconnected with event " + event + ", but is in Resource State of " + resourceState + ", so doing nothing"); +return true; +} + if (nextStatus == Status.Alert) { -/* OK, we are going to the bad status, let's see what happened */ -s_logger.info("Investigating why host " + hostId + " has disconnected with event " + event); +/* Our next Agent transition state is Alert + * Let's see if the host down or why we had this event + */ +s_logger.info("Investigating why host " + hostShortDesc + " has disconnected with event " + event); Status determinedState = investigate(attache); // if state cannot be determined do nothing and bail out if (determinedState == null) { if ((System.currentTimeMillis() >> 10) - host.getLastPinged() > AlertWait.value()) { -s_logger.warn("Agent " + hostId + " state cannot be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state"); +s_logger.warn("State for " + hostShortDesc + " could not be determined for more than " + AlertWait + "(" + AlertWait.value() + ") seconds, will go to Alert state"); determinedState = Status.Alert; } else { -s_logger.warn("Agent " + hostId + " state cannot be determined, do nothing"); +s_logger.warn("State for " + hostShortDesc + " could not be determined, doing nothing"); return false; } } final Status currentStatus = host.getStatus(); -s_logger.info("The agent from host " + hostId + " state determined is " + determinedState); +s_logger.info("Status for " + hostShortDesc + " was " + currentStatus + ". Investigation determined the current state is " + determinedState); -if (determinedState == Status.Down) { -final String message = "Host is down: " + host.getId() + "-" + host.getName() + ". Starting HA on the VMs"; -s_logger.error(message); -if (host.getType() != Host.Type.SecondaryStorage && host.getType() != Host.Type.ConsoleProxy) { - _alertMgr.sendAlert(AlertManager.AlertType.ALERT_TYPE_HOST, host.getDataCenterId(), host.getPodId(), "Host down, " + host.getId(), message); -} -event = Status.Event.HostDown; -} else if (determinedState == Status.Up) { -/* Got ping response from host, bring it back */ -s_logger.info("Agent is determined to be up and running"); +if (determinedState == Status.Up) { +// Got ping response from host, bring it back +s_logger.info(hostShortDesc + " is up again"); agentStatusTransitTo(host, Status.Event.Ping, _nodeId); -return false; } else if (determinedState == Status.Disconnected) { -s_logger.warn("Agent is disconnected but the host is still up: " + host.getId() + "-" + host.getName()); Review comment: I removed it because it was extraneous. A similar but more detailed log entry is performed for every case that the host is in Disconnected. It?s just a little hard to see that while reading
[GitHub] Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues
Slair1 commented on a change in pull request #2474: CLOUDSTACK-10246 Fix Host HA and VM HA issues URL: https://github.com/apache/cloudstack/pull/2474#discussion_r171790992 ## File path: engine/orchestration/src/com/cloud/agent/manager/AgentManagerImpl.java ## @@ -843,72 +846,103 @@ protected boolean handleDisconnectWithInvestigation(final AgentAttache attache, s_logger.debug("Caught exception while getting agent's next status", ne); } +// For log and alert purposes later +final DataCenterVO dcVO = _dcDao.findById(host.getDataCenterId()); +final HostPodVO podVO = _podDao.findById(host.getPodId()); +final String hostDesc = "[name: " + host.getName() + " (id:" + host.getId() + "), availability zone: " + dcVO.getName() + ", pod: " + podVO.getName() + "]"; +final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")"; + +final ResourceState resourceState = host.getResourceState(); +if (resourceState == ResourceState.Disabled || resourceState == ResourceState.Maintenance || resourceState == ResourceState.ErrorInMaintenance) { +// If we are in this resourceState, no need to investigate or do anything. AgentMonitor will handle when in these resourceStates +s_logger.info(hostShortDesc + " has disconnected with event " + event + ", but is in Resource State of " + resourceState + ", so doing nothing"); +return true; +} + if (nextStatus == Status.Alert) { -/* OK, we are going to the bad status, let's see what happened */ -s_logger.info("Investigating why host " + hostId + " has disconnected with event " + event); +/* Our next Agent transition state is Alert + * Let's see if the host down or why we had this event + */ +s_logger.info("Investigating why host " + hostShortDesc + " has disconnected with event " + event); Review comment: @DaanHoogland good thought, I didn?t think about that. However, the hostShortDesc does include the hostId as part of it. So maybe it?s ok? final String hostShortDesc = "Host " + host.getName() + " (id:" + host.getId() + ")"; This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services