Our hosting service where the ACE server runs was down for maintenance, so our clients couldn't contact the ACE server for an extended period of time. I've logged in to a few client sites (we have around 100) and I'm seeing that the agents eventually blacklisted the server IP and stopped checking there for updates, even though that's the only one we have. Once the server came back online, they still didn't resume sychronizing with it. Is this the correct behavior? Shouldn't the agent detect when the server is back online and connect to it again?
I now see the "agent.discovery.checking" option, which I guess we should set to false in the future. The troubling part is that we have a DS component running in our client application that pings the server periodically, but the clients all stopped pinging after the server outage. In every log I checked, the pings stopped immediately after the ACE agent blacklisted the server IP. The ping is just a task running under the standard Java ScheduledExecutorService that POSTs to our server every few minutes using the Apache HttpClient. Is it possible that the ACE agent could interfere with that somehow? The service running the ping task doesn't log that it got stopped or failed in any way. Other services on the client are working normally. After restarting the a few client processes, they all reconnected to ACE and started pinging normally again.
