Re: [squid-dev] [PATCH] Do not revive unconditionally dead peers after DNS refresh

Eduard Bagdasaryan Mon, 15 May 2017 07:32:10 -0700

I followed your plan with few adjustments and reattached the patch.
Since the new CachePeer::reprobe is a kind of 'helper' flag with
default value, I think no other initialization/reporting/cloning steps
needed (as for similar CachePeer::tcp_up).


Eduard.

On 06.05.2017 00:33, Alex Rousskov wrote:

On 04/27/2017 02:39 PM, Eduard Bagdasaryan wrote:

+    // always start probing in order to effectively detect
+    // dead or revived peers
+    (void)peerProbeConnect(p);

I think we can simplify that comment while making it more precise:

     peerProbeConnect(p); // detect any died or revived peers ASAP

The (void) cast was correct, but is no longer needed after v5 r15133.

The above are minor polishing touches that could be done during commit,
but I am writing this message because I think we may be able to cover
one more corner case without significant changes...

Imagine a situation where the peers are already being probed during
peerDNSConfigure(). Patches Squid will do nothing nothing in this case.
However, the current probes may be using old/wrong IP addresses (or
other peer configuration parameters) and, hence, may produce wrong
result. This lack of a fresh probe may delay the discovery of a died or
revived peer and might even cause an HTTP transaction failure (in the
"the peer has recently died" case).

On the other hand, we still do not want to create too many concurrent or
too frequent probes so we want to continue to honor the current
peerProbeConnect() concurrency and rate limits.

Please see if the following small change allows Squid to handle the
above case better:

1. Add a boolean CachePeer::reprobe field, defaulted to false. It can be
described as "whether to do another TCP probe after the current TCP probes".

2. peerProbeConnect() should clear the reprobe field after passing the
two if-statement guards and before entering the loop (even if there are
no IPs to probe). The right timing here would eliminate any implicit
reprobing loops while providing the desired functionality.

3. After closing the connection, peerProbeConnectDone() should call
peerProbeConnect() if reprobe is true.

4. Add a boolean reprobeIfBusy parameter to peerProbeConnect() with
false as the default value. peerDNSConfigure() will call
peerProbeConnect() with a true parameter value. peerProbeConnect() will
start by setting reprobe to (reprobe || reprobeIfBusy).

There may be more initialization/reporting/cloning steps needed, just
like for any other CachePeer data member, but the above reflects the
core logic. Adjust as needed, of course.


Thank you,

Alex.

Do not revive unconditionally dead peers after DNS refresh.

Every hour, peerRefreshDNS() performs a DNS lookup of all cache_peer
addresses. Before this patch, even if DNS lookup results did not change,
the associated peerDNSConfigure() code silently cleared dead peer
marking, if any (CachePeer::tcp_up counter).  Forcefully reviving dead
peers every hour can lead to transaction delays (and delays may lead to
failures) due to connection timeouts when using a still dead peer.

This patch starts standard TCP probing instead of pointless dead peer
reviving, correctly refreshing peer state.  The primary goal is to cover
situation when DNS refresh changes peer address list. However TCP
probing may be useful for other situations either, without much overhead
(that is why it starts unconditionally).  For example, we need it when
DNS refresh returns the same addresses list but in different order. Also
it should help with dead idle peers detection.

Also: restart peer probing if peerDNSConfigure() is invoked when peers
are already being probed. This change should avoid situation when
current probes may produce wrong results due to using old/wrong IP
addresses.

=== modified file 'src/CachePeer.h'
--- src/CachePeer.h	2017-04-14 14:35:11 +0000
+++ src/CachePeer.h	2017-05-15 09:55:25 +0000
@@ -118,60 +118,62 @@ public:
         bool originserver = false;
         bool no_tproxy = false;
 #if PEER_MULTICAST_SIBLINGS
         bool mcast_siblings = false;
 #endif
         bool auth_no_keytab = false;
     } options;
 
     int weight = 1;
     int basetime = 0;
 
     struct {
         double avg_n_members = 0.0;
         int n_times_counted = 0;
         int n_replies_expected = 0;
         int ttl = 0;
         int id = 0;
 
         struct {
             bool count_event_pending = false;
             bool counting = false;
         } flags;
     } mcast;
 
 #if USE_CACHE_DIGESTS
     PeerDigest *digest = nullptr;
     char *digest_url = nullptr;
 #endif
 
     int tcp_up = 0;         /* 0 if a connect() fails */
+    /// whether to do another TCP probe after current TCP probes
+    bool reprobe = false;
 
     Ip::Address addresses[10];
     int n_addresses = 0;
     int rr_count = 0;
     CachePeer *next = nullptr;
     int testing_now = 0;
 
     struct {
         unsigned int hash = 0;
         double load_multiplier = 0.0;
         double load_factor = 0.0;     ///< normalized weight value
     } carp;
 #if USE_AUTH
     struct {
         unsigned int hash = 0;
         double load_multiplier = 0.0;
         double load_factor = 0.0;     ///< normalized weight value
     } userhash;
 #endif
     struct {
         unsigned int hash = 0;
         double load_multiplier = 0.0;
         double load_factor = 0.0;     ///< normalized weight value
     } sourcehash;
 
     char *login = nullptr;        /* Proxy authorization */
     time_t connect_timeout_raw = 0; ///< connect_timeout; use peerConnectTimeout() instead!
     int connect_fail_limit = 0;
     int max_conn = 0;
 

=== modified file 'src/neighbors.cc'
--- src/neighbors.cc	2017-05-05 19:42:51 +0000
+++ src/neighbors.cc	2017-05-15 14:04:58 +0000
@@ -32,61 +32,61 @@
 #include "multicast.h"
 #include "neighbors.h"
 #include "NeighborTypeDomainList.h"
 #include "pconn.h"
 #include "PeerDigest.h"
 #include "PeerPoolMgr.h"
 #include "PeerSelectState.h"
 #include "RequestFlags.h"
 #include "SquidConfig.h"
 #include "SquidMath.h"
 #include "SquidTime.h"
 #include "stat.h"
 #include "Store.h"
 #include "store_key_md5.h"
 #include "tools.h"
 #include "URL.h"
 
 /* count mcast group peers every 15 minutes */
 #define MCAST_COUNT_RATE 900
 
 bool peerAllowedToUse(const CachePeer *, HttpRequest *);
 static int peerWouldBePinged(const CachePeer *, HttpRequest *);
 static void neighborRemove(CachePeer *);
 static void neighborAlive(CachePeer *, const MemObject *, const icp_common_t *);
 #if USE_HTCP
 static void neighborAliveHtcp(CachePeer *, const MemObject *, const HtcpReplyData *);
 #endif
 static void neighborCountIgnored(CachePeer *);
 static void peerRefreshDNS(void *);
 static IPH peerDNSConfigure;
-static void peerProbeConnect(CachePeer *);
+static void peerProbeConnect(CachePeer *, const bool reprobeIfBusy = false);
 static CNCB peerProbeConnectDone;
 static void peerCountMcastPeersDone(void *data);
 static void peerCountMcastPeersStart(void *data);
 static void peerCountMcastPeersSchedule(CachePeer * p, time_t when);
 static IRCB peerCountHandleIcpReply;
 
 static void neighborIgnoreNonPeer(const Ip::Address &, icp_opcode);
 static OBJH neighborDumpPeers;
 static OBJH neighborDumpNonPeers;
 static void dump_peers(StoreEntry * sentry, CachePeer * peers);
 
 static unsigned short echo_port;
 
 static int NLateReplies = 0;
 static CachePeer *first_ping = NULL;
 
 const char *
 neighborTypeStr(const CachePeer * p)
 {
     if (p->type == PEER_NONE)
         return "Non-Peer";
 
     if (p->type == PEER_SIBLING)
         return "Sibling";
 
     if (p->type == PEER_MULTICAST)
         return "Multicast Group";
 
     return "Parent";
 }
@@ -1175,72 +1175,72 @@ positiveTimeout(const time_t timeout)
 
 static void
 peerDNSConfigure(const ipcache_addrs *ia, const Dns::LookupDetails &, void *data)
 {
     // TODO: connections to no-longer valid IP addresses should be
     // closed when we can detect such IP addresses.
 
     CachePeer *p = (CachePeer *)data;
 
     int j;
 
     if (p->n_addresses == 0) {
         debugs(15, DBG_IMPORTANT, "Configuring " << neighborTypeStr(p) << " " << p->host << "/" << p->http_port << "/" << p->icp.port);
 
         if (p->type == PEER_MULTICAST)
             debugs(15, DBG_IMPORTANT, "    Multicast TTL = " << p->mcast.ttl);
     }
 
     p->n_addresses = 0;
 
     if (ia == NULL) {
         debugs(0, DBG_CRITICAL, "WARNING: DNS lookup for '" << p->host << "' failed!");
         return;
     }
 
     if ((int) ia->count < 1) {
         debugs(0, DBG_CRITICAL, "WARNING: No IP address found for '" << p->host << "'!");
         return;
     }
 
-    p->tcp_up = p->connect_fail_limit;
-
     for (j = 0; j < (int) ia->count && j < PEER_MAX_ADDRESSES; ++j) {
         p->addresses[j] = ia->in_addrs[j];
         debugs(15, 2, "--> IP address #" << j << ": " << p->addresses[j]);
         ++ p->n_addresses;
     }
 
     p->in_addr.setEmpty();
     p->in_addr = p->addresses[0];
     p->in_addr.port(p->icp.port);
 
+    peerProbeConnect(p, true); // detect any died or revived peers ASAP
+
     if (p->type == PEER_MULTICAST)
         peerCountMcastPeersSchedule(p, 10);
 
 #if USE_ICMP
     if (p->type != PEER_MULTICAST && IamWorkerProcess())
         if (!p->options.no_netdb_exchange)
             eventAddIsh("netdbExchangeStart", netdbExchangeStart, p, 30.0, 1);
 #endif
 
     if (p->standby.mgr.valid())
         PeerPoolMgr::Checkpoint(p->standby.mgr, "resolved peer");
 }
 
 static void
 peerRefreshDNS(void *data)
 {
     CachePeer *p = NULL;
 
     if (eventFind(peerRefreshDNS, NULL))
         eventDelete(peerRefreshDNS, NULL);
 
     if (!data && 0 == stat5minClientRequests()) {
         /* no recent client traffic, wait a bit */
         eventAddIsh("peerRefreshDNS", peerRefreshDNS, NULL, 180.0, 1);
         return;
     }
 
     for (p = Config.peers; p; p = p->next)
         ipcache_nbgethostbyname(p->host, peerDNSConfigure, p);
 
@@ -1260,110 +1260,123 @@ peerConnectFailedSilent(CachePeer * p)
     }
 
     -- p->tcp_up;
 
     if (!p->tcp_up) {
         debugs(15, DBG_IMPORTANT, "Detected DEAD " << neighborTypeStr(p) << ": " << p->name);
         p->stats.logged_state = PEER_DEAD;
     }
 }
 
 void
 peerConnectFailed(CachePeer *p)
 {
     debugs(15, DBG_IMPORTANT, "TCP connection to " << p->host << "/" << p->http_port << " failed");
     peerConnectFailedSilent(p);
 }
 
 void
 peerConnectSucceded(CachePeer * p)
 {
     if (!p->tcp_up) {
         debugs(15, 2, "TCP connection to " << p->host << "/" << p->http_port << " succeded");
         p->tcp_up = p->connect_fail_limit; // NP: so peerAlive(p) works properly.
         peerAlive(p);
         if (!p->n_addresses)
             ipcache_nbgethostbyname(p->host, peerDNSConfigure, p);
     } else
         p->tcp_up = p->connect_fail_limit;
 }
 
-/*
-* peerProbeConnect will be called on dead peers by neighborUp
-*/
-static void
-peerProbeConnect(CachePeer * p)
+/// Is it possible to start new probing?
+static bool
+peerProbingIsBusy(const CachePeer *p)
 {
     if (p->testing_now > 0) {
         debugs(15, 8, "already probing " << p);
-        return;
+        return true;
     }
-
     if (squid_curtime - p->stats.last_connect_probe == 0) {
         debugs(15, 8, "just probed " << p);
+        return true;
+    }
+    return false;
+}
+/*
+* peerProbeConnect will be called on dead peers by neighborUp
+*/
+static void
+peerProbeConnect(CachePeer *p, const bool reprobeIfBusy)
+{
+    if (peerProbingIsBusy(p)) {
+        p->reprobe = reprobeIfBusy;
         return;
     }
+    p->reprobe = false;
 
     const time_t ctimeout = peerConnectTimeout(p);
     /* for each IP address of this CachePeer. find one that we can connect to and probe it. */
     for (int i = 0; i < p->n_addresses; ++i) {
         Comm::ConnectionPointer conn = new Comm::Connection;
         conn->remote = p->addresses[i];
         conn->remote.port(p->http_port);
         conn->setPeer(p);
         getOutgoingAddress(NULL, conn);
 
         ++ p->testing_now;
 
         AsyncCall::Pointer call = commCbCall(15,3, "peerProbeConnectDone", CommConnectCbPtrFun(peerProbeConnectDone, p));
         Comm::ConnOpener *cs = new Comm::ConnOpener(conn, call, ctimeout);
         cs->setHost(p->host);
         AsyncJob::Start(cs);
     }
 
     p->stats.last_connect_probe = squid_curtime;
 }
 
 static void
 peerProbeConnectDone(const Comm::ConnectionPointer &conn, Comm::Flag status, int, void *data)
 {
     CachePeer *p = (CachePeer*)data;
 
     if (status == Comm::OK) {
         peerConnectSucceded(p);
     } else {
         peerConnectFailedSilent(p);
     }
 
     -- p->testing_now;
     conn->close();
     // TODO: log this traffic.
+
+    if (p->reprobe)
+        peerProbeConnect(p);
 }
 
 static void
 peerCountMcastPeersSchedule(CachePeer * p, time_t when)
 {
     if (p->mcast.flags.count_event_pending)
         return;
 
     eventAdd("peerCountMcastPeersStart",
              peerCountMcastPeersStart,
              p,
              (double) when, 1);
 
     p->mcast.flags.count_event_pending = true;
 }
 
 static void
 peerCountMcastPeersStart(void *data)
 {
     CachePeer *p = (CachePeer *)data;
     ps_state *psstate;
     StoreEntry *fake;
     MemObject *mem;
     icp_common_t *query;
     int reqnum;
     LOCAL_ARRAY(char, url, MAX_URL);
     assert(p->type == PEER_MULTICAST);
     p->mcast.flags.count_event_pending = false;
     snprintf(url, MAX_URL, "http://";);
     p->in_addr.toUrl(url+7, MAX_URL -8 );

_______________________________________________
squid-dev mailing list
[email protected]
http://lists.squid-cache.org/listinfo/squid-dev

Re: [squid-dev] [PATCH] Do not revive unconditionally dead peers after DNS refresh

Reply via email to