Re: server-side pconn races no longer retried after r10057.1.8

Amos Jeffries Tue, 20 Sep 2011 16:57:32 -0700

On Tue, 20 Sep 2011 16:35:02 -0600, Alex Rousskov wrote:

Hello,

I am getting ERR_ZERO_SIZE_OBJECT during v3.2 performance tests.Ibelieve it is caused by Squid server-side failing when the servercloses

a persistent connection before reading our second request. We used to
retry in those cases, as RFC 2616 prescribes.

It looks like we stopped retrying after the following change:

revno: 10057.1.8
branch nick: cleanup-comm
timestamp: Wed 2010-05-19 23:28:21 +1200
message:
  Comm restructure part 2 - outbound connections

The relevant change in FwdState::retryOrBail() logic is below. First,we

unconditionally remove the server already tried:

+        paths.shift(); // last one failed. try another.


Then we are retrying only using the next hop, if any:

+        if (paths.size() > 0) {

               ... retry using the next path ...

+        }
+        // else bail. no more paths possible to try.


The old code used to retry using the current server if there were no
other servers to try.

When we are dealing with a pconn race condition, we should not throwthecurrent server/path/next hop/whatever away. We should retry byopening a

fresh connection the same server/path/next_hop/whatever.

[ I suspect the old code was not exactly doing the right thing when
multiple servers were present either, but it did work correctly in a
typical "single server" case when a pconn race was encountered. ]

The current code was not exactly a change in this area. Retrying theserver was not actually doing anything previously either. The fullpeer-select cycle was repeating on every retry. Which reset the stateabout what had already been tried and hid a few of these cases.

A simple change fixing pconn retrials in the new code may go alongthese
lines:

    if (paths.size() > 1) // if we can
        paths.shift(); // try another.
    // else retry using the same path

but I am not sure whether (a) that would be OK for all other kinds of
retries and (b) would not cause excessive retries since paths willnever
be emptied in retryOrBail(). Is there a better way to fix this?

Hmm, only if we have no other choice. In the worst-case this will makeall other CANNOT_FORWARD cases "hang" until forward_timeout has passed.



Basing it on the error type/code/something would be better.

There are three sets of cases feeding into this function (maybe more ifother races exist).a) server error responses 5xx/403. new path required beforeconnectStart().

 b) TCP connection failures. new path required before connectStart().

c) this pconn race. same path can be passed back to connectStart()_once_.

Cycles of repetition trying the same path on any of the (a) and (b)cases is a very bad idea for performance.

At this point it should have (local.GetPort() > 0) from a previouslyopen connection (c or a) [NP: fd will be -1 already]. So checking forthe zero-sized error AND a present local port we could unset the localport, and call connectStart() without the shift().


Amos

Re: server-side pconn races no longer retried after r10057.1.8

Reply via email to