On Tue, 20 Sep 2011 16:35:02 -0600, Alex Rousskov wrote:
Hello,
I am getting ERR_ZERO_SIZE_OBJECT during v3.2 performance tests.
I
believe it is caused by Squid server-side failing when the server
closes
a persistent connection before reading our second request. We used to
retry in those cases, as RFC 2616 prescribes.
It looks like we stopped retrying after the following change:
revno: 10057.1.8
branch nick: cleanup-comm
timestamp: Wed 2010-05-19 23:28:21 +1200
message:
Comm restructure part 2 - outbound connections
The relevant change in FwdState::retryOrBail() logic is below. First,
we
unconditionally remove the server already tried:
+ paths.shift(); // last one failed. try another.
Then we are retrying only using the next hop, if any:
+ if (paths.size() > 0) {
... retry using the next path ...
+ }
+ // else bail. no more paths possible to try.
The old code used to retry using the current server if there were no
other servers to try.
When we are dealing with a pconn race condition, we should not throw
the
current server/path/next hop/whatever away. We should retry by
opening a
fresh connection the same server/path/next_hop/whatever.
[ I suspect the old code was not exactly doing the right thing when
multiple servers were present either, but it did work correctly in a
typical "single server" case when a pconn race was encountered. ]
The current code was not exactly a change in this area. Retrying the
server was not actually doing anything previously either. The full
peer-select cycle was repeating on every retry. Which reset the state
about what had already been tried and hid a few of these cases.
A simple change fixing pconn retrials in the new code may go along
these
lines:
if (paths.size() > 1) // if we can
paths.shift(); // try another.
// else retry using the same path
but I am not sure whether (a) that would be OK for all other kinds of
retries and (b) would not cause excessive retries since paths will
never
be emptied in retryOrBail(). Is there a better way to fix this?
Hmm, only if we have no other choice. In the worst-case this will make
all other CANNOT_FORWARD cases "hang" until forward_timeout has passed.
Basing it on the error type/code/something would be better.
There are three sets of cases feeding into this function (maybe more if
other races exist).
a) server error responses 5xx/403. new path required before
connectStart().
b) TCP connection failures. new path required before connectStart().
c) this pconn race. same path can be passed back to connectStart()
_once_.
Cycles of repetition trying the same path on any of the (a) and (b)
cases is a very bad idea for performance.
At this point it should have (local.GetPort() > 0) from a previously
open connection (c or a) [NP: fd will be -1 already]. So checking for
the zero-sized error AND a present local port we could unset the local
port, and call connectStart() without the shift().
Amos