Re: [squid-users] CARP Failover behavior - multiple parents chosen for URL

2009-05-11 Thread Amos Jeffries
> Moving this to squid-dev due to increasingly propellerhead-like
> content... :)
>
> Looking over the code and some debugging output, it's pretty clear
> what's happening here.
>
> The carpSelectParent() function does the appropriate hashing of each
> URL+parent hash and the requisite ranking of the results. To determine
> whether or not the highest-hash-value parent is the parent that
> should, in fact, be returned, it uses peerHTTPOkay() as its test.
>
> The problem here is that peerHTTPOkay only returns 0 if the peer in
> question has been marked DEAD; carpSelectParent has no way of knowing
> if the peer is down unless squid has "officially" marked it DEAD.
>
> So, if the highest-ranked peer is a peer that is refusing connections
> but isn't marked DEAD yet, then peer_select tries to use it, and when
> it fails, falls back to ANY_PARENT - this actually shows up in the
> access.log, which I didn't realize when I initially sent this in. Once
> we've tried to hit the parent 10 times, we officially mark it DEAD,
> and then carpSelectParent() does the Right Thing.
>
> So, we have a couple option here as far as how to resolve this:
>
> 1. Adjust PEER_TCP_MAGIC_COUNT from 10 to 1, so that a parent is
> marked DEAD after only one failure. This may be overly sensitive
> however. Alternatively, carpSelectParent() can check peer->tcp_up and
> disqualify the peer if it's not equal to PEER_TCP_MAGIC_COUNT; this
> will have a similar effect without going through the overhead of
> actually marking the peer DEAD and then "reviving" it.

Patches went in recently to make that setting a squid.conf option.

Squid-3:
  http://www.squid-cache.org/Versions/v3/HEAD/changesets/b9678.patch

Squid-2:
 http://www.squid-cache.org/Versions/v2/HEAD/changesets/12208.patch
 http://www.squid-cache.org/Versions/v2/HEAD/changesets/12209.patch


>
> 2. Somehow have carpSelectParent() return the entire sorted list of
> peers, so that if the to choice is found to be down, then
> peer_select() already knows where to go next...
>
> 3. Add some special-case code (I'm guessing this would be either in
> forward.c or peer_select.c) so that if a connection to a peer selected
> by carpSelectParent() fails, then increment a counter (which would be
> unique to that request) and call carpSelectParent() again. This
> counter can be used in carpPeerSelect to ignore the X highest-ranked
> entries. Once this peer gets officially declared DEAD, this becomes
> moot.
>
> Personally, I'm partial to #3, but other approaches are welcome :)
>

I'm partial to #2. But not for any particular reason.
Patches for either #2 or #3 are welcome.

Amos




Re: [squid-users] CARP Failover behavior - multiple parents chosen for URL

2009-05-11 Thread Mark Nottingham

A patch to make PEER_TCP_MAGIC_COUNT configurable is on 2-HEAD;
   http://www.squid-cache.org/Versions/v2/HEAD/changesets/12208.patch

Cheers,


On 12/05/2009, at 1:15 PM, Chris Woodfield wrote:

1. Adjust PEER_TCP_MAGIC_COUNT from 10 to 1, so that a parent is  
marked DEAD after only one failure. This may be overly sensitive  
however.


--
Mark Nottingham   m...@yahoo-inc.com




Re: [squid-users] CARP Failover behavior - multiple parents chosen for URL

2009-05-11 Thread Chris Woodfield
Moving this to squid-dev due to increasingly propellerhead-like  
content... :)


Looking over the code and some debugging output, it's pretty clear  
what's happening here.


The carpSelectParent() function does the appropriate hashing of each  
URL+parent hash and the requisite ranking of the results. To determine  
whether or not the highest-hash-value parent is the parent that  
should, in fact, be returned, it uses peerHTTPOkay() as its test.


The problem here is that peerHTTPOkay only returns 0 if the peer in  
question has been marked DEAD; carpSelectParent has no way of knowing  
if the peer is down unless squid has "officially" marked it DEAD.


So, if the highest-ranked peer is a peer that is refusing connections  
but isn't marked DEAD yet, then peer_select tries to use it, and when  
it fails, falls back to ANY_PARENT - this actually shows up in the  
access.log, which I didn't realize when I initially sent this in. Once  
we've tried to hit the parent 10 times, we officially mark it DEAD,  
and then carpSelectParent() does the Right Thing.


So, we have a couple option here as far as how to resolve this:

1. Adjust PEER_TCP_MAGIC_COUNT from 10 to 1, so that a parent is  
marked DEAD after only one failure. This may be overly sensitive  
however. Alternatively, carpSelectParent() can check peer->tcp_up and  
disqualify the peer if it's not equal to PEER_TCP_MAGIC_COUNT; this  
will have a similar effect without going through the overhead of  
actually marking the peer DEAD and then "reviving" it.


2. Somehow have carpSelectParent() return the entire sorted list of  
peers, so that if the to choice is found to be down, then  
peer_select() already knows where to go next...


3. Add some special-case code (I'm guessing this would be either in  
forward.c or peer_select.c) so that if a connection to a peer selected  
by carpSelectParent() fails, then increment a counter (which would be  
unique to that request) and call carpSelectParent() again. This  
counter can be used in carpPeerSelect to ignore the X highest-ranked  
entries. Once this peer gets officially declared DEAD, this becomes  
moot.


Personally, I'm partial to #3, but other approaches are welcome :)

Thanks,

-C

On May 6, 2009, at 10:13 PM, Amos Jeffries wrote:



On May 6, 2009, at 8:14 PM, Amos Jeffries wrote:


Hi,

I've noticed a behavior in CARP failover (on 2.7) that I was
wondering
if someone could explain.

In my test environment, I have a non-caching squid configured with
multiple CARP parent caches - two servers, three per box (listening
on
ports 1080/1081/1082, respectively, for a total of six servers.

When I fail a squid instance and immediately afterwards run GETs to
URLs that were previously directed to that instance, I notice that
the
request goes to a different squid, as expected, and I see the
following in the log for each request:

May  6 11:43:28 cdce-den002-001 squid[1557]: TCP connection to  
http-

cache-1c.den002 (http-cache-1c.den002:1082) failed

And I notice that the request is being forwarded to a different,  
but

consistent, parent.

After ten of the above requests, I see this:

May  6 11:43:41 cdce-den002-001.den002 squid[1557]: Detected DEAD
Parent: http-cache-1c.den002

So, I'm presuming that after ten failed requests, the peer is
considered DEAD. So far, so good.

The problem is this: During my test GETs, I noticed that  
immediately

after the "Detected DEAD Parent" message was generated, the parent
server that the request was being forwarded to changed - as if
there's
an "interim" decision made until the peer is officially declared
DEAD,
and then another hash decision made afterwards. So while consistent
afterwards, it's apparent that during the failover, the parent  
server

for the test URL changed twice, not once.

Can someone explain this behavior?


Do you have 'default' set on any of the parents?
It is entirely possible that multiple paths are selected as usable  
and

only the first taken.



No, my cache_peer config options are

cache_peer http-cache-1a.den002 parent 1080 0 carp http11 idle=10





During the period between death and detection the dead peer will
still be
attempted but failover happens to send the request to another
location.
When death is detected the hashes are actual re-calculated.



OK, correct me if I misread, but my understanding of the spec is that
each parent cache gets its own hash value, each of which is then
combined with the URL's hash to come up with a set of values. The
parent cache corresponding with the highest result is the cache
chosen. If that peer is unavailable, the next-best peer is selected,
then the next, etc etc.

If that is correct, what hashes are re-calculated when a dead peer is
detected? Any why would those hashes result in different results than
the pre-dead peer run of the algorithm

And more importantly, will that recalculation result in URLs being  
re-
mapped that weren't originally pointed to the failed parent? I  
thought

avoiding s