Re: Child panics on OpenSolaris

2010-02-10 Thread Poul-Henning Kamp
In message <282e72051002100305m5d7a0d1fj3a1afac6ea7dd...@mail.gmail.com>, Paul 
Wright writes:


Hi Paul,

We have a number of tickets on this issue already (626,615,588).

The problem is that EBADF return indicates that Varnish by mistake
has closed a file descriptor, that should still be open.

The alternative explanation, that Solaris can return EBADF because
the other end closed a TCP connection, does not seem to have any
support in Solaris documentation.

We are trying to get a Solaris box running so we can figure this
out once and for all.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-10 Thread Paul Wright
On 10 February 2010 13:00, Poul-Henning Kamp  wrote:
> In message <282e72051002100305m5d7a0d1fj3a1afac6ea7dd...@mail.gmail.com>, Paul
> Wright writes:
>
>
> Hi Paul,
>
> We have a number of tickets on this issue already (626,615,588).
>
> The problem is that EBADF return indicates that Varnish by mistake
> has closed a file descriptor, that should still be open.
>
> The alternative explanation, that Solaris can return EBADF because
> the other end closed a TCP connection, does not seem to have any
> support in Solaris documentation.
>
> We are trying to get a Solaris box running so we can figure this
> out once and for all.

Thanks for the explanation of what's going on.  Looking at those
tickets there are suggestions to try the poll waiter which we're
already using - are there any further tests we could try to help
narrow down this issue?  I'm happy to assist trying out patches.

Cheers,

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-10 Thread Poul-Henning Kamp
In message <282e72051002100615x701a37a8o416c9af6d1d7f...@mail.gmail.com>, Paul 
Wright writes:

>Thanks for the explanation of what's going on.  Looking at those
>tickets there are suggestions to try the poll waiter which we're
>already using - are there any further tests we could try to help
>narrow down this issue?  I'm happy to assist trying out patches.

I can see three ways to nail this issue:

1. Catch a tcpdump, when it happens, showing that the client side
   did close, and Solaris (incorrectly) returns EBADF.

2. Catch a ktrace/systrace/dtrace, when it happens, that show
   that Varnish incorrectly closes the fd.

3. Setup some synthetic test to show that solaris returns EBADF
   when it shouldn't

If either of those are in your reach, by all means go for it...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-11 Thread Paul Wright
On 10 February 2010 13:53, Poul-Henning Kamp  wrote:
> In message <282e72051002100615x701a37a8o416c9af6d1d7f...@mail.gmail.com>, Paul
> Wright writes:
>
>>Thanks for the explanation of what's going on.  Looking at those
>>tickets there are suggestions to try the poll waiter which we're
>>already using - are there any further tests we could try to help
>>narrow down this issue?  I'm happy to assist trying out patches.
>
> I can see three ways to nail this issue:
>
> 1. Catch a tcpdump, when it happens, showing that the client side
>   did close, and Solaris (incorrectly) returns EBADF.
>
> 2. Catch a ktrace/systrace/dtrace, when it happens, that show
>   that Varnish incorrectly closes the fd.
>
> 3. Setup some synthetic test to show that solaris returns EBADF
>   when it shouldn't
>
> If either of those are in your reach, by all means go for it...

I've had a go at 1.) and have two verbose `snoop` traces during child
panics.  I used sp.client from the backtrace to find out the port
number and then looked at just matching packets. From my (limited)
Wireshark comprehension they show the client establishing a connection
to Varnish, issue a GET, receive the response (200 OK).  Then the
client sends a RST packet, from there the connection disappears.
Would this cause the child to panic?

I can't post these traces publicly but are there any other details
that would help?

I've been racking my brains to think if there is any special in our
setup and the only thing that springs to mind is the firewall.  We
have an OpenBSD firewall using PF to redirect HTTP traffic from the
public IP address to the internal web servers which has worked without
issue for a number of years.  During testing the only firewall change
I've made redirects this traffic to Varnish instead.

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-11 Thread Poul-Henning Kamp
In message <282e72051002110739y50bc0b16j687bf75692f42...@mail.gmail.com>, Paul 
Wright writes:

>I've had a go at 1.) and have two verbose `snoop` traces during child
>panics.  I used sp.client from the backtrace to find out the port
>number and then looked at just matching packets. From my (limited)
>Wireshark comprehension they show the client establishing a connection
>to Varnish, issue a GET, receive the response (200 OK).  Then the
>client sends a RST packet, from there the connection disappears.
>Would this cause the child to panic?

Questions:

Does the RST arrive after the entire response is sent ?

If so: does the client ACK the response, before the RST ?

If so: how long time elapses between the ACK and the RST ?


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-12 Thread Paul Wright
Apologies if this is a duplicate.

On 11 February 2010 21:54, Poul-Henning Kamp  wrote:
> In message <282e72051002110739y50bc0b16j687bf75692f42...@mail.gmail.com>, Paul
> Wright writes:
>
>>I've had a go at 1.) and have two verbose `snoop` traces during child
>>panics.  I used sp.client from the backtrace to find out the port
>>number and then looked at just matching packets. From my (limited)
>>Wireshark comprehension they show the client establishing a connection
>>to Varnish, issue a GET, receive the response (200 OK).  Then the
>>client sends a RST packet, from there the connection disappears.
>>Would this cause the child to panic?
>
> Questions:

I've included the summary traces below in case my reading of them is
incorrect (times are relative to the start of snooping).  The first is
a single GET request, the second contains a number of KeepAlive GETs.
Both end in a RST, ACK packet.

> Does the RST arrive after the entire response is sent ?

Yes.

> If so: does the client ACK the response, before the RST ?

In Stream 1 the client ACK the response in the same packet as the RST.
 In Stream 2 there's an ACK and then an ACK, RST.

> If so: how long time elapses between the ACK and the RST ?

0.031s

Paul.

===Stream 1===
No. TimeSourceDestination   Protocol Info
  30700 10.878818   81.144.251.100192.168.0.35  TCP
  18763 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1380
  30701 10.878838   192.168.0.35  81.144.251.100TCP
  http > 18763 [SYN, ACK] Seq=0 Ack=1 Win=64860 Len=0 MSS=1460
  30703 10.885352   81.144.251.100192.168.0.35  TCP
  18763 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0
  30704 10.885758   81.144.251.100192.168.0.35  HTTP
  GET /flash/hpf.swf HTTP/1.0
  30705 10.885778   192.168.0.35  81.144.251.100TCP
  http > 18763 [ACK] Seq=1 Ack=818 Win=64860 Len=0
  30711 10.887222   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  30712 10.887231   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  30730 10.896227   81.144.251.100192.168.0.35  TCP
  18763 > http [ACK] Seq=818 Ack=1562 Win=65535 Len=0
  30731 10.896242   192.168.0.35  81.144.251.100HTTP
  HTTP/1.1 200 OK  (application/x-shockwave-flash)
  30798 10.938749   81.144.251.100192.168.0.35  TCP
  18763 > http [RST, ACK] Seq=818 Ack=2802 Win=0 Len=0


===Stream 2===
No. TimeSourceDestination   Protocol Info
  29059 10.157730   81.144.251.100192.168.0.35  TCP
  46174 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1380
  29060 10.157765   192.168.0.35  81.144.251.100TCP
  http > 46174 [SYN, ACK] Seq=0 Ack=1 Win=64860 Len=0 MSS=1460
  29115 10.174576   81.144.251.100192.168.0.35  TCP
  46174 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0
  29116 10.174716   81.144.251.100192.168.0.35  HTTP
  GET /i/home/inthepress/inthepress_22ctvodka.jpg HTTP/1.0
  29118 10.174750   192.168.0.35  81.144.251.100TCP
  http > 46174 [ACK] Seq=1 Ack=598 Win=64860 Len=0
  29120 10.174847   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29121 10.174888   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29156 10.183128   81.144.251.100192.168.0.35  TCP
  46174 > http [ACK] Seq=598 Ack=1567 Win=65535 Len=0
  29157 10.183144   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29158 10.183156   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29159 10.183166   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29177 10.190576   81.144.251.100192.168.0.35  TCP
  46174 > http [ACK] Seq=598 Ack=5707 Win=65535 Len=0
  29178 10.190593   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29179 10.190604   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29180 10.190615   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29181 10.190624   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29200 10.197563   81.144.251.100192.168.0.35  TCP
  46174 > http [ACK] Seq=598 Ack=9847 Win=65535 Len=0
  29201 10.197580   192.168.0.35  81.144.251.100TCP
  [TCP segment of a reassembled PDU]
  29202 10.197592   192.168.0.35  81.144.251.100HTTP
  HTTP/1.1 200 OK  (JPEG JFIF image)
  29225 10.205114   81.144.251.100192.168.0.35  TCP
  46174 > http [ACK] Seq=598 Ack=12766 Win=65535 Len=0
  29305 10.232454   81.144.251.100192.168.0.35  HTTP
  GET /i/home/inthepress/inthepress_cuvee.jpg HTTP/1.0
  29306 10.232469   192.168.0.35   

Re: Child panics on OpenSolaris

2010-02-12 Thread Poul-Henning Kamp
In message <282e72051002120417w6c8ce8bcha2c96a809e9ba...@mail.gmail.com>, Paul 
Wright writes:

>> If so: how long time elapses between the ACK and the RST ?
>
>0.031s

Interesting.

Can you try to run with this patch:

http://phk.freebsd.dk/misc/poll.patch

And see what values it puts in your varnishlog for these connections ?

I wonder if Solaris has some kind of "I already told you it were
closed once" logic...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-12 Thread Paul Wright
On 12 February 2010 12:37, Poul-Henning Kamp  wrote:
> In message <282e72051002120417w6c8ce8bcha2c96a809e9ba...@mail.gmail.com>, Paul
> Wright writes:
>
>>> If so: how long time elapses between the ACK and the RST ?
>>
>>0.031s
>
> Interesting.
>
> Can you try to run with this patch:
>
> http://phk.freebsd.dk/misc/poll.patch
>
> And see what values it puts in your varnishlog for these connections ?
>
> I wonder if Solaris has some kind of "I already told you it were
> closed once" logic...

Here's a snippet from varnishlog for one of these panics (also
attached to this email in case line wraps wreck formatting):

  143 ReqEnd   c 1290159921 1265995548.765989065
1265995548.766152620 0.39401 0.70572 0.92983
  143 Debugc "herding"
  143 Debugc "Poll: 1 / 1"
  143 ReqStart c 78.105.40.8 54467 1290159956
  143 RxRequestc GET
  143 RxURLc /i/template/2009/footer_bottom.gif
  143 RxProtocol   c HTTP/1.1
  143 RxHeader c Accept: */*
  143 RxHeader c Referer: http://www.firebox.com/?aff=1661
  143 RxHeader c Accept-Language: en-GB
  143 RxHeader c User-Agent: Mozilla/4.0 (compatible; MSIE 8.0;
Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR
3.5.30729; .NET CLR 3.0.30729)
  143 RxHeader c Accept-Encoding: gzip, deflate
  143 RxHeader c Host: media1.firebox.com
  143 RxHeader c Connection: Keep-Alive
  143 RxHeader c Cookie: user_session=;
locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom;
xp_list=52%3D1; LS_timeEntered=2010-02-12+17%3A25%3A45;
LS_siteID=9zbw6_pUJdo-CEmBfO.mTMjnl23RueBRWg; intl_aff=111
  143 VCL_call c recv
  143 VCL_return   c lookup
  143 VCL_call c hash
  143 VCL_return   c hash
  143 Hit  c 1290147201
  143 VCL_call c hit
  143 VCL_return   c deliver
  143 Length   c 256
  143 VCL_call c deliver
  143 VCL_return   c deliver
  143 TxProtocol   c HTTP/1.1
  143 TxStatus c 200
  143 TxResponse   c OK
  143 TxHeader c Server: Apache
  143 TxHeader c Last-Modified: Mon, 11 May 2009 10:47:04 GMT
  143 TxHeader c ETag: "ae80a9-100-469a0b2e6da00"
  143 TxHeader c Cache-Control: max-age=2592000
  143 TxHeader c Expires: Sun, 14 Mar 2010 17:22:56 GMT
  143 TxHeader c Content-Type: image/gif
  143 TxHeader c Content-Length: 256
  143 TxHeader c Date: Fri, 12 Feb 2010 17:25:49 GMT
  143 TxHeader c X-Varnish: 1290159956 1290147201
  143 TxHeader c Age: 172
  143 TxHeader c Via: 1.1 varnish
  143 TxHeader c Connection: keep-alive
  143 ReqEnd   c 1290159956 1265995549.060598135
1265995549.060703278 0.294445515 0.2 0.49591

Child (19545) died signal=6
Child (19545) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172:
  Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
(int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true.
errno = 9 (Bad file number)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  4454cb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4454cb]
  4457d5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x4457d5]
  fd7ff2fcdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c
[0xfd7ff2fcdfec]
  419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
  42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d]
  42b98a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42b98a]
  44805b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44805b]
  447654: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447654]
  447bb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447bb3]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = c84b38 {
  fd = 143, id = 143, xid = 0,
  client = 78.105.40.8:54467,
  step = STP_WAIT,
  handling = deliver,
  restarts = 0, esis = 0
  ws = c84ba8 {
id = "sess",
{s,f,r,e} = {c858b0,+18,+32786,+65536},
  },
  http[req] = {
ws = c84ba8[sess]
  "",
  "/i/template/2009/footer_bottom.gif",
  "HTTP/1.1",
  "Accept: */*",
  "Referer: http://www.firebox.com/?aff=1661";,
  "Accept-Language: en-GB",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1;
Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR
3.0.30729)",
  "Accept-Encoding: gzip, deflate",
  "Connection: Keep-Alive",
  "host: media.firebox.com",
  "X-Forwarded-For: 78.105.40.8",
  },
},

Paul.


varnish_child.143.log
Description: Binary data
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-12 Thread Poul-Henning Kamp
In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul 
Wright writes:

>> I wonder if Solaris has some kind of "I already told you it were
>> closed once" logic...
>
>Here's a snippet from varnishlog for one of these panics (also
>attached to this email in case line wraps wreck formatting):

Interesting!

This time the EBADF comes in the original worker thread, before we
hand the file descriptor over to the waiter, eliminating that entire
ball of wax from the picture.

>  419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
>  42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d]

I can find absolutely no trace of EBADF meaning "remote end closed"
in the Solaris docs or other docs on the web, but that as far as I
can tell that is indeed what happens here.

But as a kernel programmer, I can see where this might come from:

Receiving a TCP-RST means that the socket is never going to be
useful again.  Since you already have the socket/pcb locked, taking
it entirely out of its missery right away is cheap and more efficient,
than waiting for the process to notice and issue a close(2) on it,
and then have to relock the socket/pcb again etc. etc.

Next time you try to use the filedescriptor, there is no socket
and EBADF ensues.

Reasoning that most programs notice the return value, and call
strerror(3) not caring very much what the exact value of errno
is, you can get away with returning EBADF.

Varnish however, is written my a cranky old FreeBSD kernel hacker,
who has no pretentions about writing correct code the first time,
so 10% of the lines are asserts and yes I actually _do_ care about
the specific errno's returned.

And EBADF is not just any errorcode, it is the only errno which
has universally been recognized as meaning "programmer screwed up",
because you can only get it if you muck up your filedescriptors.

Or as one of the first hits Google gave me, when researching this
more politely but no less firmly describes it:

Bad file number (EBADF): The file descriptor references a
file that is either not open or is open for a conflicting
purpose. (eg, a read(2) is specified against a file that
is open for write(2) or vice-versa.) This is a programming
bug.

(http://www.princeton.edu/~unix/Solaris/troubleshoot/error.html)

If I had implemented the hack I suspect Solaris contains, I would
have found some bit somewhere, to make sure the errno would be the
correct, documented and expected:

#define ECONNRESET  54 /* Connection reset by peer */

Somebody with a Solaris service contract, if such things still
exist, should report this as a bug to them...

I will add a workaround to Varnish, with a suitable sarcastic
commentary...

Poul-Henning

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-12 Thread Michael Fischer
On Fri, Feb 12, 2010 at 10:26 AM, Poul-Henning Kamp wrote:

I can find absolutely no trace of EBADF meaning "remote end closed"
> in the Solaris docs or other docs on the web, but that as far as I
> can tell that is indeed what happens here.
>

Source code is at http://src.opensolaris.org/source/ if you feel like
digging through it.


> If I had implemented the hack I suspect Solaris contains, I would
> have found some bit somewhere, to make sure the errno would be the
> correct, documented and expected:
>
>#define ECONNRESET  54 /* Connection reset by peer */
>
> Somebody with a Solaris service contract, if such things still
> exist, should report this as a bug to them...


I have a feeling such a change will be nontrivial because TCP is implemented
under STREAMS under Solaris.  I don't intend to discourage anyone from
trying, though.

--Michael
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-13 Thread Poul-Henning Kamp
In message <89581.1265999...@critter.freebsd.dk>, "Poul-Henning Kamp" writes:
>In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul 

>>Here's a snippet from varnishlog for one of these panics (also
>>attached to this email in case line wraps wreck formatting):
>
>Interesting!
>
>This time the EBADF comes in the original worker thread, before we
>hand the file descriptor over to the waiter, eliminating that entire
>ball of wax from the picture.

I'm still not 100% convinced that EBADF is not me screwing something
up, so I have gone over the poll-waiter and made it even more
paranoid and conservative than before.

Can I get you to give -trunk r4561 a swing, and see if that changes
anything ?

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-14 Thread Nils Goroll
Paul,

have I missed anything or have you not yet stated which version of OpenSolaris 
(uname -v) you're using? If you don't use a standard build, could you please 
give the hg changeset you're on?

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-14 Thread Nils Goroll
Hi Poul-Henning and all,

> If I had implemented the hack I suspect Solaris contains, I would
> have found some bit somewhere, to make sure the errno would be the
> correct, documented and expected:
> 
>   #define ECONNRESET  54 /* Connection reset by peer */
> 
> Somebody with a Solaris service contract, if such things still
> exist, should report this as a bug to them...
> 
> I will add a workaround to Varnish, with a suitable sarcastic
> commentary...

I can't tell at this point if this is a Solaris or a Varnish issue, but I can 
tell for sure that I have never seen it on 2.0.3 with a couple of additional 
fixes running a *high* traffic site. The main difference between this version 
and trunk with respect to TCP is, IIUC, the lack of tcp_linger. Maybe that 
gives 
a clue on where to look, but on the other hand this might be absolutely the 
wrong war to go.

Regarding Solaris, this piece of code here

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp.c#11049

documents the errnos that SHOULD be set upon reception of an RST depending on 
various conditions.

Regarding sarcasm: Please add the comment when the current hypothesis proves 
true, but my experience is that in many cases the Solaris guys are quite fuzzy 
about documentation.

I volunteer to get this fixed in OpenSolaris should it turn out that the issue 
is root caused there.

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Paul Wright
On 14 February 2010 17:53, Nils Goroll  wrote:
> Paul,
>
> have I missed anything or have you not yet stated which version of
> OpenSolaris (uname -v) you're using? If you don't use a standard build,
> could you please give the hg changeset you're on?
>

You're right, I'd forgotten to mention it before now.

[r...@varnish bin]# uname -a
SunOS varnish 5.11 snv_111b i86pc i386 i86pc

Cheers,

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Paul Wright
On 13 February 2010 20:26, Poul-Henning Kamp  wrote:

> I'm still not 100% convinced that EBADF is not me screwing something
> up, so I have gone over the poll-waiter and made it even more
> paranoid and conservative than before.
>
> Can I get you to give -trunk r4561 a swing, and see if that changes
> anything ?

I just ran r4561 for about 10 minutes and observed the same error with
a client halfway through a KeepAlive session:

Child (3504) died signal=6
Child (3504) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172:
  Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
(int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true.
errno = 9 (Bad file number)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  4457db: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4457db]
  445ae5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x445ae5]
  fd7ff2efdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c
[0xfd7ff2efdfec]
  419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
  426aad: /opt/sbin/varnishd'cnt_wait+0x2bd [0x426aad]
  42bc3a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42bc3a]
  44835b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44835b]
  447954: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447954]
  447eb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447eb3]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 2b37678 {
  fd = 61, id = 61, xid = 0,
  client = 85.189.148.82:15824,
  step = STP_WAIT,
  handling = deliver,
  restarts = 0, esis = 0
  ws = 2b376e8 {
id = "sess",
{s,f,r,e} = {2b383f0,+20,+32788,+65536},
  },
  http[req] = {
ws = 2b376e8[sess]
  "",
  "/i/video_graphics/starb_0.gif",
  "HTTP/1.1",
  "Accept: */*",
  "Referer:
http://www.firebox.com/product/993/Firebox-of-Retro-Sweets?aff=1108";,
  "Accept-Language: en-gb",
  "UA-CPU: x86",
  "Accept-Encoding: gzip, deflate",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1;
.NET CLR 1.0.3705)",
  "Connection: Keep-Alive",
  "host: media.firebox.com",
  "X-Forwarded-For: 85.189.148.82",
  },
},

   61 ReqEnd   c 958037480 1266232359.325721025
1266232359.325860262 0.306401968 0.52452 0.86784
   61 Debugc "herding"
   61 Debugc "Poll: 0 / 1"
   61 ReqStart c 85.189.148.82 15824 958037491
   61 RxRequestc GET
   61 RxURLc /i/video_graphics/starb_0.gif
   61 RxProtocol   c HTTP/1.1
   61 RxHeader c Accept: */*
   61 RxHeader c Referer:
http://www.firebox.com/product/993/Firebox-of-Retro-Sweets?aff=1108
   61 RxHeader c Accept-Language: en-gb
   61 RxHeader c UA-CPU: x86
   61 RxHeader c Accept-Encoding: gzip, deflate
   61 RxHeader c User-Agent: Mozilla/4.0 (compatible; MSIE 7.0;
Windows NT 5.1; .NET CLR 1.0.3705)
   61 RxHeader c Host: media4.firebox.com
   61 RxHeader c Connection: Keep-Alive
   61 RxHeader c Cookie: user_session=;
locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom;
xp_list=52%3D2; aff=1108; intl_aff=000%3D1108%252C1268824352;
__utma=64137912.663836520.1266233079.1266233079.1266
   61 VCL_call c recv
   61 VCL_return   c lookup
   61 VCL_call c hash
   61 VCL_return   c hash
   61 Hit  c 958027384
   61 VCL_call c hit
   61 VCL_return   c deliver
   61 Length   c 125
   61 VCL_call c deliver
   61 VCL_return   c deliver
   61 TxProtocol   c HTTP/1.1
   61 TxStatus c 200
   61 TxResponse   c OK
   61 TxHeader c Server: Apache
   61 TxHeader c Last-Modified: Mon, 26 Feb 2007 16:53:44 GMT
   61 TxHeader c ETag: "1708031-7d-42a63fbf35600"
   61 TxHeader c Cache-Control: max-age=2592000
   61 TxHeader c Expires: Wed, 17 Mar 2010 11:09:26 GMT
   61 TxHeader c Content-Type: image/gif
   61 TxHeader c Content-Length: 125
   61 TxHeader c Date: Mon, 15 Feb 2010 11:12:39 GMT
   61 TxHeader c X-Varnish: 958037491 958027384
   61 TxHeader c Age: 193
   61 TxHeader c Via: 1.1 varnish
   61 TxHeader c Connection: keep-alive
   61 ReqEnd   c 958037491 1266232359.455624580
1266232359.455763578 0.129764318 0.72718 0.66280

Cheers,

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Nils Goroll
Hi Paul,

> You're right, I'd forgotten to mention it before now.
> 
> [r...@varnish bin]# uname -a
> SunOS varnish 5.11 snv_111b i86pc i386 i86pc

Good, thank you. Then at least this shouldn't be a bug introduced with the 
major 
changes of the ip datapath refactoring project 
http://hub.opensolaris.org/bin/view/Project+ip-refactor/WebHome

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Poul-Henning Kamp
In message <282e72051002150322l57063a9ds98143093910c2...@mail.gmail.com>, Paul 
Wright writes:
>On 13 February 2010 20:26, Poul-Henning Kamp  wrote:
>
>> I'm still not 100% convinced that EBADF is not me screwing something
>> up, so I have gone over the poll-waiter and made it even more
>> paranoid and conservative than before.
>>
>> Can I get you to give -trunk r4561 a swing, and see if that changes
>> anything ?
>
>I just ran r4561 for about 10 minutes and observed the same error with
>a client halfway through a KeepAlive session:

Is there any way (dtrace ?) you can log which systemcalls happen,
I would really like to be 100% sure that varnish does not close that
socket by mistake...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Nils Goroll
Hi Poul-Henning,

Victor added a comment to http://varnish-cache.org/ticket/629:

http://pastie.org/791964

child (28980) Started
Child (28980) said Closed fds: 3 5 6 9 10 12 13
Child (28980) said Child starts
Child (28980) said managed to mmap 536870912 bytes of 536870912
Child (28980) died signal=6
Child (28980) Panic message: Assert error in vca_main(), cache_waiter_ports.c 
line 175:
   Condition((errno == EINTR) || (errno == ETIME)) not true.
errno = 9 (Bad file number)
Backtrace:
   42e405: /opt/extra/sbin/varnishd'pan_ic+0x95 [0x42e405]
   fd7ffefa4ee0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ffefa4ee0]

I definitely do not see this in my patched up 2.0.3 (varnish running for weeks 
on 7 servers now), so IMHO, this is another hint into the direction of some 
change in trunk.

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Nils Goroll
Poul-Henning,

> Child (3504) died signal=6
> Child (3504) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172:
>   Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
> (int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true.
> errno = 9 (Bad file number)
> thread = (cache-worker)
> ident = -smalloc,-hcritbit,poll
> Backtrace:
>   4457db: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4457db]
>   445ae5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x445ae5]
>   fd7ff2efdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c
> [0xfd7ff2efdfec]
>   419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091]
>   426aad: /opt/sbin/varnishd'cnt_wait+0x2bd [0x426aad]
>   42bc3a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42bc3a]
>   44835b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44835b]
>   447954: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447954]
>   447eb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447eb3]

Just an idea from checking differences between the code I use and trunk: In 
cnt_wait, shouldn't we check pfd[0].revents for POLLERR and POLLHUP? Could it 
be 
that Solaris assumes that delivery an error once should suffice, so further use 
of the fd will return EBADF?

Again, I haven't investigated further, sorry for the noise if this turns out to 
be stupid.

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Poul-Henning Kamp
In message <4b793e98.2020...@schokola.de>, Nils Goroll writes:
>Poul-Henning,
>

>Just an idea from checking differences between the code I use and trunk: In 
>cnt_wait, shouldn't we check pfd[0].revents for POLLERR and POLLHUP? Could it 
>be 
>that Solaris assumes that delivery an error once should suffice, so further 
>use 
>of the fd will return EBADF?

That was one of my theories, but it does not fit the facs of the case,
and it would violate POSIX, which I doubt Solaris would do...

The best contender is still that varnish closes the fd by mistake,
but I'll be damned if I can find where...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-15 Thread Nils Goroll
Hi Poul-Henning,

> That was one of my theories, but it does not fit the facs of the case,

I don't understand what contradicts this hypothesis?

> and it would violate POSIX, which I doubt Solaris would do...

Just out of interest: Do you have a reference why that would violate POSIX?

http://www.opengroup.org/onlinepubs/009695399/functions/poll.html doesn't tell.

> The best contender is still that varnish closes the fd by mistake,
> but I'll be damned if I can find where...

Please let's come back to the basics and your initial hypothesis one last time:
Suppose the RST gets received between the poll() and ioctl() in TCP_nonblocking.

Why should the ioctl not be allowed to return EBADF in that case? For instance,
this man page

Ioctl Requests   streamio(7I)



NAME
  streamio - STREAMS ioctl commands

SYNOPSIS
  #include 
  #include 
  #include 

  int ioctl(int fildes, int command, ... /*arg*/);

only mentions a hand full of error codes and then there is this:

http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/proto_set.c#tli_errs

I'm not sure this is the exact mapping for this case, but it looks like it...

Nils

___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-16 Thread Poul-Henning Kamp
In message <4b783cd2.3030...@schokola.de>, Nils Goroll writes:

>Regarding Solaris, this piece of code here
>
>http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp.c#11049
>

Yes, I read that, and that's one of the reasons I am still not convinced
that this is indeed a Solaris issue.

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-16 Thread Poul-Henning Kamp

OK,

It took me a day to liberate a machine to install OpenSolaris on.

Then it took most of a day to remember why I really like FreeBSD
over Solaris.

Then I got distracted by the failing non-blocking connect to backends
and ran truss.

Then I found the root-cause of the problem.

We need CFLAGS to contain -mt on solaris, otherwise errno does not
get macro-fied to be per-thread.

That's where the: EBADF comes from, some entirely different
filedescriptor a long time ago, in the master process...

try:
env CFLAGS=-mt sh configure

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Nils Goroll
Hi Poul-Henning and all,

> We need CFLAGS to contain -mt on solaris, otherwise errno does not
> get macro-fied to be per-thread.
> 
> That's where the: EBADF comes from, some entirely different
> filedescriptor a long time ago, in the master process...

Fantastic. Thank you for putting so much effort into this.

But also: Stupid I didn't think about this. See how I'm compiling varnish since 
I started working with it:

## 64bit
LDFLAGS="-lpthread"
CFLAGS="-D_REENTRANT -m64" \
VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s" \
./configure \
'--enable-debugging-symbols' \
'--enable-developer-warnings' \
'--enable-dependency-tracking' \


IIUC, this effectively has the same effect as -mt for SunCC:

User Commands   cc(1)



NAME
  cc - C compiler

[...]

  -mt  Use this option to compile and link multithreaded code.
   The -mt option assures that libraries are linked in the
   appropriate order.

   This option passes -D_REENTRANT to the preprocessor and
   passes -lthread in the correct order to ld.

   If you are using POSIX threads, you must link with the
   options -mt -lpthread.

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Poul-Henning Kamp
In message <4b7bbdfa.7080...@schokola.de>, Nils Goroll writes:
>Hi Poul-Henning and all,
>
>> We need CFLAGS to contain -mt on solaris, otherwise errno does not
>> get macro-fied to be per-thread.
>> 
>> That's where the: EBADF comes from, some entirely different
>> filedescriptor a long time ago, in the master process...
>
>Fantastic. Thank you for putting so much effort into this.
>
>But also: Stupid I didn't think about this. See how I'm compiling varnish 
>since 
>I started working with it:

Yeah, we clearly need to improve the autocrap magic a bit to get stuff
like this right.

I have added a runtime test now, which panics the child process if the
errno variable is not working properly.

>VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s"

Right now our VCC_CC default is for the sun-compiler I think ("-Kpic"),

That's another piece of autocrap stuff that needs fixed...


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Nils Goroll
Hi Poul-Henning,

> Yeah, we clearly need to improve the autocrap magic a bit to get stuff
> like this right.

Yes. I'm sorry for not having documented this a year ago. Would have saved a 
lot 
of pain and effort...

> I have added a runtime test now, which panics the child process if the
> errno variable is not working properly.

Great!

>> VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s"
> 
> Right now our VCC_CC default is for the sun-compiler I think ("-Kpic"),

Yes. The reason I prefer gcc is simply that SunCC is not available everywhere 
and if I understand the licence correctly, it is still only free to use for 
developers and development purposes.

> That's another piece of autocrap stuff that needs fixed...

Do you want me to do it?

Nils
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Paul Wright
On 16 February 2010 20:48, Poul-Henning Kamp  wrote:
>
> OK,
>
> It took me a day to liberate a machine to install OpenSolaris on.
>
> Then it took most of a day to remember why I really like FreeBSD
> over Solaris.
>
> Then I got distracted by the failing non-blocking connect to backends
> and ran truss.
>
> Then I found the root-cause of the problem.
>
> We need CFLAGS to contain -mt on solaris, otherwise errno does not
> get macro-fied to be per-thread.
>
> That's where the: EBADF comes from, some entirely different
> filedescriptor a long time ago, in the master process...
>
> try:
>        env CFLAGS=-mt sh configure

Thanks so much for looking into this.  I spent yesterday attempting to
get some useful debug information out of dtrace with little success -
I either ended up with megabytes of debug dumps or empty files
depending on my D script.  Not exactly successful.

I've compiled with the additional -mt flag, here's my current
compilation process:

make clean
./autogen.sh
CFLAGS="-m64 -mt" CC=/opt/SunStudioExpress/bin/cc  ./configure --prefix=/opt
make
make install

And I'm starting up with:

newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \
-p cc_command='/opt/SunStudioExpress/bin/cc -Kpic -G -m64 -o %o %s' \
-T 127.0.0.1:9001 \
-s malloc,1G \
-p sess_timeout=5s \
-p max_restarts=12 \
-p waiter=poll \
-p connect_timeout=0s \
-p sess_workspace=65536

Varnish seems to be a *lot* happier and panics less regularly.  Panics
now all seem to suggest the error is "Connection reset by peer"
instead of EBADF

child (14052) Started
Child (14052) said Closed fds: 3 4 5 9 10 12 13
Child (14052) said Child starts
Child (14052) died signal=6
Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172:
  Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
(int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true.
errno = 131 (Connection reset by peer)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb]
  447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5]
  fd7ff181e6da: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x8a
[0xfd7ff181e6da]
  41924e: /opt/sbin/varnishd'vca_return_session+0x1de [0x41924e]
  4276ea: /opt/sbin/varnishd'cnt_wait+0x2ea [0x4276ea]
  42ccd6: /opt/sbin/varnishd'CNT_Session+0x526 [0x42ccd6]
  44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef]
  449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62]
  44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 1742b68 {
  fd = 40, id = 40, xid = 0,
  client = 94.226.122.36:3380,
  step = STP_WAIT,
  handling = deliver,
  restarts = 0, esis = 0
  ws = 1742bd8 {
id = "sess",
{s,f,r,e} = {17438e0,+19,+32787,+65536},
  },
  http[req] = {
ws = 1742bd8[sess]
  "",
  "/pic/p2607_small.jpg",
  "HTTP/1.1",
  "Accept: */*",
  "Referer: http://www.firebox.com/?aff=1759/";,
  "Accept-Language: nl-be",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1;
Trident/4.0)",
  "Accept-Encoding: gzip, deflate",
  "Connection: Keep-Alive",
  "host: media.firebox.com",
  "X-Forwarded-For: 94.226.122.36",
  },
},


And I'm also seeing the same error occur in http_copyheader() - could
this be due to the Range: aspect of the client request?

Child (12008) died signal=6
Child (12008) Panic message: Assert error in http_copyheader(),
cache_http.c line 647:
  Condition(n < to->shd) not true.
errno = 131 (Connection reset by peer)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb]
  447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5]
  440791: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x440791]
  4423b1: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x4423b1]
  429c4d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c4d]
  42cf3a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf3a]
  44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef]
  449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62]
  44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 397bf38 {
  fd = 56, id = 56, xid = 1848010778,
  client = 82.71.124.65:60324,
  step = STP_FETCH,
  handling = pass,
  err_code = 206, err_reason = (null),
  restarts = 0, esis = 0
  ws = 397bfa8 {
id = "sess",
{s,f,r,e} = {397ccb0,+900,0,+65536},
  },
  http[req] = {
ws = 397bfa8[sess]
  "GET",
  "/xml/rss.top20.000.xml",
  "HTTP/1.1",
  "Host: www.firebox.com",
  "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6",
  "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language: en-gb,en;q=0.5",
  "Accept-Encoding: gzip,deflate",
  "Accept-Charset: 

Re: Child panics on OpenSolaris

2010-02-17 Thread Poul-Henning Kamp
In message <4b7bc0d5.7080...@schokola.de>, Nils Goroll writes:

>> That's another piece of autocrap stuff that needs fixed...
>
>Do you want me to do it?

Send patches!

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Poul-Henning Kamp
In message <282e72051002170230k7ae8e0c8hc2d5226ca9288...@mail.gmail.com>, Paul 
Wright writes:

>I've compiled with the additional -mt flag, here's my current
>compilation process:

Please pull a brand new -trunk, I have added a check for errno 
working and I would like to make sure that passes for you also.

>Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line =
>172:
>  Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
>(int))&0xff)<<16)| ('f'<<8)|126))), &i)) =3D=3D 0) not true.
>errno =3D 131 (Connection reset by peer)


Now, _this_ errno I can actually belive, because that matches
the packet traces we have seen, and it is a plausible scenario.

The fact that Solaris docs does not mention ECONNRESET as a legal
error return for ioctl is a minor detail in that context.

The difference here is that the traditional BSD stack does not
return ECONNRESET until you try to move data on the socket,
giving you much simpler error checking on socket-state changes
(ioctl, fcntl, setsockopt, getsockopt etc)

So now that we have reached the root-cause, I need to go through
and do complex error-checking for all the socket-state calls.

Hopefully done later today...

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Paul Wright
On 17 February 2010 10:57, Poul-Henning Kamp  wrote:
> In message <282e72051002170230k7ae8e0c8hc2d5226ca9288...@mail.gmail.com>, Paul
> Wright writes:
>
>>I've compiled with the additional -mt flag, here's my current
>>compilation process:
>
> Please pull a brand new -trunk, I have added a check for errno
> working and I would like to make sure that passes for you also.

I have pulled the latest revision and can confirm that unlink("/")
failed as expected.  I ran it briefly and saw this panic:

Child (28554) died signal=6
Child (28554) Panic message: Assert error in http_copyheader(),
cache_http.c line 647:
  Condition(n < to->shd) not true.
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb]
  447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5]
  440791: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x440791]
  4423b1: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x4423b1]
  429c4d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c4d]
  42cf3a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf3a]
  44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef]
  449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62]
  44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 18c9b78 {
  fd = 38, id = 38, xid = 789049749,
  client = 82.71.124.65:58190,
  step = STP_FETCH,
  handling = pass,
  err_code = 206, err_reason = (null),
  restarts = 0, esis = 0
  ws = 18c9be8 {
id = "sess",
{s,f,r,e} = {18ca8f0,+884,0,+65536},
  },
  http[req] = {
ws = 18c9be8[sess]
  "GET",
  "/xml/rss.top20.000.xml",
  "HTTP/1.1",
  "Host: www.firebox.com",
  "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6",
  "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Accept-Language: en-gb,en;q=0.5",
  "Accept-Encoding: gzip,deflate",
  "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
  "Keep-Alive: 115",
  "Connection: keep-alive",
  "X-Moz: livebookmarks",
  "Cookie: user_session=;
locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom;
xp_list=52%3D1;
__utma=64137912.1325032789.1266225177.1266397371.1266400910.11;
__utmc=64137912;
__utmz=64137912.1266225177.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)",
  "Range: bytes=15136-",
  "If-Range: "14b800b-678d-47fc91860e540"",
  "Cache-Control: max-age=0",
  "X-Forwarded-For: 82.71.124.65",
  },
  worker = fd7ff69fad30 {
ws = fd7ff69fae78 {
  id = "wrk",
  {s,f,r,e} = {fd7ff69e8c40,+323,0,+65536},
},
http[bereq] = {
  ws = fd7ff69fae78[wrk]
"GET",
"/xml/rss.top20.000.xml",
"HTTP/1.1",
"Host: www.firebox.com",
"User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5;
en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6",
"Accept:
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language: en-gb,en;q=0.5",
"Accept-Encoding: gzip,deflate",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"X-Moz: livebookmarks",
"Cookie: user_session=;
locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom;
xp_list=52%3D1;
__utma=64137912.1325032789.1266225177.1266397371.1266400910.11;
__utmc=64137912;
__utmz=64137912.1266225177.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)",
"Range: bytes=15136-",
"If-Range: "14b800b-678d-47fc91860e540"",
"X-Forwarded-For: 82.71.124.65",
"X-Varnish: 789049749",
},
http[beresp] = {
  ws = fd7ff69fae78[wrk]
"HTTP/1.1",
"206",
"Partial Content",
"Date: Wed, 17 Feb 2010 10:48:30 GMT",
"Server: Apache",
"Last-Modified: Wed, 17 Feb 2010 10:13:01 GMT",
"ETag: "14b800b-678d-47fc91860e540"",
"Accept-Ranges: bytes",
"Content-Length: 11373",
"Content-Range: bytes 15136-26508/26509",
"Connection: close",
"Content-Type: application/xml",
},
},
vcl = {
  srcname = {
"input",
"Default",
  },
},
  obj = 6a0 {
xid = 789049749,
ws = 6c0 {
  id = "obj",
  {s,f,r,e} = {8a5,8a5,0,+220},
},
http[obj] = {
  ws = 6c0[obj]
"HTTP/1.1",
"206",
"Partial Content",
"Date: Wed, 17 Feb 2010 10:48:30 GMT",
"Server: Apache",
"Last-Modified: Wed, 17 Feb 2010 10:13:01 GMT",
"ETag: "14b800b-678d-47fc91860e540"",
},
len = 0,
store = {
},
  },
},


>>Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line =
>>172:
>>  Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof
>>(int))&0xff)<<16)| ('f'<<8)|126))), &i)) =3D=3D 0) not true.
>>errno =3D 131 (Connection reset by 

Re: Child panics on OpenSolaris

2010-02-17 Thread Poul-Henning Kamp
In message <282e72051002170310g565a26d8s98170fcebb9b2...@mail.gmail.com>, Paul 
Wright writes:
>On 17 February 2010 10:57, Poul-Henning Kamp  wrote:

>I have pulled the latest revision and can confirm that unlink("/")
>failed as expected.  I ran it briefly and saw this panic:
>
>Child (28554) died signal=3D6
>Child (28554) Panic message: Assert error in http_copyheader(),
>cache_http.c line 647:
>  Condition(n < to->shd) not true.

Sounds like you got too many headers from the backend, if so, 
consider increasing the parameter 'http_headers'


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Paul Wright
On 17 February 2010 11:14, Poul-Henning Kamp  wrote:
> In message <282e72051002170310g565a26d8s98170fcebb9b2...@mail.gmail.com>, Paul
> Wright writes:
>>On 17 February 2010 10:57, Poul-Henning Kamp  wrote:
>
>>I have pulled the latest revision and can confirm that unlink("/")
>>failed as expected.  I ran it briefly and saw this panic:
>>
>>Child (28554) died signal=3D6
>>Child (28554) Panic message: Assert error in http_copyheader(),
>>cache_http.c line 647:
>>  Condition(n < to->shd) not true.
>
> Sounds like you got too many headers from the backend, if so,
> consider increasing the parameter 'http_headers'

I pulled r4574 with the new TCP_Assert() changes, I increased
http_headers to 128 and it ran like a champ for 25 minutes.  Then a
client made a Range request resulting in a "Partial Content" response
which produced the panic included below.  I've boiled down the request
to a simple curl command that consistently panics the child:

curl -v -H "Range: bytes=0-1" -H "Cookie: foo" http://192.168.0.35/robots.txt

(The cookie header ensures the request is passed through to the
backend. Curl will respond with "Failure when receiving data from the
peer".)

I note that ticket #466 mentions that Varnish doesn't support Range
requests and r3910 should stop such requests passing though - is this
expected behaviour?  Should I stop the backend from generating partial
content responses?  Or should there be a check for such requests and
they should be piped instead of passed?

Paul.

Child (21350) died signal=6
Child (21350) Panic message: Assert error in http_copyheader(),
cache_http.c line 647:
  Condition(n < to->shd) not true.
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b]
  447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35]
  4407e1: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x4407e1]
  442401: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x442401]
  429c9d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c9d]
  42cf8a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf8a]
  44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f]
  449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2]
  44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 1140e08 {
  fd = 77, id = 77, xid = 133352112,
  client = 66.235.125.136:38373,
  step = STP_FETCH,
  handling = pass,
  err_code = 206, err_reason = (null),
  restarts = 0, esis = 0
  ws = 1140e78 {
id = "sess",
{s,f,r,e} = {1142400,+388,0,+65536},
  },
  http[req] = {
ws = 1140e78[sess]
  "GET",
  "/robots.txt",
  "HTTP/1.1",
  "TE: deflate,gzip;q=0.3",
  "Connection: TE, close",
  "From: rob...@pronto.com",
  "Host: www.firebox.com",
  "Range: bytes=0-19",
  "User-Agent: RedCarpet/1.4 (http://www.pronto.com/robots.html)",
  "Cookie: CFTOKEN=aeed9fb6cc27a0f7-8F680632-1517-0C84-6887BD4A85FF98A8;
CFID=128110309",
  "Cookie2: $Version="1"",
  "X-Forwarded-For: 66.235.125.136",
  },
  worker = fd7ff7813d30 {
ws = fd7ff7813e78 {
  id = "wrk",
  {s,f,r,e} = {fd7ff7801c40,+528,0,+65536},
},
http[bereq] = {
  ws = fd7ff7813e78[wrk]
"GET",
"/robots.txt",
"HTTP/1.1",
"From: rob...@pronto.com",
"Host: www.firebox.com",
"Range: bytes=0-19",
"User-Agent: RedCarpet/1.4 (http://www.pronto.com/robots.html)",
"Cookie:
CFTOKEN=aeed9fb6cc27a0f7-8F680632-1517-0C84-6887BD4A85FF98A8;
CFID=128110309",
"Cookie2: $Version="1"",
"X-Forwarded-For: 66.235.125.136",
"X-Varnish: 133352112",
},
http[beresp] = {
  ws = fd7ff7813e78[wrk]
"HTTP/1.1",
"206",
"Partial Content",
"Date: Wed, 17 Feb 2010 16:25:42 GMT",
"Server: Apache",
"Last-Modified: Fri, 11 Dec 2009 18:15:28 GMT",
"ETag: "19c0073-de-47a77e88b9000"",
"Accept-Ranges: bytes",
"Content-Length: 222",
"Content-Range: bytes 0-221/222",
"Connection: close",
"Content-Type: text/plain",
},
},
vcl = {
  srcname = {
"input",
"Default",
  },
},
  obj = 8c1d8a0 {
xid = 133352112,
ws = 8c1d8c0 {
  id = "obj",
  {s,f,r,e} = {8c1daa5,8c1daa5,0,+213},
},
http[obj] = {
  ws = 8c1d8c0[obj]
"HTTP/1.1",
"206",
"Partial Content",
"Date: Wed, 17 Feb 2010 16:25:42 GMT",
"Server: Apache",
"Last-Modified: Fri, 11 Dec 2009 18:15:28 GMT",
"ETag: "19c0073-de-47a77e88b9000"",
},
len = 0,
store = {
},
  },
},
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-17 Thread Poul-Henning Kamp
In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul 
Wright writes:

>(The cookie header ensures the request is passed through to the
>backend. Curl will respond with "Failure when receiving data from the
>peer".)

How on earth did you get the Range header to be passed through to
the backend ?  It should have been filtered out...

Can you capture a varnishlog for this one ?

Poul-Henning


-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-18 Thread Paul Wright
On 17 February 2010 18:46, Poul-Henning Kamp  wrote:
> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul
> Wright writes:
>
>>(The cookie header ensures the request is passed through to the
>>backend. Curl will respond with "Failure when receiving data from the
>>peer".)
>
> How on earth did you get the Range header to be passed through to
> the backend ?  It should have been filtered out...
>
> Can you capture a varnishlog for this one ?

Here's the output:

0 CLI  - Rd ping
0 CLI  - Wr 200 PONG 1266486712 1.0
5 SessionOpen  c 192.168.0.30 60069 :80
5 ReqStart c 192.168.0.30 60069 1736091917
5 RxRequestc GET
5 RxURLc /robots.txt
5 RxProtocol   c HTTP/1.1
5 RxHeader c User-Agent: curl/7.19.4 (i386-pc-solaris2.11)
libcurl/7.19.4 OpenSSL/0.9.8a zlib/1.2.3 libidn/1.9
5 RxHeader c Host: 192.168.0.35
5 RxHeader c Accept: */*
5 RxHeader c Range: bytes=0-1
5 RxHeader c Cookie: foo
5 VCL_call c recv
5 VCL_return   c pass
5 VCL_call c hash
5 VCL_return   c hash
5 VCL_call c pass
5 VCL_return   c pass
9 BackendOpen  b martin 192.168.0.35 54372 192.168.0.31 80
5 Backend  c 9 dynamic_director martin
9 TxRequestb GET
9 TxURLb /robots.txt
9 TxProtocol   b HTTP/1.1
9 TxHeader b User-Agent: curl/7.19.4 (i386-pc-solaris2.11)
libcurl/7.19.4 OpenSSL/0.9.8a zlib/1.2.3 libidn/1.9
9 TxHeader b Host: 192.168.0.35
9 TxHeader b Accept: */*
9 TxHeader b Range: bytes=0-1
9 TxHeader b Cookie: foo
9 TxHeader b X-Forwarded-For: 192.168.0.30
9 TxHeader b X-Varnish: 1736091917
0 WorkThread   - fd7ff9e00d30 start
0 CLI  - Rd vcl.load "boot" ./vcl.ORk8t3RP.so
0 CLI  - Wr 200 Loaded "./vcl.ORk8t3RP.so" as "boot"
0 CLI  - Rd vcl.use "boot"
0 CLI  - Wr 200
0 CLI  - Rd start
0 Debug- "Acceptor poll space increased to 512"
0 Debug- "Acceptor is poll"
0 CLI  - Wr 200
0 WorkThread   - fd7ff9206d30 start
0 WorkThread   - fd7ff9007d30 start
0 WorkThread   - fd7ff8e08d30 start
0 WorkThread   - fd7ff8c09d30 start
0 WorkThread   - fd7ff8a0ad30 start
0 WorkThread   - fd7ff880bd30 start
0 WorkThread   - fd7ff860cd30 start
0 WorkThread   - fd7ff840dd30 start
0 WorkThread   - fd7ff820ed30 start

The VCL used hasn't changed substantially from the original included
at the beginning of this thread, only vcl_recv is defined and I don't
think it's doing anything stupid.

sub vcl_recv {
// normalise static requests
if ( req.http.host ~ "media([0-9]+).firebox.com" ) {
set req.http.host = "media.firebox.com";
}

//catch any relative image URLs that haven't been repointed to media
if ( req.http.host != "media.firebox.com" &&
 ( req.url ~ "^/pic/.+" || req.url ~ "^/i/.+" ) ) {
set req.http.host = "media.firebox.com";
}

// split traffic based on host name
if ( req.http.host == "media.firebox.com" ) {
remove req.http.cookie;
set req.backend = static_director;
} else {
// dynamic content that should be cached (ie no
cookies required)
// these patterns should match up with settings.inc on
the PHP side
if ( req.url ~ "^/styles/(.+).css$" ||
 req.url ~ "^/js/(.+).js$" ) {
remove req.http.cookie;
}
//default all requests to the dynamic backend
set req.backend = dynamic_director;
}
}

This is on r4574, compiled with Sun Studio.

Cheers,

Paul.

Child (2570) died signal=6
Child (2570) Panic message: Assert error in http_copyheader(),
cache_http.c line 647:
  Condition(n < to->shd) not true.
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b]
  447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35]
  4407e1: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x4407e1]
  442401: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x442401]
  429c9d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c9d]
  42cf8a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf8a]
  44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f]
  449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2]
  44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
sp = 5eec58 {
  fd = 5, id = 5, xid = 1315520023,
  client = 192.168.0.30:41925,
  step = STP_FETCH,
  handling = pass,
  err_code = 206, err_reason = (null),
  restarts = 0, esis = 0
  ws = 5eecc8 {
id = "sess",
{s,f,r,e} = {5f0250,+257,0,+65536},
  },
  http[req] = {
ws = 5eecc8[sess

Re: Child panics on OpenSolaris

2010-02-18 Thread Paul Wright
On 18 February 2010 10:08, Paul Wright  wrote:
> On 17 February 2010 18:46, Poul-Henning Kamp  wrote:
>> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, 
>> Paul
>> Wright writes:
>>
>>>(The cookie header ensures the request is passed through to the
>>>backend. Curl will respond with "Failure when receiving data from the
>>>peer".)
>>
>> How on earth did you get the Range header to be passed through to
>> the backend ?  It should have been filtered out...

Just as a sanity check I compiled r4574 on a Fedora x64 machine with
gcc and it appears that Range requests still make it through to the
backend, but no panics:

   11 SessionOpen  c 192.168.0.169 60860 192.168.0.87:80
   11 ReqStart c 192.168.0.169 60860 1989537926
   11 RxRequestc GET
   11 RxURLc /robots.txt
   11 RxProtocol   c HTTP/1.1
   11 RxHeader c User-Agent: curl/7.19.4
(universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3
   11 RxHeader c Host: 192.168.0.87
   11 RxHeader c Accept: */*
   11 RxHeader c Range: bytes=0-1
   11 RxHeader c Cookie: foo
   11 VCL_call c recv
   11 VCL_return   c pass
   11 VCL_call c hash
   11 VCL_return   c hash
   11 VCL_call c pass
   11 VCL_return   c pass
   12 BackendClose - paul
   12 BackendOpen  b paul 192.168.0.83 54663 192.168.0.83 80
   11 Backend  c 12 dynamic_director paul
   12 TxRequestb GET
   12 TxURLb /robots.txt
   12 TxProtocol   b HTTP/1.1
   12 TxHeader b User-Agent: curl/7.19.4
(universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3
   12 TxHeader b Host: 192.168.0.87
   12 TxHeader b Accept: */*
   12 TxHeader b Range: bytes=0-1
   12 TxHeader b Cookie: foo
   12 TxHeader b X-Forwarded-For: 192.168.0.169
   12 TxHeader b X-Varnish: 1989537926
   12 RxProtocol   b HTTP/1.1
   12 RxStatus b 206
   12 RxResponse   b Partial Content
   12 RxHeader b Date: Thu, 18 Feb 2010 12:00:51 GMT
   12 RxHeader b Server: Apache
   12 RxHeader b Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT
   12 RxHeader b ETag: "a2ad-de-47aae2e6fcaa0"
   12 RxHeader b Accept-Ranges: bytes
   12 RxHeader b Content-Length: 2
   12 RxHeader b Content-Range: bytes 0-1/222
   12 RxHeader b Content-Type: text/plain
   12 RxHeader b X-Pad: avoid browser bug
   11 TTL  c 1989537926 RFC 120 1266494451 0 0 0 0
   11 VCL_call c fetch
   11 VCL_return   c pass
   11 ObjProtocol  c HTTP/1.1
   11 ObjStatusc 206
   11 ObjResponse  c Partial Content
   11 ObjHeaderc Date: Thu, 18 Feb 2010 12:00:51 GMT
   11 ObjHeaderc Server: Apache
   11 ObjHeaderc Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT
   11 ObjHeaderc ETag: "a2ad-de-47aae2e6fcaa0"
   11 ObjHeaderc Content-Type: text/plain
   11 ObjHeaderc X-Pad: avoid browser bug
   12 BackendReuse b paul
   11 Length   c 2
   11 VCL_call c deliver
   11 VCL_return   c deliver
   11 TxProtocol   c HTTP/1.1
   11 TxStatus c 206
   11 TxResponse   c Partial Content
   11 TxHeader c Server: Apache
   11 TxHeader c Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT
   11 TxHeader c ETag: "a2ad-de-47aae2e6fcaa0"
   11 TxHeader c Content-Type: text/plain
   11 TxHeader c X-Pad: avoid browser bug
   11 TxHeader c Content-Length: 2
   11 TxHeader c Date: Thu, 18 Feb 2010 12:00:51 GMT
   11 TxHeader c X-Varnish: 1989537926
   11 TxHeader c Age: 0
   11 TxHeader c Via: 1.1 varnish
   11 TxHeader c Connection: keep-alive
   11 ReqEnd   c 1989537926 1266494451.053643227
1266494451.055507183 0.000101566 0.001832485 0.31471
   11 SessionClose c EOF
   11 ReqEnd   c 0 1266494451.055849552 1266494451.055849552
0.000342369 0.0 0.0
   11 StatSess c 192.168.0.169 60860 0 1 1 0 1 1 306 2

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-19 Thread Paul Wright
On 17 February 2010 18:46, Poul-Henning Kamp  wrote:
> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul
> Wright writes:
>
>>(The cookie header ensures the request is passed through to the
>>backend. Curl will respond with "Failure when receiving data from the
>>peer".)
>
> How on earth did you get the Range header to be passed through to
> the backend ?  It should have been filtered out...

I added the following to the top of vcl_recv()

if ( req.http.Range ) {
unset req.http.Range;
}

Varnish ran happily until tripping over the same bug (#649) as victori
(on r4573).  A couple of examples follow.

New personal best of 1 hour, 20 minutes uptime!

Paul.

Child (26791) died signal=6
Child (26791) Panic message: Assert error in VCA_Prep(),
cache_acceptor.c line 184:
  Condition(TCP_Check(setsockopt(sp->fd, 0x, 0x0080, &linger,
sizeof linger))) not true.
errno = 22 (Invalid argument)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b]
  447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35]
  41862a: /opt/sbin/varnishd'VCA_Prep+0x29a [0x41862a]
  42a466: /opt/sbin/varnishd'cnt_first+0xb6 [0x42a466]
  42cd6a: /opt/sbin/varnishd'CNT_Session+0x56a [0x42cd6a]
  44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f]
  449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2]
  44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
  fd7ff653afb0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ff653afb0]
sp = 10e8ae08 {
  fd = 73, id = 73, xid = 0,
  client = 94.196.213.123:50084,
  step = STP_FIRST,
  handling = deliver,
  restarts = 0, esis = 0
  ws = 10e8ae78 {
id = "sess",
{s,f,r,e} = {10e8c400,+21,0,+65536},
  },
  http[req] = {
ws = 10e8ae78[sess]
  "",
  "/pic/p2031b.jpg",
  "HTTP/1.1",
  "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 (.NET CLR 3.5.30729)",
  "Accept: image/png,image/*;q=0.8,*/*;q=0.5",
  "Accept-Language: en-us,en;q=0.5",
  "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
  "Keep-Alive: 300",
  "Connection: keep-alive",
  "Referer:
http://www.affordablegiftsolutions.com/index.php?option=com_cmsshopbuilder&view=category&id=55&Itemid=5&limitstart=400";,
  "host: media.firebox.com",
  "X-Forwarded-For: 67.234.17.245",
  },
  worker = fd7ff2c96d30 {
ws = fd7ff2c96e78 {
  id = "wrk",
  {s,f,r,e} = {fd7ff2c84c40,fd7ff2c84c40,0,+65536},
},
},
},


Child cleanup complete
child (2199) Started
Child (2199) said Closed fds: 3 4 5 9 10 12 13
Child (2199) said Child starts

Child (8098) died signal=6
Child (8098) Panic message: Assert error in VCA_Prep(),
cache_acceptor.c line 184:
  Condition(TCP_Check(setsockopt(sp->fd, 0x, 0x0080, &linger,
sizeof linger))) not true.
errno = 22 (Invalid argument)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  447b1b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b1b]
  447e25: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e25]
  41862a: /opt/sbin/varnishd'VCA_Prep+0x29a [0x41862a]
  42a456: /opt/sbin/varnishd'cnt_first+0xb6 [0x42a456]
  42cd5a: /opt/sbin/varnishd'CNT_Session+0x56a [0x42cd5a]
  44a82f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a82f]
  449da2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449da2]
  44a355: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a355]
  fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5]
  fd7ff653afb0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ff653afb0]
sp = ee89a8 {
  fd = 17, id = 17, xid = 0,
  client = 217.111.162.2:58987,
  step = STP_FIRST,
  handling = deliver,
  restarts = 0, esis = 0
  ws = ee8a18 {
id = "sess",
{s,f,r,e} = {ee9fa0,+20,0,+65536},
  },
  http[req] = {
ws = ee8a18[sess]
  "",
  "/v/2000/2381/98x74.jpg",
  "HTTP/1.1",
  "Accept: */*",
  "Referer:
http://www.firebox.com/product/2480/LEGO-Mindstorms-NXT-2.0?via=cat";,
  "Accept-Language: en-gb",
  "UA-CPU: x86",
  "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0;
SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; InfoPath.2)",
  "Host: media.firebox.com",
  "Cache-Control: max-stale=0",
  "Connection: Keep-Alive",
  "X-BlueCoat-Via: A1EDA8423E101D0C",
  "X-Forwarded-For: 89.189.78.18, 82.114.160.35",
  },
  worker = fd7ff83edd30 {
ws = fd7ff83ede78 {
  id = "wrk",
  {s,f,r,e} = {fd7ff83dbc40,fd7ff83dbc40,0,+65536},
},
},
},
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-02-22 Thread Paul Wright
On 19 February 2010 17:45, Paul Wright  wrote:
> On 17 February 2010 18:46, Poul-Henning Kamp  wrote:
>> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, 
>> Paul
>> Wright writes:
>>
>>>(The cookie header ensures the request is passed through to the
>>>backend. Curl will respond with "Failure when receiving data from the
>>>peer".)
>>
>> How on earth did you get the Range header to be passed through to
>> the backend ?  It should have been filtered out...
>
> I added the following to the top of vcl_recv()
>
>        if ( req.http.Range ) {
>                unset req.http.Range;
>        }
>
> Varnish ran happily until tripping over the same bug (#649) as victori
> (on r4573).

For anyone else following along I've now had varnish running for over
5 hours without issue, here are the things I found out:

* add the Range unsetting code to ensure that such requests don't make
it through to the back end
* remove the TCP_Assert() that wraps the setsockopt() call on line 184
of bin/varnishd/cache_acceptor.c
* compile with gcc, not Sun Studio (there's still some sort of
funniness with TCP_(non)blocking() )

CC=/usr/bin/gcc CFLAGS="-O3 -L/lib/amd64 -pthreads -m64
-fomit-frame-pointer" LDFLAGS="-lumem -pthreads" ./configure
--prefix=/opt

* pass the right flags through to gcc when launching vanishd

newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \
-p 'cc_command=/usr/bin/gcc -fpic -shared -m64 -o %o %s' \
-T 127.0.0.1:9001 \
-s malloc,2G \
-p sess_timeout=5s \
-p max_restarts=12 \
-p waiter=poll \
-p connect_timeout=0s \
-p sess_workspace=65536

* keep checking http://letsgetdugg.com/2009/12/04/varnish-on-solaris/
for hints and suggestions

Many thanks to Poul-Henning Kamp and Victor Igumnov for getting us to
this point.

Cheers,

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Child panics on OpenSolaris

2010-03-04 Thread Paul Wright
On 22 February 2010 18:02, Paul Wright  wrote:
...
> For anyone else following along I've now had varnish running for over
> 5 hours without issue, here are the things I found out:
>
> * add the Range unsetting code to ensure that such requests don't make
> it through to the back end
> * remove the TCP_Assert() that wraps the setsockopt() call on line 184
> of bin/varnishd/cache_acceptor.c
> * compile with gcc, not Sun Studio (there's still some sort of
> funniness with TCP_(non)blocking() )
>
> CC=/usr/bin/gcc CFLAGS="-O3 -L/lib/amd64 -pthreads -m64
> -fomit-frame-pointer" LDFLAGS="-lumem -pthreads" ./configure
> --prefix=/opt
>
> * pass the right flags through to gcc when launching vanishd
>
> newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \
> -p 'cc_command=/usr/bin/gcc -fpic -shared -m64 -o %o %s' \
> -T 127.0.0.1:9001 \
> -s malloc,2G \
> -p sess_timeout=5s \
> -p max_restarts=12 \
> -p waiter=poll \
> -p connect_timeout=0s \
> -p sess_workspace=65536
>
> * keep checking http://letsgetdugg.com/2009/12/04/varnish-on-solaris/
> for hints and suggestions

Latest update, we're seeing "Connection refused" panics like the following:

Child (1955) died signal=6
Child (1955) Panic message: Assert error in TCP_blocking(), tcp.c line 164:
  Condition(TCP_Check(j)) not true.
errno = 146 (Connection refused)
thread = (cache-worker)
ident = -smalloc,-hcritbit,poll
Backtrace:
  42fb51: /opt/sbin/varnishd'pan_ic+0xb1 [0x42fb51]
  2f: [0x2f]
sp = 13570018 {
  fd = 47, id = 47, xid = 0,
  client = ?.?.?.?:?,
  step = STP_FIRST,
  handling = deliver,
  restarts = 0, esis = 0
  ws = 13570088 {
id = "sess",
{s,f,r,e} = {13570d90,13570d90,0,+65536},
  },
  http[req] = {
ws = 13570088[sess]
  "",
  "/i/nl/all/0.gif",
  "HTTP/1.1",
  "Host: media.firebox.com",
  "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2;
en-us) AppleWebKit/531.21.8 (KHTML, like Gecko)",
  "Accept: */*",
  "Accept-Language: en-us",
  "If-Modified-Since: Tue, 22 Sep 2009 15:34:33 GMT",
  "If-None-Match: "16a8070-2b-4742c55c1f440"",
  "Connection: keep-alive",
  "X-Forwarded-For: 85.189.102.193",
  },
  worker = fd7ff7bf1d80 {
ws = fd7ff7bf1ec8 {
  id = "wrk",
  {s,f,r,e} = {fd7ff7bdfcb0,fd7ff7bdfcb0,0,+65536},
},
},
},

Interesting things to note, we're confident that this request is a
cache hit which rules out the backend (handling = deliver).  Also the
client address appears to have been mangled:

  client = ?.?.?.?:?,

Would this cause varnish to attempt opening a connection which is then refused?

As a workaround would it be advisable to add a clause to TCP_Check (in
include/libvarnish.h) to skip over errno 146 (Connection refused)
along with the existing ECONNRESET and ENOTCONN clauses?

Cheers,

Paul.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc