Re: Child panics on OpenSolaris
In message <282e72051002100305m5d7a0d1fj3a1afac6ea7dd...@mail.gmail.com>, Paul Wright writes: Hi Paul, We have a number of tickets on this issue already (626,615,588). The problem is that EBADF return indicates that Varnish by mistake has closed a file descriptor, that should still be open. The alternative explanation, that Solaris can return EBADF because the other end closed a TCP connection, does not seem to have any support in Solaris documentation. We are trying to get a Solaris box running so we can figure this out once and for all. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 10 February 2010 13:00, Poul-Henning Kamp wrote: > In message <282e72051002100305m5d7a0d1fj3a1afac6ea7dd...@mail.gmail.com>, Paul > Wright writes: > > > Hi Paul, > > We have a number of tickets on this issue already (626,615,588). > > The problem is that EBADF return indicates that Varnish by mistake > has closed a file descriptor, that should still be open. > > The alternative explanation, that Solaris can return EBADF because > the other end closed a TCP connection, does not seem to have any > support in Solaris documentation. > > We are trying to get a Solaris box running so we can figure this > out once and for all. Thanks for the explanation of what's going on. Looking at those tickets there are suggestions to try the poll waiter which we're already using - are there any further tests we could try to help narrow down this issue? I'm happy to assist trying out patches. Cheers, Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002100615x701a37a8o416c9af6d1d7f...@mail.gmail.com>, Paul Wright writes: >Thanks for the explanation of what's going on. Looking at those >tickets there are suggestions to try the poll waiter which we're >already using - are there any further tests we could try to help >narrow down this issue? I'm happy to assist trying out patches. I can see three ways to nail this issue: 1. Catch a tcpdump, when it happens, showing that the client side did close, and Solaris (incorrectly) returns EBADF. 2. Catch a ktrace/systrace/dtrace, when it happens, that show that Varnish incorrectly closes the fd. 3. Setup some synthetic test to show that solaris returns EBADF when it shouldn't If either of those are in your reach, by all means go for it... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 10 February 2010 13:53, Poul-Henning Kamp wrote: > In message <282e72051002100615x701a37a8o416c9af6d1d7f...@mail.gmail.com>, Paul > Wright writes: > >>Thanks for the explanation of what's going on. Looking at those >>tickets there are suggestions to try the poll waiter which we're >>already using - are there any further tests we could try to help >>narrow down this issue? I'm happy to assist trying out patches. > > I can see three ways to nail this issue: > > 1. Catch a tcpdump, when it happens, showing that the client side > did close, and Solaris (incorrectly) returns EBADF. > > 2. Catch a ktrace/systrace/dtrace, when it happens, that show > that Varnish incorrectly closes the fd. > > 3. Setup some synthetic test to show that solaris returns EBADF > when it shouldn't > > If either of those are in your reach, by all means go for it... I've had a go at 1.) and have two verbose `snoop` traces during child panics. I used sp.client from the backtrace to find out the port number and then looked at just matching packets. From my (limited) Wireshark comprehension they show the client establishing a connection to Varnish, issue a GET, receive the response (200 OK). Then the client sends a RST packet, from there the connection disappears. Would this cause the child to panic? I can't post these traces publicly but are there any other details that would help? I've been racking my brains to think if there is any special in our setup and the only thing that springs to mind is the firewall. We have an OpenBSD firewall using PF to redirect HTTP traffic from the public IP address to the internal web servers which has worked without issue for a number of years. During testing the only firewall change I've made redirects this traffic to Varnish instead. Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002110739y50bc0b16j687bf75692f42...@mail.gmail.com>, Paul Wright writes: >I've had a go at 1.) and have two verbose `snoop` traces during child >panics. I used sp.client from the backtrace to find out the port >number and then looked at just matching packets. From my (limited) >Wireshark comprehension they show the client establishing a connection >to Varnish, issue a GET, receive the response (200 OK). Then the >client sends a RST packet, from there the connection disappears. >Would this cause the child to panic? Questions: Does the RST arrive after the entire response is sent ? If so: does the client ACK the response, before the RST ? If so: how long time elapses between the ACK and the RST ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Apologies if this is a duplicate. On 11 February 2010 21:54, Poul-Henning Kamp wrote: > In message <282e72051002110739y50bc0b16j687bf75692f42...@mail.gmail.com>, Paul > Wright writes: > >>I've had a go at 1.) and have two verbose `snoop` traces during child >>panics. I used sp.client from the backtrace to find out the port >>number and then looked at just matching packets. From my (limited) >>Wireshark comprehension they show the client establishing a connection >>to Varnish, issue a GET, receive the response (200 OK). Then the >>client sends a RST packet, from there the connection disappears. >>Would this cause the child to panic? > > Questions: I've included the summary traces below in case my reading of them is incorrect (times are relative to the start of snooping). The first is a single GET request, the second contains a number of KeepAlive GETs. Both end in a RST, ACK packet. > Does the RST arrive after the entire response is sent ? Yes. > If so: does the client ACK the response, before the RST ? In Stream 1 the client ACK the response in the same packet as the RST. In Stream 2 there's an ACK and then an ACK, RST. > If so: how long time elapses between the ACK and the RST ? 0.031s Paul. ===Stream 1=== No. TimeSourceDestination Protocol Info 30700 10.878818 81.144.251.100192.168.0.35 TCP 18763 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1380 30701 10.878838 192.168.0.35 81.144.251.100TCP http > 18763 [SYN, ACK] Seq=0 Ack=1 Win=64860 Len=0 MSS=1460 30703 10.885352 81.144.251.100192.168.0.35 TCP 18763 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0 30704 10.885758 81.144.251.100192.168.0.35 HTTP GET /flash/hpf.swf HTTP/1.0 30705 10.885778 192.168.0.35 81.144.251.100TCP http > 18763 [ACK] Seq=1 Ack=818 Win=64860 Len=0 30711 10.887222 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 30712 10.887231 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 30730 10.896227 81.144.251.100192.168.0.35 TCP 18763 > http [ACK] Seq=818 Ack=1562 Win=65535 Len=0 30731 10.896242 192.168.0.35 81.144.251.100HTTP HTTP/1.1 200 OK (application/x-shockwave-flash) 30798 10.938749 81.144.251.100192.168.0.35 TCP 18763 > http [RST, ACK] Seq=818 Ack=2802 Win=0 Len=0 ===Stream 2=== No. TimeSourceDestination Protocol Info 29059 10.157730 81.144.251.100192.168.0.35 TCP 46174 > http [SYN] Seq=0 Win=65535 Len=0 MSS=1380 29060 10.157765 192.168.0.35 81.144.251.100TCP http > 46174 [SYN, ACK] Seq=0 Ack=1 Win=64860 Len=0 MSS=1460 29115 10.174576 81.144.251.100192.168.0.35 TCP 46174 > http [ACK] Seq=1 Ack=1 Win=65535 Len=0 29116 10.174716 81.144.251.100192.168.0.35 HTTP GET /i/home/inthepress/inthepress_22ctvodka.jpg HTTP/1.0 29118 10.174750 192.168.0.35 81.144.251.100TCP http > 46174 [ACK] Seq=1 Ack=598 Win=64860 Len=0 29120 10.174847 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29121 10.174888 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29156 10.183128 81.144.251.100192.168.0.35 TCP 46174 > http [ACK] Seq=598 Ack=1567 Win=65535 Len=0 29157 10.183144 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29158 10.183156 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29159 10.183166 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29177 10.190576 81.144.251.100192.168.0.35 TCP 46174 > http [ACK] Seq=598 Ack=5707 Win=65535 Len=0 29178 10.190593 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29179 10.190604 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29180 10.190615 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29181 10.190624 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29200 10.197563 81.144.251.100192.168.0.35 TCP 46174 > http [ACK] Seq=598 Ack=9847 Win=65535 Len=0 29201 10.197580 192.168.0.35 81.144.251.100TCP [TCP segment of a reassembled PDU] 29202 10.197592 192.168.0.35 81.144.251.100HTTP HTTP/1.1 200 OK (JPEG JFIF image) 29225 10.205114 81.144.251.100192.168.0.35 TCP 46174 > http [ACK] Seq=598 Ack=12766 Win=65535 Len=0 29305 10.232454 81.144.251.100192.168.0.35 HTTP GET /i/home/inthepress/inthepress_cuvee.jpg HTTP/1.0 29306 10.232469 192.168.0.35
Re: Child panics on OpenSolaris
In message <282e72051002120417w6c8ce8bcha2c96a809e9ba...@mail.gmail.com>, Paul Wright writes: >> If so: how long time elapses between the ACK and the RST ? > >0.031s Interesting. Can you try to run with this patch: http://phk.freebsd.dk/misc/poll.patch And see what values it puts in your varnishlog for these connections ? I wonder if Solaris has some kind of "I already told you it were closed once" logic... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 12 February 2010 12:37, Poul-Henning Kamp wrote: > In message <282e72051002120417w6c8ce8bcha2c96a809e9ba...@mail.gmail.com>, Paul > Wright writes: > >>> If so: how long time elapses between the ACK and the RST ? >> >>0.031s > > Interesting. > > Can you try to run with this patch: > > http://phk.freebsd.dk/misc/poll.patch > > And see what values it puts in your varnishlog for these connections ? > > I wonder if Solaris has some kind of "I already told you it were > closed once" logic... Here's a snippet from varnishlog for one of these panics (also attached to this email in case line wraps wreck formatting): 143 ReqEnd c 1290159921 1265995548.765989065 1265995548.766152620 0.39401 0.70572 0.92983 143 Debugc "herding" 143 Debugc "Poll: 1 / 1" 143 ReqStart c 78.105.40.8 54467 1290159956 143 RxRequestc GET 143 RxURLc /i/template/2009/footer_bottom.gif 143 RxProtocol c HTTP/1.1 143 RxHeader c Accept: */* 143 RxHeader c Referer: http://www.firebox.com/?aff=1661 143 RxHeader c Accept-Language: en-GB 143 RxHeader c User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729) 143 RxHeader c Accept-Encoding: gzip, deflate 143 RxHeader c Host: media1.firebox.com 143 RxHeader c Connection: Keep-Alive 143 RxHeader c Cookie: user_session=; locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom; xp_list=52%3D1; LS_timeEntered=2010-02-12+17%3A25%3A45; LS_siteID=9zbw6_pUJdo-CEmBfO.mTMjnl23RueBRWg; intl_aff=111 143 VCL_call c recv 143 VCL_return c lookup 143 VCL_call c hash 143 VCL_return c hash 143 Hit c 1290147201 143 VCL_call c hit 143 VCL_return c deliver 143 Length c 256 143 VCL_call c deliver 143 VCL_return c deliver 143 TxProtocol c HTTP/1.1 143 TxStatus c 200 143 TxResponse c OK 143 TxHeader c Server: Apache 143 TxHeader c Last-Modified: Mon, 11 May 2009 10:47:04 GMT 143 TxHeader c ETag: "ae80a9-100-469a0b2e6da00" 143 TxHeader c Cache-Control: max-age=2592000 143 TxHeader c Expires: Sun, 14 Mar 2010 17:22:56 GMT 143 TxHeader c Content-Type: image/gif 143 TxHeader c Content-Length: 256 143 TxHeader c Date: Fri, 12 Feb 2010 17:25:49 GMT 143 TxHeader c X-Varnish: 1290159956 1290147201 143 TxHeader c Age: 172 143 TxHeader c Via: 1.1 varnish 143 TxHeader c Connection: keep-alive 143 ReqEnd c 1290159956 1265995549.060598135 1265995549.060703278 0.294445515 0.2 0.49591 Child (19545) died signal=6 Child (19545) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172: Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof (int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true. errno = 9 (Bad file number) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 4454cb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4454cb] 4457d5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x4457d5] fd7ff2fcdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c [0xfd7ff2fcdfec] 419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091] 42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d] 42b98a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42b98a] 44805b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44805b] 447654: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447654] 447bb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447bb3] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = c84b38 { fd = 143, id = 143, xid = 0, client = 78.105.40.8:54467, step = STP_WAIT, handling = deliver, restarts = 0, esis = 0 ws = c84ba8 { id = "sess", {s,f,r,e} = {c858b0,+18,+32786,+65536}, }, http[req] = { ws = c84ba8[sess] "", "/i/template/2009/footer_bottom.gif", "HTTP/1.1", "Accept: */*", "Referer: http://www.firebox.com/?aff=1661";, "Accept-Language: en-GB", "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729)", "Accept-Encoding: gzip, deflate", "Connection: Keep-Alive", "host: media.firebox.com", "X-Forwarded-For: 78.105.40.8", }, }, Paul. varnish_child.143.log Description: Binary data ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul Wright writes: >> I wonder if Solaris has some kind of "I already told you it were >> closed once" logic... > >Here's a snippet from varnishlog for one of these panics (also >attached to this email in case line wraps wreck formatting): Interesting! This time the EBADF comes in the original worker thread, before we hand the file descriptor over to the waiter, eliminating that entire ball of wax from the picture. > 419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091] > 42679d: /opt/sbin/varnishd'cnt_wait+0x2bd [0x42679d] I can find absolutely no trace of EBADF meaning "remote end closed" in the Solaris docs or other docs on the web, but that as far as I can tell that is indeed what happens here. But as a kernel programmer, I can see where this might come from: Receiving a TCP-RST means that the socket is never going to be useful again. Since you already have the socket/pcb locked, taking it entirely out of its missery right away is cheap and more efficient, than waiting for the process to notice and issue a close(2) on it, and then have to relock the socket/pcb again etc. etc. Next time you try to use the filedescriptor, there is no socket and EBADF ensues. Reasoning that most programs notice the return value, and call strerror(3) not caring very much what the exact value of errno is, you can get away with returning EBADF. Varnish however, is written my a cranky old FreeBSD kernel hacker, who has no pretentions about writing correct code the first time, so 10% of the lines are asserts and yes I actually _do_ care about the specific errno's returned. And EBADF is not just any errorcode, it is the only errno which has universally been recognized as meaning "programmer screwed up", because you can only get it if you muck up your filedescriptors. Or as one of the first hits Google gave me, when researching this more politely but no less firmly describes it: Bad file number (EBADF): The file descriptor references a file that is either not open or is open for a conflicting purpose. (eg, a read(2) is specified against a file that is open for write(2) or vice-versa.) This is a programming bug. (http://www.princeton.edu/~unix/Solaris/troubleshoot/error.html) If I had implemented the hack I suspect Solaris contains, I would have found some bit somewhere, to make sure the errno would be the correct, documented and expected: #define ECONNRESET 54 /* Connection reset by peer */ Somebody with a Solaris service contract, if such things still exist, should report this as a bug to them... I will add a workaround to Varnish, with a suitable sarcastic commentary... Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On Fri, Feb 12, 2010 at 10:26 AM, Poul-Henning Kamp wrote: I can find absolutely no trace of EBADF meaning "remote end closed" > in the Solaris docs or other docs on the web, but that as far as I > can tell that is indeed what happens here. > Source code is at http://src.opensolaris.org/source/ if you feel like digging through it. > If I had implemented the hack I suspect Solaris contains, I would > have found some bit somewhere, to make sure the errno would be the > correct, documented and expected: > >#define ECONNRESET 54 /* Connection reset by peer */ > > Somebody with a Solaris service contract, if such things still > exist, should report this as a bug to them... I have a feeling such a change will be nontrivial because TCP is implemented under STREAMS under Solaris. I don't intend to discourage anyone from trying, though. --Michael ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <89581.1265999...@critter.freebsd.dk>, "Poul-Henning Kamp" writes: >In message <282e72051002120949u56eb6914mbd55e5a355931...@mail.gmail.com>, Paul >>Here's a snippet from varnishlog for one of these panics (also >>attached to this email in case line wraps wreck formatting): > >Interesting! > >This time the EBADF comes in the original worker thread, before we >hand the file descriptor over to the waiter, eliminating that entire >ball of wax from the picture. I'm still not 100% convinced that EBADF is not me screwing something up, so I have gone over the poll-waiter and made it even more paranoid and conservative than before. Can I get you to give -trunk r4561 a swing, and see if that changes anything ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Paul, have I missed anything or have you not yet stated which version of OpenSolaris (uname -v) you're using? If you don't use a standard build, could you please give the hg changeset you're on? Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Poul-Henning and all, > If I had implemented the hack I suspect Solaris contains, I would > have found some bit somewhere, to make sure the errno would be the > correct, documented and expected: > > #define ECONNRESET 54 /* Connection reset by peer */ > > Somebody with a Solaris service contract, if such things still > exist, should report this as a bug to them... > > I will add a workaround to Varnish, with a suitable sarcastic > commentary... I can't tell at this point if this is a Solaris or a Varnish issue, but I can tell for sure that I have never seen it on 2.0.3 with a couple of additional fixes running a *high* traffic site. The main difference between this version and trunk with respect to TCP is, IIUC, the lack of tcp_linger. Maybe that gives a clue on where to look, but on the other hand this might be absolutely the wrong war to go. Regarding Solaris, this piece of code here http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp.c#11049 documents the errnos that SHOULD be set upon reception of an RST depending on various conditions. Regarding sarcasm: Please add the comment when the current hypothesis proves true, but my experience is that in many cases the Solaris guys are quite fuzzy about documentation. I volunteer to get this fixed in OpenSolaris should it turn out that the issue is root caused there. Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 14 February 2010 17:53, Nils Goroll wrote: > Paul, > > have I missed anything or have you not yet stated which version of > OpenSolaris (uname -v) you're using? If you don't use a standard build, > could you please give the hg changeset you're on? > You're right, I'd forgotten to mention it before now. [r...@varnish bin]# uname -a SunOS varnish 5.11 snv_111b i86pc i386 i86pc Cheers, Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 13 February 2010 20:26, Poul-Henning Kamp wrote: > I'm still not 100% convinced that EBADF is not me screwing something > up, so I have gone over the poll-waiter and made it even more > paranoid and conservative than before. > > Can I get you to give -trunk r4561 a swing, and see if that changes > anything ? I just ran r4561 for about 10 minutes and observed the same error with a client halfway through a KeepAlive session: Child (3504) died signal=6 Child (3504) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172: Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof (int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true. errno = 9 (Bad file number) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 4457db: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4457db] 445ae5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x445ae5] fd7ff2efdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c [0xfd7ff2efdfec] 419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091] 426aad: /opt/sbin/varnishd'cnt_wait+0x2bd [0x426aad] 42bc3a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42bc3a] 44835b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44835b] 447954: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447954] 447eb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447eb3] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 2b37678 { fd = 61, id = 61, xid = 0, client = 85.189.148.82:15824, step = STP_WAIT, handling = deliver, restarts = 0, esis = 0 ws = 2b376e8 { id = "sess", {s,f,r,e} = {2b383f0,+20,+32788,+65536}, }, http[req] = { ws = 2b376e8[sess] "", "/i/video_graphics/starb_0.gif", "HTTP/1.1", "Accept: */*", "Referer: http://www.firebox.com/product/993/Firebox-of-Retro-Sweets?aff=1108";, "Accept-Language: en-gb", "UA-CPU: x86", "Accept-Encoding: gzip, deflate", "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705)", "Connection: Keep-Alive", "host: media.firebox.com", "X-Forwarded-For: 85.189.148.82", }, }, 61 ReqEnd c 958037480 1266232359.325721025 1266232359.325860262 0.306401968 0.52452 0.86784 61 Debugc "herding" 61 Debugc "Poll: 0 / 1" 61 ReqStart c 85.189.148.82 15824 958037491 61 RxRequestc GET 61 RxURLc /i/video_graphics/starb_0.gif 61 RxProtocol c HTTP/1.1 61 RxHeader c Accept: */* 61 RxHeader c Referer: http://www.firebox.com/product/993/Firebox-of-Retro-Sweets?aff=1108 61 RxHeader c Accept-Language: en-gb 61 RxHeader c UA-CPU: x86 61 RxHeader c Accept-Encoding: gzip, deflate 61 RxHeader c User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.0.3705) 61 RxHeader c Host: media4.firebox.com 61 RxHeader c Connection: Keep-Alive 61 RxHeader c Cookie: user_session=; locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom; xp_list=52%3D2; aff=1108; intl_aff=000%3D1108%252C1268824352; __utma=64137912.663836520.1266233079.1266233079.1266 61 VCL_call c recv 61 VCL_return c lookup 61 VCL_call c hash 61 VCL_return c hash 61 Hit c 958027384 61 VCL_call c hit 61 VCL_return c deliver 61 Length c 125 61 VCL_call c deliver 61 VCL_return c deliver 61 TxProtocol c HTTP/1.1 61 TxStatus c 200 61 TxResponse c OK 61 TxHeader c Server: Apache 61 TxHeader c Last-Modified: Mon, 26 Feb 2007 16:53:44 GMT 61 TxHeader c ETag: "1708031-7d-42a63fbf35600" 61 TxHeader c Cache-Control: max-age=2592000 61 TxHeader c Expires: Wed, 17 Mar 2010 11:09:26 GMT 61 TxHeader c Content-Type: image/gif 61 TxHeader c Content-Length: 125 61 TxHeader c Date: Mon, 15 Feb 2010 11:12:39 GMT 61 TxHeader c X-Varnish: 958037491 958027384 61 TxHeader c Age: 193 61 TxHeader c Via: 1.1 varnish 61 TxHeader c Connection: keep-alive 61 ReqEnd c 958037491 1266232359.455624580 1266232359.455763578 0.129764318 0.72718 0.66280 Cheers, Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Paul, > You're right, I'd forgotten to mention it before now. > > [r...@varnish bin]# uname -a > SunOS varnish 5.11 snv_111b i86pc i386 i86pc Good, thank you. Then at least this shouldn't be a bug introduced with the major changes of the ip datapath refactoring project http://hub.opensolaris.org/bin/view/Project+ip-refactor/WebHome Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002150322l57063a9ds98143093910c2...@mail.gmail.com>, Paul Wright writes: >On 13 February 2010 20:26, Poul-Henning Kamp wrote: > >> I'm still not 100% convinced that EBADF is not me screwing something >> up, so I have gone over the poll-waiter and made it even more >> paranoid and conservative than before. >> >> Can I get you to give -trunk r4561 a swing, and see if that changes >> anything ? > >I just ran r4561 for about 10 minutes and observed the same error with >a client halfway through a KeepAlive session: Is there any way (dtrace ?) you can log which systemcalls happen, I would really like to be 100% sure that varnish does not close that socket by mistake... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Poul-Henning, Victor added a comment to http://varnish-cache.org/ticket/629: http://pastie.org/791964 child (28980) Started Child (28980) said Closed fds: 3 5 6 9 10 12 13 Child (28980) said Child starts Child (28980) said managed to mmap 536870912 bytes of 536870912 Child (28980) died signal=6 Child (28980) Panic message: Assert error in vca_main(), cache_waiter_ports.c line 175: Condition((errno == EINTR) || (errno == ETIME)) not true. errno = 9 (Bad file number) Backtrace: 42e405: /opt/extra/sbin/varnishd'pan_ic+0x95 [0x42e405] fd7ffefa4ee0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ffefa4ee0] I definitely do not see this in my patched up 2.0.3 (varnish running for weeks on 7 servers now), so IMHO, this is another hint into the direction of some change in trunk. Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Poul-Henning, > Child (3504) died signal=6 > Child (3504) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172: > Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof > (int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true. > errno = 9 (Bad file number) > thread = (cache-worker) > ident = -smalloc,-hcritbit,poll > Backtrace: > 4457db: /opt/sbin/varnishd'pan_backtrace+0x1b [0x4457db] > 445ae5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x445ae5] > fd7ff2efdfec: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x7c > [0xfd7ff2efdfec] > 419091: /opt/sbin/varnishd'vca_return_session+0x1b1 [0x419091] > 426aad: /opt/sbin/varnishd'cnt_wait+0x2bd [0x426aad] > 42bc3a: /opt/sbin/varnishd'CNT_Session+0x4ba [0x42bc3a] > 44835b: /opt/sbin/varnishd'wrk_do_cnt_sess+0x19b [0x44835b] > 447954: /opt/sbin/varnishd'wrk_thread_real+0x854 [0x447954] > 447eb3: /opt/sbin/varnishd'wrk_thread+0x123 [0x447eb3] Just an idea from checking differences between the code I use and trunk: In cnt_wait, shouldn't we check pfd[0].revents for POLLERR and POLLHUP? Could it be that Solaris assumes that delivery an error once should suffice, so further use of the fd will return EBADF? Again, I haven't investigated further, sorry for the noise if this turns out to be stupid. Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <4b793e98.2020...@schokola.de>, Nils Goroll writes: >Poul-Henning, > >Just an idea from checking differences between the code I use and trunk: In >cnt_wait, shouldn't we check pfd[0].revents for POLLERR and POLLHUP? Could it >be >that Solaris assumes that delivery an error once should suffice, so further >use >of the fd will return EBADF? That was one of my theories, but it does not fit the facs of the case, and it would violate POSIX, which I doubt Solaris would do... The best contender is still that varnish closes the fd by mistake, but I'll be damned if I can find where... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Poul-Henning, > That was one of my theories, but it does not fit the facs of the case, I don't understand what contradicts this hypothesis? > and it would violate POSIX, which I doubt Solaris would do... Just out of interest: Do you have a reference why that would violate POSIX? http://www.opengroup.org/onlinepubs/009695399/functions/poll.html doesn't tell. > The best contender is still that varnish closes the fd by mistake, > but I'll be damned if I can find where... Please let's come back to the basics and your initial hypothesis one last time: Suppose the RST gets received between the poll() and ioctl() in TCP_nonblocking. Why should the ioctl not be allowed to return EBADF in that case? For instance, this man page Ioctl Requests streamio(7I) NAME streamio - STREAMS ioctl commands SYNOPSIS #include #include #include int ioctl(int fildes, int command, ... /*arg*/); only mentions a hand full of error codes and then there is this: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/proto_set.c#tli_errs I'm not sure this is the exact mapping for this case, but it looks like it... Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <4b783cd2.3030...@schokola.de>, Nils Goroll writes: >Regarding Solaris, this piece of code here > >http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/inet/tcp/tcp.c#11049 > Yes, I read that, and that's one of the reasons I am still not convinced that this is indeed a Solaris issue. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
OK, It took me a day to liberate a machine to install OpenSolaris on. Then it took most of a day to remember why I really like FreeBSD over Solaris. Then I got distracted by the failing non-blocking connect to backends and ran truss. Then I found the root-cause of the problem. We need CFLAGS to contain -mt on solaris, otherwise errno does not get macro-fied to be per-thread. That's where the: EBADF comes from, some entirely different filedescriptor a long time ago, in the master process... try: env CFLAGS=-mt sh configure -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Poul-Henning and all, > We need CFLAGS to contain -mt on solaris, otherwise errno does not > get macro-fied to be per-thread. > > That's where the: EBADF comes from, some entirely different > filedescriptor a long time ago, in the master process... Fantastic. Thank you for putting so much effort into this. But also: Stupid I didn't think about this. See how I'm compiling varnish since I started working with it: ## 64bit LDFLAGS="-lpthread" CFLAGS="-D_REENTRANT -m64" \ VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s" \ ./configure \ '--enable-debugging-symbols' \ '--enable-developer-warnings' \ '--enable-dependency-tracking' \ IIUC, this effectively has the same effect as -mt for SunCC: User Commands cc(1) NAME cc - C compiler [...] -mt Use this option to compile and link multithreaded code. The -mt option assures that libraries are linked in the appropriate order. This option passes -D_REENTRANT to the preprocessor and passes -lthread in the correct order to ld. If you are using POSIX threads, you must link with the options -mt -lpthread. Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <4b7bbdfa.7080...@schokola.de>, Nils Goroll writes: >Hi Poul-Henning and all, > >> We need CFLAGS to contain -mt on solaris, otherwise errno does not >> get macro-fied to be per-thread. >> >> That's where the: EBADF comes from, some entirely different >> filedescriptor a long time ago, in the master process... > >Fantastic. Thank you for putting so much effort into this. > >But also: Stupid I didn't think about this. See how I'm compiling varnish >since >I started working with it: Yeah, we clearly need to improve the autocrap magic a bit to get stuff like this right. I have added a runtime test now, which panics the child process if the errno variable is not working properly. >VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s" Right now our VCC_CC default is for the sun-compiler I think ("-Kpic"), That's another piece of autocrap stuff that needs fixed... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
Hi Poul-Henning, > Yeah, we clearly need to improve the autocrap magic a bit to get stuff > like this right. Yes. I'm sorry for not having documented this a year ago. Would have saved a lot of pain and effort... > I have added a runtime test now, which panics the child process if the > errno variable is not working properly. Great! >> VCC_CC="exec gcc -fpic -D_REENTRANT -m64 -shared -o %o %s" > > Right now our VCC_CC default is for the sun-compiler I think ("-Kpic"), Yes. The reason I prefer gcc is simply that SunCC is not available everywhere and if I understand the licence correctly, it is still only free to use for developers and development purposes. > That's another piece of autocrap stuff that needs fixed... Do you want me to do it? Nils ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 16 February 2010 20:48, Poul-Henning Kamp wrote: > > OK, > > It took me a day to liberate a machine to install OpenSolaris on. > > Then it took most of a day to remember why I really like FreeBSD > over Solaris. > > Then I got distracted by the failing non-blocking connect to backends > and ran truss. > > Then I found the root-cause of the problem. > > We need CFLAGS to contain -mt on solaris, otherwise errno does not > get macro-fied to be per-thread. > > That's where the: EBADF comes from, some entirely different > filedescriptor a long time ago, in the master process... > > try: > env CFLAGS=-mt sh configure Thanks so much for looking into this. I spent yesterday attempting to get some useful debug information out of dtrace with little success - I either ended up with megabytes of debug dumps or empty files depending on my D script. Not exactly successful. I've compiled with the additional -mt flag, here's my current compilation process: make clean ./autogen.sh CFLAGS="-m64 -mt" CC=/opt/SunStudioExpress/bin/cc ./configure --prefix=/opt make make install And I'm starting up with: newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \ -p cc_command='/opt/SunStudioExpress/bin/cc -Kpic -G -m64 -o %o %s' \ -T 127.0.0.1:9001 \ -s malloc,1G \ -p sess_timeout=5s \ -p max_restarts=12 \ -p waiter=poll \ -p connect_timeout=0s \ -p sess_workspace=65536 Varnish seems to be a *lot* happier and panics less regularly. Panics now all seem to suggest the error is "Connection reset by peer" instead of EBADF child (14052) Started Child (14052) said Closed fds: 3 4 5 9 10 12 13 Child (14052) said Child starts Child (14052) died signal=6 Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line 172: Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof (int))&0xff)<<16)| ('f'<<8)|126))), &i)) == 0) not true. errno = 131 (Connection reset by peer) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb] 447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5] fd7ff181e6da: /opt/lib/libvarnish.so.1.0.0'TCP_nonblocking+0x8a [0xfd7ff181e6da] 41924e: /opt/sbin/varnishd'vca_return_session+0x1de [0x41924e] 4276ea: /opt/sbin/varnishd'cnt_wait+0x2ea [0x4276ea] 42ccd6: /opt/sbin/varnishd'CNT_Session+0x526 [0x42ccd6] 44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef] 449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62] 44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 1742b68 { fd = 40, id = 40, xid = 0, client = 94.226.122.36:3380, step = STP_WAIT, handling = deliver, restarts = 0, esis = 0 ws = 1742bd8 { id = "sess", {s,f,r,e} = {17438e0,+19,+32787,+65536}, }, http[req] = { ws = 1742bd8[sess] "", "/pic/p2607_small.jpg", "HTTP/1.1", "Accept: */*", "Referer: http://www.firebox.com/?aff=1759/";, "Accept-Language: nl-be", "User-Agent: Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)", "Accept-Encoding: gzip, deflate", "Connection: Keep-Alive", "host: media.firebox.com", "X-Forwarded-For: 94.226.122.36", }, }, And I'm also seeing the same error occur in http_copyheader() - could this be due to the Range: aspect of the client request? Child (12008) died signal=6 Child (12008) Panic message: Assert error in http_copyheader(), cache_http.c line 647: Condition(n < to->shd) not true. errno = 131 (Connection reset by peer) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb] 447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5] 440791: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x440791] 4423b1: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x4423b1] 429c4d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c4d] 42cf3a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf3a] 44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef] 449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62] 44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 397bf38 { fd = 56, id = 56, xid = 1848010778, client = 82.71.124.65:60324, step = STP_FETCH, handling = pass, err_code = 206, err_reason = (null), restarts = 0, esis = 0 ws = 397bfa8 { id = "sess", {s,f,r,e} = {397ccb0,+900,0,+65536}, }, http[req] = { ws = 397bfa8[sess] "GET", "/xml/rss.top20.000.xml", "HTTP/1.1", "Host: www.firebox.com", "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6", "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language: en-gb,en;q=0.5", "Accept-Encoding: gzip,deflate", "Accept-Charset:
Re: Child panics on OpenSolaris
In message <4b7bc0d5.7080...@schokola.de>, Nils Goroll writes: >> That's another piece of autocrap stuff that needs fixed... > >Do you want me to do it? Send patches! -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002170230k7ae8e0c8hc2d5226ca9288...@mail.gmail.com>, Paul Wright writes: >I've compiled with the additional -mt flag, here's my current >compilation process: Please pull a brand new -trunk, I have added a check for errno working and I would like to make sure that passes for you also. >Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line = >172: > Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof >(int))&0xff)<<16)| ('f'<<8)|126))), &i)) =3D=3D 0) not true. >errno =3D 131 (Connection reset by peer) Now, _this_ errno I can actually belive, because that matches the packet traces we have seen, and it is a plausible scenario. The fact that Solaris docs does not mention ECONNRESET as a legal error return for ioctl is a minor detail in that context. The difference here is that the traditional BSD stack does not return ECONNRESET until you try to move data on the socket, giving you much simpler error checking on socket-state changes (ioctl, fcntl, setsockopt, getsockopt etc) So now that we have reached the root-cause, I need to go through and do complex error-checking for all the socket-state calls. Hopefully done later today... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 17 February 2010 10:57, Poul-Henning Kamp wrote: > In message <282e72051002170230k7ae8e0c8hc2d5226ca9288...@mail.gmail.com>, Paul > Wright writes: > >>I've compiled with the additional -mt flag, here's my current >>compilation process: > > Please pull a brand new -trunk, I have added a check for errno > working and I would like to make sure that passes for you also. I have pulled the latest revision and can confirm that unlink("/") failed as expected. I ran it briefly and saw this panic: Child (28554) died signal=6 Child (28554) Panic message: Assert error in http_copyheader(), cache_http.c line 647: Condition(n < to->shd) not true. thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447adb: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447adb] 447de5: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447de5] 440791: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x440791] 4423b1: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x4423b1] 429c4d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c4d] 42cf3a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf3a] 44a7ef: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a7ef] 449d62: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449d62] 44a315: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a315] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 18c9b78 { fd = 38, id = 38, xid = 789049749, client = 82.71.124.65:58190, step = STP_FETCH, handling = pass, err_code = 206, err_reason = (null), restarts = 0, esis = 0 ws = 18c9be8 { id = "sess", {s,f,r,e} = {18ca8f0,+884,0,+65536}, }, http[req] = { ws = 18c9be8[sess] "GET", "/xml/rss.top20.000.xml", "HTTP/1.1", "Host: www.firebox.com", "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6", "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language: en-gb,en;q=0.5", "Accept-Encoding: gzip,deflate", "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 115", "Connection: keep-alive", "X-Moz: livebookmarks", "Cookie: user_session=; locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom; xp_list=52%3D1; __utma=64137912.1325032789.1266225177.1266397371.1266400910.11; __utmc=64137912; __utmz=64137912.1266225177.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)", "Range: bytes=15136-", "If-Range: "14b800b-678d-47fc91860e540"", "Cache-Control: max-age=0", "X-Forwarded-For: 82.71.124.65", }, worker = fd7ff69fad30 { ws = fd7ff69fae78 { id = "wrk", {s,f,r,e} = {fd7ff69e8c40,+323,0,+65536}, }, http[bereq] = { ws = fd7ff69fae78[wrk] "GET", "/xml/rss.top20.000.xml", "HTTP/1.1", "Host: www.firebox.com", "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-GB; rv:1.9.2) Gecko/20100115 Firefox/3.6", "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", "Accept-Language: en-gb,en;q=0.5", "Accept-Encoding: gzip,deflate", "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "X-Moz: livebookmarks", "Cookie: user_session=; locale=company_id%3A0%2Ccurrency_id%3A0%2Clanguage_id%3A0%2Ccountry%3AUnited+Kingdom; xp_list=52%3D1; __utma=64137912.1325032789.1266225177.1266397371.1266400910.11; __utmc=64137912; __utmz=64137912.1266225177.1.1.utmccn=(direct)|utmcsr=(direct)|utmcmd=(none)", "Range: bytes=15136-", "If-Range: "14b800b-678d-47fc91860e540"", "X-Forwarded-For: 82.71.124.65", "X-Varnish: 789049749", }, http[beresp] = { ws = fd7ff69fae78[wrk] "HTTP/1.1", "206", "Partial Content", "Date: Wed, 17 Feb 2010 10:48:30 GMT", "Server: Apache", "Last-Modified: Wed, 17 Feb 2010 10:13:01 GMT", "ETag: "14b800b-678d-47fc91860e540"", "Accept-Ranges: bytes", "Content-Length: 11373", "Content-Range: bytes 15136-26508/26509", "Connection: close", "Content-Type: application/xml", }, }, vcl = { srcname = { "input", "Default", }, }, obj = 6a0 { xid = 789049749, ws = 6c0 { id = "obj", {s,f,r,e} = {8a5,8a5,0,+220}, }, http[obj] = { ws = 6c0[obj] "HTTP/1.1", "206", "Partial Content", "Date: Wed, 17 Feb 2010 10:48:30 GMT", "Server: Apache", "Last-Modified: Wed, 17 Feb 2010 10:13:01 GMT", "ETag: "14b800b-678d-47fc91860e540"", }, len = 0, store = { }, }, }, >>Child (14052) Panic message: Assert error in TCP_nonblocking(), tcp.c line = >>172: >> Condition((ioctl(sock, ((int)((uint32_t)(0x8000|(((sizeof >>(int))&0xff)<<16)| ('f'<<8)|126))), &i)) =3D=3D 0) not true. >>errno =3D 131 (Connection reset by
Re: Child panics on OpenSolaris
In message <282e72051002170310g565a26d8s98170fcebb9b2...@mail.gmail.com>, Paul Wright writes: >On 17 February 2010 10:57, Poul-Henning Kamp wrote: >I have pulled the latest revision and can confirm that unlink("/") >failed as expected. I ran it briefly and saw this panic: > >Child (28554) died signal=3D6 >Child (28554) Panic message: Assert error in http_copyheader(), >cache_http.c line 647: > Condition(n < to->shd) not true. Sounds like you got too many headers from the backend, if so, consider increasing the parameter 'http_headers' -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 17 February 2010 11:14, Poul-Henning Kamp wrote: > In message <282e72051002170310g565a26d8s98170fcebb9b2...@mail.gmail.com>, Paul > Wright writes: >>On 17 February 2010 10:57, Poul-Henning Kamp wrote: > >>I have pulled the latest revision and can confirm that unlink("/") >>failed as expected. I ran it briefly and saw this panic: >> >>Child (28554) died signal=3D6 >>Child (28554) Panic message: Assert error in http_copyheader(), >>cache_http.c line 647: >> Condition(n < to->shd) not true. > > Sounds like you got too many headers from the backend, if so, > consider increasing the parameter 'http_headers' I pulled r4574 with the new TCP_Assert() changes, I increased http_headers to 128 and it ran like a champ for 25 minutes. Then a client made a Range request resulting in a "Partial Content" response which produced the panic included below. I've boiled down the request to a simple curl command that consistently panics the child: curl -v -H "Range: bytes=0-1" -H "Cookie: foo" http://192.168.0.35/robots.txt (The cookie header ensures the request is passed through to the backend. Curl will respond with "Failure when receiving data from the peer".) I note that ticket #466 mentions that Varnish doesn't support Range requests and r3910 should stop such requests passing though - is this expected behaviour? Should I stop the backend from generating partial content responses? Or should there be a check for such requests and they should be piped instead of passed? Paul. Child (21350) died signal=6 Child (21350) Panic message: Assert error in http_copyheader(), cache_http.c line 647: Condition(n < to->shd) not true. thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b] 447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35] 4407e1: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x4407e1] 442401: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x442401] 429c9d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c9d] 42cf8a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf8a] 44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f] 449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2] 44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 1140e08 { fd = 77, id = 77, xid = 133352112, client = 66.235.125.136:38373, step = STP_FETCH, handling = pass, err_code = 206, err_reason = (null), restarts = 0, esis = 0 ws = 1140e78 { id = "sess", {s,f,r,e} = {1142400,+388,0,+65536}, }, http[req] = { ws = 1140e78[sess] "GET", "/robots.txt", "HTTP/1.1", "TE: deflate,gzip;q=0.3", "Connection: TE, close", "From: rob...@pronto.com", "Host: www.firebox.com", "Range: bytes=0-19", "User-Agent: RedCarpet/1.4 (http://www.pronto.com/robots.html)", "Cookie: CFTOKEN=aeed9fb6cc27a0f7-8F680632-1517-0C84-6887BD4A85FF98A8; CFID=128110309", "Cookie2: $Version="1"", "X-Forwarded-For: 66.235.125.136", }, worker = fd7ff7813d30 { ws = fd7ff7813e78 { id = "wrk", {s,f,r,e} = {fd7ff7801c40,+528,0,+65536}, }, http[bereq] = { ws = fd7ff7813e78[wrk] "GET", "/robots.txt", "HTTP/1.1", "From: rob...@pronto.com", "Host: www.firebox.com", "Range: bytes=0-19", "User-Agent: RedCarpet/1.4 (http://www.pronto.com/robots.html)", "Cookie: CFTOKEN=aeed9fb6cc27a0f7-8F680632-1517-0C84-6887BD4A85FF98A8; CFID=128110309", "Cookie2: $Version="1"", "X-Forwarded-For: 66.235.125.136", "X-Varnish: 133352112", }, http[beresp] = { ws = fd7ff7813e78[wrk] "HTTP/1.1", "206", "Partial Content", "Date: Wed, 17 Feb 2010 16:25:42 GMT", "Server: Apache", "Last-Modified: Fri, 11 Dec 2009 18:15:28 GMT", "ETag: "19c0073-de-47a77e88b9000"", "Accept-Ranges: bytes", "Content-Length: 222", "Content-Range: bytes 0-221/222", "Connection: close", "Content-Type: text/plain", }, }, vcl = { srcname = { "input", "Default", }, }, obj = 8c1d8a0 { xid = 133352112, ws = 8c1d8c0 { id = "obj", {s,f,r,e} = {8c1daa5,8c1daa5,0,+213}, }, http[obj] = { ws = 8c1d8c0[obj] "HTTP/1.1", "206", "Partial Content", "Date: Wed, 17 Feb 2010 16:25:42 GMT", "Server: Apache", "Last-Modified: Fri, 11 Dec 2009 18:15:28 GMT", "ETag: "19c0073-de-47a77e88b9000"", }, len = 0, store = { }, }, }, ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul Wright writes: >(The cookie header ensures the request is passed through to the >backend. Curl will respond with "Failure when receiving data from the >peer".) How on earth did you get the Range header to be passed through to the backend ? It should have been filtered out... Can you capture a varnishlog for this one ? Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 17 February 2010 18:46, Poul-Henning Kamp wrote: > In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul > Wright writes: > >>(The cookie header ensures the request is passed through to the >>backend. Curl will respond with "Failure when receiving data from the >>peer".) > > How on earth did you get the Range header to be passed through to > the backend ? It should have been filtered out... > > Can you capture a varnishlog for this one ? Here's the output: 0 CLI - Rd ping 0 CLI - Wr 200 PONG 1266486712 1.0 5 SessionOpen c 192.168.0.30 60069 :80 5 ReqStart c 192.168.0.30 60069 1736091917 5 RxRequestc GET 5 RxURLc /robots.txt 5 RxProtocol c HTTP/1.1 5 RxHeader c User-Agent: curl/7.19.4 (i386-pc-solaris2.11) libcurl/7.19.4 OpenSSL/0.9.8a zlib/1.2.3 libidn/1.9 5 RxHeader c Host: 192.168.0.35 5 RxHeader c Accept: */* 5 RxHeader c Range: bytes=0-1 5 RxHeader c Cookie: foo 5 VCL_call c recv 5 VCL_return c pass 5 VCL_call c hash 5 VCL_return c hash 5 VCL_call c pass 5 VCL_return c pass 9 BackendOpen b martin 192.168.0.35 54372 192.168.0.31 80 5 Backend c 9 dynamic_director martin 9 TxRequestb GET 9 TxURLb /robots.txt 9 TxProtocol b HTTP/1.1 9 TxHeader b User-Agent: curl/7.19.4 (i386-pc-solaris2.11) libcurl/7.19.4 OpenSSL/0.9.8a zlib/1.2.3 libidn/1.9 9 TxHeader b Host: 192.168.0.35 9 TxHeader b Accept: */* 9 TxHeader b Range: bytes=0-1 9 TxHeader b Cookie: foo 9 TxHeader b X-Forwarded-For: 192.168.0.30 9 TxHeader b X-Varnish: 1736091917 0 WorkThread - fd7ff9e00d30 start 0 CLI - Rd vcl.load "boot" ./vcl.ORk8t3RP.so 0 CLI - Wr 200 Loaded "./vcl.ORk8t3RP.so" as "boot" 0 CLI - Rd vcl.use "boot" 0 CLI - Wr 200 0 CLI - Rd start 0 Debug- "Acceptor poll space increased to 512" 0 Debug- "Acceptor is poll" 0 CLI - Wr 200 0 WorkThread - fd7ff9206d30 start 0 WorkThread - fd7ff9007d30 start 0 WorkThread - fd7ff8e08d30 start 0 WorkThread - fd7ff8c09d30 start 0 WorkThread - fd7ff8a0ad30 start 0 WorkThread - fd7ff880bd30 start 0 WorkThread - fd7ff860cd30 start 0 WorkThread - fd7ff840dd30 start 0 WorkThread - fd7ff820ed30 start The VCL used hasn't changed substantially from the original included at the beginning of this thread, only vcl_recv is defined and I don't think it's doing anything stupid. sub vcl_recv { // normalise static requests if ( req.http.host ~ "media([0-9]+).firebox.com" ) { set req.http.host = "media.firebox.com"; } //catch any relative image URLs that haven't been repointed to media if ( req.http.host != "media.firebox.com" && ( req.url ~ "^/pic/.+" || req.url ~ "^/i/.+" ) ) { set req.http.host = "media.firebox.com"; } // split traffic based on host name if ( req.http.host == "media.firebox.com" ) { remove req.http.cookie; set req.backend = static_director; } else { // dynamic content that should be cached (ie no cookies required) // these patterns should match up with settings.inc on the PHP side if ( req.url ~ "^/styles/(.+).css$" || req.url ~ "^/js/(.+).js$" ) { remove req.http.cookie; } //default all requests to the dynamic backend set req.backend = dynamic_director; } } This is on r4574, compiled with Sun Studio. Cheers, Paul. Child (2570) died signal=6 Child (2570) Panic message: Assert error in http_copyheader(), cache_http.c line 647: Condition(n < to->shd) not true. thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b] 447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35] 4407e1: /opt/sbin/varnishd'http_copyheader+0x1c1 [0x4407e1] 442401: /opt/sbin/varnishd'http_FilterFields+0xdc1 [0x442401] 429c9d: /opt/sbin/varnishd'cnt_fetch+0x11fd [0x429c9d] 42cf8a: /opt/sbin/varnishd'CNT_Session+0x78a [0x42cf8a] 44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f] 449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2] 44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] sp = 5eec58 { fd = 5, id = 5, xid = 1315520023, client = 192.168.0.30:41925, step = STP_FETCH, handling = pass, err_code = 206, err_reason = (null), restarts = 0, esis = 0 ws = 5eecc8 { id = "sess", {s,f,r,e} = {5f0250,+257,0,+65536}, }, http[req] = { ws = 5eecc8[sess
Re: Child panics on OpenSolaris
On 18 February 2010 10:08, Paul Wright wrote: > On 17 February 2010 18:46, Poul-Henning Kamp wrote: >> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, >> Paul >> Wright writes: >> >>>(The cookie header ensures the request is passed through to the >>>backend. Curl will respond with "Failure when receiving data from the >>>peer".) >> >> How on earth did you get the Range header to be passed through to >> the backend ? It should have been filtered out... Just as a sanity check I compiled r4574 on a Fedora x64 machine with gcc and it appears that Range requests still make it through to the backend, but no panics: 11 SessionOpen c 192.168.0.169 60860 192.168.0.87:80 11 ReqStart c 192.168.0.169 60860 1989537926 11 RxRequestc GET 11 RxURLc /robots.txt 11 RxProtocol c HTTP/1.1 11 RxHeader c User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3 11 RxHeader c Host: 192.168.0.87 11 RxHeader c Accept: */* 11 RxHeader c Range: bytes=0-1 11 RxHeader c Cookie: foo 11 VCL_call c recv 11 VCL_return c pass 11 VCL_call c hash 11 VCL_return c hash 11 VCL_call c pass 11 VCL_return c pass 12 BackendClose - paul 12 BackendOpen b paul 192.168.0.83 54663 192.168.0.83 80 11 Backend c 12 dynamic_director paul 12 TxRequestb GET 12 TxURLb /robots.txt 12 TxProtocol b HTTP/1.1 12 TxHeader b User-Agent: curl/7.19.4 (universal-apple-darwin10.0) libcurl/7.19.4 OpenSSL/0.9.8l zlib/1.2.3 12 TxHeader b Host: 192.168.0.87 12 TxHeader b Accept: */* 12 TxHeader b Range: bytes=0-1 12 TxHeader b Cookie: foo 12 TxHeader b X-Forwarded-For: 192.168.0.169 12 TxHeader b X-Varnish: 1989537926 12 RxProtocol b HTTP/1.1 12 RxStatus b 206 12 RxResponse b Partial Content 12 RxHeader b Date: Thu, 18 Feb 2010 12:00:51 GMT 12 RxHeader b Server: Apache 12 RxHeader b Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT 12 RxHeader b ETag: "a2ad-de-47aae2e6fcaa0" 12 RxHeader b Accept-Ranges: bytes 12 RxHeader b Content-Length: 2 12 RxHeader b Content-Range: bytes 0-1/222 12 RxHeader b Content-Type: text/plain 12 RxHeader b X-Pad: avoid browser bug 11 TTL c 1989537926 RFC 120 1266494451 0 0 0 0 11 VCL_call c fetch 11 VCL_return c pass 11 ObjProtocol c HTTP/1.1 11 ObjStatusc 206 11 ObjResponse c Partial Content 11 ObjHeaderc Date: Thu, 18 Feb 2010 12:00:51 GMT 11 ObjHeaderc Server: Apache 11 ObjHeaderc Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT 11 ObjHeaderc ETag: "a2ad-de-47aae2e6fcaa0" 11 ObjHeaderc Content-Type: text/plain 11 ObjHeaderc X-Pad: avoid browser bug 12 BackendReuse b paul 11 Length c 2 11 VCL_call c deliver 11 VCL_return c deliver 11 TxProtocol c HTTP/1.1 11 TxStatus c 206 11 TxResponse c Partial Content 11 TxHeader c Server: Apache 11 TxHeader c Last-Modified: Mon, 14 Dec 2009 11:00:28 GMT 11 TxHeader c ETag: "a2ad-de-47aae2e6fcaa0" 11 TxHeader c Content-Type: text/plain 11 TxHeader c X-Pad: avoid browser bug 11 TxHeader c Content-Length: 2 11 TxHeader c Date: Thu, 18 Feb 2010 12:00:51 GMT 11 TxHeader c X-Varnish: 1989537926 11 TxHeader c Age: 0 11 TxHeader c Via: 1.1 varnish 11 TxHeader c Connection: keep-alive 11 ReqEnd c 1989537926 1266494451.053643227 1266494451.055507183 0.000101566 0.001832485 0.31471 11 SessionClose c EOF 11 ReqEnd c 0 1266494451.055849552 1266494451.055849552 0.000342369 0.0 0.0 11 StatSess c 192.168.0.169 60860 0 1 1 0 1 1 306 2 Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 17 February 2010 18:46, Poul-Henning Kamp wrote: > In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, Paul > Wright writes: > >>(The cookie header ensures the request is passed through to the >>backend. Curl will respond with "Failure when receiving data from the >>peer".) > > How on earth did you get the Range header to be passed through to > the backend ? It should have been filtered out... I added the following to the top of vcl_recv() if ( req.http.Range ) { unset req.http.Range; } Varnish ran happily until tripping over the same bug (#649) as victori (on r4573). A couple of examples follow. New personal best of 1 hour, 20 minutes uptime! Paul. Child (26791) died signal=6 Child (26791) Panic message: Assert error in VCA_Prep(), cache_acceptor.c line 184: Condition(TCP_Check(setsockopt(sp->fd, 0x, 0x0080, &linger, sizeof linger))) not true. errno = 22 (Invalid argument) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447b2b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b2b] 447e35: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e35] 41862a: /opt/sbin/varnishd'VCA_Prep+0x29a [0x41862a] 42a466: /opt/sbin/varnishd'cnt_first+0xb6 [0x42a466] 42cd6a: /opt/sbin/varnishd'CNT_Session+0x56a [0x42cd6a] 44a83f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a83f] 449db2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449db2] 44a365: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a365] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] fd7ff653afb0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ff653afb0] sp = 10e8ae08 { fd = 73, id = 73, xid = 0, client = 94.196.213.123:50084, step = STP_FIRST, handling = deliver, restarts = 0, esis = 0 ws = 10e8ae78 { id = "sess", {s,f,r,e} = {10e8c400,+21,0,+65536}, }, http[req] = { ws = 10e8ae78[sess] "", "/pic/p2031b.jpg", "HTTP/1.1", "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 (.NET CLR 3.5.30729)", "Accept: image/png,image/*;q=0.8,*/*;q=0.5", "Accept-Language: en-us,en;q=0.5", "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7", "Keep-Alive: 300", "Connection: keep-alive", "Referer: http://www.affordablegiftsolutions.com/index.php?option=com_cmsshopbuilder&view=category&id=55&Itemid=5&limitstart=400";, "host: media.firebox.com", "X-Forwarded-For: 67.234.17.245", }, worker = fd7ff2c96d30 { ws = fd7ff2c96e78 { id = "wrk", {s,f,r,e} = {fd7ff2c84c40,fd7ff2c84c40,0,+65536}, }, }, }, Child cleanup complete child (2199) Started Child (2199) said Closed fds: 3 4 5 9 10 12 13 Child (2199) said Child starts Child (8098) died signal=6 Child (8098) Panic message: Assert error in VCA_Prep(), cache_acceptor.c line 184: Condition(TCP_Check(setsockopt(sp->fd, 0x, 0x0080, &linger, sizeof linger))) not true. errno = 22 (Invalid argument) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 447b1b: /opt/sbin/varnishd'pan_backtrace+0x1b [0x447b1b] 447e25: /opt/sbin/varnishd'pan_ic+0x1c5 [0x447e25] 41862a: /opt/sbin/varnishd'VCA_Prep+0x29a [0x41862a] 42a456: /opt/sbin/varnishd'cnt_first+0xb6 [0x42a456] 42cd5a: /opt/sbin/varnishd'CNT_Session+0x56a [0x42cd5a] 44a82f: /opt/sbin/varnishd'wrk_do_cnt_sess+0x1bf [0x44a82f] 449da2: /opt/sbin/varnishd'wrk_thread_real+0x882 [0x449da2] 44a355: /opt/sbin/varnishd'wrk_thread+0x135 [0x44a355] fd7ff653acf5: /lib/amd64/libc.so.1'_thrp_setup+0x8d [0xfd7ff653acf5] fd7ff653afb0: /lib/amd64/libc.so.1'_lwp_start+0x0 [0xfd7ff653afb0] sp = ee89a8 { fd = 17, id = 17, xid = 0, client = 217.111.162.2:58987, step = STP_FIRST, handling = deliver, restarts = 0, esis = 0 ws = ee8a18 { id = "sess", {s,f,r,e} = {ee9fa0,+20,0,+65536}, }, http[req] = { ws = ee8a18[sess] "", "/v/2000/2381/98x74.jpg", "HTTP/1.1", "Accept: */*", "Referer: http://www.firebox.com/product/2480/LEGO-Mindstorms-NXT-2.0?via=cat";, "Accept-Language: en-gb", "UA-CPU: x86", "User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.0.04506; InfoPath.2)", "Host: media.firebox.com", "Cache-Control: max-stale=0", "Connection: Keep-Alive", "X-BlueCoat-Via: A1EDA8423E101D0C", "X-Forwarded-For: 89.189.78.18, 82.114.160.35", }, worker = fd7ff83edd30 { ws = fd7ff83ede78 { id = "wrk", {s,f,r,e} = {fd7ff83dbc40,fd7ff83dbc40,0,+65536}, }, }, }, ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 19 February 2010 17:45, Paul Wright wrote: > On 17 February 2010 18:46, Poul-Henning Kamp wrote: >> In message <282e72051002170852g532cd8acqb570e6db4b829...@mail.gmail.com>, >> Paul >> Wright writes: >> >>>(The cookie header ensures the request is passed through to the >>>backend. Curl will respond with "Failure when receiving data from the >>>peer".) >> >> How on earth did you get the Range header to be passed through to >> the backend ? It should have been filtered out... > > I added the following to the top of vcl_recv() > > if ( req.http.Range ) { > unset req.http.Range; > } > > Varnish ran happily until tripping over the same bug (#649) as victori > (on r4573). For anyone else following along I've now had varnish running for over 5 hours without issue, here are the things I found out: * add the Range unsetting code to ensure that such requests don't make it through to the back end * remove the TCP_Assert() that wraps the setsockopt() call on line 184 of bin/varnishd/cache_acceptor.c * compile with gcc, not Sun Studio (there's still some sort of funniness with TCP_(non)blocking() ) CC=/usr/bin/gcc CFLAGS="-O3 -L/lib/amd64 -pthreads -m64 -fomit-frame-pointer" LDFLAGS="-lumem -pthreads" ./configure --prefix=/opt * pass the right flags through to gcc when launching vanishd newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \ -p 'cc_command=/usr/bin/gcc -fpic -shared -m64 -o %o %s' \ -T 127.0.0.1:9001 \ -s malloc,2G \ -p sess_timeout=5s \ -p max_restarts=12 \ -p waiter=poll \ -p connect_timeout=0s \ -p sess_workspace=65536 * keep checking http://letsgetdugg.com/2009/12/04/varnish-on-solaris/ for hints and suggestions Many thanks to Poul-Henning Kamp and Victor Igumnov for getting us to this point. Cheers, Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Child panics on OpenSolaris
On 22 February 2010 18:02, Paul Wright wrote: ... > For anyone else following along I've now had varnish running for over > 5 hours without issue, here are the things I found out: > > * add the Range unsetting code to ensure that such requests don't make > it through to the back end > * remove the TCP_Assert() that wraps the setsockopt() call on line 184 > of bin/varnishd/cache_acceptor.c > * compile with gcc, not Sun Studio (there's still some sort of > funniness with TCP_(non)blocking() ) > > CC=/usr/bin/gcc CFLAGS="-O3 -L/lib/amd64 -pthreads -m64 > -fomit-frame-pointer" LDFLAGS="-lumem -pthreads" ./configure > --prefix=/opt > > * pass the right flags through to gcc when launching vanishd > > newtask -p highfile /opt/sbin/varnishd -f /opt/etc/varnish/firebox.vcl -F \ > -p 'cc_command=/usr/bin/gcc -fpic -shared -m64 -o %o %s' \ > -T 127.0.0.1:9001 \ > -s malloc,2G \ > -p sess_timeout=5s \ > -p max_restarts=12 \ > -p waiter=poll \ > -p connect_timeout=0s \ > -p sess_workspace=65536 > > * keep checking http://letsgetdugg.com/2009/12/04/varnish-on-solaris/ > for hints and suggestions Latest update, we're seeing "Connection refused" panics like the following: Child (1955) died signal=6 Child (1955) Panic message: Assert error in TCP_blocking(), tcp.c line 164: Condition(TCP_Check(j)) not true. errno = 146 (Connection refused) thread = (cache-worker) ident = -smalloc,-hcritbit,poll Backtrace: 42fb51: /opt/sbin/varnishd'pan_ic+0xb1 [0x42fb51] 2f: [0x2f] sp = 13570018 { fd = 47, id = 47, xid = 0, client = ?.?.?.?:?, step = STP_FIRST, handling = deliver, restarts = 0, esis = 0 ws = 13570088 { id = "sess", {s,f,r,e} = {13570d90,13570d90,0,+65536}, }, http[req] = { ws = 13570088[sess] "", "/i/nl/all/0.gif", "HTTP/1.1", "Host: media.firebox.com", "User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_2; en-us) AppleWebKit/531.21.8 (KHTML, like Gecko)", "Accept: */*", "Accept-Language: en-us", "If-Modified-Since: Tue, 22 Sep 2009 15:34:33 GMT", "If-None-Match: "16a8070-2b-4742c55c1f440"", "Connection: keep-alive", "X-Forwarded-For: 85.189.102.193", }, worker = fd7ff7bf1d80 { ws = fd7ff7bf1ec8 { id = "wrk", {s,f,r,e} = {fd7ff7bdfcb0,fd7ff7bdfcb0,0,+65536}, }, }, }, Interesting things to note, we're confident that this request is a cache hit which rules out the backend (handling = deliver). Also the client address appears to have been mangled: client = ?.?.?.?:?, Would this cause varnish to attempt opening a connection which is then refused? As a workaround would it be advisable to add a clause to TCP_Check (in include/libvarnish.h) to skip over errno 146 (Connection refused) along with the existing ECONNRESET and ENOTCONN clauses? Cheers, Paul. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc