Varnish 2.0.6 nuking all my objects?
Howdy, We are finally getting around to upgrading to the latest version of varnish and are running into quite a weird problem. Everything works fine for a bit (1+day) , then all of a sudden Varnish starts nuking all of the objects from the cache: About 4 hours ago there were 1 million objects in the cache, now there are just about 172k. This looks a bit weird to me: sms_nbytes 18446744073709548694 . SMS outstanding bytes Here are the options I am passing to varnishd: /usr/local/sbin/varnishd -a 0.0.0.0: -f /etc/varnish/varnish.vcl -P /var/run/varnishd.pid -T 0.0.0.0:47200 -t 600 -w 1,200,300 -p thread_pools 4 -p thread_pool_add_delay 2 -p lru_interval 60 -h classic,59 -p obj_workspace 4096 -s file,/varnish/cache,150G /varnish is 2 x 80GB Intel X-25M SSDs in a software RAID 0 array. OS is Debian Lenny 64-bit. There is plenty of space: /dev/md0 149G 52G 98G 35% /varnish Here is the output of varnishstat -1 uptime 134971 . Child uptime client_conn 1205103789.29 Client connections accepted client_drop 0 0.00 Connection dropped, no sess client_req 1204867289.27 Client requests received cache_hit1016127275.28 Cache hits cache_hitpass 133244 0.99 Cache hits for pass cache_miss175085712.97 Cache misses backend_conn 182459413.52 Backend conn. success backend_unhealthy0 0.00 Backend conn. not attempted backend_busy0 0.00 Backend conn. too many backend_fail 3644 0.03 Backend conn. failures backend_reuse 0 0.00 Backend conn. reuses backend_toolate 0 0.00 Backend conn. was closed backend_recycle 0 0.00 Backend conn. recycles backend_unused 0 0.00 Backend conn. unused fetch_head 5309 0.04 Fetch head fetch_length 181642213.46 Fetch with Length fetch_chunked 0 0.00 Fetch chunked fetch_eof 0 0.00 Fetch EOF fetch_bad 0 0.00 Fetch had bad headers fetch_close 0 0.00 Fetch wanted close fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed fetch_zero 0 0.00 Fetch zero len fetch_failed 16 0.00 Fetch failed n_srcaddr 0 . N struct srcaddr n_srcaddr_act 0 . N active struct srcaddr n_sess_mem578 . N struct sess_mem n_sess414 . N struct sess n_object 172697 . N struct object n_objecthead 173170 . N struct objecthead n_smf 471310 . N struct smf n_smf_frag 62172 . N small free smf n_smf_large 67978 . N large free smf n_vbe_conn 18446744073709551611 . N struct vbe_conn n_bereq 315 . N struct bereq n_wrk 76 . N worker threads n_wrk_create 3039 0.02 N worker threads created n_wrk_failed0 0.00 N worker threads not created n_wrk_max 0 0.00 N worker threads limited n_wrk_queue 0 0.00 N queued work requests n_wrk_overflow 25136 0.19 N overflowed work requests n_wrk_drop 0 0.00 N dropped work requests n_backend 4 . N backends n_expired 771687 . N expired objects n_lru_nuked744693 . N LRU nuked objects n_lru_saved 0 . N LRU saved objects n_lru_moved 8675178 . N LRU moved objects n_deathrow 0 . N objects on deathrow losthdr25 0.00 HTTP header overflows n_objsendfile 0 0.00 Objects sent with sendfile n_objwrite 1174941587.05 Objects sent with write n_objoverflow 0 0.00 Objects overflowing workspace s_sess 1205100789.29 Total Sessions s_req1205018489.28 Total Requests s_pipe 2661 0.02 Total pipe s_pass 134858 1.00 Total pass s_fetch 182172113.50 Total fetch s_hdrbytes 3932274894 29134.22 Total header bytes s_bodybytes 894452020319 6626994.10 Total body bytes sess_closed 1205092589.29 Session Closed sess_pipeline 0 0.00 Session Pipeline sess_readahead 0 0.00 Session Read Ahead sess_linger 0 0.00 Session Linger sess_herd 160 0.00 Session herd shm_records 610011852
Re: Varnish 2.0.6 nuking all my objects?
I have seen this happen. I have a similar hardware setup, though I changed the multi-ssd raid into 3 separate cache file arguments. We had roughly 240GB storage space total, after about 2-3 weeks and sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree climbed to ~60GB, but lru_nuking never stopped. On Wed, Feb 24, 2010 at 8:15 PM, Barry Abrahamson wrote: > Howdy, > > We are finally getting around to upgrading to the latest version of varnish > and are running into quite a weird problem. Everything works fine for a bit > (1+day) , then all of a sudden Varnish starts nuking all of the objects from > the cache: > > About 4 hours ago there were 1 million objects in the cache, now there are > just about 172k. This looks a bit weird to me: > > sms_nbytes 18446744073709548694 . SMS outstanding bytes > > Here are the options I am passing to varnishd: > > /usr/local/sbin/varnishd -a 0.0.0.0: -f /etc/varnish/varnish.vcl -P > /var/run/varnishd.pid -T 0.0.0.0:47200 -t 600 -w 1,200,300 -p thread_pools 4 > -p thread_pool_add_delay 2 -p lru_interval 60 -h classic,59 -p > obj_workspace 4096 -s file,/varnish/cache,150G > > /varnish is 2 x 80GB Intel X-25M SSDs in a software RAID 0 array. OS is > Debian Lenny 64-bit. There is plenty of space: > > /dev/md0 149G 52G 98G 35% /varnish > > Here is the output of varnishstat -1 > > uptime 134971 . Child uptime > client_conn 12051037 89.29 Client connections accepted > client_drop 0 0.00 Connection dropped, no sess > client_req 12048672 89.27 Client requests received > cache_hit 10161272 75.28 Cache hits > cache_hitpass 133244 0.99 Cache hits for pass > cache_miss 1750857 12.97 Cache misses > backend_conn 1824594 13.52 Backend conn. success > backend_unhealthy 0 0.00 Backend conn. not attempted > backend_busy 0 0.00 Backend conn. too many > backend_fail 3644 0.03 Backend conn. failures > backend_reuse 0 0.00 Backend conn. reuses > backend_toolate 0 0.00 Backend conn. was closed > backend_recycle 0 0.00 Backend conn. recycles > backend_unused 0 0.00 Backend conn. unused > fetch_head 5309 0.04 Fetch head > fetch_length 1816422 13.46 Fetch with Length > fetch_chunked 0 0.00 Fetch chunked > fetch_eof 0 0.00 Fetch EOF > fetch_bad 0 0.00 Fetch had bad headers > fetch_close 0 0.00 Fetch wanted close > fetch_oldhttp 0 0.00 Fetch pre HTTP/1.1 closed > fetch_zero 0 0.00 Fetch zero len > fetch_failed 16 0.00 Fetch failed > n_srcaddr 0 . N struct srcaddr > n_srcaddr_act 0 . N active struct srcaddr > n_sess_mem 578 . N struct sess_mem > n_sess 414 . N struct sess > n_object 172697 . N struct object > n_objecthead 173170 . N struct objecthead > n_smf 471310 . N struct smf > n_smf_frag 62172 . N small free smf > n_smf_large 67978 . N large free smf > n_vbe_conn 18446744073709551611 . N struct vbe_conn > n_bereq 315 . N struct bereq > n_wrk 76 . N worker threads > n_wrk_create 3039 0.02 N worker threads created > n_wrk_failed 0 0.00 N worker threads not created > n_wrk_max 0 0.00 N worker threads limited > n_wrk_queue 0 0.00 N queued work requests > n_wrk_overflow 25136 0.19 N overflowed work requests > n_wrk_drop 0 0.00 N dropped work requests > n_backend 4 . N backends > n_expired 771687 . N expired objects > n_lru_nuked 744693 . N LRU nuked objects > n_lru_saved 0 . N LRU saved objects > n_lru_moved 8675178 . N LRU moved objects > n_deathrow 0 . N objects on deathrow > losthdr 25 0.00 HTTP header overflows > n_objsendfile 0 0.00 Objects sent with sendfile > n_objwrite 11749415 87.05 Objects sent with write > n_objoverflow 0 0.00 Objects overflowing workspace > s_sess 12051007 89.29 Total Sessions > s_req 12050184 89.28 Total Requests > s_pipe 2661 0.02 Total pipe > s_pass 13485
Re: Varnish 2.0.6 nuking all my objects?
In message , David Birdsong writes: >We had roughly 240GB storage space total, after about 2-3 weeks and >sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree >climbed to ~60GB, but lru_nuking never stopped. We had a bug where we would nuke from one stevedore, but try to allocate from another. Not sure if the fix made it into any of the 2.0 releases, it will be in 2.1 Poul-Henning -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 p...@freebsd.org | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Varnish 2.0.6 nuking all my objects?
On Feb 25, 2010, at 2:26 AM, David Birdsong wrote: > I have seen this happen. > > I have a similar hardware setup, though I changed the multi-ssd raid > into 3 separate cache file arguments. Did you try RAID and switch to the separate cache files because performance was better? > We had roughly 240GB storage space total, after about 2-3 weeks and > sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree > climbed to ~60GB, but lru_nuking never stopped. How did you fix it? -- Barry Abrahamson | Systems Wrangler | Automattic Blog: http://barry.wordpress.com ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Varnish 2.0.6 nuking all my objects?
On Feb 25, 2010, at 3:54 AM, Poul-Henning Kamp wrote: > In message , > David > Birdsong writes: > >> We had roughly 240GB storage space total, after about 2-3 weeks and >> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree >> climbed to ~60GB, but lru_nuking never stopped. > > We had a bug where we would nuke from one stevedore, but try to allocate > from another. Not sure if the fix made it into any of the 2.0 releases, > it will be in 2.1 Thanks for the info - are the fixes in -trunk now? -- Barry Abrahamson | Systems Wrangler | Automattic Blog: http://barry.wordpress.com ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Varnish 2.0.6 nuking all my objects?
On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson wrote: > > On Feb 25, 2010, at 2:26 AM, David Birdsong wrote: > >> I have seen this happen. >> >> I have a similar hardware setup, though I changed the multi-ssd raid >> into 3 separate cache file arguments. > > Did you try RAID and switch to the separate cache files because performance > was better? seemingly so. for some reason enabling block_dump showed that kswapd was always writing to those devices despite their not being any swap space on them. i searched around fruitlessly to try to understand the overhead of software raid to explain this, but once i discovered varnish could take on multiple cache files, i saw no reason for the software raid and just abandoned it. > >> We had roughly 240GB storage space total, after about 2-3 weeks and >> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree >> climbed to ~60GB, but lru_nuking never stopped. > > How did you fix it? i haven't yet. i'm changing up how i cache content, such that lru_nuking can be better tolerated. > > > -- > Barry Abrahamson | Systems Wrangler | Automattic > Blog: http://barry.wordpress.com > > > > ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Varnish 2.0.6 nuking all my objects?
On Feb 25, 2010, at 12:47 PM, David Birdsong wrote: > On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson > wrote: >> >> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote: >> >>> I have seen this happen. >>> >>> I have a similar hardware setup, though I changed the multi-ssd raid >>> into 3 separate cache file arguments. >> >> Did you try RAID and switch to the separate cache files because performance >> was better? > seemingly so. > > for some reason enabling block_dump showed that kswapd was always > writing to those devices despite their not being any swap space on > them. > > i searched around fruitlessly to try to understand the overhead of > software raid to explain this, but once i discovered varnish could > take on multiple cache files, i saw no reason for the software raid > and just abandoned it. Interesting - I will try it out! Thanks for the info. >>> We had roughly 240GB storage space total, after about 2-3 weeks and >>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree >>> climbed to ~60GB, but lru_nuking never stopped. >> >> How did you fix it? > i haven't yet. > > i'm changing up how i cache content, such that lru_nuking can be > better tolerated. In my case, Varnish took a cache of 1 million objects, purged 920k of them. When there were 80k objects left the child restarted, thus dumping the remaining 80k :) -- Barry Abrahamson | Systems Wrangler | Automattic Blog: http://barry.wordpress.com ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc
Re: Varnish 2.0.6 nuking all my objects?
On Feb 25, 2010, at 2:56 PM, Barry Abrahamson wrote: > In my case, Varnish took a cache of 1 million objects, purged 920k of them. > When there were 80k objects left the child restarted, thus dumping the > remaining 80k :) Happened again - here is the backtrace info: AdvChild (7222) died signal=6 Child (7222) Panic message: Assert error in STV_alloc(), stevedore.c line 71: Condition((st) != NULL) not true. thread = (cache-worker) Backtrace: 0x41d655: pan_ic+85 0x433815: STV_alloc+a5 0x416ca4: Fetch+684 0x41131f: cnt_fetch+cf 0x4125a5: CNT_Session+3a5 0x41f616: wrk_do_cnt_sess+86 0x41eb90: wrk_thread+1b0 0x7f79f61e0fc7: _end+7f79f5b7a147 0x7f79f5abb59d: _end+7f79f545471d sp = 0x7f542e45a008 { fd = 9, id = 9, xid = 116896, client = 10.2.255.5:22276, step = STP_FETCH, handling = discard, restarts = 0, esis = 0 ws = 0x7f542e45a080 { id = "sess", {s,f,r,e} = {0x7f542e45a820,+347,(nil),+16384}, }, The request information shows that it was apparently fetching a 1GB file from the backend and trying to insert it into the cache. -- Barry Abrahamson | Systems Wrangler | Automattic Blog: http://barry.wordpress.com ___ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc