Varnish 2.0.6 nuking all my objects?

2010-02-24 Thread Barry Abrahamson
Howdy,

We are finally getting around to upgrading to the latest version of varnish and 
are running into quite a weird problem.  Everything works fine for a bit 
(1+day) , then all of a sudden Varnish starts nuking all of the objects from 
the cache:

About 4 hours ago there were 1 million objects in the cache, now there are just 
about 172k.  This looks a bit weird to me:

sms_nbytes   18446744073709548694  .   SMS outstanding bytes

Here are the options I am passing to varnishd:

/usr/local/sbin/varnishd -a 0.0.0.0: -f /etc/varnish/varnish.vcl -P 
/var/run/varnishd.pid -T 0.0.0.0:47200 -t 600 -w 1,200,300 -p thread_pools 4 -p 
thread_pool_add_delay 2 -p lru_interval 60 -h classic,59 -p obj_workspace 
4096 -s file,/varnish/cache,150G

/varnish is 2 x 80GB Intel X-25M SSDs in a software RAID 0 array.  OS is Debian 
Lenny 64-bit.  There is plenty of space:

/dev/md0  149G   52G   98G  35% /varnish

Here is the output of varnishstat -1

uptime 134971  .   Child uptime
client_conn  1205103789.29 Client connections accepted
client_drop 0 0.00 Connection dropped, no sess
client_req   1204867289.27 Client requests received
cache_hit1016127275.28 Cache hits
cache_hitpass  133244 0.99 Cache hits for pass
cache_miss175085712.97 Cache misses
backend_conn  182459413.52 Backend conn. success
backend_unhealthy0 0.00 Backend conn. not attempted
backend_busy0 0.00 Backend conn. too many
backend_fail 3644 0.03 Backend conn. failures
backend_reuse   0 0.00 Backend conn. reuses
backend_toolate 0 0.00 Backend conn. was closed
backend_recycle 0 0.00 Backend conn. recycles
backend_unused  0 0.00 Backend conn. unused
fetch_head   5309 0.04 Fetch head
fetch_length  181642213.46 Fetch with Length
fetch_chunked   0 0.00 Fetch chunked
fetch_eof   0 0.00 Fetch EOF
fetch_bad   0 0.00 Fetch had bad headers
fetch_close 0 0.00 Fetch wanted close
fetch_oldhttp   0 0.00 Fetch pre HTTP/1.1 closed
fetch_zero  0 0.00 Fetch zero len
fetch_failed   16 0.00 Fetch failed
n_srcaddr   0  .   N struct srcaddr
n_srcaddr_act   0  .   N active struct srcaddr
n_sess_mem578  .   N struct sess_mem
n_sess414  .   N struct sess
n_object   172697  .   N struct object
n_objecthead   173170  .   N struct objecthead
n_smf  471310  .   N struct smf
n_smf_frag  62172  .   N small free smf
n_smf_large 67978  .   N large free smf
n_vbe_conn   18446744073709551611  .   N struct vbe_conn
n_bereq   315  .   N struct bereq
n_wrk  76  .   N worker threads
n_wrk_create 3039 0.02 N worker threads created
n_wrk_failed0 0.00 N worker threads not created
n_wrk_max   0 0.00 N worker threads limited
n_wrk_queue 0 0.00 N queued work requests
n_wrk_overflow  25136 0.19 N overflowed work requests
n_wrk_drop  0 0.00 N dropped work requests
n_backend   4  .   N backends
n_expired  771687  .   N expired objects
n_lru_nuked744693  .   N LRU nuked objects
n_lru_saved 0  .   N LRU saved objects
n_lru_moved   8675178  .   N LRU moved objects
n_deathrow  0  .   N objects on deathrow
losthdr25 0.00 HTTP header overflows
n_objsendfile   0 0.00 Objects sent with sendfile
n_objwrite   1174941587.05 Objects sent with write
n_objoverflow   0 0.00 Objects overflowing workspace
s_sess   1205100789.29 Total Sessions
s_req1205018489.28 Total Requests
s_pipe   2661 0.02 Total pipe
s_pass 134858 1.00 Total pass
s_fetch   182172113.50 Total fetch
s_hdrbytes 3932274894 29134.22 Total header bytes
s_bodybytes  894452020319   6626994.10 Total body bytes
sess_closed  1205092589.29 Session Closed
sess_pipeline   0 0.00 Session Pipeline
sess_readahead  0 0.00 Session Read Ahead
sess_linger 0 0.00 Session Linger
sess_herd 160 0.00 Session herd
shm_records 610011852   

Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread David Birdsong
I have seen this happen.

I have a similar hardware setup, though I changed the multi-ssd raid
into 3 separate cache file arguments.

We had roughly 240GB storage space total, after about 2-3 weeks and
sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
climbed to ~60GB, but lru_nuking never stopped.

On Wed, Feb 24, 2010 at 8:15 PM, Barry Abrahamson  wrote:
> Howdy,
>
> We are finally getting around to upgrading to the latest version of varnish 
> and are running into quite a weird problem.  Everything works fine for a bit 
> (1+day) , then all of a sudden Varnish starts nuking all of the objects from 
> the cache:
>
> About 4 hours ago there were 1 million objects in the cache, now there are 
> just about 172k.  This looks a bit weird to me:
>
> sms_nbytes       18446744073709548694          .   SMS outstanding bytes
>
> Here are the options I am passing to varnishd:
>
> /usr/local/sbin/varnishd -a 0.0.0.0: -f /etc/varnish/varnish.vcl -P 
> /var/run/varnishd.pid -T 0.0.0.0:47200 -t 600 -w 1,200,300 -p thread_pools 4 
> -p thread_pool_add_delay 2 -p lru_interval 60 -h classic,59 -p 
> obj_workspace 4096 -s file,/varnish/cache,150G
>
> /varnish is 2 x 80GB Intel X-25M SSDs in a software RAID 0 array.  OS is 
> Debian Lenny 64-bit.  There is plenty of space:
>
> /dev/md0              149G   52G   98G  35% /varnish
>
> Here is the output of varnishstat -1
>
> uptime                 134971          .   Child uptime
> client_conn          12051037        89.29 Client connections accepted
> client_drop                 0         0.00 Connection dropped, no sess
> client_req           12048672        89.27 Client requests received
> cache_hit            10161272        75.28 Cache hits
> cache_hitpass          133244         0.99 Cache hits for pass
> cache_miss            1750857        12.97 Cache misses
> backend_conn          1824594        13.52 Backend conn. success
> backend_unhealthy            0         0.00 Backend conn. not attempted
> backend_busy                0         0.00 Backend conn. too many
> backend_fail             3644         0.03 Backend conn. failures
> backend_reuse               0         0.00 Backend conn. reuses
> backend_toolate             0         0.00 Backend conn. was closed
> backend_recycle             0         0.00 Backend conn. recycles
> backend_unused              0         0.00 Backend conn. unused
> fetch_head               5309         0.04 Fetch head
> fetch_length          1816422        13.46 Fetch with Length
> fetch_chunked               0         0.00 Fetch chunked
> fetch_eof                   0         0.00 Fetch EOF
> fetch_bad                   0         0.00 Fetch had bad headers
> fetch_close                 0         0.00 Fetch wanted close
> fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
> fetch_zero                  0         0.00 Fetch zero len
> fetch_failed               16         0.00 Fetch failed
> n_srcaddr                   0          .   N struct srcaddr
> n_srcaddr_act               0          .   N active struct srcaddr
> n_sess_mem                578          .   N struct sess_mem
> n_sess                    414          .   N struct sess
> n_object               172697          .   N struct object
> n_objecthead           173170          .   N struct objecthead
> n_smf                  471310          .   N struct smf
> n_smf_frag              62172          .   N small free smf
> n_smf_large             67978          .   N large free smf
> n_vbe_conn       18446744073709551611          .   N struct vbe_conn
> n_bereq                   315          .   N struct bereq
> n_wrk                      76          .   N worker threads
> n_wrk_create             3039         0.02 N worker threads created
> n_wrk_failed                0         0.00 N worker threads not created
> n_wrk_max                   0         0.00 N worker threads limited
> n_wrk_queue                 0         0.00 N queued work requests
> n_wrk_overflow          25136         0.19 N overflowed work requests
> n_wrk_drop                  0         0.00 N dropped work requests
> n_backend                   4          .   N backends
> n_expired              771687          .   N expired objects
> n_lru_nuked            744693          .   N LRU nuked objects
> n_lru_saved                 0          .   N LRU saved objects
> n_lru_moved           8675178          .   N LRU moved objects
> n_deathrow                  0          .   N objects on deathrow
> losthdr                    25         0.00 HTTP header overflows
> n_objsendfile               0         0.00 Objects sent with sendfile
> n_objwrite           11749415        87.05 Objects sent with write
> n_objoverflow               0         0.00 Objects overflowing workspace
> s_sess               12051007        89.29 Total Sessions
> s_req                12050184        89.28 Total Requests
> s_pipe                   2661         0.02 Total pipe
> s_pass                 13485

Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Poul-Henning Kamp
In message , David
 Birdsong writes:

>We had roughly 240GB storage space total, after about 2-3 weeks and
>sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>climbed to ~60GB, but lru_nuking never stopped.

We had a bug where we would nuke from one stevedore, but try to allocate
from another.  Not sure if the fix made it into any of the 2.0 releases,
it will be in 2.1

Poul-Henning

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:

> I have seen this happen.
> 
> I have a similar hardware setup, though I changed the multi-ssd raid
> into 3 separate cache file arguments.

Did you try RAID and switch to the separate cache files because performance was 
better?

> We had roughly 240GB storage space total, after about 2-3 weeks and
> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
> climbed to ~60GB, but lru_nuking never stopped.

How did you fix it?


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 3:54 AM, Poul-Henning Kamp wrote:

> In message , 
> David
> Birdsong writes:
> 
>> We had roughly 240GB storage space total, after about 2-3 weeks and
>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>> climbed to ~60GB, but lru_nuking never stopped.
> 
> We had a bug where we would nuke from one stevedore, but try to allocate
> from another.  Not sure if the fix made it into any of the 2.0 releases,
> it will be in 2.1

Thanks for the info - are the fixes in -trunk now?

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread David Birdsong
On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson  wrote:
>
> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:
>
>> I have seen this happen.
>>
>> I have a similar hardware setup, though I changed the multi-ssd raid
>> into 3 separate cache file arguments.
>
> Did you try RAID and switch to the separate cache files because performance 
> was better?
seemingly so.

for some reason enabling block_dump showed that kswapd was always
writing to those devices despite their not being any swap space on
them.

i searched around fruitlessly to try to understand the overhead of
software raid to explain this, but once i discovered varnish could
take on multiple cache files, i saw no reason for the software raid
and just abandoned it.

>
>> We had roughly 240GB storage space total, after about 2-3 weeks and
>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>> climbed to ~60GB, but lru_nuking never stopped.
>
> How did you fix it?
i haven't yet.

i'm changing up how i cache content, such that lru_nuking can be
better tolerated.

>
>
> --
> Barry Abrahamson | Systems Wrangler | Automattic
> Blog: http://barry.wordpress.com
>
>
>
>
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 12:47 PM, David Birdsong wrote:

> On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson  
> wrote:
>> 
>> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:
>> 
>>> I have seen this happen.
>>> 
>>> I have a similar hardware setup, though I changed the multi-ssd raid
>>> into 3 separate cache file arguments.
>> 
>> Did you try RAID and switch to the separate cache files because performance 
>> was better?
> seemingly so.
> 
> for some reason enabling block_dump showed that kswapd was always
> writing to those devices despite their not being any swap space on
> them.
> 
> i searched around fruitlessly to try to understand the overhead of
> software raid to explain this, but once i discovered varnish could
> take on multiple cache files, i saw no reason for the software raid
> and just abandoned it.

Interesting - I will try it out!  Thanks for the info.


>>> We had roughly 240GB storage space total, after about 2-3 weeks and
>>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>>> climbed to ~60GB, but lru_nuking never stopped.
>> 
>> How did you fix it?
> i haven't yet.
> 
> i'm changing up how i cache content, such that lru_nuking can be
> better tolerated.

In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
When there were 80k objects left the child restarted, thus dumping the 
remaining 80k :)  


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:56 PM, Barry Abrahamson wrote:

> In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
> When there were 80k objects left the child restarted, thus dumping the 
> remaining 80k :)  

Happened again - here is the backtrace info:

AdvChild (7222) died signal=6
Child (7222) Panic message: Assert error in STV_alloc(), stevedore.c line 71:
  Condition((st) != NULL) not true.
thread = (cache-worker)
Backtrace:
  0x41d655: pan_ic+85
  0x433815: STV_alloc+a5
  0x416ca4: Fetch+684
  0x41131f: cnt_fetch+cf
  0x4125a5: CNT_Session+3a5
  0x41f616: wrk_do_cnt_sess+86
  0x41eb90: wrk_thread+1b0
  0x7f79f61e0fc7: _end+7f79f5b7a147
  0x7f79f5abb59d: _end+7f79f545471d
sp = 0x7f542e45a008 {
  fd = 9, id = 9, xid = 116896,
  client = 10.2.255.5:22276,
  step = STP_FETCH,
  handling = discard,
  restarts = 0, esis = 0
  ws = 0x7f542e45a080 {
id = "sess",
{s,f,r,e} = {0x7f542e45a820,+347,(nil),+16384},
  },

The request information shows that it was apparently fetching a 1GB file from 
the backend and trying to insert it into the cache.
--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc