Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:56 PM, Barry Abrahamson wrote:

> In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
> When there were 80k objects left the child restarted, thus dumping the 
> remaining 80k :)  

Happened again - here is the backtrace info:

AdvChild (7222) died signal=6
Child (7222) Panic message: Assert error in STV_alloc(), stevedore.c line 71:
  Condition((st) != NULL) not true.
thread = (cache-worker)
Backtrace:
  0x41d655: pan_ic+85
  0x433815: STV_alloc+a5
  0x416ca4: Fetch+684
  0x41131f: cnt_fetch+cf
  0x4125a5: CNT_Session+3a5
  0x41f616: wrk_do_cnt_sess+86
  0x41eb90: wrk_thread+1b0
  0x7f79f61e0fc7: _end+7f79f5b7a147
  0x7f79f5abb59d: _end+7f79f545471d
sp = 0x7f542e45a008 {
  fd = 9, id = 9, xid = 116896,
  client = 10.2.255.5:22276,
  step = STP_FETCH,
  handling = discard,
  restarts = 0, esis = 0
  ws = 0x7f542e45a080 {
id = "sess",
{s,f,r,e} = {0x7f542e45a820,+347,(nil),+16384},
  },

The request information shows that it was apparently fetching a 1GB file from 
the backend and trying to insert it into the cache.
--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 12:47 PM, David Birdsong wrote:

> On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson  
> wrote:
>> 
>> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:
>> 
>>> I have seen this happen.
>>> 
>>> I have a similar hardware setup, though I changed the multi-ssd raid
>>> into 3 separate cache file arguments.
>> 
>> Did you try RAID and switch to the separate cache files because performance 
>> was better?
> seemingly so.
> 
> for some reason enabling block_dump showed that kswapd was always
> writing to those devices despite their not being any swap space on
> them.
> 
> i searched around fruitlessly to try to understand the overhead of
> software raid to explain this, but once i discovered varnish could
> take on multiple cache files, i saw no reason for the software raid
> and just abandoned it.

Interesting - I will try it out!  Thanks for the info.


>>> We had roughly 240GB storage space total, after about 2-3 weeks and
>>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>>> climbed to ~60GB, but lru_nuking never stopped.
>> 
>> How did you fix it?
> i haven't yet.
> 
> i'm changing up how i cache content, such that lru_nuking can be
> better tolerated.

In my case, Varnish took a cache of 1 million objects, purged 920k of them.  
When there were 80k objects left the child restarted, thus dumping the 
remaining 80k :)  


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread David Birdsong
On Thu, Feb 25, 2010 at 8:41 AM, Barry Abrahamson  wrote:
>
> On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:
>
>> I have seen this happen.
>>
>> I have a similar hardware setup, though I changed the multi-ssd raid
>> into 3 separate cache file arguments.
>
> Did you try RAID and switch to the separate cache files because performance 
> was better?
seemingly so.

for some reason enabling block_dump showed that kswapd was always
writing to those devices despite their not being any swap space on
them.

i searched around fruitlessly to try to understand the overhead of
software raid to explain this, but once i discovered varnish could
take on multiple cache files, i saw no reason for the software raid
and just abandoned it.

>
>> We had roughly 240GB storage space total, after about 2-3 weeks and
>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>> climbed to ~60GB, but lru_nuking never stopped.
>
> How did you fix it?
i haven't yet.

i'm changing up how i cache content, such that lru_nuking can be
better tolerated.

>
>
> --
> Barry Abrahamson | Systems Wrangler | Automattic
> Blog: http://barry.wordpress.com
>
>
>
>
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 3:54 AM, Poul-Henning Kamp wrote:

> In message , 
> David
> Birdsong writes:
> 
>> We had roughly 240GB storage space total, after about 2-3 weeks and
>> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>> climbed to ~60GB, but lru_nuking never stopped.
> 
> We had a bug where we would nuke from one stevedore, but try to allocate
> from another.  Not sure if the fix made it into any of the 2.0 releases,
> it will be in 2.1

Thanks for the info - are the fixes in -trunk now?

--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Barry Abrahamson

On Feb 25, 2010, at 2:26 AM, David Birdsong wrote:

> I have seen this happen.
> 
> I have a similar hardware setup, though I changed the multi-ssd raid
> into 3 separate cache file arguments.

Did you try RAID and switch to the separate cache files because performance was 
better?

> We had roughly 240GB storage space total, after about 2-3 weeks and
> sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
> climbed to ~60GB, but lru_nuking never stopped.

How did you fix it?


--
Barry Abrahamson | Systems Wrangler | Automattic
Blog: http://barry.wordpress.com



___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread Poul-Henning Kamp
In message , David
 Birdsong writes:

>We had roughly 240GB storage space total, after about 2-3 weeks and
>sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
>climbed to ~60GB, but lru_nuking never stopped.

We had a bug where we would nuke from one stevedore, but try to allocate
from another.  Not sure if the fix made it into any of the 2.0 releases,
it will be in 2.1

Poul-Henning

-- 
Poul-Henning Kamp   | UNIX since Zilog Zeus 3.20
p...@freebsd.org | TCP/IP since RFC 956
FreeBSD committer   | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
___
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc


Re: Varnish 2.0.6 nuking all my objects?

2010-02-25 Thread David Birdsong
I have seen this happen.

I have a similar hardware setup, though I changed the multi-ssd raid
into 3 separate cache file arguments.

We had roughly 240GB storage space total, after about 2-3 weeks and
sm_bfree reached ~20GB. lru_nuked started incrementing, sm_bfree
climbed to ~60GB, but lru_nuking never stopped.

On Wed, Feb 24, 2010 at 8:15 PM, Barry Abrahamson  wrote:
> Howdy,
>
> We are finally getting around to upgrading to the latest version of varnish 
> and are running into quite a weird problem.  Everything works fine for a bit 
> (1+day) , then all of a sudden Varnish starts nuking all of the objects from 
> the cache:
>
> About 4 hours ago there were 1 million objects in the cache, now there are 
> just about 172k.  This looks a bit weird to me:
>
> sms_nbytes       18446744073709548694          .   SMS outstanding bytes
>
> Here are the options I am passing to varnishd:
>
> /usr/local/sbin/varnishd -a 0.0.0.0: -f /etc/varnish/varnish.vcl -P 
> /var/run/varnishd.pid -T 0.0.0.0:47200 -t 600 -w 1,200,300 -p thread_pools 4 
> -p thread_pool_add_delay 2 -p lru_interval 60 -h classic,59 -p 
> obj_workspace 4096 -s file,/varnish/cache,150G
>
> /varnish is 2 x 80GB Intel X-25M SSDs in a software RAID 0 array.  OS is 
> Debian Lenny 64-bit.  There is plenty of space:
>
> /dev/md0              149G   52G   98G  35% /varnish
>
> Here is the output of varnishstat -1
>
> uptime                 134971          .   Child uptime
> client_conn          12051037        89.29 Client connections accepted
> client_drop                 0         0.00 Connection dropped, no sess
> client_req           12048672        89.27 Client requests received
> cache_hit            10161272        75.28 Cache hits
> cache_hitpass          133244         0.99 Cache hits for pass
> cache_miss            1750857        12.97 Cache misses
> backend_conn          1824594        13.52 Backend conn. success
> backend_unhealthy            0         0.00 Backend conn. not attempted
> backend_busy                0         0.00 Backend conn. too many
> backend_fail             3644         0.03 Backend conn. failures
> backend_reuse               0         0.00 Backend conn. reuses
> backend_toolate             0         0.00 Backend conn. was closed
> backend_recycle             0         0.00 Backend conn. recycles
> backend_unused              0         0.00 Backend conn. unused
> fetch_head               5309         0.04 Fetch head
> fetch_length          1816422        13.46 Fetch with Length
> fetch_chunked               0         0.00 Fetch chunked
> fetch_eof                   0         0.00 Fetch EOF
> fetch_bad                   0         0.00 Fetch had bad headers
> fetch_close                 0         0.00 Fetch wanted close
> fetch_oldhttp               0         0.00 Fetch pre HTTP/1.1 closed
> fetch_zero                  0         0.00 Fetch zero len
> fetch_failed               16         0.00 Fetch failed
> n_srcaddr                   0          .   N struct srcaddr
> n_srcaddr_act               0          .   N active struct srcaddr
> n_sess_mem                578          .   N struct sess_mem
> n_sess                    414          .   N struct sess
> n_object               172697          .   N struct object
> n_objecthead           173170          .   N struct objecthead
> n_smf                  471310          .   N struct smf
> n_smf_frag              62172          .   N small free smf
> n_smf_large             67978          .   N large free smf
> n_vbe_conn       18446744073709551611          .   N struct vbe_conn
> n_bereq                   315          .   N struct bereq
> n_wrk                      76          .   N worker threads
> n_wrk_create             3039         0.02 N worker threads created
> n_wrk_failed                0         0.00 N worker threads not created
> n_wrk_max                   0         0.00 N worker threads limited
> n_wrk_queue                 0         0.00 N queued work requests
> n_wrk_overflow          25136         0.19 N overflowed work requests
> n_wrk_drop                  0         0.00 N dropped work requests
> n_backend                   4          .   N backends
> n_expired              771687          .   N expired objects
> n_lru_nuked            744693          .   N LRU nuked objects
> n_lru_saved                 0          .   N LRU saved objects
> n_lru_moved           8675178          .   N LRU moved objects
> n_deathrow                  0          .   N objects on deathrow
> losthdr                    25         0.00 HTTP header overflows
> n_objsendfile               0         0.00 Objects sent with sendfile
> n_objwrite           11749415        87.05 Objects sent with write
> n_objoverflow               0         0.00 Objects overflowing workspace
> s_sess               12051007        89.29 Total Sessions
> s_req                12050184        89.28 Total Requests
> s_pipe                   2661         0.02 Total pipe
> s_pass                 13485