Hi,
Of course the evening was quite quiet and I have no spurious output to
show. (schrodinger effect)
Anyway here the pastebin of the busiest period this night
https://pastebin.com/536LM9Nx.
We use std, and director vmod.
Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r
%s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
Side question : why not include hit/miss in the default output ?
Thks for the help.
Best,
--
Raphael Mazelier
On 14/11/2017 23:41, Guillaume Quintard wrote:
Hi,
Let's look at the usual suspects first, can we get the output of "ps
aux |grep varnish" and a pastebin of "varnishncsa -1"?
Are you using any vmod?
man varnishncsa will help craft a format line with the response time
(on mobile now, I don't have access to it)
Cheers,
--
Guillaume Quintard
On Nov 14, 2017 23:25, "Raphael Mazelier" <[email protected]
<mailto:[email protected]>> wrote:
Hello list,
First of all despite my mail subject I really appreciate varnish.
We use it a lot at work (hundred of instances) with success and
unfortunately some pain these time.
TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
infrastructure brought us some serious trouble and instability on
this platform.
And we are a bit desperate/frustrated
Long story.
A bit of context :
This a very complex platform serving an IPTV service with some
traffic. (8k req/s in peak, even more when it work well).
It is compose of a two stage reverse proxy cache (3 x 2 varnish
for stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of
different backends (php applications, nodejs apps, remote backends
*sigh*, and even pipe one). This a big historical spaghetti app.
We plan to rebuild it from scratch in 2018.
The first stage varnish are separate in two pool handling
different topology of clients.
A lot of the logic is in varnish/vcl itself, lot of url rewrite,
lot of manipulation of headers, choice of a backend, and even ESI
processing...
The VCL of the stage 1 varnish are almost 3000 lines long.
But for now we have to leave/deal with it.
History of the problem :
At the beginning all varnish are in 2.x version. Things works
almost well.
This summer we need to upgrade the varnish version to handle very
long header (a product requirement).
So after a short battle porting our vcl to vcl4.0 we start using
varnish 4.
Shortly after thing begun to goes very bad.
The first issue we hit, is a memory exhaustion on both stage, and
oom-killer...
We test a lot of things, and in the battle we upgrade to varnish5.
We fix it, resizing the pool, and using now file backend (from
memory before).
Memory is now stable (we have large pool, 32G, and strange thing,
we never have object being nuke, which it good or bad it depend).
We have also fix a lot of things in our vcl.
The problem we fight against now is only on the stage1 varnish,
and specifically on one pool (the busiest one).
When everything goes well the average cpu usage is 30%, memory
stabilize around 12G, hit cache is around 0.85.
Problem happen randomly (not everyday) but during our peaks. The
cpu increase fasly to reach 350% (4 core) and load > 3/
When the problem is here varnish still deliver requests (we didn't
see dropped or reject connections) but our application begin to
lost user, including a big lot of business. I suspect this is
because timeout are very aggressive on the client side and varnish
should answer slowly
-first question : how see response time of request of the varnish
server ?. (varnishnsca something ?)
I also suspect some kind of request queuing, also stracing varnish
when it happen show a lot of futex wait ?!.
The frustrating part is restarting varnish fix the problem
immediately, and the cpu remains normal after, even if the trafic
peak is not finish.
So there is clearly something stacked in varnish which cause our
problem.
-second question : how to see number of stacked connections, long
connections and so on ?
At this stage we accept all kind of help / hints for debuging (and
regarding the business impact we can evaluate the help of a
professional support)
PS : I always have the option to scale out, popping a lot of new
varnish instance, but this seems very frustrating...
Best,
--
Raphael Mazelier
_______________________________________________
varnish-misc mailing list
[email protected] <mailto:[email protected]>
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
<https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
_______________________________________________
varnish-misc mailing list
[email protected]
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc