I think we just replicate the ncsa default format line -- Guillaume Quintard
On Nov 15, 2017 23:52, "Raphael Mazelier" <[email protected]> wrote: > Hi, > > Of course the evening was quite quiet and I have no spurious output to > show. (schrodinger effect) > > Anyway here the pastebin of the busiest period this night > https://pastebin.com/536LM9Nx. > > We use std, and director vmod. > > Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r > %s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job). > Side question : why not include hit/miss in the default output ? > > > Thks for the help. > > Best, > > -- > Raphael Mazelier > > On 14/11/2017 23:41, Guillaume Quintard wrote: > > Hi, > > Let's look at the usual suspects first, can we get the output of "ps aux > |grep varnish" and a pastebin of "varnishncsa -1"? > > Are you using any vmod? > > man varnishncsa will help craft a format line with the response time (on > mobile now, I don't have access to it) > > Cheers, > > -- > Guillaume Quintard > > On Nov 14, 2017 23:25, "Raphael Mazelier" <[email protected]> wrote: > >> Hello list, >> >> First of all despite my mail subject I really appreciate varnish. >> We use it a lot at work (hundred of instances) with success and >> unfortunately some pain these time. >> >> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our >> infrastructure brought us some serious trouble and instability on this >> platform. >> And we are a bit desperate/frustrated >> >> >> Long story. >> >> A bit of context : >> >> This a very complex platform serving an IPTV service with some traffic. >> (8k req/s in peak, even more when it work well). >> It is compose of a two stage reverse proxy cache (3 x 2 varnish for stage >> 1), 2 varnish for stage 2, (so 8 in total) and a lot of different backends >> (php applications, nodejs apps, remote backends *sigh*, and even pipe one). >> This a big historical spaghetti app. We plan to rebuild it from scratch in >> 2018. >> The first stage varnish are separate in two pool handling different >> topology of clients. >> >> A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of >> manipulation of headers, choice of a backend, and even ESI processing... >> The VCL of the stage 1 varnish are almost 3000 lines long. >> >> But for now we have to leave/deal with it. >> >> History of the problem : >> >> At the beginning all varnish are in 2.x version. Things works almost well. >> This summer we need to upgrade the varnish version to handle very long >> header (a product requirement). >> So after a short battle porting our vcl to vcl4.0 we start using varnish >> 4. >> Shortly after thing begun to goes very bad. >> >> The first issue we hit, is a memory exhaustion on both stage, and >> oom-killer... >> We test a lot of things, and in the battle we upgrade to varnish5. >> We fix it, resizing the pool, and using now file backend (from memory >> before). >> Memory is now stable (we have large pool, 32G, and strange thing, we >> never have object being nuke, which it good or bad it depend). >> We have also fix a lot of things in our vcl. >> >> The problem we fight against now is only on the stage1 varnish, and >> specifically on one pool (the busiest one). >> When everything goes well the average cpu usage is 30%, memory stabilize >> around 12G, hit cache is around 0.85. >> Problem happen randomly (not everyday) but during our peaks. The cpu >> increase fasly to reach 350% (4 core) and load > 3/ >> When the problem is here varnish still deliver requests (we didn't see >> dropped or reject connections) but our application begin to lost user, >> including a big lot of business. I suspect this is because timeout are very >> aggressive on the client side and varnish should answer slowly >> >> -first question : how see response time of request of the varnish server >> ?. (varnishnsca something ?) >> >> I also suspect some kind of request queuing, also stracing varnish when >> it happen show a lot of futex wait ?!. >> The frustrating part is restarting varnish fix the problem immediately, >> and the cpu remains normal after, even if the trafic peak is not finish. >> So there is clearly something stacked in varnish which cause our problem. >> >> -second question : how to see number of stacked connections, long >> connections and so on ? >> >> At this stage we accept all kind of help / hints for debuging (and >> regarding the business impact we can evaluate the help of a professional >> support) >> >> PS : I always have the option to scale out, popping a lot of new varnish >> instance, but this seems very frustrating... >> >> Best, >> >> -- >> Raphael Mazelier >> >> >> _______________________________________________ >> varnish-misc mailing list >> [email protected] >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
