Re: [Wikitech-l] [WikimediaMobile] Caching Problem with Mobile Main Page?

2013-05-05 Thread Faidon Liambotis
On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:
 1) Our multicast purge stream is very busy and isn't split up by cache
 type, so it includes lots of purge requests for images on
 upload.wikimedia.org.  Processing the purges is somewhat cpu intensive, and
 I saw doing so once per varnish server as preferable to twice.

I believe the plan is to split up the multicast groups *and* to filter
based on predefined regexps on the HTCP-PURGE layer, via the
varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know
more.

 There are multiple ways to approach making the purges sent to the frontends
 actually work such as rewriting the purges in varnish, rewriting them
 before they're sent to varnish depending on where they're being sent, or
 perhaps changing how cached objects are stored in the frontend.  I
 personally think it's all an unnecessary waste of resources and prefer my
 original approach.

Although the current VCL calls vcl_recv_purge after the rewrite step
(and hence actually rewriting purges too), unless I'm mistaken this is
actually unnecessary. The incoming purges match the way the objects are
stored in the cache: both are without the .m. (et al) prefix, as normal
desktop purges are matched with objects that had their URLs rewritten
in vcl_recv. Handling purges after the rewrite step might be unnecessary
but it doesn't mean it's a bad idea though; it doesn't hurt much and
it's better as it allows us to also purge via the original .m. URL,
which is what a person might do instictively.

While mobile purges were actually broken recently in the past in a
similar way as you guessed with I77b88f[1] (Restrict PURGE lookups to
mobile domains) they were fixed shortly after with I76e5c4[2], a full
day before the frontend cache TTL was removed.

1: 
https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,n,z
2: 
https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,n,z

What actually broke them again this time is I3d0280[3], which stripped
absolute URIs before vcl_recv_purge, despite the latter having code that
matches only against absolute URIs. This is my commit, so I'm
responsible for this breakage, although in my defence I have an even
score now for discovering the flaw last time around :)

I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after
vcl_recv_purge and should hopefully unbreak purging while also not
reintroducing BZ #47807.

3: 
https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,n,z
4: 
https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,n,z

As a side note, notice how rewrite_proxy_urls  vcl_recv_purge are both
flawed in the same way: the former exists solely to workaround a Varnish
bug with absolute URIs, while the latter is *depending* on that bug to
manifest to actually work. req.url should always be a (relative) URL and
hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should
normally always evaluate to false, making the whole function a no-op.

However, due to the bug in question, Varnish doesn't special-handle
absolute URIs in violation of RFC 2616. This, in combination with the
fact that varnishhtcpd always sends absolute URIs (due to an
RFC-compliant behavior of LWP's proxy() method), is why we have this
seemingly wrong VCL code but which actually works as intended.

This Varnish bug was reported by Tim upstream[5] and the fix is
currently sitting in Varnish's git master[6]. It's simple enough and it
might be worth it to backport it, although it might be more troulbe that
it's worth, considering how it will break purges with our current VCL :)

5: https://www.varnish-cache.org/trac/ticket/1255 
6: 
https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f747f2e860

Cheers,
Faidon

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [WikimediaMobile] Caching Problem with Mobile Main Page?

2013-05-05 Thread Asher Feldman
Faidon - thanks for the more accurate trackdown, and fix!

On Sunday, May 5, 2013, Faidon Liambotis wrote:

 On Fri, May 03, 2013 at 03:19:13PM -0700, Asher Feldman wrote:
  1) Our multicast purge stream is very busy and isn't split up by cache
  type, so it includes lots of purge requests for images on
  upload.wikimedia.org.  Processing the purges is somewhat cpu intensive,
 and
  I saw doing so once per varnish server as preferable to twice.

 I believe the plan is to split up the multicast groups *and* to filter
 based on predefined regexps on the HTCP-PURGE layer, via the
 varnishhtcpd rewrite. But I may be mistaken, Mark and Brandon will know
 more.

  There are multiple ways to approach making the purges sent to the
 frontends
  actually work such as rewriting the purges in varnish, rewriting them
  before they're sent to varnish depending on where they're being sent, or
  perhaps changing how cached objects are stored in the frontend.  I
  personally think it's all an unnecessary waste of resources and prefer my
  original approach.

 Although the current VCL calls vcl_recv_purge after the rewrite step
 (and hence actually rewriting purges too), unless I'm mistaken this is
 actually unnecessary. The incoming purges match the way the objects are
 stored in the cache: both are without the .m. (et al) prefix, as normal
 desktop purges are matched with objects that had their URLs rewritten
 in vcl_recv. Handling purges after the rewrite step might be unnecessary
 but it doesn't mean it's a bad idea though; it doesn't hurt much and
 it's better as it allows us to also purge via the original .m. URL,
 which is what a person might do instictively.

 While mobile purges were actually broken recently in the past in a
 similar way as you guessed with I77b88f[1] (Restrict PURGE lookups to
 mobile domains) they were fixed shortly after with I76e5c4[2], a full
 day before the frontend cache TTL was removed.

 1:
 https://gerrit.wikimedia.org/r/#q,I77b88f3b4bb5ec84f70b2241cdd5dc496025e6fd,n,z
 2:
 https://gerrit.wikimedia.org/r/#q,I76e5c4218c1dec06673aa5121010875031c1a1e2,n,z

 What actually broke them again this time is I3d0280[3], which stripped
 absolute URIs before vcl_recv_purge, despite the latter having code that
 matches only against absolute URIs. This is my commit, so I'm
 responsible for this breakage, although in my defence I have an even
 score now for discovering the flaw last time around :)

 I've pushed and merged I08f761[4] which moves rewrite_proxy_urls after
 vcl_recv_purge and should hopefully unbreak purging while also not
 reintroducing BZ #47807.

 3:
 https://gerrit.wikimedia.org/r/#q,I3d02804170f7e502300329740cba9f45437a24fa,n,z
 4:
 https://gerrit.wikimedia.org/r/#q,I08f7615230037a6ffe7d1130a2a6de7ba370faf2,n,z

 As a side note, notice how rewrite_proxy_urls  vcl_recv_purge are both
 flawed in the same way: the former exists solely to workaround a Varnish
 bug with absolute URIs, while the latter is *depending* on that bug to
 manifest to actually work. req.url should always be a (relative) URL and
 hence the if (req.url ~ '^http:') comparison in vcl_recv_purge should
 normally always evaluate to false, making the whole function a no-op.

 However, due to the bug in question, Varnish doesn't special-handle
 absolute URIs in violation of RFC 2616. This, in combination with the
 fact that varnishhtcpd always sends absolute URIs (due to an
 RFC-compliant behavior of LWP's proxy() method), is why we have this
 seemingly wrong VCL code but which actually works as intended.

 This Varnish bug was reported by Tim upstream[5] and the fix is
 currently sitting in Varnish's git master[6]. It's simple enough and it
 might be worth it to backport it, although it might be more troulbe that
 it's worth, considering how it will break purges with our current VCL :)

 5: https://www.varnish-cache.org/trac/ticket/1255
 6:
 https://www.varnish-cache.org/trac/changeset/2bbb032bf67871d7d5a43a38104d58f747f2e860

 Cheers,
 Faidon

 ___
 Mobile-l mailing list
 mobil...@lists.wikimedia.org javascript:;
 https://lists.wikimedia.org/mailman/listinfo/mobile-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [WikimediaMobile] Caching Problem with Mobile Main Page?

2013-05-03 Thread Arthur Richards
+wikitech-l

I've confirmed the issue on my end; ?action=purge seems to have no effect
and the 'last modified' notification on the mobile main page looks correct
(though the content itself is out of date and not in sync with the 'last
modified' notification). What's doubly weird to me is the 'Last modified'
HTTP response headers says:

Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT

Which appears to be newer than when the content I'm seeing on the main page
was updated... Anyone from ops have an idea what might be going on?


On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipa...@gmail.com wrote:

 Encountered
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_Main_Page_on_mobile.2C_viz._it_hasn.27t_changed_since_Tuesday

 Some people seem to be having problems with the mobile main page being
 cached too much. Can someone look into it?


 --
 Yuvi Panda T
 http://yuvi.in/blog

 ___
 Mobile-l mailing list
 mobil...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mobile-l




--
Arthur Richards
Software Engineer, Mobile
[[User:Awjrichards]]
IRC: awjr
+1-415-839-6885 x6687
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] [WikimediaMobile] Caching Problem with Mobile Main Page?

2013-05-03 Thread Asher Feldman
The problem is due to recent changes that were made to how mobile caching
works.  I just flushed cache on all of the frontend varnish instances which
indeed appears to have fixed the problem but it isn't actually fixed.
Note, the frontend instances just have 1GB of cache, so only very popular
objects (like the enwiki front page) avoid getting LRU'd.  The backend
varnish instances utilize the ssd's and perform the heavy caching work.

When I originally built this, I had the frontends force a short (300s) ttl
on all cacheable objects, while the backends honored the times specified by
mediawiki.

I chose to only send purges to the backend instances (via wikia's old
varnishhtcpd) and let the frontend instances catch up with their short
ttls.  My reasoning was:

1) Our multicast purge stream is very busy and isn't split up by cache
type, so it includes lots of purge requests for images on
upload.wikimedia.org.  Processing the purges is somewhat cpu intensive, and
I saw doing so once per varnish server as preferable to twice.

2) Purges are for url's such as en.wikipedia.org/wiki/Main_Page.  The
frontend varnish instance strips the m subdomain before sending the request
onwards, but still caches content based on the request url.  Purges are
never sent for en.m.wikipedia.org/wiki/Main_Page - every purge would need
to be rewritten to apply to the frontend varnishes.  Doing this blindly
would be more expensive than it should be, since a significant percentage
of purge statements aren't applicable.

I don't think my original approach had any fans.  Purges are now sent to
both varnish instances per host, and more recently, the 300s ttl override
was removed from the frontends.  But all of the purges are no-ops.

There are multiple ways to approach making the purges sent to the frontends
actually work such as rewriting the purges in varnish, rewriting them
before they're sent to varnish depending on where they're being sent, or
perhaps changing how cached objects are stored in the frontend.  I
personally think it's all an unnecessary waste of resources and prefer my
original approach.

-Asher

On Fri, May 3, 2013 at 2:23 PM, Arthur Richards aricha...@wikimedia.orgwrote:

 +wikitech-l

 I've confirmed the issue on my end; ?action=purge seems to have no effect
 and the 'last modified' notification on the mobile main page looks correct
 (though the content itself is out of date and not in sync with the 'last
 modified' notification). What's doubly weird to me is the 'Last modified'
 HTTP response headers says:

 Last-Modified: Tue, 30 Apr 2013 00:17:32 GMT

 Which appears to be newer than when the content I'm seeing on the main
 page was updated... Anyone from ops have an idea what might be going on?



 On Thu, May 2, 2013 at 10:01 PM, Yuvi Panda yuvipa...@gmail.com wrote:
 
  Encountered
 https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Issue_with_Main_Page_on_mobile.2C_viz._it_hasn.27t_changed_since_Tuesday
 
  Some people seem to be having problems with the mobile main page being
  cached too much. Can someone look into it?
 
 
  --
  Yuvi Panda T
  http://yuvi.in/blog
 
  ___
  Mobile-l mailing list
  mobil...@lists.wikimedia.org
  https://lists.wikimedia.org/mailman/listinfo/mobile-l




 --
 Arthur Richards
 Software Engineer, Mobile
 [[User:Awjrichards]]
 IRC: awjr
 +1-415-839-6885 x6687

 ___
 Mobile-l mailing list
 mobil...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/mobile-l


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l