Help trying to figure out why an output_filter is not called.

2011-01-05 Thread Joshua Marantz
One of the improvements mod_pagespeed is supposed to do to sites is extend
the cache lifetime of their resources indefinitely by including a content
hash in the URL.  This is working well for a large number of sites, but I
encountered one today where it does not work.

To accomplish the cache extension, overriding any wildcarded or
directory-based expire settings a site admin has set for their resources,
mod_pagespeed inserts two output filters.

The first one does the HTML rewriting:

  ap_register_output_filter(
  kModPagespeedFilterName, instaweb_out_filter, NULL,
AP_FTYPE_RESOURCE);

When instaweb_out_filter runs, it makes this transformation:

  before:  script src=foo.js/script
  after:   script src=foo.js.pagespeed.ce.HASH.js/script

The rewritten resource, foo.js.pagespeed.ce.HASH.js, is served by a hook:

  ap_hook_handler(instaweb_handler, NULL, NULL, APR_HOOK_FIRST - 1);

Knowing that mod_headers will later override Cache-Control, which we don't
want, our hook serves the .js file with our own header:

   X-Mod-Pagespeed-Repair: max-age=31536000

We a second output filter, to repair it:

  // We need our repair headers filter to run after mod_headers. The
  // mod_headers, which is the filter that is used to add the cache
settings, is
  // AP_FTYPE_CONTENT_SET. Using (AP_FTYPE_CONTENT_SET + 2) to make sure
that we
  // run after mod_headers.
  ap_register_output_filter(
  InstawebContext::kRepairHeadersFilterName, repair_caching_header,
NULL,
  static_castap_filter_type(AP_FTYPE_CONTENT_SET + 2));

This is added into the filter chain whenever we want to extend cache:

  apr_table_add(request-headers_out, X-Mod-Pagespeed-Repair,
cache_control);
  ap_add_output_filter(X-Mod-Pagespeed-Repair,
   NULL, request, request-connection);

When working properly, this header is removed from request-headers_out by
repair_caching_header():

  const char* cache_control = apr_table_get(request-headers_out,
X-Mod-Pagespeed-Repair);
  if (cache_control != NULL) {
SetCacheControl(cache_control, request);
apr_table_unset(request-headers_out, kRepairCachingHeader);
  }

Where SetCacheControl also makes the Expires header consistent, etc.

While this approach is complex, I've never seen it fail until today, on the
site http://law.thu.edu.tw/main.php . On that site, the
X-Mod-Pagespeed-Repair
header is visible (it should have been removed) and the Cache-Control header
has the value set from the conf files (public, max-age=600).   So on this
server, the repair_caching_header filter is not being run, despite having
been programatically inserted by our code in the same place where we add 
X-Mod-Pagespeed-Repair header.

What might be going wrong in his server to cause this to fail?  Could some
other filter be somehow finding our filter and killing it?  Or sending the
bytes directly to the network before our filter has a chance to run?

Thanks!
-Josh


Re: Help trying to figure out why an output_filter is not called.

2011-01-05 Thread Joshua Marantz
Thanks again for the fast response, Ben!

On Wed, Jan 5, 2011 at 8:57 AM, Ben Noordhuis i...@bnoordhuis.nl wrote:

 On Wed, Jan 5, 2011 at 14:45, Joshua Marantz jmara...@google.com wrote:
  other filter be somehow finding our filter and killing it?  Or sending
 the
  bytes directly to the network before our filter has a chance to run?

 Possibly, yes.


Can you elaborate?   Is this a common practice, to write bytes directly to
the network from an output filter?  What should I look for?  The owner of
the site where this is breaking sent me a few conf files and it enumerates
some of the modules inserted:

LoadModule perl_module modules/mod_perl.so
LoadModule mono_module modules/mod_mono.so
LoadModule bwlimited_module modules/mod_bwlimited.so
LoadModule bw_module modules/mod_bw.so
LoadModule jk_module modules/mod_jk.so

Does anyone know anything about these?  Could one of these have inserted an
output filter that spews bytes directly to the network?  I'll try to find
sources for those but if someone knows off-hand that would be helpful.


 By the way, why the complex setup? If you don't want the mod_headers
 filter to run, insert your filter before it, then remove it for each
 request that you handle.


This is an interesting idea.

I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and
MOD_EXPIRES.  Are there any other similar header-mucking-filters I need to
kill?  I don't mind squirreling through the source code to find these names
(all are string literals in .c files) but I'm nervous they could change
without warning in a future version.

Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it runs
after my content-generator, which means that it won't have been inserted by
the time when I want to set my caching headers.

I guess that means I have to insert a new late-running hook that kills
undesirable output filters.  Does that wind up being simpler?

-Josh


Re: Help trying to figure out why an output_filter is not called.

2011-01-05 Thread Joshua Marantz
On Wed, Jan 5, 2011 at 10:43 AM, Ben Noordhuis i...@bnoordhuis.nl wrote:

   I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and
  MOD_EXPIRES.


 Are there any other similar header-mucking-filters I need to kill?

  Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it
 runs
  after my content-generator, which means that it won't have been inserted
 by
  the time when I want to set my caching headers.

 You can remove it from your handler, scan
 r-output_filters-frec-name to find the filter.


I'm not following you.  mod_expires.c has this:

static void expires_insert_filter(request_rec *r) {
  ...ap_add_output_filter(MOD_EXPIRES, NULL, r, r-connection);...
}

static void register_hooks(apr_pool_t *p) {
/* mod_expires needs to run *before* the cache save filter which is
 * AP_FTYPE_CONTENT_SET-1.  Otherwise, our expires won't be honored.
 */
ap_register_output_filter(MOD_EXPIRES, expires_filter, NULL,
  AP_FTYPE_CONTENT_SET-2);
ap_hook_insert_error_filter(expires_insert_filter, NULL, NULL,
APR_HOOK_MIDDLE);
ap_hook_insert_filter(expires_insert_filter, NULL, NULL,
APR_HOOK_MIDDLE);
}

So if I try to remove the 'expires' filter from my handler (which runs
early) then mod_expires will have a handler that runs later that inserts it
after my module has completed.  Hence:

 I guess that means I have to insert a new late-running hook that kills
  undesirable output filters.  Does that wind up being simpler?

 The above is probably easier but whatever ends up being the most
 readable / maintainable, right?


And also functional :) which, evidently, my current solution is not, at
least in the presence of mod_perl and mod_php.

Your solution has the advantage of being more robust when upstream filters
write directly to the network.  I just wish I didn't have to depend on my
copies of string literals from .c files I don't control.  But as you said
core-filters will hopefully not change internal string constants often.   I
just coded it up and it seems to work :)

-Josh


Re: Help trying to figure out why an output_filter is not called.

2011-01-05 Thread Ben Noordhuis
On Wed, Jan 5, 2011 at 20:40, Joshua Marantz jmara...@google.com wrote:
 So if I try to remove the 'expires' filter from my handler (which runs
 early) then mod_expires will have a handler that runs later that inserts it
 after my module has completed.

No, it's the other way around. mod_expires uses the insert_filter hook
to insert its filter before your handler is run (and how could it be
otherwise? Output filters are there to post-process the content your
handler generates).

Have a look at ap_invoke_handler() in config.c, that should give you a
handle on how the filter chain works. But don't hesitate to post your
questions if you have them, of course. :)


Re: Help trying to figure out why an output_filter is not called.

2011-01-05 Thread Joshua Marantz
This has certainly gotten off topic :)

http://www.google.com/search?sourceid=chromeie=UTF-8q=nginx+vs+apache has
lots of interesting opinions on the subject.

Having said that, mod_pagespeed's initial target is Apache because that's
the dominant server stack driving the web.  I'm optimistic that nginx will
also be a compelling opportunity at some point and I'm anxious to learn
more.



Now -- back on topic -- this issue is tracked as
http://code.google.com/p/modpagespeed/issues/detail?id=179 for those
following along at home, and hopefully will be resolved shortly based on the
advise of the contributors to this thread.

Thanks again, everyone,
-Josh

On Wed, Jan 5, 2011 at 8:40 PM, Ray Morris supp...@bettercgi.com wrote:

  Just a quick note since you mentioned nginx.  Nginx is of
 course normally used by people wanting higher performance than
 they are getting from Apache, because certain tests seemed to
 suggest that nginx can significantly outperform Apache in some
 cases. If that's the case for you, we learned something very
 interesting.

  We wondered how nginx could possibly be much faster since the
 speed of the disk itself is normally the limiting factor.  Was
 there something to be learned from nginx which could be applied
 to Apache?  At the end of all of the testing, we learned what
 caused the large apparent difference.  noatime.  Nginx effectively
 skips atime updates, which can make a huge difference.  By simply
 mounting the directory with the noatime option, any reasonable
 Apache configuration will have about the same performance as nginx,
 which is basically the performance of the underlying storage.

   People build complex systems with nginx as a proxy to Apache,
 but the same or better performance, with better standards compliance
 and better reliability, can be obtained by just setting noatime
 directly rather than using getting noatime accidentally as a
 side effect of nginx.

   With noatime set, one server or another might be 1% faster,
 but using TWO servers, with one as a proxy, will be slower than
 just simply using Apache, and in no case will nginx be SIGNIFICANTLY
 faster, when using noatime.
 --
 Ray Morris
 supp...@bettercgi.com

 Strongbox - The next generation in site security:
 http://www.bettercgi.com/strongbox/

 Throttlebox - Intelligent Bandwidth Control
 http://www.bettercgi.com/throttlebox/

 Strongbox / Throttlebox affiliate program:
 http://www.bettercgi.com/affiliates/user/register.php



 On 01/05/2011 03:16:21 PM, Ben Noordhuis wrote:

 On Wed, Jan 5, 2011 at 22:03, Joshua Marantz jmara...@google.com wrote:
  Right you are.  That's much simpler then.  Thanks!

 My pleasure, Joshua.

 Two quick questions, hope you don't mind: Is mod_pagespeed an official
 Google project? Or is it something you guys do on your day off? And
 are there plans for a nginx port?