Help trying to figure out why an output_filter is not called.
One of the improvements mod_pagespeed is supposed to do to sites is extend the cache lifetime of their resources indefinitely by including a content hash in the URL. This is working well for a large number of sites, but I encountered one today where it does not work. To accomplish the cache extension, overriding any wildcarded or directory-based expire settings a site admin has set for their resources, mod_pagespeed inserts two output filters. The first one does the HTML rewriting: ap_register_output_filter( kModPagespeedFilterName, instaweb_out_filter, NULL, AP_FTYPE_RESOURCE); When instaweb_out_filter runs, it makes this transformation: before: script src=foo.js/script after: script src=foo.js.pagespeed.ce.HASH.js/script The rewritten resource, foo.js.pagespeed.ce.HASH.js, is served by a hook: ap_hook_handler(instaweb_handler, NULL, NULL, APR_HOOK_FIRST - 1); Knowing that mod_headers will later override Cache-Control, which we don't want, our hook serves the .js file with our own header: X-Mod-Pagespeed-Repair: max-age=31536000 We a second output filter, to repair it: // We need our repair headers filter to run after mod_headers. The // mod_headers, which is the filter that is used to add the cache settings, is // AP_FTYPE_CONTENT_SET. Using (AP_FTYPE_CONTENT_SET + 2) to make sure that we // run after mod_headers. ap_register_output_filter( InstawebContext::kRepairHeadersFilterName, repair_caching_header, NULL, static_castap_filter_type(AP_FTYPE_CONTENT_SET + 2)); This is added into the filter chain whenever we want to extend cache: apr_table_add(request-headers_out, X-Mod-Pagespeed-Repair, cache_control); ap_add_output_filter(X-Mod-Pagespeed-Repair, NULL, request, request-connection); When working properly, this header is removed from request-headers_out by repair_caching_header(): const char* cache_control = apr_table_get(request-headers_out, X-Mod-Pagespeed-Repair); if (cache_control != NULL) { SetCacheControl(cache_control, request); apr_table_unset(request-headers_out, kRepairCachingHeader); } Where SetCacheControl also makes the Expires header consistent, etc. While this approach is complex, I've never seen it fail until today, on the site http://law.thu.edu.tw/main.php . On that site, the X-Mod-Pagespeed-Repair header is visible (it should have been removed) and the Cache-Control header has the value set from the conf files (public, max-age=600). So on this server, the repair_caching_header filter is not being run, despite having been programatically inserted by our code in the same place where we add X-Mod-Pagespeed-Repair header. What might be going wrong in his server to cause this to fail? Could some other filter be somehow finding our filter and killing it? Or sending the bytes directly to the network before our filter has a chance to run? Thanks! -Josh
Re: Help trying to figure out why an output_filter is not called.
Thanks again for the fast response, Ben! On Wed, Jan 5, 2011 at 8:57 AM, Ben Noordhuis i...@bnoordhuis.nl wrote: On Wed, Jan 5, 2011 at 14:45, Joshua Marantz jmara...@google.com wrote: other filter be somehow finding our filter and killing it? Or sending the bytes directly to the network before our filter has a chance to run? Possibly, yes. Can you elaborate? Is this a common practice, to write bytes directly to the network from an output filter? What should I look for? The owner of the site where this is breaking sent me a few conf files and it enumerates some of the modules inserted: LoadModule perl_module modules/mod_perl.so LoadModule mono_module modules/mod_mono.so LoadModule bwlimited_module modules/mod_bwlimited.so LoadModule bw_module modules/mod_bw.so LoadModule jk_module modules/mod_jk.so Does anyone know anything about these? Could one of these have inserted an output filter that spews bytes directly to the network? I'll try to find sources for those but if someone knows off-hand that would be helpful. By the way, why the complex setup? If you don't want the mod_headers filter to run, insert your filter before it, then remove it for each request that you handle. This is an interesting idea. I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and MOD_EXPIRES. Are there any other similar header-mucking-filters I need to kill? I don't mind squirreling through the source code to find these names (all are string literals in .c files) but I'm nervous they could change without warning in a future version. Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it runs after my content-generator, which means that it won't have been inserted by the time when I want to set my caching headers. I guess that means I have to insert a new late-running hook that kills undesirable output filters. Does that wind up being simpler? -Josh
Re: Help trying to figure out why an output_filter is not called.
On Wed, Jan 5, 2011 at 10:43 AM, Ben Noordhuis i...@bnoordhuis.nl wrote: I guess we should eliminate FIXUP_HEADERS_OUT, FIXUP_HEADERS_ERR, and MOD_EXPIRES. Are there any other similar header-mucking-filters I need to kill? Moreover, expires_insert_filter runs as APR_HOOK_MIDDLE which means it runs after my content-generator, which means that it won't have been inserted by the time when I want to set my caching headers. You can remove it from your handler, scan r-output_filters-frec-name to find the filter. I'm not following you. mod_expires.c has this: static void expires_insert_filter(request_rec *r) { ...ap_add_output_filter(MOD_EXPIRES, NULL, r, r-connection);... } static void register_hooks(apr_pool_t *p) { /* mod_expires needs to run *before* the cache save filter which is * AP_FTYPE_CONTENT_SET-1. Otherwise, our expires won't be honored. */ ap_register_output_filter(MOD_EXPIRES, expires_filter, NULL, AP_FTYPE_CONTENT_SET-2); ap_hook_insert_error_filter(expires_insert_filter, NULL, NULL, APR_HOOK_MIDDLE); ap_hook_insert_filter(expires_insert_filter, NULL, NULL, APR_HOOK_MIDDLE); } So if I try to remove the 'expires' filter from my handler (which runs early) then mod_expires will have a handler that runs later that inserts it after my module has completed. Hence: I guess that means I have to insert a new late-running hook that kills undesirable output filters. Does that wind up being simpler? The above is probably easier but whatever ends up being the most readable / maintainable, right? And also functional :) which, evidently, my current solution is not, at least in the presence of mod_perl and mod_php. Your solution has the advantage of being more robust when upstream filters write directly to the network. I just wish I didn't have to depend on my copies of string literals from .c files I don't control. But as you said core-filters will hopefully not change internal string constants often. I just coded it up and it seems to work :) -Josh
Re: Help trying to figure out why an output_filter is not called.
On Wed, Jan 5, 2011 at 20:40, Joshua Marantz jmara...@google.com wrote: So if I try to remove the 'expires' filter from my handler (which runs early) then mod_expires will have a handler that runs later that inserts it after my module has completed. No, it's the other way around. mod_expires uses the insert_filter hook to insert its filter before your handler is run (and how could it be otherwise? Output filters are there to post-process the content your handler generates). Have a look at ap_invoke_handler() in config.c, that should give you a handle on how the filter chain works. But don't hesitate to post your questions if you have them, of course. :)
Re: Help trying to figure out why an output_filter is not called.
This has certainly gotten off topic :) http://www.google.com/search?sourceid=chromeie=UTF-8q=nginx+vs+apache has lots of interesting opinions on the subject. Having said that, mod_pagespeed's initial target is Apache because that's the dominant server stack driving the web. I'm optimistic that nginx will also be a compelling opportunity at some point and I'm anxious to learn more. Now -- back on topic -- this issue is tracked as http://code.google.com/p/modpagespeed/issues/detail?id=179 for those following along at home, and hopefully will be resolved shortly based on the advise of the contributors to this thread. Thanks again, everyone, -Josh On Wed, Jan 5, 2011 at 8:40 PM, Ray Morris supp...@bettercgi.com wrote: Just a quick note since you mentioned nginx. Nginx is of course normally used by people wanting higher performance than they are getting from Apache, because certain tests seemed to suggest that nginx can significantly outperform Apache in some cases. If that's the case for you, we learned something very interesting. We wondered how nginx could possibly be much faster since the speed of the disk itself is normally the limiting factor. Was there something to be learned from nginx which could be applied to Apache? At the end of all of the testing, we learned what caused the large apparent difference. noatime. Nginx effectively skips atime updates, which can make a huge difference. By simply mounting the directory with the noatime option, any reasonable Apache configuration will have about the same performance as nginx, which is basically the performance of the underlying storage. People build complex systems with nginx as a proxy to Apache, but the same or better performance, with better standards compliance and better reliability, can be obtained by just setting noatime directly rather than using getting noatime accidentally as a side effect of nginx. With noatime set, one server or another might be 1% faster, but using TWO servers, with one as a proxy, will be slower than just simply using Apache, and in no case will nginx be SIGNIFICANTLY faster, when using noatime. -- Ray Morris supp...@bettercgi.com Strongbox - The next generation in site security: http://www.bettercgi.com/strongbox/ Throttlebox - Intelligent Bandwidth Control http://www.bettercgi.com/throttlebox/ Strongbox / Throttlebox affiliate program: http://www.bettercgi.com/affiliates/user/register.php On 01/05/2011 03:16:21 PM, Ben Noordhuis wrote: On Wed, Jan 5, 2011 at 22:03, Joshua Marantz jmara...@google.com wrote: Right you are. That's much simpler then. Thanks! My pleasure, Joshua. Two quick questions, hope you don't mind: Is mod_pagespeed an official Google project? Or is it something you guys do on your day off? And are there plans for a nginx port?