Re: ESI and search engine spiders

Rob S Wed, 11 Aug 2010 08:49:42 -0700

Chris, Stew,

This was always going to be controversial. It's a very tough balancingact. In our view, we're not serving additional content to seed thesearch engines, and so it's reasonable. We are removing content thatusers find useful, but which might make it harder for the search engineto make a good judgement about the site overall. Yahoo explains whythis is desirable:

Webpages often include headers, footers, navigational sections,repeated boilerplate text, copyright notices, ad sections, or dynamiccontent that is useful to users, but not to search engines. Webmasterscan apply the "robots-nocontent" attribute to indicate to searchengines any content that is extraneous to the main unique content ofthe page.

A few blogs have picked up on examples of prominent sites implementingcloaking to different extents, such ashttp://www.seroundtable.com/archives/021504.html andhttp://www.seoegghead.com/blog/seo/the-google-cloaking-hypocrisy-p32.html.Then there's also the First Click Free approach(http://www.google.com/support/webmasters/bin/answer.py?answer=74536),which many people might feel is a bit borderline.

I agree many people could use the code below to try to boost theirsearch engine results by including lots of keywords or links, but I'mconfident that there are many legitimate reasons to do this. I'd lovenot to do this in Varnish / HTTP, but there don't appear to be otherwidely supported solutions. In this case (and addressing Chris' point)it's not possible to use robots.txt as we're not trying to block theentire page, just a subset of it. There are ways of hinting to aGoogle Search Appliance to turn off indexing of a portion of the page(googleon/off tags, seehttp://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/),but these aren't supported by the normal googlebot, nor other searchengines. Yahoo has a robots-nocontent class that can be added to HTMLelements, but again, it's a single solution for just one search engine(http://help.yahoo.com/l/us/yahoo/search/webcrawler/slurp-14.html).I've heard of (but can't find a link) a discussion to add a specificattribute to any HTML element to the new HTML5 standard, but that thiswasn't adopted.

Someone reading this might know a magic answer, but in the meantime,we'll be making minor page alterations to help ensure users findrelevant results when searching Google, even if that involves a littlecloaking to suppress a small portion of the page.



Rob


Chris Hecker wrote:

On that note, why not use robots.txt and a clear path name to turn offbots for the lists?
Chris

On 2010/08/11 08:25, Stewart Robinson wrote:
Hi,

Whilst this looks excellent and I may use it to serve different
content to other types of users I think you should read, if you
haven't already, this URL which discourages this sort of behaviour.
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=66355
Great VCL though!
Stew


On 11 August 2010 16:20, Rob S<[email protected]>  wrote:
Michael Loftis wrote:
--On Tuesday, August 10, 2010 9:05 PM +0100 RobS<[email protected]>
wrote:
Hi,

On one site we run behind varnish, we've got a "most popular" widget
displayed on every page (much like http://www.bbc.co.uk/news/).However,we have difficulties where this pollutes search engines, assearches for
a specific popular headline tend not to link directly to the article
itself, but to one of the index pages with high Google pagerank or
similar.

What I'd like to know is how other Varnish users might have served
different ESI content based on whether it's a bot or not.
My initial idea was to set an "X-Not-For-Bots: 1" header on theURL that
generates the most-popular fragment, then do something like (though
untested):
ESI goes through all the normal steps, so a<esi:include
src="/esi/blargh"> is fired off starting with vcl_receive lookingjustexactly like the browser had hit the cache with that as the req.url-- theentire req object is the same -- i am *not* certain that headersyou'veadded get propogated as I've not tested that (and all of my rulesare built
on the assumption that is not the case, just to be sure)

So there's no need to do it in vcl_deliver, in fact, you're far better
handling it in vcl_recv and/or vcl_hash (actually you really SHOULDhandleit in vcl_hash and change the hash for these search engine specificobjects
else you'll serve them to regular users)...


for example -- assume vcl_recv sets X-BotDetector in the req header...
(not tested)::


sub vcl_hash {
  // always take into account the url and host
  set req.hash += req.url;
  if (req.http.host) {
   set req.hash += req.http.host;
  } else {
   set req.hash += server.ip;
  }

  if(req.http.X-BotDetector == "1") {
   set req.hash += "bot detector";
  }
}
You still have to do the detection inside of varnish, I don't seeany wayaround that. The reason is that only varnish knows who it'stalking to, andvarnish needs to decide which object to spit out. Working properlywhathappens is essentially the webserver sends back a 'template' forthe page
containing the page specific stuff, and pointers to a bunch of ESI
fragments. The ESI fragments are also cache objects/requests...Sowhathappens is the cache takes this template, fills in ESI fragments(from cacheif it can, fetching them if it needs to, treating them just as ifthe web
browser had run to the ESI url)
This is actually exactly how I handle menu's that change based on ausersauthentication status. The browser gets a cookie. The ESI URL isformed aseither 'authenticated' 'personalized' or 'global' -- authenticatedmeans itvaries only on the clients login state, personalized takes intoaccount theactual session we're working with. And global means everyone getsthe samecache regardless (we strip cookies going into these ESI URLs andcoming fromthese ESI URLs in the vcl_recv/vcl_fetch code, the vcl_fetch codelooks forsome special headers set that indicate that the recv has decided itneeds toditch set-cookies -- this is mostly a safety measure to prevent asession
sticking to a client it shouldn't due to any bugs in code)

The basic idea is borrowed from
<http://varnish-cache.org/wiki/VCLExampleCachingLoggedInUsers>  and
<http://varnish-cache.org/wiki/VCLExampleCacheCookies>

HTH!
Thanks.  We've proved this works with a simple setup:

sub vcl_recv {
       ....
       // Establish if the visitor is a search engine:
       set req.http.X-IsABot = "0";
if (req.http.user-agent ~ "Yahoo! Slurp") { setreq.http.X-IsABot =
"1"; }
if (req.http.X-IsABot == "0"&& req.http.user-agent ~"Googlebot") {
set req.http.X-IsABot = "1"; }
if (req.http.X-IsABot == "0"&& req.http.user-agent ~"msnbot") { set
req.http.X-IsABot = "1"; }
       ....

}
...
sub vcl_hash {
       set req.hash += req.url;
       if (req.http.host) {
               set req.hash += req.http.host;
       } else {
               set req.hash += server.ip;
       }

       if (req.http.X-IsABot == "1") {
               set req.hash += "for-bot";
       } else {
               set req.hash += "for-non-bot";
       }
       hash;
}
The main HTML has a simple ESI, which loads a page fragment whosePHP reads:
if ($_SERVER["HTTP_X_ISABOT"]) {

       echo "";
} else {
             // calculate most popular
       echo "The most popular article is XYZ";
}



Thanks again.



_______________________________________________
varnish-misc mailing list
[email protected]
http://lists.varnish-cache.org/mailman/listinfo/varnish-misc

Re: ESI and search engine spiders

Reply via email to