Currently Nutch supports the <meta name="robots" content="noindex"> directive 
in the head of individual pages. I would like to extend this feature to allow 
the "http.agent.name" as a valid name in addition to the "robots" directive. 
For example, in your nutch-site.xml file if you have the property

  <property>
    <name>http.agent.name</name>
    <value>examplebot</value>
  </property>

then any pages with the meta directive in the head

  <meta name="examplebot" content="noindex,nofollow">

then this directive would cause Nutch to not index the page and to not follow 
links on the page.

So, the place to put this code is in the parse-html plugin and the parse-tika 
plugin. Both of these have an HTMLMetaProcessor class that would need to be 
updated.

1) Has someone already written this patch?

2) Should the property "http.robots.agents" also be included as is done in the 
lib-http RobotRulesParser class?


Blessings,
TwP

P.S.  There is precedence for such a change as the googlebot already parses 
these type of meta directives 
<http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710&from=61050&rd=1>

Reply via email to