Currently Nutch supports the <meta name="robots" content="noindex"> directive
in the head of individual pages. I would like to extend this feature to allow
the "http.agent.name" as a valid name in addition to the "robots" directive.
For example, in your nutch-site.xml file if you have the property
<property>
<name>http.agent.name</name>
<value>examplebot</value>
</property>
then any pages with the meta directive in the head
<meta name="examplebot" content="noindex,nofollow">
then this directive would cause Nutch to not index the page and to not follow
links on the page.
So, the place to put this code is in the parse-html plugin and the parse-tika
plugin. Both of these have an HTMLMetaProcessor class that would need to be
updated.
1) Has someone already written this patch?
2) Should the property "http.robots.agents" also be included as is done in the
lib-http RobotRulesParser class?
Blessings,
TwP
P.S. There is precedence for such a change as the googlebot already parses
these type of meta directives
<http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=93710&from=61050&rd=1>