Can't you use a robots.txt instead?
> Currently Nutch supports the > directive in the head of individual pages. I would like to extend this > feature to allow the "http.agent.name" as a valid name in addition to the > "robots" directive. For example, in your nutch- site.xml file if you have > the property > > > http.agent.name > examplebot > > > then any pages with the meta directive in the head > > > > then this directive would cause Nutch to not index the page and to not > follow links on the page. > > So, the place to put this code is in the parse-html plugin and the > parse-tika plugin. Both of these have an HTMLMetaProcessor class that > would need to be updated. > > 1) Has someone already written this patch? > > 2) Should the property "http.robots.agents" also be included as is done in > the lib-http RobotRulesParser class? > > > Blessings, > TwP > > P.S. There is precedence for such a change as the googlebot already parses > these type of meta directives > &from=61050&rd=1>

