Can't you use a robots.txt instead?

> Currently Nutch supports the > directive in the head of individual pages. I 
would like to extend this > feature to allow the "http.agent.name" as a valid 
name in addition to the > "robots" directive. For example, in your nutch-
site.xml file if you have > the property > > > http.agent.name > examplebot > > 
> then any pages with the meta directive in the head > > > > then this 
directive would cause Nutch to not index the page and to not > follow links on 
the page. > > So, the place to put this code is in the parse-html plugin and 
the > parse-tika plugin. Both of these have an HTMLMetaProcessor class that > 
would need to be updated. > > 1) Has someone already written this patch? > > 
2) Should the property "http.robots.agents" also be included as is done in > 
the lib-http RobotRulesParser class? > > > Blessings, > TwP > > P.S. There is 
precedence for such a change as the googlebot already parses > these type of 
meta directives > &from=61050&rd=1>

Reply via email to