On Jul 11, 2011, at 4:22 PM, Markus Jelsma wrote:

> Can't you use a robots.txt instead?
> 

I have no control over the robots.txt

This is a client of ours with poor URL schemas. They would have to put 6 
million entries into their robots.txt in order to exclude our crawler. That 
leaves <meta> directives as the next logical place to exclude specific pages 
(CMS templates make this easy).

I would love to not have this problem. But alas, since the client's URL schema 
is very poor we cannot effectively use robots.txt or any of the regular 
expression filters.

I am comfortable patching parse-html for our needs. Was just wondering if 
someone else has done this and would it be a worthwhile contribution back to 
the Nutch community at large.

Blessings,
TwP


>> Currently Nutch supports the > directive in the head of individual pages. I 
> would like to extend this > feature to allow the "http.agent.name" as a valid 
> name in addition to the > "robots" directive. For example, in your nutch-
> site.xml file if you have > the property > > > http.agent.name > examplebot > 
> > 
>> then any pages with the meta directive in the head > > > > then this 
> directive would cause Nutch to not index the page and to not > follow links 
> on 
> the page. > > So, the place to put this code is in the parse-html plugin and 
> the > parse-tika plugin. Both of these have an HTMLMetaProcessor class that > 
> would need to be updated. > > 1) Has someone already written this patch? > > 
> 2) Should the property "http.robots.agents" also be included as is done in > 
> the lib-http RobotRulesParser class? > > > Blessings, > TwP > > P.S. There is 
> precedence for such a change as the googlebot already parses > these type of 
> meta directives > &from=61050&rd=1>

Reply via email to