I see. You can write an indexing filter that drops (returns null) the document 
if it encounters noindex. I'm not sure the meta tag is by default present in 
the meta data, if not, you must either make a HTML parse filter to add the 
value of that meta tag to the meta data or try an existing meta data plugin.

> On Jul 11, 2011, at 4:22 PM, Markus Jelsma wrote:
> > Can't you use a robots.txt instead?
> 
> I have no control over the robots.txt
> 
> This is a client of ours with poor URL schemas. They would have to put 6
> million entries into their robots.txt in order to exclude our crawler.
> That leaves <meta> directives as the next logical place to exclude
> specific pages (CMS templates make this easy).
> 
> I would love to not have this problem. But alas, since the client's URL
> schema is very poor we cannot effectively use robots.txt or any of the
> regular expression filters.
> 
> I am comfortable patching parse-html for our needs. Was just wondering if
> someone else has done this and would it be a worthwhile contribution back
> to the Nutch community at large.
> 
> Blessings,
> TwP
> 
> >> Currently Nutch supports the > directive in the head of individual
> >> pages. I
> > 
> > would like to extend this > feature to allow the "http.agent.name" as a
> > valid name in addition to the > "robots" directive. For example, in your
> > nutch- site.xml file if you have > the property > > > http.agent.name >
> > examplebot > >
> > 
> >> then any pages with the meta directive in the head > > > > then this
> > 
> > directive would cause Nutch to not index the page and to not > follow
> > links on the page. > > So, the place to put this code is in the
> > parse-html plugin and the > parse-tika plugin. Both of these have an
> > HTMLMetaProcessor class that > would need to be updated. > > 1) Has
> > someone already written this patch? > > 2) Should the property
> > "http.robots.agents" also be included as is done in > the lib-http
> > RobotRulesParser class? > > > Blessings, > TwP > > P.S. There is
> > precedence for such a change as the googlebot already parses > these
> > type of meta directives > &from=61050&rd=1>

Reply via email to