JIRA 966 has been opened for this issue. And thank you, the custom indexing filter works perfectly.
|------------> | From: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |".: Abhishek :." <[email protected]> | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | To: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |[email protected] | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Date: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |02/08/2011 08:05 PM | >--------------------------------------------------------------------------------------------------------------------------------------------------| |------------> | Subject: | |------------> >--------------------------------------------------------------------------------------------------------------------------------------------------| |Re: Nutch not respecting a NOINDEX,FOLLOW tag | >--------------------------------------------------------------------------------------------------------------------------------------------------| Hi Julien, Thanks! This actually answers the other question I asked sometime back :) Cheers, Abi On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche <[email protected] > wrote: > Hi Joshua > > you can circumvent that by creating a custom indexing filter e.g. > MetaNoIndexingFilter below > > */** > * Prevents documents not allowing indexing in the meta to be indexed. By > * default Nutch simply empties the content and title fields but this is > not > * enough to prevent documents to match e.g. on URL, metatags etc... > **/ > public class MetaNoIndexingFilter implements IndexingFilter { > public static final Log LOG = > LogFactory.getLog(MetaNoIndexingFilter.class); > > private Configuration conf; > > public NutchDocument filter(NutchDocument doc, Parse parse, Text url, > CrawlDatum datum, Inlinks inlinks) throws IndexingException { > // should rely on doc or parse metadata but nothing stored > // by the html parser > String text = parse.getText(); > String title = parse.getData().getTitle(); > if ((text == null || text.equals("")) > && (title == null || title.equals(""))) { > // no text -> no indexing > return null; > } > return doc; > } > > public void setConf(Configuration conf) { > this.conf = conf; > } > > public Configuration getConf() { > return this.conf; > } > > } > * > We should probably have a think about how to do that systematically as the > current behaviour is slightly counter intuitive. Could you please open a > JIRA for this? > > Thanks > > Julien > > > > On 7 February 2011 21:41, Joshua J Pavel <[email protected]> wrote: > > > > > Running version 1.2. > > > > A very simple page I'm using to seed some URLs but don't want to return > in > > the index itself has this metatag: > > <head><META http-equiv="Content-Type" content="text/html; > > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head> > > > > ...but the page keeps showing up in my index. Any thoughts on how I can > > troubleshoot this or otherwise implement a page that I want to be crawled > > but not indexed? > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >

