Hi Joshua

you can circumvent that by creating a custom indexing filter e.g.
MetaNoIndexingFilter below

*/**
 * Prevents documents not allowing indexing in the meta to be indexed. By
 * default Nutch simply empties the content and title fields but this is not
 * enough to prevent documents to match e.g. on URL, metatags etc...
 **/
public class MetaNoIndexingFilter implements IndexingFilter {
    public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);

    private Configuration conf;

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        // should rely on doc or parse metadata but nothing stored
        // by the html parser
        String text = parse.getText();
        String title = parse.getData().getTitle();
        if ((text == null || text.equals(""))
                && (title == null || title.equals(""))) {
            // no text -> no indexing
            return null;
        }
        return doc;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public Configuration getConf() {
        return this.conf;
    }

}
*
We should probably have a think about how to do that systematically as the
current behaviour is slightly counter intuitive. Could you please open a
JIRA for this?

Thanks

Julien



On 7 February 2011 21:41, Joshua J Pavel <[email protected]> wrote:

>
> Running version 1.2.
>
> A very simple page I'm using to seed some URLs but don't want to return in
> the index itself has this metatag:
> <head><META http-equiv="Content-Type" content="text/html;
> charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
>
> ...but the page keeps showing up in my index.  Any thoughts on how I can
> troubleshoot this or otherwise implement a page that I want to be crawled
> but not indexed?




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to