Hi Joshua
you can circumvent that by creating a custom indexing filter e.g.
MetaNoIndexingFilter below
*/**
* Prevents documents not allowing indexing in the meta to be indexed. By
* default Nutch simply empties the content and title fields but this is not
* enough to prevent documents to match e.g. on URL, metatags etc...
**/
public class MetaNoIndexingFilter implements IndexingFilter {
public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);
private Configuration conf;
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks) throws IndexingException {
// should rely on doc or parse metadata but nothing stored
// by the html parser
String text = parse.getText();
String title = parse.getData().getTitle();
if ((text == null || text.equals(""))
&& (title == null || title.equals(""))) {
// no text -> no indexing
return null;
}
return doc;
}
public void setConf(Configuration conf) {
this.conf = conf;
}
public Configuration getConf() {
return this.conf;
}
}
*
We should probably have a think about how to do that systematically as the
current behaviour is slightly counter intuitive. Could you please open a
JIRA for this?
Thanks
Julien
On 7 February 2011 21:41, Joshua J Pavel <[email protected]> wrote:
>
> Running version 1.2.
>
> A very simple page I'm using to seed some URLs but don't want to return in
> the index itself has this metatag:
> <head><META http-equiv="Content-Type" content="text/html;
> charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
>
> ...but the page keeps showing up in my index. Any thoughts on how I can
> troubleshoot this or otherwise implement a page that I want to be crawled
> but not indexed?
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com