Re: Nutch not respecting a NOINDEX,FOLLOW tag

Joshua J Pavel Wed, 09 Feb 2011 06:35:24 -0800

JIRA 966 has been opened for this issue.  And thank you, the custom
indexing filter works perfectly.




|------------>
| From:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |".: Abhishek :." <[email protected]>                                         
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |[email protected]                                                        
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |02/08/2011 08:05 PM                                                          
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Re: Nutch not respecting a NOINDEX,FOLLOW tag                                
                                                                     |
  
>--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi Julien,

 Thanks! This actually answers the other question I asked sometime back :)

Cheers,
Abi


On Tue, Feb 8, 2011 at 6:14 PM, Julien Nioche
<[email protected]
> wrote:

> Hi Joshua
>
> you can circumvent that by creating a custom indexing filter e.g.
> MetaNoIndexingFilter below
>
> */**
>  * Prevents documents not allowing indexing in the meta to be indexed. By
>  * default Nutch simply empties the content and title fields but this is
> not
>  * enough to prevent documents to match e.g. on URL, metatags etc...
>  **/
> public class MetaNoIndexingFilter implements IndexingFilter {
>    public static final Log LOG =
> LogFactory.getLog(MetaNoIndexingFilter.class);
>
>    private Configuration conf;
>
>    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
>            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
>        // should rely on doc or parse metadata but nothing stored
>        // by the html parser
>        String text = parse.getText();
>        String title = parse.getData().getTitle();
>        if ((text == null || text.equals(""))
>                && (title == null || title.equals(""))) {
>            // no text -> no indexing
>            return null;
>        }
>        return doc;
>    }
>
>    public void setConf(Configuration conf) {
>        this.conf = conf;
>    }
>
>    public Configuration getConf() {
>        return this.conf;
>    }
>
> }
> *
> We should probably have a think about how to do that systematically as
the
> current behaviour is slightly counter intuitive. Could you please open a
> JIRA for this?
>
> Thanks
>
> Julien
>
>
>
> On 7 February 2011 21:41, Joshua J Pavel <[email protected]> wrote:
>
> >
> > Running version 1.2.
> >
> > A very simple page I'm using to seed some URLs but don't want to return
> in
> > the index itself has this metatag:
> > <head><META http-equiv="Content-Type" content="text/html;
> > charset=UTF-8"><META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"></head>
> >
> > ...but the page keeps showing up in my index.  Any thoughts on how I
can
> > troubleshoot this or otherwise implement a page that I want to be
crawled
> > but not indexed?
>
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>

Re: Nutch not respecting a NOINDEX,FOLLOW tag

Reply via email to