If you want to only index files , not directory, you can
implement IndexingFilter interface like index-basic plugin. code like this

public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks)
    throws IndexingException {
      // check and ignore directory url then return null, so IndexingJob
will skip this document discarded by indexing filters
}


On Tue, Aug 13, 2013 at 8:13 PM, stone2dbone <[email protected]
> wrote:

> UPDATE:
>
> I found that the anchor for any parent directory is blank.  In Solr I have
> been able to use the following to delete the parent directories:
>
> deleteByQuery( "-anchor:[* TO *]" )
>
> However, I would prefer to delete these with Nutch if possible.  Any
> suggestions would be appreciated.
>
> Regards.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084252.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to