If you want to only index files , not directory, you can
implement IndexingFilter interface like index-basic plugin. code like this
public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
CrawlDatum datum, Inlinks inlinks)
throws IndexingException {
// check and ignore directory url then return null, so IndexingJob
will skip this document discarded by indexing filters
}
On Tue, Aug 13, 2013 at 8:13 PM, stone2dbone <[email protected]
> wrote:
> UPDATE:
>
> I found that the anchor for any parent directory is blank. In Solr I have
> been able to use the following to delete the parent directories:
>
> deleteByQuery( "-anchor:[* TO *]" )
>
> However, I would prefer to delete these with Nutch if possible. Any
> suggestions would be appreciated.
>
> Regards.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Prevent-crawl-of-parent-URL-tp4080032p4084252.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
--
Don't Grow Old, Grow Up... :-)