Hello Dave,

If you have just one specific page you do not want Nutch to index, or Solr to 
show, you can either create a custom IndexingFilter that returns null 
(rejecting it) for the specified URL, or add an additional filterQuery to Solr, 
fq=-id:<SEED_URL>, filtering the specific URL from the results.

If there are more than a few URLs you want to exclude from indexing, and they 
have a pattern, you can uses regular expressions in the IndexingFilter or Solr 

This is manual intervention, and only possible if your set is small enough, and 
does not change frequently. If this is not the case, you need more rigorous 
tools to detect and reject - what we call - hub pages or overview pages.

-----Original message-----
> From:Dave Beckstrom <dbeckst...@collectivefls.com>
> Sent: Thursday 10th October 2019 22:34
> To: user@nutch.apache.org
> Subject: Excluding individual pages?
> Hi Everyone,
> I searched and didn't find an answer.
> Nutch is indexing the content of the page that has the seed urls in it and
> then that page shows up in the SOLR search results.   We don't want that to
> happen.
> Is there a way to have nutch crawl the seed url page but not push that page
> into SOLR?  If not, is there a way to have a particular page excluded from
> the SOLR search results?  Either way I'm trying to not have that page show
> in search results.
> Thank you!
> Dave
> -- 
> *Fig Leaf Software is now Collective FLS, Inc.*
> *
> *
> *Collective FLS, Inc.* 
> https://www.collectivefls.com/ <https://www.collectivefls.com/> 

Reply via email to