Hi, So I have a page in wikipedia (e.g. https://en.wikipedia.org/wiki/List_of_free_and_open-source_software_packages) which I am crawling. Now one problem I have is that I would like to keep Nutch from storing content and outlinks from tags that don't include relevant content (e.g. in my example page only consider the tag div#mw-content-text). This is important both to provide better content for Solr for indexing and because then Nutch would only follow relevant outlinks. (e.g. it would not follow contact_us or about_us or plenty other extraneous links)
I have tried using the index-blacklist-whitelist patch in https://issues.apache.org/jira/browse/NUTCH-585 but it does not seem to filter out the original outlinks or content created by Nutch, it seems to create a new field "stripped_content" that you then can map to Solr for indexing. I am wondering what would be a good approach to follow to solve this problem, do I need to write a new plugin that modifies/replaces the functionality of parse-html to achieve this task, or is there a better way to accomplish this. All suggestions are welcome, Thanks for all your help! -- Camilo Tejeiro *Be **honest, be grateful, be humble.* https://www.linkedin.com/in/camilotejeiro http://camilotejeiro.wordpress.com

