Only consider content and outlinks from certain html tag

Camilo Tejeiro Thu, 03 Sep 2015 11:10:25 -0700

Hi,

So I have a page in wikipedia (e.g.
https://en.wikipedia.org/wiki/List_of_free_and_open-source_software_packages)
which I am crawling. Now one problem I have is that I would like to keep
Nutch from storing content and outlinks from tags that don't include
relevant content (e.g. in my example page only consider the tag
div#mw-content-text). This is important both to provide better content for
Solr for indexing and because then Nutch would only follow relevant
outlinks. (e.g. it would not follow contact_us or about_us or plenty other
extraneous links)


I have tried using the index-blacklist-whitelist patch in
https://issues.apache.org/jira/browse/NUTCH-585 but it does not seem to
filter out the original outlinks or content created by Nutch, it seems to
create a new field "stripped_content" that you then can map to Solr for
indexing.

I am wondering what would be a good approach to follow to solve this
problem, do I need to write a new plugin that modifies/replaces the
functionality of parse-html to achieve this task, or is there a better way
to accomplish this.

All suggestions are welcome,

Thanks for all your help!

-- 
Camilo Tejeiro
*Be **honest, be grateful, be humble.*
https://www.linkedin.com/in/camilotejeiro
http://camilotejeiro.wordpress.com

Only consider content and outlinks from certain html tag

Reply via email to