Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Israel Ekpo Mon, 18 Oct 2010 18:02:18 -0700

Hi All,

I am indexing a web application with approximately 9500 distinct URL and
contents using Nutch and Solr.


I use Nutch to fetch the urls, links and the crawl the entire web
application to extract all the content for all pages.

Then I run the solrindex command to send the content to Solr.

The problem that I have now is that the first 1000 or so characters of some
pages and the last 400 characters of the pages are showing up in the search
results.

These are contents of the common header and footer used in the site
respectively.

The only work around that I have now is to index everything and then go
through each document one at a time to remove the first 1000 characters if
the levenshtein distance between the first 1000 characters of the page and
the common header is less than a certain value. Same applies to the footer
content common to all pages.

Is there a way to ignore certain "stop phrase" so to speak in the Nutch
configuration based on levenshtein distance or jaro winkler distance so that
certain parts of the fetched data that matches this stop phrases will not be
parsed?

Any useful pointers would be highly appreciated.

Thanks in advance.


-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/

Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Reply via email to