Hi Israel, You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika Not sure if it's built into Nutch, though...
Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ ----- Original Message ---- > From: Israel Ekpo <israele...@gmail.com> > To: solr-user@lucene.apache.org; u...@nutch.apache.org > Sent: Mon, October 18, 2010 9:01:50 PM > Subject: Removing Common Web Page Header and Footer from All Content Fetched > by >Nutch > > Hi All, > > I am indexing a web application with approximately 9500 distinct URL and > contents using Nutch and Solr. > > I use Nutch to fetch the urls, links and the crawl the entire web > application to extract all the content for all pages. > > Then I run the solrindex command to send the content to Solr. > > The problem that I have now is that the first 1000 or so characters of some > pages and the last 400 characters of the pages are showing up in the search > results. > > These are contents of the common header and footer used in the site > respectively. > > The only work around that I have now is to index everything and then go > through each document one at a time to remove the first 1000 characters if > the levenshtein distance between the first 1000 characters of the page and > the common header is less than a certain value. Same applies to the footer > content common to all pages. > > Is there a way to ignore certain "stop phrase" so to speak in the Nutch > configuration based on levenshtein distance or jaro winkler distance so that > certain parts of the fetched data that matches this stop phrases will not be > parsed? > > Any useful pointers would be highly appreciated. > > Thanks in advance. > > > -- > °O° > "Good Enough" is not good enough. > To give anything less than your best is to sacrifice the gift. > Quality First. Measure Twice. Cut Once. > http://www.israelekpo.com/ >