Hi Israel,

You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika
Not sure if it's built into Nutch, though...

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Israel Ekpo <israele...@gmail.com>
> To: solr-user@lucene.apache.org; u...@nutch.apache.org
> Sent: Mon, October 18, 2010 9:01:50 PM
> Subject: Removing Common Web Page Header and Footer from All Content Fetched 
> by 
>Nutch
> 
> Hi All,
> 
> I am indexing a web application with approximately 9500 distinct  URL and
> contents using Nutch and Solr.
> 
> I use Nutch to fetch the urls,  links and the crawl the entire web
> application to extract all the content for  all pages.
> 
> Then I run the solrindex command to send the content to  Solr.
> 
> The problem that I have now is that the first 1000 or so characters  of some
> pages and the last 400 characters of the pages are showing up in the  search
> results.
> 
> These are contents of the common header and footer  used in the site
> respectively.
> 
> The only work around that I have now is  to index everything and then go
> through each document one at a time to remove  the first 1000 characters if
> the levenshtein distance between the first 1000  characters of the page and
> the common header is less than a certain value.  Same applies to the footer
> content common to all pages.
> 
> Is there a way  to ignore certain "stop phrase" so to speak in the Nutch
> configuration based  on levenshtein distance or jaro winkler distance so that
> certain parts of the  fetched data that matches this stop phrases will not be
> parsed?
> 
> Any  useful pointers would be highly appreciated.
> 
> Thanks in  advance.
> 
> 
> -- 
> °O°
> "Good Enough" is not good enough.
> To give  anything less than your best is to sacrifice the gift.
> Quality First. Measure  Twice. Cut Once.
> http://www.israelekpo.com/
>

Reply via email to