Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Ken Krugler Tue, 19 Oct 2010 09:07:12 -0700


On Oct 19, 2010, at 1:45am, Markus Jelsma wrote:

Unfortunately, Nutch still uses Tika 0.7 in 1.2 and trunk. Nutchneeds to beupgraded to Tika 0.8 (when it's released or just the current trunk).Also, theBoilerpipe API needs to be exposed through Nutch configuration,which extractor
can be used, which parameters need to be set etc.
Upgrading to Tika's trunk might be relatively easy but exposingBoilerpipe
surely isn't.

Boilerpipe has been integrated into Tika trunk, as a content handlerthat can be specified when calling the parse() method.


See https://issues.apache.org/jira/browse/TIKA-420

There's also TIKA-462, which is the related requirement of addingBoilerpipe to the Maven central repo - I haven't addressed that one yet.


-- Ken

On Tuesday, October 19, 2010 06:47:43 am Otis Gospodnetic wrote:
Hi Israel,

You can use this: http://search-lucene.com/?q=boilerpipe&fc_project=Tika
Not sure if it's built into Nutch, though...

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
From: Israel Ekpo <[email protected]>
To: [email protected]; [email protected]
Sent: Mon, October 18, 2010 9:01:50 PM
Subject: Removing Common Web Page Header and Footer from All Content
Fetched by

Nutch

Hi All,
I am indexing a web application with approximately 9500 distinctURL and
contents using Nutch and Solr.

I use Nutch to fetch the urls,  links and the crawl the entire web
application to extract all the content for  all pages.

Then I run the solrindex command to send the content to  Solr.
The problem that I have now is that the first 1000 or socharacters ofsome pages and the last 400 characters of the pages are showing upin
the  search results.

These are contents of the common header and footer  used in the site
respectively.
The only work around that I have now is to index everything andthen gothrough each document one at a time to remove the first 1000charactersif the levenshtein distance between the first 1000 characters ofthepage and the common header is less than a certain value. Sameapplies
to the footer content common to all pages.
Is there a way to ignore certain "stop phrase" so to speak in theNutchconfiguration based on levenshtein distance or jaro winklerdistance sothat certain parts of the fetched data that matches this stopphrases
will not be parsed?

Any  useful pointers would be highly appreciated.

Thanks in  advance.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Removing Common Web Page Header and Footer from All Content Fetched by Nutch

Reply via email to