Hi Hannes,

Thanks for the suggestion, I will have a look at wikipedia dumps. What is your 
advice on integrating the downloaded data from wikipedia dumps with Lucene? Can 
I use Lucene to directly index it? My initial thoughts are getting the mysql 
version of the wikipedia dumps, then use Lusql to create an Lucene index of the 
mysql data. 


What is your take on this?

Many thanks

Best regards,
Kelvin



________________________________
From: Hannes Carl Meyer <[email protected]>
To: [email protected]; Kelvin <[email protected]>
Sent: Wednesday, 4 May 2011 11:37 PM
Subject: Re: Can I custom crawl using Nutch?

Hi,

I would rather use the wikipedia dumps!

You should have a look at jwpl http://code.google.com/p/jwpl/

BR

Hannes

On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote:

> Hello,
>
> I would like to crawl wikipedia using Nutch, but as it is too large, I
> would only like to crawl pages that are related to a particular subject.
>
> For example, I would like to crawl for webpages of wikipedia that contain
> the term "Football". Is this possible using Nutch?
>
> Thank you for your kind help.
>

Reply via email to