Hi Hannes, Thanks for the suggestion, I will have a look at wikipedia dumps. What is your advice on integrating the downloaded data from wikipedia dumps with Lucene? Can I use Lucene to directly index it? My initial thoughts are getting the mysql version of the wikipedia dumps, then use Lusql to create an Lucene index of the mysql data.
What is your take on this? Many thanks Best regards, Kelvin ________________________________ From: Hannes Carl Meyer <[email protected]> To: [email protected]; Kelvin <[email protected]> Sent: Wednesday, 4 May 2011 11:37 PM Subject: Re: Can I custom crawl using Nutch? Hi, I would rather use the wikipedia dumps! You should have a look at jwpl http://code.google.com/p/jwpl/ BR Hannes On Wed, May 4, 2011 at 5:20 PM, Kelvin <[email protected]> wrote: > Hello, > > I would like to crawl wikipedia using Nutch, but as it is too large, I > would only like to crawl pages that are related to a particular subject. > > For example, I would like to crawl for webpages of wikipedia that contain > the term "Football". Is this possible using Nutch? > > Thank you for your kind help. >

