RE: Use nutch to crawl and Lucene to index and Search

McGibbney, Lewis John Thu, 18 Nov 2010 08:51:32 -0800

>Hello,



>I've been searching for three days and I haven't still found a solution.
>What I need is, use nutch from java just to crawl a number of urls and then
>use lucene to index the pages that nutch finds. I have to integrate this in
>my app so I need to make it all from java code. I know how to Index with
>Lucene, but don't know how to just crawl with nutch,

There are numerous tutorials (try the Nutch wiki for starters) scattered across 
the net for this.

>do it programmatically

Don't really understand this terminology! Could you please be more specific.

>and from where get the urls to index with Lucene.

I assume you will specify the URLs you wish Nutch to fetch. It's then a case of 
specifying that the URLs will be indexed by Lucene. By adding the 
lucene-core(version).jar to your nutch environment variable this will use 
Lucene as the indexer

>Also, the urls will be dynamically added and removed.

added and removed from where? crawl-urlfilter? regex-urlfilter? Lucene once 
they have been indexed?

>Any help would be appreciated.



>Thanks!


>Email has been scanned for viruses by Altman Technologies' email management 
>service - www.altman.co.uk/emailsystems

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

RE: Use nutch to crawl and Lucene to index and Search

Reply via email to