Using Nutch to crawl and use it as input to Solr

Kumar Krishnasami Fri, 22 Jan 2010 23:28:32 -0800

Hi All,

I am trying to decide if I could use Nutch for a project I am working onwith the following requirements:


1. I need to build the ability to search a bunch of urls.

2. These urls are given to me and there is no need to crawl links fromor to these urls.3. From time to time new urls will be added to the original set of urls.I need to update the indexes as soon as I get a new url to be added tothe original set of urls.

4. There is no need to rank these urls based on outside links etc..

Based on these requirements it seems that most of the capabilities ofNutch (crawling, hadoop etc.) would be an overkill for this project.There is no need for a linkdb etc..

Due to this I am thinking that I could use Solr with some othercomponent to feed it with the appropriate data. If I use Solr, I wouldneed a mechanism to fetch those urls and convert them to the format Solrneeds the data to be sent to it. Can I use Nutch for this by just usingthe Fetcher and build something that would convert the html into theappropriate xml format for Solr? Is there something else that I coulduse that anyone here is aware of?

I am just starting out with Nutch and Solr and any help would be greatlyappreciated.


Thanks,
Kumar.

Using Nutch to crawl and use it as input to Solr

Reply via email to