Well, there are two tutorials that I found.

http://thewiki4opentech.org/index.php/Nutch

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

As far as benefits of solr go, I am not entirely sure. solr is a search
engine, but nutch seems to have one of its own. You can either use solrindex
and index to solr or run index and index with nutch's search engine. the
regular index seems to work fine with hadoop's task tracker and job tracker
daemons. Solrindex seems to have to run single threaded.

What I am doing right now in my script is starting up the hadoop daemons
with one configuration to take advantage of mutiple cores and threads, then
stopping the daemons right before the script kicks off solrindex, and
starting up the hadoop daemons with the tasks set to one using --config
/path/to/different/conf/dir in the hadoop script.

The problem with that is solrindex is really slow on a sun sparc processor
if it is running single threaded. I just tried creating a processor set and
assigning all the nutch/hadoop processes to it and that caused it to produce
errors like:

org.apache.solr.common.SolrException: no segments* file found in
org.apache.lucene.store.FSDirectory@/nutchnew/crawl/index: files:
java.io.FileNotFoundException: no segments* file found in
org.apache.lucene.store.FSDirectory@/search_fodors/nutchnew/crawl/index:
files:  at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:604)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:272)     at
org.apache.lucene.index.In
dexWriter.init(IndexWriter.java:1158)   at
org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:938)     at
org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:116)   at
org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:122)
at org.apache.solr.update.DirectUpdat
eHandler2.openWriter(DirectUpdateHandler2.java:167)     at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:221)
at org.apache
.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:59)
at org.apache.solr.handler.XmlUpdateRequestHandler.processUpd
ate(XmlUpdateRequestHandler.java:196)   at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at org.apache
.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)    at
org.apache
.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)

So I am still figuring out how to get solrindex to work faster.

On Thu, Oct 7, 2010 at 8:38 AM, Israel <[email protected]> wrote:

> Hi Steve, I don“t have the answer to your question, but wanted to ask how
> to
> integrate SOLR to nutch 1.2 and what brings benefits. Or if you have your
> own tutorial.
>

Reply via email to