Hi,

I need to move the SOLR based search platform to a distributed setup, and 
therefore need to be able to write to multiple SOLR servers from Nutch (working 
on the nutchgora branch, so this may be specific to this branch). Here is what 
I think I need to do...

Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it 
converts the WebPage to a NutchDocument, then passes the NutchDocument to the 
appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter adds the 
NutchDocument to a queue and when the commit size is exceeded, writes out the 
queue and does a commit (and another one in the shutdown step).

My proposal is to specify the SolrConstants.SERVER_URL parameter as a 
comma-separated list of URLs. The SolrWriter splits this parameter by "," and 
creates an array of server URLs and the same size array of inputDocs queue. It 
then takes the URL, runs it through a hashMod partitioner and writes it out to 
the inputDocs queue pointed by the partition.

Then my pages get split up into a number of SOLR servers, where I can query 
them in a distributed fashion (according to the SOLR docs, it is advisable to 
do this in a random manner to make sure the (unreliable) idf values do not 
influence scores from one server too much).

Is this a reasonable way to go about this? Or is there a simpler method I am 
overlooking?

TIA for any help you can provide.

-sujit

Reply via email to