Hi, I need to move the SOLR based search platform to a distributed setup, and therefore need to be able to write to multiple SOLR servers from Nutch (working on the nutchgora branch, so this may be specific to this branch). Here is what I think I need to do...
Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it converts the WebPage to a NutchDocument, then passes the NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter adds the NutchDocument to a queue and when the commit size is exceeded, writes out the queue and does a commit (and another one in the shutdown step). My proposal is to specify the SolrConstants.SERVER_URL parameter as a comma-separated list of URLs. The SolrWriter splits this parameter by "," and creates an array of server URLs and the same size array of inputDocs queue. It then takes the URL, runs it through a hashMod partitioner and writes it out to the inputDocs queue pointed by the partition. Then my pages get split up into a number of SOLR servers, where I can query them in a distributed fashion (according to the SOLR docs, it is advisable to do this in a random manner to make sure the (unreliable) idf values do not influence scores from one server too much). Is this a reasonable way to go about this? Or is there a simpler method I am overlooking? TIA for any help you can provide. -sujit

