Hi. There was an issue [0] opened for this some time ago and it looks that apart from the (bare minimal) description, there has been no work done on it.
Would be a real nice feature to have. [0] https://issues.apache.org/jira/browse/NUTCH-945 On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < [email protected]> wrote: > Hi Sujit, > > Sounds good. A nice way of doing it would be to make so that people can > define how to partition over the SOLR instances in the way they want e.g. > consistent hashing, URL range or crawldb metadata by taking a class name as > parameter. Does not need to be pluggable I think. I had implemented > something along these lines some time ago for a customer but could not > release it open source. > > Feel free to open a JIRA to comment on this issue and attach a patch. > > Thanks > > Julien > > On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote: > > > Hi, > > > > I need to move the SOLR based search platform to a distributed setup, and > > therefore need to be able to write to multiple SOLR servers from Nutch > > (working on the nutchgora branch, so this may be specific to this > branch). > > Here is what I think I need to do... > > > > Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it > > converts the WebPage to a NutchDocument, then passes the NutchDocument to > > the appropriate NutchIndexWriter (SolrWriter in this case). The > SolrWriter > > adds the NutchDocument to a queue and when the commit size is exceeded, > > writes out the queue and does a commit (and another one in the shutdown > > step). > > > > My proposal is to specify the SolrConstants.SERVER_URL parameter as a > > comma-separated list of URLs. The SolrWriter splits this parameter by "," > > and creates an array of server URLs and the same size array of inputDocs > > queue. It then takes the URL, runs it through a hashMod partitioner and > > writes it out to the inputDocs queue pointed by the partition. > > > > Then my pages get split up into a number of SOLR servers, where I can > > query them in a distributed fashion (according to the SOLR docs, it is > > advisable to do this in a random manner to make sure the (unreliable) idf > > values do not influence scores from one server too much). > > > > Is this a reasonable way to go about this? Or is there a simpler method I > > am overlooking? > > > > TIA for any help you can provide. > > > > -sujit > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- *Lewis*

