Hi Sujit,

Sounds good. A nice way of doing it would be to make so that people can
define how to partition over the SOLR instances in the way they want e.g.
consistent hashing, URL range or crawldb metadata by taking a class name as
parameter. Does not need to be pluggable I think. I had implemented
something along these lines some time ago for a customer but could not
release it open source.

Feel free to open a JIRA  to comment on this issue and attach a patch.

Thanks

Julien

On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:

> Hi,
>
> I need to move the SOLR based search platform to a distributed setup, and
> therefore need to be able to write to multiple SOLR servers from Nutch
> (working on the nutchgora branch, so this may be specific to this branch).
> Here is what I think I need to do...
>
> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
> converts the WebPage to a NutchDocument, then passes the NutchDocument to
> the appropriate NutchIndexWriter (SolrWriter in this case). The SolrWriter
> adds the NutchDocument to a queue and when the commit size is exceeded,
> writes out the queue and does a commit (and another one in the shutdown
> step).
>
> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> comma-separated list of URLs. The SolrWriter splits this parameter by ","
> and creates an array of server URLs and the same size array of inputDocs
> queue. It then takes the URL, runs it through a hashMod partitioner and
> writes it out to the inputDocs queue pointed by the partition.
>
> Then my pages get split up into a number of SOLR servers, where I can
> query them in a distributed fashion (according to the SOLR docs, it is
> advisable to do this in a random manner to make sure the (unreliable) idf
> values do not influence scores from one server too much).
>
> Is this a reasonable way to go about this? Or is there a simpler method I
> am overlooking?
>
> TIA for any help you can provide.
>
> -sujit
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to