Re: [nutchgora] - proposal to support distributed indexing

Markus Jelsma Wed, 22 Feb 2012 10:31:16 -0800

Hi,

We're in the process of testing Solr trunk's cloud features that recently 
includes initial work for distributed indexing. With it, there is no need 
anymore for doing the partitioning client site because Solr will forward the 
input documents to the proper shard. Solr uses the MurMur hashing algorithm to 
decide the target shard so i would stick to that in any case.


Anyway, with Solr being able to handle incoming documents on any node, and 
distributing them appropriately there is no need anymore for hashing at all. 
What we do need to to select a target server from a pool per batch.  
Committing is not needed if soft autocommitting is enabled, quite useful for 
Solr's new NRT features.

If Solr 4.0 is released in the coming months (and that's what it looks like) i 
would suggest to patch Nutch to allow for a list of Solr server URL's instead 
of doing partitioning on the client site.

In our case we don't even need a pool of Solr servers in Nutch to select from 
because we pass the documents through a proxy that is aware of running and 
offline servers.

Markus

> Thanks Julien and Lewis.
> 
> Being able to specify the partitioner class sounds good - I am thinking
> that perhaps they could all be impls of the Hadoop
> org.apache.hadoop.mapreduce.Partitioner interface.
> 
> Would it be okay if I annotated NUTCH-945 saying that I am working on
> providing a patch for the NutchGora branch initially (I haven't looked at
> the head code yet, its likely to be slightly different), and then try to
> port the change over to the head?
> 
> -sujit
> 
> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
> > Hi.
> > 
> > There was an issue [0] opened for this some time ago and it looks that
> > apart from the (bare minimal) description, there has been no work done on
> > it.
> > 
> > Would be a real nice feature to have.
> > 
> > [0] https://issues.apache.org/jira/browse/NUTCH-945
> > 
> > On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> > 
> > [email protected]> wrote:
> >> Hi Sujit,
> >> 
> >> Sounds good. A nice way of doing it would be to make so that people can
> >> define how to partition over the SOLR instances in the way they want
> >> e.g. consistent hashing, URL range or crawldb metadata by taking a
> >> class name as parameter. Does not need to be pluggable I think. I had
> >> implemented something along these lines some time ago for a customer
> >> but could not release it open source.
> >> 
> >> Feel free to open a JIRA  to comment on this issue and attach a patch.
> >> 
> >> Thanks
> >> 
> >> Julien
> >> 
> >> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:
> >>> Hi,
> >>> 
> >>> I need to move the SOLR based search platform to a distributed setup,
> >>> and therefore need to be able to write to multiple SOLR servers from
> >>> Nutch (working on the nutchgora branch, so this may be specific to
> >>> this
> >> 
> >> branch).
> >> 
> >>> Here is what I think I need to do...
> >>> 
> >>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
> >>> it converts the WebPage to a NutchDocument, then passes the
> >>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
> >>> case). The
> >> 
> >> SolrWriter
> >> 
> >>> adds the NutchDocument to a queue and when the commit size is exceeded,
> >>> writes out the queue and does a commit (and another one in the shutdown
> >>> step).
> >>> 
> >>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> >>> comma-separated list of URLs. The SolrWriter splits this parameter by
> >>> "," and creates an array of server URLs and the same size array of
> >>> inputDocs queue. It then takes the URL, runs it through a hashMod
> >>> partitioner and writes it out to the inputDocs queue pointed by the
> >>> partition.
> >>> 
> >>> Then my pages get split up into a number of SOLR servers, where I can
> >>> query them in a distributed fashion (according to the SOLR docs, it is
> >>> advisable to do this in a random manner to make sure the (unreliable)
> >>> idf values do not influence scores from one server too much).
> >>> 
> >>> Is this a reasonable way to go about this? Or is there a simpler method
> >>> I am overlooking?
> >>> 
> >>> TIA for any help you can provide.
> >>> 
> >>> -sujit
> >> 
> >> --
> >> *
> >> *Open Source Solutions for Text Engineering
> >> 
> >> http://digitalpebble.blogspot.com/
> >> http://www.digitalpebble.com
> >> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Reply via email to