Re: [nutchgora] - proposal to support distributed indexing

Markus Jelsma Wed, 22 Feb 2012 14:10:56 -0800

In that case the algorithm doesn't matter as you still need to reindex the 
corpus if you upgrade to 4.x.


Cheers!

> Thanks Marcus, I guess I'll probably still need to build nutch side
> partitioning for myself since I am on Solr 3.5, it would be throw-away
> code, to be changed when I get on to 4.x.
> 
> -sujit
> 
> On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote:
> > Hi,
> > 
> > We're in the process of testing Solr trunk's cloud features that recently
> > includes initial work for distributed indexing. With it, there is no need
> > anymore for doing the partitioning client site because Solr will forward
> > the input documents to the proper shard. Solr uses the MurMur hashing
> > algorithm to decide the target shard so i would stick to that in any
> > case.
> > 
> > Anyway, with Solr being able to handle incoming documents on any node,
> > and distributing them appropriately there is no need anymore for hashing
> > at all. What we do need to to select a target server from a pool per
> > batch. Committing is not needed if soft autocommitting is enabled, quite
> > useful for Solr's new NRT features.
> > 
> > If Solr 4.0 is released in the coming months (and that's what it looks
> > like) i would suggest to patch Nutch to allow for a list of Solr server
> > URL's instead of doing partitioning on the client site.
> > 
> > In our case we don't even need a pool of Solr servers in Nutch to select
> > from because we pass the documents through a proxy that is aware of
> > running and offline servers.
> > 
> > Markus
> > 
> >> Thanks Julien and Lewis.
> >> 
> >> Being able to specify the partitioner class sounds good - I am thinking
> >> that perhaps they could all be impls of the Hadoop
> >> org.apache.hadoop.mapreduce.Partitioner interface.
> >> 
> >> Would it be okay if I annotated NUTCH-945 saying that I am working on
> >> providing a patch for the NutchGora branch initially (I haven't looked
> >> at the head code yet, its likely to be slightly different), and then
> >> try to port the change over to the head?
> >> 
> >> -sujit
> >> 
> >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
> >>> Hi.
> >>> 
> >>> There was an issue [0] opened for this some time ago and it looks that
> >>> apart from the (bare minimal) description, there has been no work done
> >>> on it.
> >>> 
> >>> Would be a real nice feature to have.
> >>> 
> >>> [0] https://issues.apache.org/jira/browse/NUTCH-945
> >>> 
> >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> >>> 
> >>> [email protected]> wrote:
> >>>> Hi Sujit,
> >>>> 
> >>>> Sounds good. A nice way of doing it would be to make so that people
> >>>> can define how to partition over the SOLR instances in the way they
> >>>> want e.g. consistent hashing, URL range or crawldb metadata by taking
> >>>> a class name as parameter. Does not need to be pluggable I think. I
> >>>> had implemented something along these lines some time ago for a
> >>>> customer but could not release it open source.
> >>>> 
> >>>> Feel free to open a JIRA  to comment on this issue and attach a patch.
> >>>> 
> >>>> Thanks
> >>>> 
> >>>> Julien
> >>>> 
> >>>> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> I need to move the SOLR based search platform to a distributed setup,
> >>>>> and therefore need to be able to write to multiple SOLR servers from
> >>>>> Nutch (working on the nutchgora branch, so this may be specific to
> >>>>> this
> >>>> 
> >>>> branch).
> >>>> 
> >>>>> Here is what I think I need to do...
> >>>>> 
> >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
> >>>>> it converts the WebPage to a NutchDocument, then passes the
> >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
> >>>>> case). The
> >>>> 
> >>>> SolrWriter
> >>>> 
> >>>>> adds the NutchDocument to a queue and when the commit size is
> >>>>> exceeded, writes out the queue and does a commit (and another one in
> >>>>> the shutdown step).
> >>>>> 
> >>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> >>>>> comma-separated list of URLs. The SolrWriter splits this parameter by
> >>>>> "," and creates an array of server URLs and the same size array of
> >>>>> inputDocs queue. It then takes the URL, runs it through a hashMod
> >>>>> partitioner and writes it out to the inputDocs queue pointed by the
> >>>>> partition.
> >>>>> 
> >>>>> Then my pages get split up into a number of SOLR servers, where I can
> >>>>> query them in a distributed fashion (according to the SOLR docs, it
> >>>>> is advisable to do this in a random manner to make sure the
> >>>>> (unreliable) idf values do not influence scores from one server too
> >>>>> much).
> >>>>> 
> >>>>> Is this a reasonable way to go about this? Or is there a simpler
> >>>>> method I am overlooking?
> >>>>> 
> >>>>> TIA for any help you can provide.
> >>>>> 
> >>>>> -sujit
> >>>> 
> >>>> --
> >>>> *
> >>>> *Open Source Solutions for Text Engineering
> >>>> 
> >>>> http://digitalpebble.blogspot.com/
> >>>> http://www.digitalpebble.com
> >>>> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Reply via email to