Re: [nutchgora] - proposal to support distributed indexing

Lewis John Mcgibbney Wed, 22 Feb 2012 03:01:38 -0800

Hi.

There was an issue [0] opened for this some time ago and it looks that
apart from the (bare minimal) description, there has been no work done on
it.


Would be a real nice feature to have.

[0] https://issues.apache.org/jira/browse/NUTCH-945

On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
[email protected]> wrote:

> Hi Sujit,
>
> Sounds good. A nice way of doing it would be to make so that people can
> define how to partition over the SOLR instances in the way they want e.g.
> consistent hashing, URL range or crawldb metadata by taking a class name as
> parameter. Does not need to be pluggable I think. I had implemented
> something along these lines some time ago for a customer but could not
> release it open source.
>
> Feel free to open a JIRA  to comment on this issue and attach a patch.
>
> Thanks
>
> Julien
>
> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:
>
> > Hi,
> >
> > I need to move the SOLR based search platform to a distributed setup, and
> > therefore need to be able to write to multiple SOLR servers from Nutch
> > (working on the nutchgora branch, so this may be specific to this
> branch).
> > Here is what I think I need to do...
> >
> > Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
> > converts the WebPage to a NutchDocument, then passes the NutchDocument to
> > the appropriate NutchIndexWriter (SolrWriter in this case). The
> SolrWriter
> > adds the NutchDocument to a queue and when the commit size is exceeded,
> > writes out the queue and does a commit (and another one in the shutdown
> > step).
> >
> > My proposal is to specify the SolrConstants.SERVER_URL parameter as a
> > comma-separated list of URLs. The SolrWriter splits this parameter by ","
> > and creates an array of server URLs and the same size array of inputDocs
> > queue. It then takes the URL, runs it through a hashMod partitioner and
> > writes it out to the inputDocs queue pointed by the partition.
> >
> > Then my pages get split up into a number of SOLR servers, where I can
> > query them in a distributed fashion (according to the SOLR docs, it is
> > advisable to do this in a random manner to make sure the (unreliable) idf
> > values do not influence scores from one server too much).
> >
> > Is this a reasonable way to go about this? Or is there a simpler method I
> > am overlooking?
> >
> > TIA for any help you can provide.
> >
> > -sujit
> >
> >
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
*Lewis*

Re: [nutchgora] - proposal to support distributed indexing

Reply via email to