Re: [nutchgora] - proposal to support distributed indexing

SUJIT PAL Wed, 22 Feb 2012 10:16:46 -0800

Thanks Julien and Lewis.

Being able to specify the partitioner class sounds good - I am thinking that 
perhaps they could all be impls of the Hadoop 
org.apache.hadoop.mapreduce.Partitioner interface.


Would it be okay if I annotated NUTCH-945 saying that I am working on providing 
a patch for the NutchGora branch initially (I haven't looked at the head code 
yet, its likely to be slightly different), and then try to port the change over 
to the head?

-sujit

On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:

> Hi.
> 
> There was an issue [0] opened for this some time ago and it looks that
> apart from the (bare minimal) description, there has been no work done on
> it.
> 
> Would be a real nice feature to have.
> 
> [0] https://issues.apache.org/jira/browse/NUTCH-945
> 
> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
> [email protected]> wrote:
> 
>> Hi Sujit,
>> 
>> Sounds good. A nice way of doing it would be to make so that people can
>> define how to partition over the SOLR instances in the way they want e.g.
>> consistent hashing, URL range or crawldb metadata by taking a class name as
>> parameter. Does not need to be pluggable I think. I had implemented
>> something along these lines some time ago for a customer but could not
>> release it open source.
>> 
>> Feel free to open a JIRA  to comment on this issue and attach a patch.
>> 
>> Thanks
>> 
>> Julien
>> 
>> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:
>> 
>>> Hi,
>>> 
>>> I need to move the SOLR based search platform to a distributed setup, and
>>> therefore need to be able to write to multiple SOLR servers from Nutch
>>> (working on the nutchgora branch, so this may be specific to this
>> branch).
>>> Here is what I think I need to do...
>>> 
>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where it
>>> converts the WebPage to a NutchDocument, then passes the NutchDocument to
>>> the appropriate NutchIndexWriter (SolrWriter in this case). The
>> SolrWriter
>>> adds the NutchDocument to a queue and when the commit size is exceeded,
>>> writes out the queue and does a commit (and another one in the shutdown
>>> step).
>>> 
>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
>>> comma-separated list of URLs. The SolrWriter splits this parameter by ","
>>> and creates an array of server URLs and the same size array of inputDocs
>>> queue. It then takes the URL, runs it through a hashMod partitioner and
>>> writes it out to the inputDocs queue pointed by the partition.
>>> 
>>> Then my pages get split up into a number of SOLR servers, where I can
>>> query them in a distributed fashion (according to the SOLR docs, it is
>>> advisable to do this in a random manner to make sure the (unreliable) idf
>>> values do not influence scores from one server too much).
>>> 
>>> Is this a reasonable way to go about this? Or is there a simpler method I
>>> am overlooking?
>>> 
>>> TIA for any help you can provide.
>>> 
>>> -sujit
>>> 
>>> 
>> 
>> 
>> --
>> *
>> *Open Source Solutions for Text Engineering
>> 
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>> 
> 
> 
> 
> -- 
> *Lewis*

Re: [nutchgora] - proposal to support distributed indexing

Reply via email to