Re: [nutchgora] - proposal to support distributed indexing

SUJIT PAL Wed, 22 Feb 2012 11:28:32 -0800

Thanks Marcus, I guess I'll probably still need to build nutch side 
partitioning for myself since I am on Solr 3.5, it would be throw-away code, to 
be changed when I get on to 4.x.


-sujit

On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote:

> Hi,
> 
> We're in the process of testing Solr trunk's cloud features that recently 
> includes initial work for distributed indexing. With it, there is no need 
> anymore for doing the partitioning client site because Solr will forward the 
> input documents to the proper shard. Solr uses the MurMur hashing algorithm 
> to 
> decide the target shard so i would stick to that in any case.
> 
> Anyway, with Solr being able to handle incoming documents on any node, and 
> distributing them appropriately there is no need anymore for hashing at all. 
> What we do need to to select a target server from a pool per batch.  
> Committing is not needed if soft autocommitting is enabled, quite useful for 
> Solr's new NRT features.
> 
> If Solr 4.0 is released in the coming months (and that's what it looks like) 
> i 
> would suggest to patch Nutch to allow for a list of Solr server URL's instead 
> of doing partitioning on the client site.
> 
> In our case we don't even need a pool of Solr servers in Nutch to select from 
> because we pass the documents through a proxy that is aware of running and 
> offline servers.
> 
> Markus
> 
>> Thanks Julien and Lewis.
>> 
>> Being able to specify the partitioner class sounds good - I am thinking
>> that perhaps they could all be impls of the Hadoop
>> org.apache.hadoop.mapreduce.Partitioner interface.
>> 
>> Would it be okay if I annotated NUTCH-945 saying that I am working on
>> providing a patch for the NutchGora branch initially (I haven't looked at
>> the head code yet, its likely to be slightly different), and then try to
>> port the change over to the head?
>> 
>> -sujit
>> 
>> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote:
>>> Hi.
>>> 
>>> There was an issue [0] opened for this some time ago and it looks that
>>> apart from the (bare minimal) description, there has been no work done on
>>> it.
>>> 
>>> Would be a real nice feature to have.
>>> 
>>> [0] https://issues.apache.org/jira/browse/NUTCH-945
>>> 
>>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche <
>>> 
>>> [email protected]> wrote:
>>>> Hi Sujit,
>>>> 
>>>> Sounds good. A nice way of doing it would be to make so that people can
>>>> define how to partition over the SOLR instances in the way they want
>>>> e.g. consistent hashing, URL range or crawldb metadata by taking a
>>>> class name as parameter. Does not need to be pluggable I think. I had
>>>> implemented something along these lines some time ago for a customer
>>>> but could not release it open source.
>>>> 
>>>> Feel free to open a JIRA  to comment on this issue and attach a patch.
>>>> 
>>>> Thanks
>>>> 
>>>> Julien
>>>> 
>>>> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote:
>>>>> Hi,
>>>>> 
>>>>> I need to move the SOLR based search platform to a distributed setup,
>>>>> and therefore need to be able to write to multiple SOLR servers from
>>>>> Nutch (working on the nutchgora branch, so this may be specific to
>>>>> this
>>>> 
>>>> branch).
>>>> 
>>>>> Here is what I think I need to do...
>>>>> 
>>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where
>>>>> it converts the WebPage to a NutchDocument, then passes the
>>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this
>>>>> case). The
>>>> 
>>>> SolrWriter
>>>> 
>>>>> adds the NutchDocument to a queue and when the commit size is exceeded,
>>>>> writes out the queue and does a commit (and another one in the shutdown
>>>>> step).
>>>>> 
>>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a
>>>>> comma-separated list of URLs. The SolrWriter splits this parameter by
>>>>> "," and creates an array of server URLs and the same size array of
>>>>> inputDocs queue. It then takes the URL, runs it through a hashMod
>>>>> partitioner and writes it out to the inputDocs queue pointed by the
>>>>> partition.
>>>>> 
>>>>> Then my pages get split up into a number of SOLR servers, where I can
>>>>> query them in a distributed fashion (according to the SOLR docs, it is
>>>>> advisable to do this in a random manner to make sure the (unreliable)
>>>>> idf values do not influence scores from one server too much).
>>>>> 
>>>>> Is this a reasonable way to go about this? Or is there a simpler method
>>>>> I am overlooking?
>>>>> 
>>>>> TIA for any help you can provide.
>>>>> 
>>>>> -sujit
>>>> 
>>>> --
>>>> *
>>>> *Open Source Solutions for Text Engineering
>>>> 
>>>> http://digitalpebble.blogspot.com/
>>>> http://www.digitalpebble.com
>>>> http://twitter.com/digitalpebble

Re: [nutchgora] - proposal to support distributed indexing

Reply via email to