In that case the algorithm doesn't matter as you still need to reindex the corpus if you upgrade to 4.x.
Cheers! > Thanks Marcus, I guess I'll probably still need to build nutch side > partitioning for myself since I am on Solr 3.5, it would be throw-away > code, to be changed when I get on to 4.x. > > -sujit > > On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote: > > Hi, > > > > We're in the process of testing Solr trunk's cloud features that recently > > includes initial work for distributed indexing. With it, there is no need > > anymore for doing the partitioning client site because Solr will forward > > the input documents to the proper shard. Solr uses the MurMur hashing > > algorithm to decide the target shard so i would stick to that in any > > case. > > > > Anyway, with Solr being able to handle incoming documents on any node, > > and distributing them appropriately there is no need anymore for hashing > > at all. What we do need to to select a target server from a pool per > > batch. Committing is not needed if soft autocommitting is enabled, quite > > useful for Solr's new NRT features. > > > > If Solr 4.0 is released in the coming months (and that's what it looks > > like) i would suggest to patch Nutch to allow for a list of Solr server > > URL's instead of doing partitioning on the client site. > > > > In our case we don't even need a pool of Solr servers in Nutch to select > > from because we pass the documents through a proxy that is aware of > > running and offline servers. > > > > Markus > > > >> Thanks Julien and Lewis. > >> > >> Being able to specify the partitioner class sounds good - I am thinking > >> that perhaps they could all be impls of the Hadoop > >> org.apache.hadoop.mapreduce.Partitioner interface. > >> > >> Would it be okay if I annotated NUTCH-945 saying that I am working on > >> providing a patch for the NutchGora branch initially (I haven't looked > >> at the head code yet, its likely to be slightly different), and then > >> try to port the change over to the head? > >> > >> -sujit > >> > >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: > >>> Hi. > >>> > >>> There was an issue [0] opened for this some time ago and it looks that > >>> apart from the (bare minimal) description, there has been no work done > >>> on it. > >>> > >>> Would be a real nice feature to have. > >>> > >>> [0] https://issues.apache.org/jira/browse/NUTCH-945 > >>> > >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < > >>> > >>> [email protected]> wrote: > >>>> Hi Sujit, > >>>> > >>>> Sounds good. A nice way of doing it would be to make so that people > >>>> can define how to partition over the SOLR instances in the way they > >>>> want e.g. consistent hashing, URL range or crawldb metadata by taking > >>>> a class name as parameter. Does not need to be pluggable I think. I > >>>> had implemented something along these lines some time ago for a > >>>> customer but could not release it open source. > >>>> > >>>> Feel free to open a JIRA to comment on this issue and attach a patch. > >>>> > >>>> Thanks > >>>> > >>>> Julien > >>>> > >>>> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote: > >>>>> Hi, > >>>>> > >>>>> I need to move the SOLR based search platform to a distributed setup, > >>>>> and therefore need to be able to write to multiple SOLR servers from > >>>>> Nutch (working on the nutchgora branch, so this may be specific to > >>>>> this > >>>> > >>>> branch). > >>>> > >>>>> Here is what I think I need to do... > >>>>> > >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where > >>>>> it converts the WebPage to a NutchDocument, then passes the > >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this > >>>>> case). The > >>>> > >>>> SolrWriter > >>>> > >>>>> adds the NutchDocument to a queue and when the commit size is > >>>>> exceeded, writes out the queue and does a commit (and another one in > >>>>> the shutdown step). > >>>>> > >>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a > >>>>> comma-separated list of URLs. The SolrWriter splits this parameter by > >>>>> "," and creates an array of server URLs and the same size array of > >>>>> inputDocs queue. It then takes the URL, runs it through a hashMod > >>>>> partitioner and writes it out to the inputDocs queue pointed by the > >>>>> partition. > >>>>> > >>>>> Then my pages get split up into a number of SOLR servers, where I can > >>>>> query them in a distributed fashion (according to the SOLR docs, it > >>>>> is advisable to do this in a random manner to make sure the > >>>>> (unreliable) idf values do not influence scores from one server too > >>>>> much). > >>>>> > >>>>> Is this a reasonable way to go about this? Or is there a simpler > >>>>> method I am overlooking? > >>>>> > >>>>> TIA for any help you can provide. > >>>>> > >>>>> -sujit > >>>> > >>>> -- > >>>> * > >>>> *Open Source Solutions for Text Engineering > >>>> > >>>> http://digitalpebble.blogspot.com/ > >>>> http://www.digitalpebble.com > >>>> http://twitter.com/digitalpebble

