Thanks Marcus, I guess I'll probably still need to build nutch side partitioning for myself since I am on Solr 3.5, it would be throw-away code, to be changed when I get on to 4.x.
-sujit On Feb 22, 2012, at 10:24 AM, Markus Jelsma wrote: > Hi, > > We're in the process of testing Solr trunk's cloud features that recently > includes initial work for distributed indexing. With it, there is no need > anymore for doing the partitioning client site because Solr will forward the > input documents to the proper shard. Solr uses the MurMur hashing algorithm > to > decide the target shard so i would stick to that in any case. > > Anyway, with Solr being able to handle incoming documents on any node, and > distributing them appropriately there is no need anymore for hashing at all. > What we do need to to select a target server from a pool per batch. > Committing is not needed if soft autocommitting is enabled, quite useful for > Solr's new NRT features. > > If Solr 4.0 is released in the coming months (and that's what it looks like) > i > would suggest to patch Nutch to allow for a list of Solr server URL's instead > of doing partitioning on the client site. > > In our case we don't even need a pool of Solr servers in Nutch to select from > because we pass the documents through a proxy that is aware of running and > offline servers. > > Markus > >> Thanks Julien and Lewis. >> >> Being able to specify the partitioner class sounds good - I am thinking >> that perhaps they could all be impls of the Hadoop >> org.apache.hadoop.mapreduce.Partitioner interface. >> >> Would it be okay if I annotated NUTCH-945 saying that I am working on >> providing a patch for the NutchGora branch initially (I haven't looked at >> the head code yet, its likely to be slightly different), and then try to >> port the change over to the head? >> >> -sujit >> >> On Feb 22, 2012, at 3:01 AM, Lewis John Mcgibbney wrote: >>> Hi. >>> >>> There was an issue [0] opened for this some time ago and it looks that >>> apart from the (bare minimal) description, there has been no work done on >>> it. >>> >>> Would be a real nice feature to have. >>> >>> [0] https://issues.apache.org/jira/browse/NUTCH-945 >>> >>> On Wed, Feb 22, 2012 at 6:12 AM, Julien Nioche < >>> >>> [email protected]> wrote: >>>> Hi Sujit, >>>> >>>> Sounds good. A nice way of doing it would be to make so that people can >>>> define how to partition over the SOLR instances in the way they want >>>> e.g. consistent hashing, URL range or crawldb metadata by taking a >>>> class name as parameter. Does not need to be pluggable I think. I had >>>> implemented something along these lines some time ago for a customer >>>> but could not release it open source. >>>> >>>> Feel free to open a JIRA to comment on this issue and attach a patch. >>>> >>>> Thanks >>>> >>>> Julien >>>> >>>> On 22 February 2012 03:45, SUJIT PAL <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> I need to move the SOLR based search platform to a distributed setup, >>>>> and therefore need to be able to write to multiple SOLR servers from >>>>> Nutch (working on the nutchgora branch, so this may be specific to >>>>> this >>>> >>>> branch). >>>> >>>>> Here is what I think I need to do... >>>>> >>>>> Currently, SolrIndexerJob writes to Solr in the IndexerReducer, where >>>>> it converts the WebPage to a NutchDocument, then passes the >>>>> NutchDocument to the appropriate NutchIndexWriter (SolrWriter in this >>>>> case). The >>>> >>>> SolrWriter >>>> >>>>> adds the NutchDocument to a queue and when the commit size is exceeded, >>>>> writes out the queue and does a commit (and another one in the shutdown >>>>> step). >>>>> >>>>> My proposal is to specify the SolrConstants.SERVER_URL parameter as a >>>>> comma-separated list of URLs. The SolrWriter splits this parameter by >>>>> "," and creates an array of server URLs and the same size array of >>>>> inputDocs queue. It then takes the URL, runs it through a hashMod >>>>> partitioner and writes it out to the inputDocs queue pointed by the >>>>> partition. >>>>> >>>>> Then my pages get split up into a number of SOLR servers, where I can >>>>> query them in a distributed fashion (according to the SOLR docs, it is >>>>> advisable to do this in a random manner to make sure the (unreliable) >>>>> idf values do not influence scores from one server too much). >>>>> >>>>> Is this a reasonable way to go about this? Or is there a simpler method >>>>> I am overlooking? >>>>> >>>>> TIA for any help you can provide. >>>>> >>>>> -sujit >>>> >>>> -- >>>> * >>>> *Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble

