Hi Jason, I did something along the lines you are after and submitted the patch as NUTCH-945, maybe you will find it useful.
https://issues.apache.org/jira/browse/NUTCH-945 The idea behind the patch is this - you set up a list of SOLR servers in your configuration, and define and configure a partitioner that returns an index into the list of SOLR servers given the document URL. In your case I think you will have to build a custom partitioner that uses the domain to decide the partition. -sujit On Mar 6, 2013, at 1:34 AM, Stubblefield Jason wrote: > Well Lewis, I quite frankly disagree. > > I am asking how I can have more control for the slice process in the nutch > mergesegs operation. > > I think this could be a useful feature to many Nutch users. > > I can see that I wont get any more assistance here. > > Thanks, > > Jason > > > > On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <[email protected]> > wrote: > >> Hi Jason, >> There is nothing I can see here which concerns Nutch. >> Try solr lists please. >> Thank you >> Lewis >> >> On Tuesday, March 5, 2013, Stubblefield Jason < >> [email protected]> wrote: >>> I have several Solr 3.6 instances that for various reasons, I don't want >> to upgrade to 4.0 yet. My index is too big to fit on one machine. I want >> to be able to slice the crawl so that I can have 1 slice per solr shard, >> but also use the grouping feature on solr. From what I understand, solr >> grouping doesn't work properly when pages from a domain are spread across >> solr shards. >>> >>> Basically i'm after something like this: >>> >>> slice1 (apache.org, linux.org) -> solr1 >>> >>> slice2 (stackoverflow.com, wikipedia.org) -> solr2 >>> >>> etc... >>> >>> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would >> be a fair amount of re-coding. I was just curious if I could manage the >> sharding manually. >>> >>> Suggestions would certainly be appreciated, it seems like I am faced with >> a massive upgrade or to break the grouping functionality. >>> >>> ~Jason >>> >>> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <[email protected]> >> wrote: >>> >>>> Hi >>>> >>>> You can't do this with -slice but you can merge segments and filter >> them. This would mean you'd have to merge the segments for each domain. But >> that's far too much work. Why do you want to do this? There may be better >> ways in achieving you goal. >>>> >>>> >>>> >>>> -----Original message----- >>>>> From:Jason S <[email protected]> >>>>> Sent: Tue 05-Mar-2013 22:18 >>>>> To: [email protected] >>>>> Subject: keep all pages from a domain in one slice >>>>> >>>>> Hello, >>>>> >>>>> I seem to remember seeing a discussion about this in the past but I >> can't seem to find it in the archives. >>>>> >>>>> When using mergesegs -slice, is it possible to keep all the pages from >> a domain in the same slice? I have just been messing around with this >> functionality (Nutch 1.6), and it seems like the records are simply split >> after the counter has reached the slice size specified, sometimes splitting >> the records from a single domain over multiple slices. >>>>> >>>>> How can I segregate a domain to a single slice? >>>>> >>>>> Thanks in advance, >>>>> >>>>> ~Jason >>> >>> >> >> -- >> *Lewis* >

