Well Lewis, I quite frankly disagree. I am asking how I can have more control for the slice process in the nutch mergesegs operation.
I think this could be a useful feature to many Nutch users. I can see that I wont get any more assistance here. Thanks, Jason On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <[email protected]> wrote: > Hi Jason, > There is nothing I can see here which concerns Nutch. > Try solr lists please. > Thank you > Lewis > > On Tuesday, March 5, 2013, Stubblefield Jason < > [email protected]> wrote: >> I have several Solr 3.6 instances that for various reasons, I don't want > to upgrade to 4.0 yet. My index is too big to fit on one machine. I want > to be able to slice the crawl so that I can have 1 slice per solr shard, > but also use the grouping feature on solr. From what I understand, solr > grouping doesn't work properly when pages from a domain are spread across > solr shards. >> >> Basically i'm after something like this: >> >> slice1 (apache.org, linux.org) -> solr1 >> >> slice2 (stackoverflow.com, wikipedia.org) -> solr2 >> >> etc... >> >> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would > be a fair amount of re-coding. I was just curious if I could manage the > sharding manually. >> >> Suggestions would certainly be appreciated, it seems like I am faced with > a massive upgrade or to break the grouping functionality. >> >> ~Jason >> >> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <[email protected]> > wrote: >> >>> Hi >>> >>> You can't do this with -slice but you can merge segments and filter > them. This would mean you'd have to merge the segments for each domain. But > that's far too much work. Why do you want to do this? There may be better > ways in achieving you goal. >>> >>> >>> >>> -----Original message----- >>>> From:Jason S <[email protected]> >>>> Sent: Tue 05-Mar-2013 22:18 >>>> To: [email protected] >>>> Subject: keep all pages from a domain in one slice >>>> >>>> Hello, >>>> >>>> I seem to remember seeing a discussion about this in the past but I > can't seem to find it in the archives. >>>> >>>> When using mergesegs -slice, is it possible to keep all the pages from > a domain in the same slice? I have just been messing around with this > functionality (Nutch 1.6), and it seems like the records are simply split > after the counter has reached the slice size specified, sometimes splitting > the records from a single domain over multiple slices. >>>> >>>> How can I segregate a domain to a single slice? >>>> >>>> Thanks in advance, >>>> >>>> ~Jason >> >> > > -- > *Lewis*

