Hi Jason,

I did something along the lines you are after and submitted the patch as 
NUTCH-945, maybe you will find it useful.

https://issues.apache.org/jira/browse/NUTCH-945

The idea behind the patch is this - you set up a list of SOLR servers in your 
configuration, and define and configure a partitioner that returns an index 
into the list of SOLR servers given the document URL. In your case I think you 
will have to build a custom partitioner that uses the domain to decide the 
partition.

-sujit

On Mar 6, 2013, at 1:34 AM, Stubblefield Jason wrote:

> Well Lewis, I quite frankly disagree.  
> 
> I am asking how I can have more control for the slice process in the nutch 
> mergesegs operation.
> 
> I think this could be a useful feature to many Nutch users.
> 
> I can see that I wont get any more assistance here.
> 
> Thanks,
> 
> Jason
> 
> 
> 
> On Mar 6, 2013, at 6:18 AM, Lewis John Mcgibbney <[email protected]> 
> wrote:
> 
>> Hi Jason,
>> There is nothing I can see here which concerns Nutch.
>> Try solr lists please.
>> Thank you
>> Lewis
>> 
>> On Tuesday, March 5, 2013, Stubblefield Jason <
>> [email protected]> wrote:
>>> I have several Solr 3.6 instances that for various reasons, I don't want
>> to upgrade to 4.0 yet.  My index is too big to fit on one machine.  I want
>> to be able to slice the crawl so that I can have 1 slice per solr shard,
>> but also use the grouping feature on solr.  From what I understand, solr
>> grouping doesn't work properly when pages from a domain are spread across
>> solr shards.
>>> 
>>> Basically i'm after something like this:
>>> 
>>> slice1 (apache.org, linux.org) -> solr1
>>> 
>>> slice2 (stackoverflow.com, wikipedia.org) -> solr2
>>> 
>>> etc...
>>> 
>>> I could upgrade to Solrcloud, or possibly use elasticsearch, but it would
>> be a fair amount of re-coding.  I was just curious if I could manage the
>> sharding manually.
>>> 
>>> Suggestions would certainly be appreciated, it seems like I am faced with
>> a massive upgrade or to break the grouping functionality.
>>> 
>>> ~Jason
>>> 
>>> On Mar 5, 2013, at 11:02 PM, Markus Jelsma <[email protected]>
>> wrote:
>>> 
>>>> Hi
>>>> 
>>>> You can't do this with -slice but you can merge segments and filter
>> them. This would mean you'd have to merge the segments for each domain. But
>> that's far too much work. Why do you want to do this? There may be better
>> ways in achieving you goal.
>>>> 
>>>> 
>>>> 
>>>> -----Original message-----
>>>>> From:Jason S <[email protected]>
>>>>> Sent: Tue 05-Mar-2013 22:18
>>>>> To: [email protected]
>>>>> Subject: keep all pages from a domain in one slice
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I seem to remember seeing a discussion about this in the past but I
>> can't seem to find it in the archives.
>>>>> 
>>>>> When using mergesegs -slice, is it possible to keep all the pages from
>> a domain in the same slice?  I have just been messing around with this
>> functionality (Nutch 1.6), and it seems like the records are simply split
>> after the counter has reached the slice size specified, sometimes splitting
>> the records from a single domain over multiple slices.
>>>>> 
>>>>> How can I segregate a domain to a single slice?
>>>>> 
>>>>> Thanks in advance,
>>>>> 
>>>>> ~Jason
>>> 
>>> 
>> 
>> -- 
>> *Lewis*
> 

Reply via email to