I agree that doing it at the Solr level is the most straight forward easy way. 
However, if possible I would like to do it at the webpage table level. That way 
I would have the original data and I would be able to reindex data at a later 
date and retroactively apply any improvements to the indexing. I am fairly 
certain I will screw up my solr index/want to make changes to it at some point.

________________________________________
From: Dave Stuart [[email protected]]
Sent: Tuesday, October 09, 2012 7:22 PM
To: [email protected]
Subject: Re: Keeping History/Archive with Nutch 2.x

Are you pushing it into a search index of some sort?

As I mostly push things into Solr I would modify the key to take signature into 
account.



On 9 Oct 2012, at 11:17, <[email protected]> wrote:

> Hi
>
> Rather than a wide crawl of the web keeping track of the current state of 
> sites (as I understand Nutch is currently optimized for) I am interested in 
> keeping copies of a more modest number of sites over time as they change. In 
> other words keeping copies of both the old webpages and the new pages as they 
> change. My overly optimistic wishful thinking is that I could get close 
> enough to this by simply adding the signature (TextProfileSignature in 
> particular) to the current id key. Any thoughts as to if this is feasible and 
> if so where in the codebase I should start looking in order to do that? I am 
> aware Heritrix specializes in archiving but I would really like to stick with 
> Nutch if possible unless it absolutely doesn't make sense.
>
> Thanks
>
> James

Reply via email to