I agree that doing it at the Solr level is the most straight forward easy way. However, if possible I would like to do it at the webpage table level. That way I would have the original data and I would be able to reindex data at a later date and retroactively apply any improvements to the indexing. I am fairly certain I will screw up my solr index/want to make changes to it at some point.
________________________________________ From: Dave Stuart [[email protected]] Sent: Tuesday, October 09, 2012 7:22 PM To: [email protected] Subject: Re: Keeping History/Archive with Nutch 2.x Are you pushing it into a search index of some sort? As I mostly push things into Solr I would modify the key to take signature into account. On 9 Oct 2012, at 11:17, <[email protected]> wrote: > Hi > > Rather than a wide crawl of the web keeping track of the current state of > sites (as I understand Nutch is currently optimized for) I am interested in > keeping copies of a more modest number of sites over time as they change. In > other words keeping copies of both the old webpages and the new pages as they > change. My overly optimistic wishful thinking is that I could get close > enough to this by simply adding the signature (TextProfileSignature in > particular) to the current id key. Any thoughts as to if this is feasible and > if so where in the codebase I should start looking in order to do that? I am > aware Heritrix specializes in archiving but I would really like to stick with > Nutch if possible unless it absolutely doesn't make sense. > > Thanks > > James

