Are you pushing it into a search index of some sort? As I mostly push things into Solr I would modify the key to take signature into account.
On 9 Oct 2012, at 11:17, <[email protected]> wrote: > Hi > > Rather than a wide crawl of the web keeping track of the current state of > sites (as I understand Nutch is currently optimized for) I am interested in > keeping copies of a more modest number of sites over time as they change. In > other words keeping copies of both the old webpages and the new pages as they > change. My overly optimistic wishful thinking is that I could get close > enough to this by simply adding the signature (TextProfileSignature in > particular) to the current id key. Any thoughts as to if this is feasible and > if so where in the codebase I should start looking in order to do that? I am > aware Heritrix specializes in archiving but I would really like to stick with > Nutch if possible unless it absolutely doesn't make sense. > > Thanks > > James

