Hi James You could have a custom map reduce job to copy the documents with a custom ID as you just described. Another option would be to use Nutch 2 + HBase and set a large value of versions ( http://hbase.apache.org/book/schema.versions.html) in the HBase schema.
Julien On 9 October 2012 11:17, <[email protected]> wrote: > Hi > > Rather than a wide crawl of the web keeping track of the current state of > sites (as I understand Nutch is currently optimized for) I am interested in > keeping copies of a more modest number of sites over time as they change. > In other words keeping copies of both the old webpages and the new pages as > they change. My overly optimistic wishful thinking is that I could get > close enough to this by simply adding the signature (TextProfileSignature > in particular) to the current id key. Any thoughts as to if this is > feasible and if so where in the codebase I should start looking in order to > do that? I am aware Heritrix specializes in archiving but I would really > like to stick with Nutch if possible unless it absolutely doesn't make > sense. > > Thanks > > James > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

