Hi James

You could have a custom map reduce job to copy the documents with a custom
ID as you just described. Another option would be to use Nutch 2 + HBase
and set a large value of versions (
http://hbase.apache.org/book/schema.versions.html) in the HBase schema.

Julien

On 9 October 2012 11:17, <[email protected]> wrote:

> Hi
>
> Rather than a wide crawl of the web keeping track of the current state of
> sites (as I understand Nutch is currently optimized for) I am interested in
> keeping copies of a more modest number of sites over time as they change.
> In other words keeping copies of both the old webpages and the new pages as
> they change. My overly optimistic wishful thinking is that I could get
> close enough to this by simply adding the signature (TextProfileSignature
> in particular) to the current id key. Any thoughts as to if this is
> feasible and if so where in the codebase I should start looking in order to
> do that? I am aware Heritrix specializes in archiving but I would really
> like to stick with Nutch if possible unless it absolutely doesn't make
> sense.
>
> Thanks
>
> James
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to