Good point Ferdy, thanks! On 9 October 2012 18:10, Ferdy Galema <[email protected]> wrote:
> Hi, > > HBase with multiple versions is certainly an option, however the current > HBaseStore implementation is implemented with a single version in mind. (I > have not really tested what happens with multiple versions, I guess you get > unexpected/undefined results). The exception to this case would be to > setting specific column families for multiple values (for example 'content' > and put it into a separate column family). Storing new content would > overwrite the old ones. You have to have an external process or implemented > tool to retrieve earlier versions from the store. For information like maps > (inlinks, outlinks, metadata) the results with multiple versions are a lot > more confusing. There is still some work to do. > > In short, yes HBase would work but you definitely would have to hack a > custom HBaseStore if you want to perfectly keep track of snapshots. > > Ferdy. > > On Tue, Oct 9, 2012 at 3:30 PM, Julien Nioche < > [email protected] > > wrote: > > > Hi James > > > > You could have a custom map reduce job to copy the documents with a > custom > > ID as you just described. Another option would be to use Nutch 2 + HBase > > and set a large value of versions ( > > http://hbase.apache.org/book/schema.versions.html) in the HBase schema. > > > > Julien > > > > On 9 October 2012 11:17, <[email protected]> wrote: > > > > > Hi > > > > > > Rather than a wide crawl of the web keeping track of the current state > of > > > sites (as I understand Nutch is currently optimized for) I am > interested > > in > > > keeping copies of a more modest number of sites over time as they > change. > > > In other words keeping copies of both the old webpages and the new > pages > > as > > > they change. My overly optimistic wishful thinking is that I could get > > > close enough to this by simply adding the signature > (TextProfileSignature > > > in particular) to the current id key. Any thoughts as to if this is > > > feasible and if so where in the codebase I should start looking in > order > > to > > > do that? I am aware Heritrix specializes in archiving but I would > really > > > like to stick with Nutch if possible unless it absolutely doesn't make > > > sense. > > > > > > Thanks > > > > > > James > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

