Good point Ferdy, thanks!

On 9 October 2012 18:10, Ferdy Galema <[email protected]> wrote:

> Hi,
>
> HBase with multiple versions is certainly an option, however the current
> HBaseStore implementation is implemented with a single version in mind. (I
> have not really tested what happens with multiple versions, I guess you get
> unexpected/undefined results). The exception to this case would be to
> setting specific column families for multiple values (for example 'content'
> and put it into a separate column family). Storing new content would
> overwrite the old ones. You have to have an external process or implemented
> tool to retrieve earlier versions from the store. For information like maps
> (inlinks, outlinks, metadata) the results with multiple versions are a lot
> more confusing. There is still some work to do.
>
> In short, yes HBase would work but you definitely would have to hack a
> custom HBaseStore if you want to perfectly keep track of snapshots.
>
> Ferdy.
>
> On Tue, Oct 9, 2012 at 3:30 PM, Julien Nioche <
> [email protected]
> > wrote:
>
> > Hi James
> >
> > You could have a custom map reduce job to copy the documents with a
> custom
> > ID as you just described. Another option would be to use Nutch 2 + HBase
> > and set a large value of versions (
> > http://hbase.apache.org/book/schema.versions.html) in the HBase schema.
> >
> > Julien
> >
> > On 9 October 2012 11:17, <[email protected]> wrote:
> >
> > > Hi
> > >
> > > Rather than a wide crawl of the web keeping track of the current state
> of
> > > sites (as I understand Nutch is currently optimized for) I am
> interested
> > in
> > > keeping copies of a more modest number of sites over time as they
> change.
> > > In other words keeping copies of both the old webpages and the new
> pages
> > as
> > > they change. My overly optimistic wishful thinking is that I could get
> > > close enough to this by simply adding the signature
> (TextProfileSignature
> > > in particular) to the current id key. Any thoughts as to if this is
> > > feasible and if so where in the codebase I should start looking in
> order
> > to
> > > do that? I am aware Heritrix specializes in archiving but I would
> really
> > > like to stick with Nutch if possible unless it absolutely doesn't make
> > > sense.
> > >
> > > Thanks
> > >
> > > James
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to