Copying in user@ On Thu, Nov 6, 2014 at 6:37 PM, Lewis John Mcgibbney < [email protected]> wrote:
> Hi amit, > > On Thu, Nov 6, 2014 at 1:54 PM, <[email protected]> wrote: > >> I have a small question about Nutch 2.X source code, i hope this is the >> right mailing list for >> that. i was unable to locate the following pieces from the code: >> >> a) where does the linkdb get generated, which java file contains the code >> for that >> > > There is currently no independent linkdb data structure such as the opaque > object generated within Nutch 1.X. > > >> >> b) i see the WebPage class being utilized for remembering the pages that >> were >> gathered. > > > Each URL is essentially a WebPage in Nutch 2.X. There is therefore one > WebPage for every document which is fetched by Nutch. > > >> It looks like the crawldb is a repository of these pages. > > > You are using Nutch 1.X and 2.X terminology here interchangeably I feel > and it is quite confusing. Nutch 2.X does in fact not have a crawldb > either. It delegates all such data structures to Gora, which is an > object-to-datastore mapping framework. Objects in Gora are associated with > an Object store. In Nutch both the WebPage store and Host stores are > initialized within StorageUtils > > https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java > > So, I would say that the WebPage store is a datastore containing > collections of Nutch WebPage and/or Host objects. > > >> If that is >> the case then: >> >> -- it looks like WepPage remembers the contents of the page together >> with the >> rest of the information about the page. How do we delete content >> which is >> old and not changed for a while >> > > You do not need to do this. If a WebPage is refetched after some duration > of time the content will be updated based on the new version. > > >> >> -- it does not appear that Nutch 2.X has any concept of segments. > > > Correct > > >> How do we >> delete stuff that is older than 1 month so that we dont blow out the >> disk space ? >> > > Well seeing as you have no segments, you don't need to delete anything. > All your data is flushed down into the datastore of your choice. Nutch 2.X > does not reply upon the opaque Hadoop sequence file data structure which > work within Nutch 1.X. Not having to maintain segments is one feature I > suppose of Nutch 2.X. > > >> It seemed that Nutch 1.x had segments, and older segments were >> removable >> >> Yes that is correct and also highly advised. Keeping an eye on your > older segments is something which everyone should do IMHO. > hth > Lewis > > -- *Lewis*

