Thank you so much, the reason i was asking was that in some data stores such as Cassandra updates get versioned and if a page is frequently updated it will create a lot of versions and will reduce the performance of the system,
additionally while it would be useful to remember meta data of pages, if a page is not changing much at all and in the application there is not much interest left in it then it would be better to simply remove the page ... otherwise we will eventually blow out the disk space ... any suggestions are greatly appreciated ... thanks -------------------------------------------- On Thu, 11/6/14, Sebastian Nagel <[email protected]> wrote: Subject: Re: Nutch 2.X question To: [email protected] Date: Thursday, November 6, 2014, 12:51 PM Hi Amit, in Nutch 2.x there are no segments and there is no LinkDB. Every data is hold in one single "WebTable". Usually, you want to keep the most recent version of each document (one row in the table). Depending on the storage back-end and its configuration there may be multiple versions stored for each cell in the table (eg. for HBase). If you want to clean-up old stuff, e.g. old 404 pages, you can do this either using administration tools of your storage (HBase shell, etc.) Sebastian On 11/06/2014 06:42 PM, amit sehas wrote: > Does Nutch 2.x have segments ? I dont see the java code for that. How do > we delete data which is older than a certain date. > > Does Nutch 2.X maintain linkdb ? I am not able to locate java code for that either ... > > any help is greatly appreciated > > thanks >

