Thank you so much, the reason i was asking was that in some data stores such as 
Cassandra
updates get versioned and if a page is frequently updated it will create a lot 
of versions and
will reduce the performance of the system,

additionally while it would be useful to remember meta data of pages, if a page 
is not changing
much at all and in the application there is not much interest left in it then 
it would be better
to simply remove the page ... otherwise we will eventually blow out the disk 
space ...

any suggestions are greatly appreciated ...

thanks
--------------------------------------------
On Thu, 11/6/14, Sebastian Nagel <[email protected]> wrote:

 Subject: Re: Nutch 2.X question
 To: [email protected]
 Date: Thursday, November 6, 2014, 12:51 PM
 
 Hi Amit,
 
 in Nutch 2.x there are no
 segments and there is no LinkDB.
 Every data
 is hold in one single "WebTable".
 
 Usually, you want to keep the
 most recent version
 of each document (one
 row in the table).
 Depending on the storage
 back-end and its configuration
 there may be
 multiple versions stored for each cell
 in
 the table (eg. for HBase).
 If you want to
 clean-up old stuff, e.g. old 404 pages,
 you
 can do this either using administration tools
 of your storage (HBase shell, etc.)
 
 Sebastian
 
 On
 11/06/2014 06:42 PM, amit sehas wrote:
 >
 Does Nutch 2.x have segments ?  I dont see the java code
 for that. How do
 > we delete data which
 is older than a certain date.
 > 
 > Does Nutch 2.X maintain linkdb ? I am not
 able to locate java code for that either ...
 > 
 > any help is greatly
 appreciated
 > 
 >
 thanks
 > 
 

Reply via email to