Copying in user@

On Thu, Nov 6, 2014 at 6:37 PM, Lewis John Mcgibbney <
[email protected]> wrote:

> Hi amit,
>
> On Thu, Nov 6, 2014 at 1:54 PM, <[email protected]> wrote:
>
>> I have a small question about Nutch 2.X source code, i hope this is the
>> right mailing list for
>> that. i was unable to locate the following pieces from the code:
>>
>> a) where does the linkdb get generated, which java file contains the code
>> for that
>>
>
> There is currently no independent linkdb data structure such as the opaque
> object generated within Nutch 1.X.
>
>
>>
>> b) i see the WebPage class being utilized for remembering the pages that
>> were
>>   gathered.
>
>
> Each URL is essentially a WebPage in Nutch 2.X. There is therefore one
> WebPage for every document which is fetched by Nutch.
>
>
>> It looks like the crawldb is a repository of these pages.
>
>
> You are using Nutch 1.X and 2.X terminology here interchangeably I feel
> and it is quite confusing. Nutch 2.X does in fact not have a crawldb
> either. It delegates all such data structures to Gora, which is an
> object-to-datastore mapping framework. Objects in Gora are associated with
> an Object store. In Nutch both the WebPage store and Host stores are
> initialized within StorageUtils
>
> https://svn.apache.org/repos/asf/nutch/branches/2.x/src/java/org/apache/nutch/storage/StorageUtils.java
>
> So, I would say that the WebPage store is a datastore containing
> collections of Nutch WebPage and/or Host objects.
>
>
>> If that is
>>   the case then:
>>
>>   -- it looks like WepPage remembers the contents of the page together
>> with the
>>     rest of the information about the page. How do we delete content
>> which is
>>     old and not changed for a while
>>
>
> You do not need to do this. If a WebPage is refetched after some duration
> of time the content will be updated based on the new version.
>
>
>>
>> -- it does not appear that Nutch 2.X has any concept of segments.
>
>
> Correct
>
>
>> How do we
>>     delete stuff that is older than 1 month so that we dont blow out the
>> disk space ?
>>
>
> Well seeing as you have no segments, you don't need to delete anything.
> All your data is flushed down into the datastore of your choice. Nutch 2.X
> does not reply upon the opaque Hadoop sequence file data structure which
> work within Nutch 1.X. Not having to maintain segments is one feature I
> suppose of Nutch 2.X.
>
>
>>    It seemed that Nutch 1.x had segments, and older segments were
>> removable
>>
>>  Yes that is correct and also highly advised. Keeping an eye on your
> older segments is something which everyone should do IMHO.
> hth
> Lewis
>
>


-- 
*Lewis*

Reply via email to