Re: Apache Nutch Output structure

Sebastian Nagel Thu, 08 Oct 2015 11:57:07 -0700

Hi,

sorry for the late reply.


I've once prepared an overview and also a flow diagram as part of
http://www.slideshare.net/sebastian_nagel/aceu2014-snagelwebcrawlingnutch

crawl_parse: all crawling-related data from the parsing step used to update 
CrawlDb:
outlinks, scores, signatures, meta data.

Feel free to add anything from the slides to the wiki.
If you need the images just let me know.

Sebastian


On 10/03/2015 03:28 AM, Lewis John Mcgibbney wrote:
> Hi Folks,
> 
> On Fri, Oct 2, 2015 at 4:33 PM, <[email protected]> wrote:
> 
>>
>> I already went through the page but it gives only technical information
>> about the directories but no information related to relation amongst these
>> folders and what they really mean in terms of crawled output.
>>
> 
> I agree to an extent. I've therefore added further content to the page
> linking every key and value data types back to the current Javadoc's for
> both Hadoop and Nutch. Mu justification for doing this is that you can then
> follow the links to the Javadoc and read them. We are trying to build pages
> on the wiki which stand the test of time and hence the update of this page
> recently.  I hope that the precise Javadoc locations are helpful to you.
> 
> 
>> Like for ex: what crawl_parse contains is it like all the crawled data
>> parsed in terms of html tags or it just contains all the urls extracted
>> from pages.
>>
>>
> This can be seen from reading the Javadoc's. I would also highly encourage
> you to furhter opulate Java documentation for both Nutch and Hadoop if you
> feel they are lacking. This is an extremely valuable exercise for us all.
> Maybe the expanded sentiment/description that you are looking for would
> best be placed within a patch and sent to augment the existing Javadoc?
> Just a thought.
> Thanks
> Lewis
>

Re: Apache Nutch Output structure

Reply via email to