Hi, sorry for the late reply.
I've once prepared an overview and also a flow diagram as part of http://www.slideshare.net/sebastian_nagel/aceu2014-snagelwebcrawlingnutch crawl_parse: all crawling-related data from the parsing step used to update CrawlDb: outlinks, scores, signatures, meta data. Feel free to add anything from the slides to the wiki. If you need the images just let me know. Sebastian On 10/03/2015 03:28 AM, Lewis John Mcgibbney wrote: > Hi Folks, > > On Fri, Oct 2, 2015 at 4:33 PM, <[email protected]> wrote: > >> >> I already went through the page but it gives only technical information >> about the directories but no information related to relation amongst these >> folders and what they really mean in terms of crawled output. >> > > I agree to an extent. I've therefore added further content to the page > linking every key and value data types back to the current Javadoc's for > both Hadoop and Nutch. Mu justification for doing this is that you can then > follow the links to the Javadoc and read them. We are trying to build pages > on the wiki which stand the test of time and hence the update of this page > recently. I hope that the precise Javadoc locations are helpful to you. > > >> Like for ex: what crawl_parse contains is it like all the crawled data >> parsed in terms of html tags or it just contains all the urls extracted >> from pages. >> >> > This can be seen from reading the Javadoc's. I would also highly encourage > you to furhter opulate Java documentation for both Nutch and Hadoop if you > feel they are lacking. This is an extremely valuable exercise for us all. > Maybe the expanded sentiment/description that you are looking for would > best be placed within a patch and sent to augment the existing Javadoc? > Just a thought. > Thanks > Lewis >

